I Introduction
Visualisation is a fundamental task in data mining which aims to represent data in a humanunderstandable graphical manner [fayyad1996data]. Visualisation is often touted as a useful tool for allowing practitioners to gain an understanding of the data being analysed, but stateoftheart visualisation methods (e.g. tDistributed Stochastic Neighbour Embedding (tSNE) [maatenTSNE]
) tend to be black boxes that give no insight into how the visualisation represents the original features of the data. Other methods such as autoencoders
[hinton2006reducing] and parametric tSNE [maaten2009learning]produce a parametric mapping from the original dataset to the twodimensional visualisation, but these use complex deep neural networks which are opaque to humans and provide little “intuitive” understanding. Even the most recent advances such as Uniform Manifold Approximation and Projection (UMAP)
[2018arXivUMAP] acknowledge significant weaknesses with regards to interpretability. As McInnes et al. note (emphasis ours):“For a number of uses [sic] cases the interpretability of the reduced dimension results is of critical importance. Similarly to most nonlinear dimension reduction techniques (including tSNE and Isomap), UMAP lacks the strong interpretability
of Principal Component Analysis (PCA) and related techniques such a [
sic] NonNegative Matrix Factorization (NMF). In particular the dimensions of the UMAP embedding space have no specific meaning, unlike PCA where the dimensions are the directions of greatest variance in the source data. Furthermore, since UMAP is based on the distance between observations rather than the source features, it does not have an equivalent of factor loadings that linear techniques such as PCA, or Factor Analysis can provide. If strong interpretability is critical we therefore recommend linear techniques such as PCA and NMF.”
[2018arXivUMAP, p. 35]
While such linear techniques as PCA and NMF provide strong interpretability, they are inherently limited in the quality of their output dimensions by their use of linear weightings only. Treebased genetic programming (GP) is often recognised for its ability to directly model functions by taking inputs as leaves and producing an output at the root [poli2008field] and is often used for dimensionality reduction in the form of feature construction (FC) [neshatian2012filter]. Significant progress has been made on producing interpretable GP trees, which are simple enough to be understood by a human expert by using techniques such as parsimony pressure [poli2008field] and multiobjective optimisation [bleuler2001multiobjective], with multiobjective approaches showing particularly good results due to their ability to produce a range of solutions with different levels of complexity [wagner2012parsimony, cano2017multi]. Despite being clearly suitable for dimensionality reduction and evolving interpretable trees, GP has never been used to produce interpretable models that create understandable visualisations.
We recently investigated the use of GP to do manifold learning (GPMaL) for dimensionality reduction [lensen2019can]. We found that GP was able to significantly reduce the size of feature sets while still retaining highdimensional structure by producing sophisticated constructed features which used a variety of nonlinear and conditional transformations. Our findings also highlighted the considerable potential of GPMaL for visualisation, but that the original fitness function needed substantial refinement and that the complexity of the evolved models would need to be significantly reduced.
In this paper we aim to significantly expand our previous work by using a multiobjective GP approach for visualisation which is expected to address the above limitations. To our knowledge, this is the first multiobjective GP method proposed which treats visualisation quality and model complexity as competing objectives. This paper will:

Propose a holistic set of functions and terminals for creating powerful and interpretable models for visualisation;

Design a multiobjective approach to allow for the evolution of a solution front containing visualisations representing different levels of quality and complexity;

Compare the quality and interpretability of the evolved visualisations with those produced by stateoftheart visualisation methods on a range of datasets; and

Perform an indepth analysis of the tradeoff between visualisation quality and tree complexity to demonstrate the unique advantages of the proposed approach.
This work is expected to provide a new and innovative approach which addresses the littlestudied issue of model interpretability in visualisation. Of particular novelty is the indepth analysis that will be performed that allows significant insight which is unattainable with existing singlesolution black box approaches.
Ii Background
Iia Dimensionality Reduction
Dimensionality reduction (DR), the process of reducing the number of features/attributes in a dataset to improve understanding and performance [liu2012feature], has continued to grow in importance as datasets become increasingly big and deep neural networks become more and more uninterpretable. Common techniques to address this problem include feature selection (FS) and feature construction (FC), which reduce the feature space through removing unwanted features or creating fewer, more complex metafeatures, respectively. Evolutionary Computation (EC) methods have been applied to these NPhard problems with significant success [xue2015survey, neshatian2012filter]. In particular, treebased GP has proved to be a natural fit for FC problems due to its functional and interpretable nature; GPbased FC has been used in classification [tran2016genetic], image analysis [alsahaf2017automatically], clustering [lensen2017GPGC] and other domains [Ahmed2014Multiple, hart2017hybrid].
Manifold learning (also known as nonlinear dimensionality reduction) [lee2007nonlinear] can be regarded as an unsupervised FC approach, where the task is to build a set of features which represent the nonlinear manifold present in the highdimensional feature space. One way of approaching this task is to attempt to build a function, which maps the highdimensional space to the lowdimensional manifold; such an approach could provide an understandable mapping between the two. Until our recent work [lensen2019can] there had been no investigation into whether GP could be used to evolve such a mapping.
IiB Machine Learning for Performing Visualisation
The simplest commonlyused machinelearning visualisation methods are FS and Principal Component Analysis (PCA)
[jolliffe2011pca]. By using FS to select only two (or perhaps three) features, one can visualise a dataset by plotting the feature values of each instance along the x and yaxes. PCA uses a more sophisticated approach, where it creates a series of “principal components” which are linear combinations of the feature set, such that each successive component is orthogonal to all the preceding components. By plotting the first two created components, a visualisation is created which is optimal when performing only linear transformations.
However, linear transformations fail to give clear visualisations of any underlying manifold/structure in a dataset beyond simple, lowdimensional data. Unfortunately, creating optimal nonlinear transformations is an NPhard problem, with many machine learning methods proposed as a result. The earliest methods include techniques such as Isomap [tenenbaum2000isomap] and Local Linear Embedding (LLE) [roweis2000lle], but tSNE [maatenTSNE] is generally regarded as the mainstream stateoftheart method.
tSNE is an improvement to the previously proposed SNE method [hinton2002sne], which introduced a more nuanced probabilistic mapping from the high to lowdimensional spaces based on the similarity of neighbours in the high dimensional space. tSNE improved upon SNE by introducing a cost function that could be better optimised by gradientbased optimisers; and by using a heavytailed tdistribution in place of a Gaussian to address the tendency of points in the lowdimensional space to be pushed together at the cost of separation between points, a characteristic known as the crowding problem. A range of improvements to tSNE have since been proposed, including a more efficient treebased nearestneighbour search [maaten2014accelerating], and a parametric version[maaten2009learning], but the cost function, which is the key aspect in our work, has remained relatively unchanged.
IiC Multiobjective Optimisation
Multiobjective optimisation (MO) is a technique used when a problem intrinsically has two (or more) conflicting objectives between which a tradeoff must be made by any solution to the problem. The quality of a solution in this context is often relative to the objective function values of other solutions. For example, given two candidate solutions to a problem with objectives to be minimised, then the following test can be used to determine if is strictly better than (or dominates) :
(1) 
where for objectives (in this paper, ). A solution for which this does not hold true for all other solutions (i.e. it is nondominated) is called a Paretooptimal solution. The set of Paretooptimal solutions is often called the Pareto set. The Pareto front is then the image of the Pareto set in the objective space. In practice, it is infeasible or very difficult to find the true Pareto front, and so MO algorithms hence search for the best approximation front.
EC methods are some of the most successful and widely used approaches to MO as their populationbased structure naturally allows for multiple nondominated solutions to be found in one run. Of the Evolutionary Multiobjective Optimisation (EMO) algorithms, the MultiObjective Evolutionary Algorithm based on Decomposition (MOEA/D)
[zhang2007moead] has seen widespread use in recent years. MOEA/D has been shown to produce approximation fronts with better spread and convergence than previous EMO methods[li2009mo]. Multiobjective genetic programming (MOGP) approaches have seen significant success in recent years [bhowan2013evolving, cano2017multi].IiD Related Work
IiD1 GP for Manifold Learning and Visualisation
The only existing work we are aware of which used GP for MaL is our previous GPMaL approach [lensen2019can]. GPMaL was proposed to tackle the general MaL problem, i.e. the reduction of a dimension highdimensional space to a dimension lowerdimensional space, where . A (singleobjective) fitness function was proposed that measured how well manifold structure was preserved by measuring the preservation of neighbour orderings in the lowerdimensional space. GPMaL was fundamentally designed for dimensionality reduction for data mining tasks (e.g. classification), and so was primarily used to retain as much structure of the data as possible. Some initial application of GPMaL for visualisation was investigated with encouraging results, but it was found that a more specific fitness function would be needed to improve visualisation quality, and that tree size would need to be addressed for visualisations to be understandable.
Multiobjective GP approaches have been proposed to improve the visualisation quality of feature construction in a supervised learning context. The MOG3P method
[icke2011multi]aimed to produce solutions with a tradeoff between three objectives: classifiability, visual interpretability, and semantic interpretability. Unlike in this work, MOG3P focused on performing
supervised learning, using class labels to guide the learning process. The three objectives also do not appear to be strictly conflicting: on a naive classification dataset, it would be possible to achieve perfect classification performance with a simple hyperplane split, which could be represented by a simple GP tree which produces a visualisation with two very distinct classes. Even on more complex datasets, there is an inherent relationship between the visual interpretability of a dataset (i.e. how well classes are separated), and the classification performance: given wellseparated classes, one would expect correct classification to result. A later approach focused on classifiability and visual interpretability only, but used several different measures for each of these objectives
[cano2017multi]. This work also is a supervised approach, and so tackles a significantly different problem to that of this paper. The use of GP for visualisation has also been applied to combinatorial optimisation problems, such as job shop scheduling [nguyen2018visualising].IiD2 Other Approaches
A variety of nonEC approaches have been proposed for MaL, some of which produce models that could theoretically be interpreted to understand the manifold in terms of the original features. Parametric tSNE [maaten2009learning]
is a variation of tSNE that allows for reuse of learnt tSNE representations on future samples. Parametric tSNE constructs a mapping from the high to lowdimensional space using restricted Boltzmann machines to construct a pretrained feedforward neural network model. The neural network used over 10,000 neurons on the largest dataset, heavily restricting the potential for interpretation of the network. Autoencoders
[hinton2006reducing] are another neuralnetwork based approach, which attempt to compress the representation of the data into as narrow of a middle hidden layer as possible such that the original data can be recreated from the concise representation. To do so, autoencoders use a number of layers of varying sizes to encode and decode the data. This provides a mapping to and from the learnt representation, but is unrealistic to interpret given the number of nodes and fullyconnected topology. Attempts have been made to improve the interpretability of autoencoders [sun2018evolving], but success is fundamentally limited by the need for architectures which are differentiable. Selforganising maps (SOMs) [kohonen1982som], a variation of an unsupervised neural network, have been used for visualisation, but the number of weights scales with dimensionality and so their interpretability is limited on nontrivial datasets. The visualisation literature [nonato2019multidimensional] discusses a few examples of mapping “synthesised dimensions” to original dimensions (in particular, page 4 of the supplementary material). These examples generally use a post hoc approach, by varying the model’s parameters or the instances [faust2019dimreader] and analysing the effect, or using tools such as heatmaps. There is a distinct lack of literature that directly evolves simple models that use minimal unique features.Iii The Proposed Method: GPtSNE
Generally when applying GP to a problem, there are two main decisions to be made: what terminal and functional nodes are suitable, and how to formulate a fitness function to solve the problem. In addition, there are often other improvements made to the standard GP evolutionary process to further tailor it to the problem. Each of these three considerations are discussed in the following subsections. While each of these components builds on established work, the use of them together to produce a range of interpretable models producing highquality visualisations is a new and substantial advance in this field. The full source code of GPtSNE is available online^{1}^{1}1https://github.com/AndLen/gptsne.
Iiia GP Architecture
In order to project a highdimensional space to a twodimensional space for visualisation, it is important that a range of powerful and varied functions are available to the evolutionary process in order to produce compact but representative models. As exactly two dimensions are required for visualisation, we use a multitree approach where each individual contains two trees and each tree produces one dimension (i.e. the x or y axis) of the visualisation. A multitree representation is chosen rather than a cooperative coevolution approach as the two trees must be tightly coupled (i.e. highly dependant) in order to give highquality visualisations; cooperative coevolution approaches tend to nondeterministically pair solutions from different subpopulations which greatly decreases the ability for such coupling to occur.
IiiA1 Function Set
Table I lists the nine functions and four terminals used in GPtSNE. The four arithmetic functions are standard, with the exception of the and function which are the only functions that can take the zero node as input. This allows a variable number of inputs to these functions, which allows the evolutionary process to easily perform mutation that effectively removes whole trees (hopefully introns: “useless” subtrees whose output do not affect the tree output) in the pursuit of model simplification. The function performs protected division: if the denominator is , it returns .
Function  No. of Inputs  Description  
Arithmetic Functions  
1–5  Flexible Addition  
1–2  Std. Subtraction  
2  Std. Multiplication  
2  Protected Division  
NonLinear Functions  
Sigmoid  1  
ReLU  1  
Conditional Functions  
Max  2  
Min  2  
If  3  if : y; else  
Terminal Nodes  
F  0  feature value  
NF  0 


Constant  0  From  
Zero  0  The number 
The sigmoid and ReLU functions are unary operators that are included to allow easy transformation of (linear) inputs into a nonlinear output space. These were chosen based on inspiration from autoencoders and other neural network methods. The three conditional operators provide a different kind of nonlinear transform which is quite unique to GP due to their nondifferentiability, and they are expected to allow a single tree to exhibit varied behaviour depending on its inputs.
IiiA2 Terminal Set
The F terminal is commonly used in featurebased GP to use a feature as an input to a GP tree. Each feature in a dataset is assigned a distinct terminal which returns the values of the feature for a given instance. While this provides a great deal of flexibility to the EC search process, it does make the search space large for high numbers of features (). To remedy this, we use PCA to produce the first PC on a dataset, and then select the features from that PC which have the highestmagnitude weights. Features with highermagnitude weights contribute more to a PC, and therefore choosing the highestmagnitude features is a simple form of unsupervised feature selection. Each of these features have an increased likelihood of being chosen from the terminal set, which helps the EC search process focus on a set of promising features, while still allowing it to select other features with a smaller likelihood. In this work, we set , and each of the PCAselected features have an additional likelihood of being selected from the terminal set. For example, if features, then : each of the 8 features chosen by PCA has ( the likelihood) of being chosen from the terminal set. is chosen so that the number of selected features increases slowly with the total number of features. Note that while we use PCA to weight promising features more strongly, we do not use the actual transformed PCs in our terminal set, as this would make the trees very difficult to interpret.
The NF terminal serves two purposes: it provides a lowernoise version of a given feature, making trees less sensitive to noisy instances; and it allows a tree to encode local structure into the lowdimensional space more effectively by considering an instance’s neighbourhood. While individual features (F) are suitable for embedding global structure, they are not as effective for preserving local structure. The three nearestneighbours for each instance are precomputed by considering the ordering of the other instances by Euclidean distance.
The constant terminal is often used in GP, and is used here with a range of to allow different parts of the tree to have different impacts on the final output. The zero node is included solely for the and function, and is not used by any other functions as it is either destructive or has no effect. The constant and zero nodes are chosen from the terminal set with a likelihood of as they are less likely to be useful to a function compared to a featurebased node. A summary of the terminal nodes’ weightings are shown in Table II. Note that the likelihood calculation is for a single instance in the case of a featurebased terminal, e.g. there are different terminals, each with a base likelihood of 1. All of the numerical values, including the “5” in the flexible addition, the use of 3nearest neighbours, and the likelihood values, were empirically chosen by testing on a range of representative datasets.
Terminal  L.h.  L.h. for 

“Normal” F  1  1 
PCAselected F  11  
NF  1  1 
Constant  10  
Zero  10 
IiiB MultiObjective Approach
There is an intrinsic relationship in machine learning between the potential performance of a model and the complexity required to attain that performance. For example, the simplest model to separate two classes would be a decision boundary that simply thresholds at a certain point in space, whereas for three linearlyseparable classes, at least two thresholds would be required (for some real , :
). The same is true in visualisation: the more granular (specific) a visualisation is, the more complex the function used to produce that visualisation must be. To recreate the highdimensional structure of a complex dataset in two dimensions would require two very big and complex GP trees. Each node removed from a tree decreases the accuracy with which the tree can reproduce the highdimensional probability distribution (in the case of tSNE). As an analogy, consider evolving a very complex polynomial function with GP: the fewer components (nodes) in the evolved functions, the fewer inflection points available to approximate the function.
In this work, we employ a multiobjective approach to produce a set of solutions that allow for a tradeoff between visualisation quality and model interpretability to be chosen. This is strictly a more difficult problem than optimising only visualisation quality (i.e. as in tSNE) as a model must be found which maps the highdimensional to the lowdimensional space. We choose to use MOEA/D [zhang2007moead] due to the reasons discussed in Section IIC. The first objective uses tSNE to measure the visualisation quality of a GP tree; the second objective uses tree size as a proxy measure for the interpretability of the tree. These will be discussed in turn.
IiiC Objective 1: Visualisation Quality
tSNE uses conditional probabilities to represent the similarity between instances in a given dimensional space. Given two instances in the highdimensional space, and , the conditional probability that would be chosen as a neighbour of is defined as [maatenTSNE]:
(2) 
given a Gaussian with variance centred at . tSNE employs a symmetric approach where joint probability distributions are used; is computed as the symmetrised conditional probabilities which for instances is defined as:
(3) 
In order to remedy the crowding problem (see [maatenTSNE] for further details), tSNE uses a slightly different approach in the lowdimensional space which is based on a Student tdistribution. The joint probabilities of two instances, and , in the lowdimensional space, called is computed as:
(4) 
The cost function (C), which measures the extent to which the lowdimensional probability distribution does not match the highdimensional probability distribution, is the difference between the two distributions as measured using the sum of the KullbackLeibler (KL) divergences:
(5) 
We use this cost function as the first objective, which should be minimized. The parameter is computed based on a perplexity of using the approach outlined in [maatenTSNE]. We choose this cost function as it is welltested and wellregarded in the MaL and visualisation literature.
IiiD Objective 2: Model Complexity
A common issue encountered in GP is the production of bloated trees, where a GP tree is significantly bigger than is necessary to achieve a given level of fitness. Traditionally, there is no evolutionary pressure to encourage compact trees, and so trees may contain unnecessarily complex subtrees, or in the worst case introns While being computationally inefficient, bloated trees are also much harder for humans to interpret and understand.
Parsimony pressure, which treats minimisation of model size as a (minor) objective in the optimisation process, is the most frequently method for controlling bloat in GP [poli2014parsimony]. Weighted sum approaches, where a small component of overall fitness is based on tree size, has been often used for attempting to control bloat, but choosing the right weighting is difficult and generally must be set empirically. More recently, multiobjective approaches have been proposed for addressing bloat, whereby tree size is used as a secondary objective to be minimised [wagner2012parsimony]. This approach allows a tradeoff between fitness and model complexity to be found according to the approximation front produced by the GP process, while also producing a range of solutions of varying complexity in a single GP run which can be compared by the user for better insight into the problem being tackled.
We use a simple formula for complexity in this work which is based on the number of nodes in each of the two trees, , in a GP individual :
(6) 
A given node is counted towards the complexity unless it is a ZeroNode; these are not counted as they exist only to allow flexibility in the arity of the addition function, and can be removed from the tree structure when interpreting it.
IiiE Optimisation of Tree Constants
To further increase the visualisation quality without introducing additional model complexity, we use Particle Swarm Optimisation (PSO) [kennedy1995particle] to finetune the Ephemeral random constants (ERCs) in each individual in the final front at the end of the evolutionary process. Standard GP has no ability to finetune its numerical parameters efficiently (as it randomly searches the parameter space); by employing PSO we can do so. Each ERC (from both trees) is allocated a dimension in the PSO representation, with the value of a given dimension representing how much the value of the given ERC varies from its original value. We use a very small range of initial position values () and low minimum and maximum velocities ( and ) to focus the PSO search on finetuning the ERCs in the tree. One of the 30 particles is initialised with all values of (i.e. the original ERC values) so that the PSO search is guaranteed not to produce inferior solutions. The PSO search is singleobjective (as the tree structure is fixed), with the fitness function being the same cost function used as the first objective in the GP search (Eq. 5). PSO is only used at the end of the GP evolutionary process both due to its computational cost and to prevent GP from falling into local minima that would result from finetuning during evolution.
Other techniques such as the Covariance Matrix Adaptation Evolution Strategy (CMAES) [cmaes2003] have also been used for numerical optimisation, but PSO has been shown to give considerably better results on illconditioned functions^{2}^{2}2An illconditioned function is one with a high condition number: its output is overly sensitive to changes in the values of its variables. [hansen2011impacts]. For GP trees of sufficient complexity, it is expected they have the characteristics of an illconditioned function as a change in a given ERC value is likely to significantly change the output of the tree due to the interactions present across different subtrees. Preliminary testing confirmed that PSO could optimise the random constants of the final GP individuals more effectively and consistently than CMAES.
IiiF Other Considerations
We use a slight variation to the standard MOEA/D, whereby we prevent the breeding process from producing individuals which have already been created in earlier generations^{3}^{3}3The breeding process tries up to 10 times to produce a nonduplicate individual — otherwise it returns one of the parents, chosen randomly.
. While this added a small amount of overhead to the evolutionary process, it gave a much more efficient and effective learning process which was ultimately found to improve the diversity and quality of the final approximated front significantly. The weight vectors in MOEA/D were scaled to encourage a uniformly distributed front. The cost objective was scaled to
and complexity to .The evolutionary process was sped up through the use of multithreaded evaluation, caching of quality values for previouslyseen trees, and the use of linear algebra libraries which allow significantly faster matrix operations through native code libraries. An overview of the GPtSNE algorithm is shown in Algorithm 1.
Iv Experiment Setup
The proposed GPtSNE method was applied to a range of representative datasets which contain wellformed classes that lend well to visualisation. The nine datasets are summarised in Table III, and have a range of numbers of instances, features, and classes. These datasets are from a number of different domains including general classification, biology, and image analysis. Most of these datasets were sourced from the UCI repository [uci]. The standard tSNE implementation [maatenTSNE] was chosen as a baseline method as it has the same optimisation measure as GPtSNE, and is regarded as the stateoftheart in visualisation techniques. We used a standard perplexity value of (the same as in GPtSNE). We also compare to our previous GPMaL method [lensen2019can] which was the first GP method to perform MaL, using a singleobjective approach. GPMaL parameter settings were unchanged from the paper.
Dataset  Instances  Features  Classes 

Iris  150  3  3 
Wine  178  13  3 
Dermatology  358  34  6 
Breast Cancer Wisconsin  683  9  2 
COIL20  1440  1024  20 
Isolet  1560  617  26 
MFAT  2000  649  10 
MNIST 2class  2000  784  2 
Image Segmentation  2310  19  7 
We found that the use of a multiobjective approach necessitated a large number of generations (2,500) in order to allow the biggest trees (with the best performance) to be trained sufficiently. However, only a small population size of 100 individuals was needed to achieve a reasonable cover of the front, especially due to the technique used to breed unique individuals. The remaining parameters (Table IV) are standard settings — tuning them further gave no change in performance.
Standard GP mutation is used, with one of the two trees in an individual being randomly chosen to be mutated. Crossover is performed by randomly selecting either the first or second tree to use, and then performing crossover between this tree in each parent — this crossover occurs between the same axis of the visualisation.
Parameter  Setting  Parameter  Setting 

Generations  2500  Population Size  100 
Mutation  20%  Crossover  80% 
Min. Tree Depth  2  Max. Tree Depth  14 
No. Trees  2  Pop. Initialisation  Halfandhalf 
All three methods were run 30 times on each dataset. As our evaluation is primarily based on qualitative analysis (as is standard in visualisation [maatenTSNE, 2018arXivUMAP]), we chose the median result out of the 30 runs according to the objective function used by the method: tSNE’s cost function for tSNE and GPtSNE, and the fitness function proposed in [lensen2019can] for GPMaL.
tSNE runs (by default) for up to 1,000 iterations, corresponding to 1,000 evaluations of the tSNE cost function. GPtSNE runs for 2500 generations with a population size of 100, giving 250,000 evaluations. Clearly, GPtSNE is computationally more expensive, although the use of fitness caching and multithreading reduces this expense. Also, GPtSNE produces many visualisations in a single run, whereas tSNE produces only one. On the biggest dataset, Image Segmentation, GPtSNE takes 30 hours, whereas tSNE takes around two minutes. We intend to significantly reduce this difference in future work, through the use of surrogate techniques and tuning of the GP design, but note that GPtSNE is highly parallelisable, and so can run much quicker in practice.
V Results and Discussion
The results across each of the nine datasets are shown in Figs. 1–9. Each figure contains eight plots, which we refer to as (a)–(h) reading lefttoright and toptobottom. Plot (a) shows the attained approximation front of the (median) GP result, with blue crosses showing each individual in the front, and the black line showing the shape of the front. The y and xaxes show each individual’s value for Objective 1 (cost/visualisation quality) and 2 (model complexity) respectively. The grey lines show the worst and best of the 30 GPtSNE runs, which show the variance of the GPtSNE algorithm. The green and yellow dotted lines represent the cost achieved by the baseline tSNE and GPMaL results respectively. Plot (b) shows the same front, but with the xaxis truncated at 200 complexity, so as to focus on the distribution of the simpler/more interpretable trees. Plots (c)–(f) show representative visualisations — coloured by class label — produced by individuals at different levels of model complexity (highlighted in red on the attained front): (c) is the simplest model found by GPtSNE where each tree is a single feature (i.e. feature selection (FS)), (d) and (e) are two characteristic samples from along the front, and (f) is the most complex model, with the most accurate visualisation and lowest cost. A small amount of jitter is added to visualisations with low complexity (less than 20) so that overlapping points can be better distinguished. Plots (g) and (h) are the tSNE and GPMaL baseline results with median performance, where the cost written above the plots is calculated using tSNE’s objective function (the first objective of GPtSNE), to allow for sensible comparisons.
It is only possible to show a small sample of the evolved visualisations in this paper, but one of the significant advantages of GPtSNE is its ability to produce a series of visualisations which progressively become more complex but more accurate representations of the highdimensional feature space. To demonstrate this, we have included a video that shows each individual along the approximation front for each dataset, as the front is traversed from least to most complex. This is included with our submission, and is also available on YouTube^{4}^{4}4https://www.youtube.com/watch?v=Kz1jhBDHlY. Specific visualisations can be studied in more detail by lowering the playback speed (e.g. to 25% on YouTube).
Va Specific Analysis
On the simplest dataset, Iris (Fig. 1), all three methods provide a clear 2D visualisation, although tSNE and GPtSNE more distantly separate the green class. The Sample 1 visualisation for GPtSNE is quite similar to that produced by GPMaL when rotated, but with less continuous spacing of points, likely due to the trees being too small to convert the lowgranularity feature space to a more complex output space. The front shows that diminishing returns are quickly found on this dataset: after 34 tree size (sample 2), quality only gradually improves as tree size increases. Indeed, the difference between the visualisations at complexity 34 and 1923 are very minor — clearly GPtSNE can capture the bulk of the structure in the simple Iris dataset using a simple model.
Figure 2 shows the results on another simple dataset: Wine. All three methods produce clear visualisations, with the main difference being that tSNE projects all 3 classes across the visualisation diagonally, whereas the two GP methods produce a curved pattern. Even at a low level of complexity, GPtSNE clearly separates the three classes, and higher levels of complexity tend to mostly refine the local structure within each class. As on the Iris dataset, the cost achieved by GPtSNE at its maximum complexity is reasonably close to tSNE, despite being produced by a mapping rather than just an embedding.
Both tSNE and GPtSNE clearly separate the two classes on the Wisconsin Breast Cancer dataset (Fig. 3), whereas GPMaL does not give as clear a separation boundary. GPtSNE begins to show the division between the two classes as early as at a complexity of 33, with the red class becoming more tightly packed as the complexity increases further. Indeed, it appears that the GP models retain the same general “approach” in sample 1, 2 and at the maximum complexity, but become increasingly refined and accurate. In this way, it could be possible to use the simpler models to explore the more accurate patterns found by the bigger uninterpretable models by exploring how points move as complexity increases.
The Dermatology dataset (Fig. 4) further reinforces this pattern: as GPtSNE produces increasingly complex models, the visualisation becomes of a “higherresolution” and produces points that appear to be in a continuous space, rather than a discrete space. Sample 1 shows the blue class clearly being separated, and the red and purple classes beginning to separate from the other classes. Sample 2 shows a clear separation between all classes except the yellow and green ones, despite having a complexity of only 86. The maximum complexity separates the classes further, and compresses each class more tightly. The lowest cost for GPtSNE gives a very similar visualisation to tSNE, despite having a cost that is higher. GPMaL is clearly worse than GPtSNE, even when GPtSNE has a tree size as low as 86.
The COIL20 dataset (Fig. 5) highlights tSNE’s ability to freely move points throughout the 2D space, as it clearly produces the best visualisation. It is interesting to note however that GPtSNE is able to separate some classes well, even at low model complexity. For example, in Sample 1 (complexity of 22), a number of classes (dark green, pink, blue) are already clearly visible — this suggests that these classes may be particularly welldefined, as they are easily separated. As the complexity increases, additional classes separate out, and the curve topology of the blue/pink/red classes seen in the tSNE visualisation starts to form for GPtSNE too.
Figure 6 shows how the Isolet dataset has many overlapping classes, with only a few classes distinctly separated by tSNE. The GP methods generally overlap the same classes as tSNE, but do not manage to separate the groups quite as successfully. Isolet has the highest number of classes (26) of all the datasets, which makes it especially challenging for GPbased methods, since evolving two functions that can provide sufficient granularity to separate 26 classes is clearly very challenging and requires sufficiently complex trees. Despite this, it may be possible to gain some insight into why given classes overlap, as the general overlapping patterns start to be visible even at lower complexities of 39 to 131.
On the MFAT dataset (Fig. 7), GPtSNE is able to group each class effectively at the maximum complexity, but does not show the clear separation between classes that tSNE provides. GPtSNE clearly performs significantly better than GPMaL, which overlaps the light green and purple classes onto other classes. The grouping of classes starts to appear at a complexity of 94, although the orange and purple classes overlap still. A complexity of 94 corresponds to two trees with 94 nodes in total — which is clearly quite restrictive given that the MFAT dataset has 649 features and 10 classes.
The visualisations in Fig. 8 (MNIST 2class) provide an example of a large dataset with a small number of classes. All three methods are able to separate the two classes reasonably well, with GPtSNE and tSNE producing quite similar results. This separation is evident even at low model complexity in GPtSNE – at a complexity of 51, it is possible to draw a vertical line through the middle of the visualisation that would separate the two classes with reasonable accuracy. At higher levels of complexity, GPtSNE is able to push the two classes apart to leave a clear gap between them. GPMaL produces a less clear separation and struggles to distribute the points in the visualisation space well.
On the final dataset, Image Segmentation (Fig. 9), all three methods are able to separate the green and orange classes out, with tSNE and GPtSNE providing clearer separation margins. Both GPtSNE and tSNE also separate the red class to some extent, with tSNE doing so slightly more effectively. An interesting observation is that the orange, and to a lesser extent the green classes, are separated at very low model complexity. The orange class is easily separated using only one feature, whereas the green class can be separated at a complexity of 31. This suggests that perhaps these are more natural/intrinsic classes in the dataset; perhaps they exhibit characteristics that make them particularly distinct from the other instances. Another interesting finding is that the red class is separated well by GPtSNE at a complexity of 11 — but then overlaps with the purple class at higher complexity. This may be a limitation of the tSNE objective function or may suggest that the red class is actually a subclass of the purple class and is incorrectly overseparated at a low complexity. Indeed, the red class corresponds to pixels that are part of a path in an image, and the purple class to cement pixels. Clearly, many different things can be constructed out of concrete, including paths, which explains the overlap that is evident at higher complexities. This phenomenon is another useful characteristic of GPtSNE: it shows the different relationships between classes at different levels of model complexity.
VB General Findings
The visualisations produced by GPtSNE were consistently superior to those of our previous GP method, GPMaL. GPMaL was developed for more general MaL (i.e. not just reducing to two dimensions); by using a visualisationspecific quality measure, GPtSNE was able to clearly improve. As the number of instances increased, tSNE begun to produce clearer visualisations than GPtSNE, although the same general patterns were shown by both methods. Given GPtSNE is in essence tackling a strictly more difficult task, of evolving a functional model to produce a visualisation, we see this as a success for the first such approach to this task. This pattern can be further seen in the convergence analysis shown in Figure 10. On the easier datasets (the first few in the legend), GPtSNE converges before 1,000 generations. On the hardest datasets such as MNIST and MFAT, GPtSNE is likely to benefit from additional generations. We are confident that future research in this new research direction will be able to further improve GPtSNE on highdimensional datasets with many classes.
One particularly interesting observation was how GPtSNE produced similar overall visualisations at different model complexities, with the more complex models being more granular/refined than the simpler ones. This demonstrates the advantages of an evolutionary approach: good subtrees of a complex tree can be used to improve simpler trees (and viceversa) via crossover, allowing for more efficient search that produces models with different levels of tradeoff. In the following section, we will explore this phenomenon in more depth to highlight the novel advantages of GPtSNE.
Vi Further Analysis
The results of GPtSNE on the Dermatology dataset (Fig. 4) were of particular interest as they showed similar results between GPtSNE and tSNE; showed the same general visualisation, but at different qualities across tree sizes; and achieved good visualisations even at reasonably small model complexities. To further demonstrate the value of GPtSNE, this section analyses a selection of increasingly complex trees produced by the median GP run on this dataset.
Via Simple Models
The simplest model which gives a reasonable visualisation is shown in Fig. 13, which contains one tree with three nodes, and the other with 10, for a total complexity of 13. Even with such a low complexity, three of the classes start to appear, with the blue, purple, and red classes already showing a clear separation from the others. The top tree, which gives the xaxis of the visualisation, can be written as . The bottom tree, which gives the yaxis is simply . These two trees use a total of seven unique features out of the 34 in the original feature set, yet they are able to separate three classes of the dataset well. In particular, the blue, red and purple classes can be separated out by drawing two vertical lines through the plot. In other words, two thresholds can be applied to the output of the top tree (xaxis) in order to roughly separate the classes into three groups: red; orange and green; and purple and blue. Given that all features have the same range of and the only feature subtracted in the tree is , this suggests that the blue and purple classes have particularly small feature values for (as they have high x values), and the red class has particularly large values (as it has low x values). in the Dermatology dataset corresponds to the “thinning of the suprapapillary epidermis” feature, which is a feature commonly associated with the skin condition of psoriasis [lever2015histopathology]. In the visualisation in Fig. 13, the red class corresponds to a diagnosis (class label) of psoriasis — and indeed, the red instances appear on the left side of the plot, which is consistent with having a high value of being subtracted from the rest of the tree. This sort of analysis can be used by a clinician to understand that this feature alone could be used to diagnose psoriasis with reasonable accuracy, and also provides greater confidence in the visualisation’s accuracy.
When the model complexity is increased to 19 (Fig. 13), the separation of points within classes starts to become betterdefined, and the orange class starts to become more distinct from the yellow and green classes. The top tree actually uses fewer unique features (three) than at a complexity of 13, with significant weight put on : , whereas the bottom tree uses four unique features: , again for a total of seven unique features across both trees. The blue class (“lichen planus”) is clearly distinct from the other classes along the xaxis — given that the top tree weights heavily, it seems likely that a high value of this feature is characteristic of the “lichen planus” diagnosis. Indeed, the dermatology literature commonly reports on this symptom being indicative of this diagnosis [lever2015histopathology].
The trees shown in Fig. 13 appear very similar to the previous model, with the top tree varying only by the omission of one nf node, and the introduction of the subtraction of the nf feature. This is, however, sufficient to cause some of the orange points to be separated from the yellow and green points along the xaxis — indicating that higher values of nf may indicate a point belongs to the orange class rather than the yellow one. The major change in the bottom tree is the introduction of the nf and nf features, which helps to compact the blue class and starts to separate the red class more strongly.
Figure 16 further pushes the blue class away from the other classes, at the expense of squashing the other classes together. This is achieved by making the top tree (xaxis) weight f and f even more strongly. It is interesting to note that all the trees analysed so far have been strictly linear functions, utilising only arithmetic and subtraction. This is perhaps not unexpected: utilising more sophisticated functions is likely to require complex trees to fully take advantage of them — for example, taking the maximum of two simple subtrees is unlikely to be more representative of the original dataset than simply adding together features. This also has the benefit of making the simpler trees easier to interpret and understand. While there are many methods for doing linear transformations for visualisation, such as PCA, these methods generally function by weighting all features to different extents; GPtSNE only selects a subset of features to be used, and so is inherently more interpretable. The sequential analysis of increasingly more complex models clearly provides additional insight that would be lacking in a nonmultiobjective approach. This analysis technique is also an exclusive benefit of GPtSNE among other visualisation techniques which are primarily black boxes with one visualisation produced per run.
ViB Complex Models
From here, the improvement in quality with the addition of more complexity begins to show clear diminishing returns, with the cost decreasing by only about 0.1 as the complexity is roughly doubled from 33 to 59, and from 59 to 104. Figure 16 gives a visualisation that is very similar to that of the maximum complexity (see Fig. 4), with the yellow and green classes overlapping but all other classes clearly distinctly separated. This is also the first stage at which constants and nonlinear operators such as and are introduced. The trees used in previous models are still recognisable as subtrees (or “building blocks”) in the more complex trees in Fig. 16 — it may be possible to analyse the more complex trees based on what has been added in addition to these subtrees. Figure 16 does not change the class separations present in Fig. 16, but better separates withinclass instances, as well as moving the purple class away from the other classes.
We do not attempt to analyse models with a complexity of over 100 in detail, as the trees become difficult to interpret easily, and eventually become as much of a black box as tSNE itself is. The main aim of this paper is to produce goodquality (not stateoftheart) visualisations that are interpretable in order to provide deeper insight into the relationships within a dataset in terms of the features it uses. If visualisation quality is the primary goal, then methods such as tSNE are expected to be more appropriate, given they are not constrained to finding a functional mapping from the original feature space to the visualisation axes. While the very complex GP trees cannot be feasibly interpreted, they are still important to the GPtSNE algorithm. During the evolutionary process, it is common to see complex trees with low cost being automatically simplified through the removal of superfluous subtrees via crossover and mutation, allowing reductions in cost to “trickle down” to simpler models along the front. When we restricted the EMO search to only simple trees, we found that results were much poorer, with smaller trees having much higher cost. The complex trees are also useful outright as they can be directly applied for visualising future instances (e.g. streaming data), whereas tSNE must reoptimise its embedding each time.
ViC Summary
In this section, we showed how progressive examination of the approximation front from least to most complex models allowed insight into the structure of the data that is not easily achievable through traditional visualisation algorithms. Using the Dermatology dataset as a case study, we showed that even simple models could separate out some classes clearly, with the use of only a few features. As we increased the complexity, we found that the granularity/local structure within classes improved, but the overall patterns changed only slowly. In this way, it is possible to use simple, understandable models to provide insight into how more complex models are able to produce goodquality visualisations.
Vii Conclusion
This paper highlighted the need for research into machine learning methods that can not only produce clear visualisations, but do so through a model that can be interpreted itself to give insight into how the visualisation arose from the features of the dataset. The use of GP was suggested due to its functional model structure, and a multiobjective approach was proposed that gave a staggered front of visualisations with different levels of tradeoff between visual clarity and model complexity. Results on nine different datasets highlighted the promise in using a GP approach for this task, with visualisations produced that showed clear characteristics of datasets even at model complexities that could be low enough to understand. This was further showcased through an indepth analysis of an evolved front on the Dermatology dataset, where concrete insight into how the dataset was structured was found by examining a range of models with different complexities.
As this is the first multiobjective GP approach to this problem, there is a clear need for continued future research. While the quality of visualisations were encouraging, on more complex datasets they fell short of those produced by tSNE. This is not unexpected given tSNE does not produce a functional mapping, but rather an embedding of the data. The measure of visualisation quality used in this work was based on the one used by tSNE given it is the stateoftheart measure. However, the design of tSNE was constrained by the need for a differentiable cost function; as an ECbased method, GPtSNE need not have this constraint on its objective function — there is likely to be better measures of visualisation quality that could be used in an EC context. It was also found that the more unique features used by a GP tree, the more difficult it was to understand; some sort of evolutionary pressure towards trees using a small set of (cohesive) features may improve this.
Comments
There are no comments yet.