Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Applications of Machine Learning and High Dimensional Visualization in Cancer Diagnosis and Detection John F. McCarthy*, Kenneth A. Marx, Patrick Hoffman, Alex Gee, Philip O’Neil, M.L. Ujwal, and John Hotchkiss AnVil, Inc. 25 Corporate Drive Burlington, MA 01803 *corresponding author [email protected]; (781) 828-4230 1 Abstract {John M will provide} 2 Introduction {John M. will provide} 0. Data Analysis by Machine Learning Overview of Machine Learning and Visualization Machine learning is the application of statistical techniques to derive general knowledge from specific data sets by searching through possible hypotheses exemplified in the data. The goal is typically to build predictive or descriptive models from characteristic features of a dataset and then use those features to draw conclusions from other similar datasets [0.1]. In cancer diagnosis and detection, machine learning helps identify significant factors in high dimensional datasets of genomic, proteomic, chemical, or clinical data that can be used to understand or predict underlying disease. Machine learning techniques serve as tools for finding the “needle in the haystack” of possible hypotheses formulated by studying the correlation of protein or genomic expression with the presence or absence of disease. These same techniques can also be used to search chemical structure databases for correlations of chemical structure with biological activity. In the process of analyzing the accuracy of the generalization of concepts from a dataset, high dimensional visualizations give the researcher time-saving tools for analyzing the significance, biases, and strength of a possible hypothesis. In dealing with a potentially discontinuous high dimensional concept space, the researcher’s intuition benefits greatly from a visual validation of the statistical correctness of a result. The visualizations can also reveal sensitivity to variance, non-obvious correlations, and unusual higher order effects that are scientifically important, but would require time consuming mathematical analysis to discover. 3 Cancer diagnosis and detection involves a group of techniques in machine learning called classification techniques. Classification can be done by supervised learning, where the classes of objects are already known, and are then used to train the system to learn the attributes that most effectively discriminates among the members of the class. For example, given a set of gene expression data for samples with known classes of disease, a supervised learning algorithm might learn to classify disease states based on patterns of gene expression. In unsupervised learning (also know simply as clustering), there are either no predetermined classes, or the class assignments are ignored, and data objects are grouped together by cluster analysis based on a wide variety of similarity measures. In both supervised and unsupervised classification an explicit or implicit model is created from the data to help to predict future data instances, or understand the physical processes responsible for generating the data. Creating these models can be a very computationally intensive task, given the large size and dimensionality of typical biological datasets. As a consequence many such models are prone to the flaw of ”overfitting” which may reduce the external validity of the model when applied to other data sets of the same type. Feature selection and reduction techniques help with both compute time and overfitting problems by reducing the data attributes used in creating a data model to those that are most important in characterizing the hypothesis. This process can reduce analysis time and create simpler and (sometimes) more accurate models. In the three cancer examples to be presented subsequently, both supervised and unsupervised classification, as well as feature reduction, are used and will be described. In addition, we will discuss the use of high dimensional visualization in conjunction with these analytical techniques. One particular visualization, RadViz™, incorporates machine learning techniques with an 4 intuitive and interactive visual display. Two other high dimensional other visualizations, Parallel Coordinates and PatchGrid (similar to HeatMaps) are also used to analyze and display results. Below we summarize the classification, feature reduction, validation, and visualization techniques we use in the examples that follow. In particular emphasis is placed on explaining the techniques of RadVizTM and Principle Uncorrelated Record Selection (PURS), which have been developed by the authors. Machine Learning Techniques: Classification techniques vary from the statistically simple testing of sample features for statistical significance, to sophisticated probabilistic modeling techniques. The supervised learning techniques used in the following examples include Naïve Bayes, support vector machines, instance-based learning (K-nearest neighbor), logistic regression, and neural networks. Much of the work in the following examples is supervised learning, but it also includes some unsupervised hierarchical clustering using Pearson correlations. There are many excellent texts giving detailed descriptions of the implementations and use of these techniques [0.1]. Feature Reduction Techniques: In machine learning, a dataset is usually represented as a flat file or table consisting of m rows and n columns. A row is also called a record, a case or an n-dimensional data point. The n columns are the dimensions, but are also called “features”, “attributes” or “variables” of the data points. One of the dimensions is typically the “class” label used in supervised learning . Machine learning classifiers do best when the number of dimensions are small ( less than 100) and the number of data points are large ( greater than 1000). Unfortunately, in many biochemical datasets, the dimensions are large ( ex. 30,000 genes) and the data points (ex. 50 patient samples) are small. The first task is to reduce the dimensions or features so that machine learning techniques can be used effectively. 5 There are a number of statistical approaches to feature reduction that are quite useful. These include the application of pairwise t-statistics and using F-statistics to select the features that best discriminate among classes. A more sophisticated approach is one we call Principle Uncorrelated Record Selection, or PURS. PURS involves initially selecting some initial seed features, based on a high t or F-statistic. Features that correlate highly with the seed features are then repeatedly deleted. If a given feature does not correlate highly with a feature already in the seed feature set, it ist hen added to that set. We repeat this process, reducing the correlation threshold, until the seed featrure set is reduced to the desired dimensionality. We also use random feature selection to train and test classifiers. This technique is used to measure the improvement in predictive power of the more selectively chosen feature set. Validation Techniques: Perhaps the most significant challenge in the application of machine learning to biological data is the problem of validation, or the task of determining the expected error rate from a classifier when applied to a new dataset. The data used to create a model cannot be used to predict the performance of that model on other datasets. In order to evaluate the external validity of a given model, the features selected as important for classification must be tested against a different dataset which was not used in the creation of the original classifier. An easy solution to this problem is to divide the data into a training set and a test set. However, since biological data is usually expensive to acquire, large datasets sufficient to allow this subdivision and still have the statistical power to generalize knowledge from the training set are hard to find. In order to overcome this problem, a common machine learning technique called 10-fold cross-validation is sometimes used. This approach divides the data into 10 groups, creates the model using 9 of the 6 groups, and then tests it on the remaining group. This procedure is repeated in an iterative fashion until each ot the 10 groups has served as a test group. The ten error estimates are then averaged to give an overall sense of the predictive power of classification technique on that dataset. Another technique used to help predict performance in limited data sets is an extension of the 10fold validation idea called leave-one-out validation. In this technique one data point is left out of each of the iterations of model creation, and is used to test the model. This is repeated until every data point in the dataset has been used once as test data. This approach is nicely deterministic, as compared to 10-fold cross validation which requires the careful random stratification of the ten groups. In contrast to the 10-fold approach however, it does not give as useful a characterization of the accuracy of the model for some distributions of classes within the datasets. High Dimensional Visualization Although there are a number of conventional visualizations that can help in understanding the correlation of a small number of dimensions to an attribute, high dimensional visualizations have been difficult to understand and use because of the potential loss of information that occurs in projecting high-dimensional data down to a two or three-dimensional representation. There are numerous visualizations and a good number of valuable taxonomies of visual techniques [0.2]. The authors frequently make use of many different visualization techniques in the analysis of biological data, especially: matrices of scatterplots [0.3]; heat maps [0.3]; parallel coordinates [0.4]; RadViz™ [1.13]; and principal component analysis [0.5]. Only RadVizTM however is uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions) 7 datasets, and very useful when used interactively in conjunction with specific machine learning and statistical techniques to explore the critical dimensions for accurate classification. RadViz™ is a visualization, classification, and clustering tool that uses a spring analogy for placement of data points while incorporating machine learning and feature reduction techniques as selectable algorithms.. The “force” that any dimension exerts on a sample point is determined by Hooke’s law: f kd . The spring constant k, ranging from 0.0 to1.0, is the value of the scaled dimension for that sample, and d is the distance between the sample point and the perimeter point on the RadViz™ circle assigned to that feature as shown in Figure 0.1. The placement of a sample point is determined by the point where the total force, determined vectorially from all features, is zero. In the RadViz layout illustrated in Figure 0.1. there are 16 variables or dimensions associated with the one data point plotted, however variable 1 is a class label and is typically not used in the layout. Fifteen imaginary springs are anchored to the points on the circumference and attached to this one data point. In this example, the spring constants (or dimensional values) are higher for the darker springs and lower for the lighter springs. Normally, many data points are plotted without showing the spring lines. The values of the dimensions are normalized to have values between 0 and 1 so that all dimensions have “equal” weights. This spring paradigm layout has some interesting features. For example if all dimensions have the same normalized value, the data point will lie exactly in the center of the circle. If the point is a unit vector, then that point will lie exactly at the fixed point on the edge of the circle corresponding to its respective dimension. It should be noted that many points may map to the same position. This represents a non-linear transformation of the 8 data which preserves certain symmetries and which produces an intuitive display. Some of the significant aspects of the RadVizTM visualization technique are as follows: it is intuitive, higher dimension values “pull” the data points closer to that dimension on the circumference points with approximately equal dimension values will lie close to the center points with similar values, whose dimensions are opposite each other on the circle, will lie near the center points which have one or two dimension values greater than the others lie closer to those dimensions the relative locations of the of the dimension anchor points can drastically affect the layout an n-dimensional line gets mapped to a line (or a single point) in RadVizTM Convex sets in n-space map into convex sets in RadVizTM The RadVizTM display combines the n data dimensions into a single point for the purpose of clustering, while also integrating analytic embedded algorithms in order to intelligently select and radially arrange the dimensional axes. This arrangement is performed through a set of algorithmic procedures based upon the dimensions’ significance statistics, which optimizes clustering by maximizing the distance separating clusters of points. The default arrangement is to have all dimensions equally spaced around the perimeter of the circle. However, the feature reduction and class discrimination algorithms subsequently optimize the arrangement of features in order to increase the separation of different classes of sample points. The feature reduction technique used in all figures in the present work is based on the t-statistic with Bonferroni 9 correction for multiple tests. The RadVizTM circle is divided into n equal sectors or “pie slices,” one for each class. Features assigned to each class are spaced evenly within the sector for that class, counterclockwise in order of significance within a given class. As an example, for a 3 class problem such as illustrated in Figure 1.2, features are assigned to class 1 based on the sample’s t-statistic which compares class 1 samples with class 2 and 3 samples combined. Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3 combined values, and Class 3 features are assigned based on the t-statistic comparing class 3 values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of the circle have no features assigned to them, the data points would all cluster on one side of the circle, pulled by the unbalanced force of the features present in the other sectors. In this case, a variation of the spring force calculation is used where the features present are effectively divided into qualitatively different forces comprised of high and low k value classes. This is done by requiring k to range from –1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and others ‘push’ (low or –k values) the points within the RadVizTM display space while still maintaining their relative point separations. The t-statistic significance is a standard method for feature reduction in machine learning approaches. RadVizTM has this machine learning feature embedded in it and is responsible for the selections carried out here. The advantage of RadVizTM is that one immediately sees a “visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class separation correlates to the accuracy of any classifier built from a reduced feature set. The additional advantage of this visualization technique is that subclusters, outliers, and misclassified points can quickly be seen in the graphical layout. One of the standard techniques to visualize clusters or class labels is to perform a Principle Component Analysis (PCA) and show the points 10 in a two dimensional or three dimensional scatter plot using the first few principle components as axes. Often this display shows clear class separation, but the most important features contributing to the PCA are not easily detrmined as the PCA axes represent a linear combination of the original feature set. RadVizTM is a “visual” classifier that can help one understand important features and how these features are related within the original feature space. 11