Download NCI 8-17-03 Proceedi..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Applications of Machine Learning and High Dimensional
Visualization in Cancer Diagnosis and Detection
John F. McCarthy*, Kenneth A. Marx, Patrick Hoffman, Alex
Gee, Philip O’Neil, M.L. Ujwal, and John Hotchkiss
AnVil, Inc.
25 Corporate Drive
Burlington, MA 01803
*corresponding author
[email protected];
(781) 828-4230
1
Abstract
{John M will provide}
2
Introduction
{John M. will provide}
0. Data Analysis by Machine Learning
Overview of Machine Learning and Visualization
Machine learning is the application of statistical techniques to derive general knowledge
from specific data sets by searching through possible hypotheses exemplified in the data. The
goal is typically to build predictive or descriptive models from characteristic features of a dataset
and then use those features to draw conclusions from other similar datasets [0.1]. In cancer
diagnosis and detection, machine learning helps identify significant factors in high dimensional
datasets of genomic, proteomic, chemical, or clinical data that can be used to understand or
predict underlying disease. Machine learning techniques serve as tools for finding the “needle in
the haystack” of possible hypotheses formulated by studying the correlation of protein or
genomic expression with the presence or absence of disease. These same techniques can also be
used to search chemical structure databases for correlations of chemical structure with biological
activity.
In the process of analyzing the accuracy of the generalization of concepts from a dataset,
high dimensional visualizations give the researcher time-saving tools for analyzing the
significance, biases, and strength of a possible hypothesis. In dealing with a potentially
discontinuous high dimensional concept space, the researcher’s intuition benefits greatly from a
visual validation of the statistical correctness of a result. The visualizations can also reveal
sensitivity to variance, non-obvious correlations, and unusual higher order effects that are
scientifically important, but would require time consuming mathematical analysis to discover.
3
Cancer diagnosis and detection involves a group of techniques in machine learning called
classification techniques. Classification can be done by supervised learning, where the classes of
objects are already known, and are then used to train the system to learn the attributes that most
effectively discriminates among the members of the class. For example, given a set of gene
expression data for samples with known classes of disease, a supervised learning algorithm
might learn to classify disease states based on patterns of gene expression. In unsupervised
learning (also know simply as clustering), there are either no predetermined classes, or the class
assignments are ignored, and data objects are grouped together by cluster analysis based on a
wide variety of similarity measures. In both supervised and unsupervised classification an
explicit or implicit model is created from the data to help to predict future data instances, or
understand the physical processes responsible for generating the data. Creating these models can
be a very computationally intensive task, given the large size and dimensionality of typical
biological datasets. As a consequence many such models are prone to the flaw of ”overfitting”
which may reduce the external validity of the model when applied to other data sets of the same
type. Feature selection and reduction techniques help with both compute time and overfitting
problems by reducing the data attributes used in creating a data model to those that are most
important in characterizing the hypothesis. This process can reduce analysis time and create
simpler and (sometimes) more accurate models.
In the three cancer examples to be presented subsequently, both supervised and unsupervised
classification, as well as feature reduction, are used and will be described. In addition, we will
discuss the use of high dimensional visualization in conjunction with these analytical techniques.
One particular visualization, RadViz™, incorporates machine learning techniques with an
4
intuitive and interactive visual display. Two other high dimensional other visualizations, Parallel
Coordinates and PatchGrid (similar to HeatMaps) are also used to analyze and display results.
Below we summarize the classification, feature reduction, validation, and visualization
techniques we use in the examples that follow. In particular emphasis is placed on explaining the
techniques of RadVizTM and Principle Uncorrelated Record Selection (PURS), which have been
developed by the authors.
Machine Learning Techniques:
Classification techniques vary from the statistically simple testing of sample features for
statistical significance, to sophisticated probabilistic modeling techniques. The supervised
learning techniques used in the following examples include Naïve Bayes, support vector
machines, instance-based learning (K-nearest neighbor), logistic regression, and neural networks.
Much of the work in the following examples is supervised learning, but it also includes some
unsupervised hierarchical clustering using Pearson correlations. There are many excellent texts
giving detailed descriptions of the implementations and use of these techniques [0.1].
Feature Reduction Techniques:
In machine learning, a dataset is usually represented as a flat file or table consisting of m rows
and n columns. A row is also called a record, a case or an n-dimensional data point. The n
columns are the dimensions, but are also called “features”, “attributes” or “variables” of the data
points. One of the dimensions is typically the “class” label used in supervised learning .
Machine learning classifiers do best when the number of dimensions are small ( less than 100)
and the number of data points are large ( greater than 1000). Unfortunately, in many biochemical datasets, the dimensions are large ( ex. 30,000 genes) and the data points (ex. 50 patient
samples) are small. The first task is to reduce the dimensions or features so that machine
learning techniques can be used effectively.
5
There are a number of statistical approaches to feature reduction that are quite useful. These
include the application of pairwise t-statistics and using F-statistics to select the features that best
discriminate among classes.
A more sophisticated approach is one we call Principle Uncorrelated Record Selection, or PURS.
PURS involves initially selecting some initial seed features, based on a high t or F-statistic.
Features that correlate highly with the seed features are then repeatedly deleted. If a given feature
does not correlate highly with a feature already in the seed feature set, it ist hen added to that set.
We repeat this process, reducing the correlation threshold, until the seed featrure set is reduced to
the desired dimensionality. We also use random feature selection to train and test classifiers. This
technique is used to measure the improvement in predictive power of the more selectively chosen
feature set.
Validation Techniques:
Perhaps the most significant challenge in the application of machine learning to biological data is
the problem of validation, or the task of determining the expected error rate from a classifier
when applied to a new dataset. The data used to create a model cannot be used to predict the
performance of that model on other datasets. In order to evaluate the external validity of a given
model, the features selected as important for classification must be tested against a different
dataset which was not used in the creation of the original classifier. An easy solution to this
problem is to divide the data into a training set and a test set. However, since biological data is
usually expensive to acquire, large datasets sufficient to allow this subdivision and still have the
statistical power to generalize knowledge from the training set are hard to find. In order to
overcome this problem, a common machine learning technique called 10-fold cross-validation is
sometimes used. This approach divides the data into 10 groups, creates the model using 9 of the
6
groups, and then tests it on the remaining group. This procedure is repeated in an iterative
fashion until each ot the 10 groups has served as a test group. The ten error estimates are then
averaged to give an overall sense of the predictive power of classification technique on that
dataset.
Another technique used to help predict performance in limited data sets is an extension of the 10fold validation idea called leave-one-out validation. In this technique one data point is left out of
each of the iterations of model creation, and is used to test the model. This is repeated until every
data point in the dataset has been used once as test data. This approach is nicely deterministic, as
compared to 10-fold cross validation which requires the careful random stratification of the ten
groups. In contrast to the 10-fold approach however, it does not give as useful a characterization
of the accuracy of the model for some distributions of classes within the datasets.
High Dimensional Visualization
Although there are a number of conventional visualizations that can help in understanding the
correlation of a small number of dimensions to an attribute, high dimensional visualizations have
been difficult to understand and use because of the potential loss of information that occurs in
projecting high-dimensional data down to a two or three-dimensional representation.
There are numerous visualizations and a good number of valuable taxonomies of visual
techniques [0.2]. The authors frequently make use of many different visualization techniques in
the analysis of biological data, especially: matrices of scatterplots [0.3]; heat maps [0.3]; parallel
coordinates [0.4]; RadViz™ [1.13]; and principal component analysis [0.5]. Only RadVizTM
however is uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions)
7
datasets, and very useful when used interactively in conjunction with specific machine learning
and statistical techniques to explore the critical dimensions for accurate classification.
RadViz™ is a visualization, classification, and clustering tool that uses a spring analogy for
placement of data points while incorporating machine learning and feature reduction techniques
as selectable algorithms.. The “force” that any dimension exerts on a sample point is determined
by Hooke’s law: f  kd . The spring constant k, ranging from 0.0 to1.0, is the value of the
scaled dimension for that sample, and d is the distance between the sample point and the
perimeter point on the RadViz™ circle assigned to that feature as shown in Figure 0.1. The
placement of a sample point is determined by the point where the total force, determined
vectorially from all features, is zero.
In the RadViz layout illustrated in Figure 0.1. there are 16 variables or dimensions associated
with the one data point plotted, however variable 1 is a class label and is typically not used in the
layout. Fifteen imaginary springs are anchored to the points on the circumference and attached
to this one data point. In this example, the spring constants (or dimensional values) are higher for
the darker springs and lower for the lighter springs. Normally, many data points are plotted
without showing the spring lines. The values of the dimensions are normalized to have values
between 0 and 1 so that all dimensions have “equal” weights. This spring paradigm layout has
some interesting features.
For example if all dimensions have the same normalized value, the data point will lie exactly in
the center of the circle. If the point is a unit vector, then that point will lie exactly at the fixed
point on the edge of the circle corresponding to its respective dimension. It should be noted that
many points may map to the same position. This represents a non-linear transformation of the
8
data which preserves certain symmetries and which produces an intuitive display. Some of the
significant aspects of the RadVizTM visualization technique are as follows:

it is intuitive, higher dimension values “pull” the data points closer to that dimension on the
circumference

points with approximately equal dimension values will lie close to the center

points with similar values, whose dimensions are opposite each other on the circle, will lie
near the center

points which have one or two dimension values greater than the others lie closer to those
dimensions

the relative locations of the of the dimension anchor points can drastically affect the layout

an n-dimensional line gets mapped to a line (or a single point) in RadVizTM

Convex sets in n-space map into convex sets in RadVizTM
The RadVizTM display combines the n data dimensions into a single point for the purpose of
clustering, while also integrating analytic embedded algorithms in order to intelligently select
and radially arrange the dimensional axes. This arrangement is performed through a set of
algorithmic procedures based upon the dimensions’ significance statistics, which optimizes
clustering by maximizing the distance separating clusters of points. The default arrangement is to
have all dimensions equally spaced around the perimeter of the circle. However, the feature
reduction and class discrimination algorithms subsequently optimize the arrangement of features
in order to increase the separation of different classes of sample points. The feature reduction
technique used in all figures in the present work is based on the t-statistic with Bonferroni
9
correction for multiple tests. The RadVizTM circle is divided into n equal sectors or “pie slices,”
one for each class. Features assigned to each class are spaced evenly within the sector for that
class, counterclockwise in order of significance within a given class. As an example, for a 3
class problem such as illustrated in Figure 1.2, features are assigned to class 1 based on the
sample’s t-statistic which compares class 1 samples with class 2 and 3 samples combined. Class
2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3
combined values, and Class 3 features are assigned based on the t-statistic comparing class 3
values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of
the circle have no features assigned to them, the data points would all cluster on one side of the
circle, pulled by the unbalanced force of the features present in the other sectors. In this case, a
variation of the spring force calculation is used where the features present are effectively divided
into qualitatively different forces comprised of high and low k value classes. This is done by
requiring k to range from –1.0 to 1.0. The net effect is to make some of the features pull (high or
+k values) and others ‘push’ (low or –k values) the points within the RadVizTM display space
while still maintaining their relative point separations.
The t-statistic significance is a standard method for feature reduction in machine learning
approaches. RadVizTM has this machine learning feature embedded in it and is responsible for
the selections carried out here. The advantage of RadVizTM is that one immediately sees a
“visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class
separation correlates to the accuracy of any classifier built from a reduced feature set. The
additional advantage of this visualization technique is that subclusters, outliers, and misclassified
points can quickly be seen in the graphical layout. One of the standard techniques to visualize
clusters or class labels is to perform a Principle Component Analysis (PCA) and show the points
10
in a two dimensional or three dimensional scatter plot using the first few principle components as
axes.
Often this display shows clear class separation, but the most important features
contributing to the PCA are not easily detrmined as the PCA axes represent a linear combination
of the original feature set. RadVizTM is a “visual” classifier that can help one understand
important features and how these features are related within the original feature space.
11