Download Discovery2000_Paper

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining of Microarray Data Using
Multidimensional Analytic and
Visualization Techniques
Patrick Hoffman, Dave Pinkney, Jennifer Wu
AnVil Informatics
4 Floor, 600 Suffolk Street
Lowell, MA 01854
th
[email protected]
http://www.anvilinformatics.com
The insights that result from analyses of large microarray datasets represent an
important new focus in the drug discovery process. In this poster we
demonstrate the application of two machine learning techniques, supervised
and unsupervised learning, to microarray data. Additionally we present new
techniques that facilitate clustering comparisons using visual and analytical
approaches.
The microarray data sets we used are publicly available and result from
various yeast gene experiments. Our purpose was to demonstrate the value of
applying high dimensional analytic and visual data mining techniques to
discover trends and patterns in the data.
In our analyses, we compare many classification and clustering techniques on
both yeast diauxic shift data and yeast cell cycle data. Application of novel
visualization techniques (Parallel Coordinates, Circle Segments, Radviz, etc.)
to both datasets helps us gain insights into the gene expression data.
Supervised Clustering of
Microarray Data
Microarray experiments typically lead to the analysis of thousands of
gene expression profiles. Genes of similar function often have
similar expression profiles. This attribute can be exploited by
creating classifiers that are trained on the expression profiles of genes
with known function, and applied to unknown genes in order to
classify them based on expression profile.
We performed several experiments that built classifiers from 35
genes with 5 distinct expression profiles. The information came from
publicly available yeast gene expression data that was generated from
microarray experiments. Some of the results are shown in this poster.
Most classifiers, such as a Decision Tree, Neural Network and
Naive Bayes, can classify the 35 “training” genes perfectly. Once
trained, these tools can be used to automatically classify the 6000
remaining unclassified genes based on the characteristics of their
expression profiles.
A Kohonen self-organizing map can be used to cluster the 35
“training” genes, and based on this clustering can be applied to the
classification of the unknown genes. The following four pictures
show the expression profiles of the “training” genes, the Kohonen
map built from the genes, and the expression profiles of the unknown
genes after being classified with the Kohonen Map.
A Parallel Coordinates visualization displaying gene expression
levels for 35 genes with distinct expression profiles. The genes were
classified based on their expression profile, which is shown plotted
over the 7 measured time intervals.
A Kohonen self-organizing map clusters the 35 genes with known
class by computing a new pair of axes and locating the genes
according to its idea of similarity. The Kohonen map can then be
used as a classifier if the operator designates which clusters
correspond to which gene function.
After classification based on the genes of known expression profile,
the Kohonen self-organizing map shows the distribution of over
6000 microarray records (genes).
A Parallel Coordinates visualization shows the expression profiles
of the 6000 genes after classification by the Kohonen self
organizing map.
Unsupervised Clustering of
Microarray Data
There are many clustering techniques that can be applied to
microarray data, such as Hierarchical, K-Means and Self
Organizing Maps. We applied several clustering techniques to
publicly available microarray yeast gene expression data. The
expression levels were measured over two cell cycles and 800 genes
were identified algorithmically as being cell cycle regulated. These
genes were classified into 5 groups based on the cell cycle phase of
their expression. We analyzed and visualized the expression levels of
the 800 genes using several unsupervised clustering techniques; a few
excerpts of these analyses are shown.
In the following pictures we show several traditional and novel
techniques for visualizing data once it has been clustered or
classified, and then present the results of two unsupervised clustering
techniques.
If the data has already been clustered, graphs such as this average
expression profile plot can be used to present summary information
about the characteristics of each cluster. Here we see the average
expression profile, with standard deviation bars added, plotted for the
5 Peak clusters in the cell cycle data. The clusters clearly
demonstrate the cyclic nature of the data set.
A novel extension to the average expression profile plot is this
Histogram Matrix visualization. The 5 Peak phase clusters are
displayed as a sequence of sixteen histograms for each cell cycle.
Rather then providing standard deviation bars, this visualization
presents all of the distribution information using a histogram at each
time point.
Another powerful way of examining classified or clustered data is
with an interactive Parallel Coordinates visualization. This parallel
coordinates visualization is being used to examine all of the gene
expression values corresponding to two of the cell cycle clusters.
The phase difference between the expression times of the two clusters
can be clearly seen.
If the data has not been clustered, a common approach is to apply a
Hierarchical Agglomerative Clustering method, and to visualize
the results the familiar Dendrogram visualization. A colored patch
grid corresponding to positive (green) and negative (red) data values
enhances the visual analysis and comprehension of the clustering.
Another way to cluster data is to use Polyviz, a proprietary highdimensional clustering technique based on a spring force paradigm.
This Polyviz visualization clustered the microarray data using the
expression values and is colored by the Peak phase classification.
The Kohonen self-organizing map is another powerful clustering
technique that can be applied to unclustered data. This clustering of
the microarray data shows the relationship between gene expression
levels (based on cluster location) and the Peak classification column
(used to color the points).
Cluster Comparison Techniques
Scientists have many clustering techniques at their disposal, each
with its own set of advantages and disadvantages. How can scientists
determine which clustering technique is best for their data? How can
the results of two different clustering algorithms be meaningfully
compared? The answers to these questions are ongoing research
issues, but we present here three visual and one analytic approach
towards answering these questions.
This custom visualization allows one to visually compare the results
of a K-means clustering technique that generates 30 clusters (on the
left) with a technique that produced 5 clusters (on the right). Polylines are used to identify an individual record’s location within each
of the results. The visual comparison allows one to gain a
meaningful understanding of how the cluster results differ.
This visualization of two clustering techniques uses a jittered scatter
plot to enable the comparison of the clustering results. Five clusters
from one technique (along the Y-axis) are compared with 12 clusters
from another technique (along the X-axis). If the X-axis clusters
were a pure subset of the Y-axis clusters then there would only be
one clump per vertical line. In this case only the 12th cluster on the
X-axis is pure while the 1st is nearly so.
The Color Correlated Column visualization is another custom
visualization for comparing clustering results. This visualization
allows one to simultaneously compare the results of over 20 different
clusterings of the data. The records are sorted vertically by the Peak
class, which is represented by the colored bar on the right. The
predicted class is represented with a grayscale. If the change in
grayscale value corresponds to the change in color, then there is a
strong correlation between the true and predicted class.
Comparing Clustering Techniques
Rank Clustering
Data
Technique
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Kohonen 3
Kohonen 1
Kohonen 2
C K-means 1
SOM 4
SOM 12
Kohonen 2
C K-means 1
Kohonen 1
Kohonen 3
C K-means 2
SOM 7
M K-means 1
Dendrogram 2
K-means 2
SOM 7
Dendrogram 1
SOM 12
M K-means 2
M K-means 3
random
Norm
Norm
Norm
Norm
Original
Original
Original
Original
Original
Original
Norm
Norm
Original
Original
Original
Original
Original
Norm
Original
Original
Original
Number of
Clusters
%correct
method -1
%correct
method -2
%correct
method -3
%correct
method -4
%correct
maximum
30
30
30
30
25
27
19
30
19
18
5
12
5
6
5
5
6
30
30
17
6
72.6
72.3
71.8
71.1
70.1
69.3
68.5
67.2
67.1
66.8
66.8
62.5
59.7
58.8
55.8
54.8
45.6
44.2
43.7
39.5
37.5
69.1
69.5
66.4
66.4
61.9
64.0
64.3
63.6
59.8
65.5
61.1
57.8
51.8
54.5
50.0
51.8
43.1
38.5
36.6
30.8
16.3
65.7
65.2
62.3
59.7
59.9
60.1
58.6
55.0
53.6
56.4
56.4
49.6
48.4
46.8
47.8
42.8
32.7
31.0
29.3
23.5
20.0
67.8
67.7
65.2
65.1
63.2
63.0
62.7
61.9
58.8
63.9
58.6
52.8
54.7
47.5
54.5
55.1
33.4
36.0
35.9
30.2
22.9
72.6
72.3
71.8
71.1
70.1
69.3
68.5
67.2
67.1
66.8
66.8
62.5
59.7
58.8
55.8
55.1
45.6
44.2
43.7
39.5
37.5
The results of several clustering techniques are analytically compared
with the Peak class in this example. For a given technique, each
generated cluster was considered to be a subset of one of the true
classes. The class chosen for each cluster was based on the majority
of “truth” classes for the genes in that cluster. After each cluster was
categorized, the resulting accuracies were calculated. The total
percent correct and the average accuracy for each class was
calculated and is presented in the method columns.
This Radviz visualization presents Mechanism of Action data from
the NCI chemical structure database, clustered by the fingerprint of
the chemicals.
This Polyviz visualization presents Mechanism of Action data from
the NCI chemical structures database, clustered by the fingerprint of
the chemicals.