Download NCI 7-31-03 Proceedi..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Applications of Machine Learning Approaches
Integrating Analytic Methods and Statistics with High
Dimensional Visualizations to Different Problems in
Cancer Diagnosis and Detection
[LIST OFAUTHORS SUBJECT TO DATASETS USED AND WHO
WRITES OR HAS DONE ANALYSIS]
John McCarthy*, Kenneth A. Marx, Philip O’Neil, M.L.
Ujwal, Patrick Hoffman, Alex Gee and Natasha
Markuzon
AnVil, Inc.
25 Corporate Drive
Burlington, MA 01803
*corresponding author
[email protected];
(781) 272-1600 X 460
Abstract
1
Introduction to Data Analysis by Machine Learning
Overview of Machine Learning and Visualization
Three of the major techniques in machine learning are clustering, classification and
feature reduction.. Classificationa and clustering are also broadly known as unsupervised
and supervised learning. In supervised learning, the object is to learn predetermined class
assignments from other data attributes. For example, given a set of gene expression data
for samples with known diseases, a supervised learning algorithm might learn to classify
disease states based on patterns of gene expression. In unsupervised learning, there either
are no predetermined classes or class assignments are ignored. Cluster analysis is the
process by which data objects are grouped together based on some relationship defined
between objects. In both classification and clustering an explicit or implicit model is
created from the data which can help to predict future data instances or understand the
physical process behind the data. Creating these models can be a very compute intensive
task, such as training a neural network. Feature reduction or selection reduces the data
attributes used in creating a data model. This process can reduce analysis time and create
simpler and (sometimes) more accurate models.
In the three cancer examples presented all three machine learning techniques are used
and will be described, however, one of the primary analysis techniques used is high
dimensional visualizations. One particular visualization, RadViz , incorporates all three
machine learning techniques in an intuitive, interactive display. Two other high
dimensional other visualizations, Parallel Coordinates and PatchGrid (similar to
HeatMap) are also used to analyze and display results.
Classification techniques used:
RadViz – rearranging dimensions based on T-statistic – a visual classifier
Naïve Bayes (Weka)
Support Vector Machines (Weka)
Instance Based or K – nearest neighbor (Weka)
Logistic Regression (Weka)
Neural Net (Weka)
Neural Net (Clementine)
Validation technique
10-fold
Hold 1 out
Training and Test datasets
Clustering techniques:
RadViz – arranging dimensions not based on class label – ex. Principal Components
Hiarchical with Pearson correlation
2
Feature Reduction techniques used:
Pairwise t-statistic - equal variance used in RadViz
F-statistic – select top dimensions based on the highest F-statistic computed from
class labels
***** Phil Should have the new algorithm definition *******
PURS - Principal Uncorrelated Record Selection
Initially selection some “seed” dimensions, say based on high t or F statistic,
repeatedly delete dimensions that correlate highly to seed dimensions, if not
correlated add the dimension to the “seed” dimension set. Repeat and slowly
reduce the correlation threshold until “seed” dimensions are reduced to the
desired amount.
Random – randomly selected dimensions and build/test classifier
****** Not sure if this should be left in ********].
One of AnVil’s strengths is our ability to carry out integrated data mining and
visualization analyses on large, complex nonlinear datasets that may have as many as
50,000 data dimensions. Therefore, we have a practical way to overcome the need to
reduce dimensionality early on in addressing any specific problem. One advantage this
mechanism provides is the ability to simultaneously handle large numbers of data
dimensions, enabling us, for example, to add contextual knowledge into already largedimensionality datasets that researchers have to analyze; the contextual knowledge is
simply considered as additional data dimensions. We discuss the distinct advantages of
our technology in greater detail in the following sections.
****** This probably should be reduced ********].
The Importance of High-dimensional Data Visualization and its Integration with
Analytic Data Mining Techniques. Visualization, data mining, statistics, as well as
mathematical modeling and simulation are all methodologies that can be used to enhance
the discovery process [15]. AnVil’s expertise lies in a combination of analytic data
mining techniques integrated with advanced high-dimensional visualizations (HDVs).
There are numerous visualizations and a good number of valuable taxonomies (See [16]
for an overview of taxonomies). Most information visualization systems focus on tables
of numerical data (rows and columns), such as 2D and 3D scatterplots [17], although
many of the techniques apply to categorical data. Looking at the taxonomies, the
following stand out as high-dimensional visualizations: Matrix of scatterplots [17]; Heat
maps [17]; Height maps [17]; Table lens [18]; Survey plots [19]; Iconographic displays
[20]; Dimensional stacking (general logic diagrams) [21]; parallel coordinates [22]; Pixel
techniques, circle segments [23]; Multidimensional scaling [23]; Sammon plots [24];
Polar charts [17]; RadViz [25]; Principal component analysis [26]; Principal curve
analysis [27]; Grand Tours [28]; Projection pursuit [29]; Kohonen self-organizing maps
[30]. Grinstein et.al., [31] have compared the capabilities of most of these visualizations.
Historically, static displays include histograms, scatterplots, and large numbers of their
extensions. These can be seen in most commercial graphics and statistical packages
(Spotfire, S-PLUS, SPSS, SAS, MATLAB, Clementine, Partek, Visual Insight’s Advisor,
3
and SGI’s Mineset, to name a few). Most software packages provide limited features that
allow interactive and dynamic querying of data.
HDVs have been limited to research applications and have not been incorporated into
many commercial products. However, HDVs are extremely useful because they provide
insight during the analysis process and guide the user to more targeted queries.
Visualizations fall into two main categories: (1) low-dimensional, which includes
scatterplots, with from 2-9 variables (fields, columns, parameters) and (2) highdimensional, with 100-1000+ variables. Parallel coordinates or a spider chart or radar
display in Microsoft Excel can display up to 100 dimensions, but place a limit on the
number of records that can be interpreted. There are a few visualizations that deal with a
large number (>100) of dimensions quite well: Heatmaps, Heightmaps, Iconographic
Displays, Pixel Displays, Parallel Coordinates, Survey Plots, and RadViz. When more
than 1000 records are displayed, the lines overlap and cannot be distinguished. Of these,
only RadViz is uniquely capable of dealing with ultra–high-dimensional (>10,000
dimensions) datasets, and we discuss it in detail below.
RadViz™ is a visualization and classification tool that uses a spring analogy for
placement of data points and incorporates machine learning feature reduction techniques
as selectable algorithms. 13-15 The “force” that any feature exerts on a sample point is
determined by Hooke’s law: f  kd . The spring constant, k, ranging from 0.0 to1.0 is
the value of the feature for that sample, and d is the distance between the sample point
and the perimeter point on the RadViz circle assigned to that feature-see Figure 1. The
placement of a sample point, as described in Figure 1 is determined by the point where
the total force determined vectorially from all features is 0. The RadViz display combines
the n data dimensions into a single point for the purpose of clustering, but it also
integrates analytic embedded algorithms in order to intelligently select and radially
arrange the dimensional axes. This arrangement is performed through Autolayout, a
unique, proprietary set of algorithmic features based upon the dimensions’ significance
statistics that optimizes clustering by optimizing the distance separating clusters of
points. The default arrangement is to have all features equally spaced around the
perimeter of the circle, but the feature reduction and class discrimination algorithms
arrange the features unevenly in order to increase the separation of different classes of
sample points. The feature reduction technique used in all figures in the present work is
based on the t statistic with Bonferroni correction for multiple tests. The circle is divided
into n equal sectors or “pie slices,” one for each class. Features assigned to each class are
spaced evenly within the sector for that class, counterclockwise in order of significance
(as determined by the t statistic, comparing samples in the class with all other samples).
As an example, for a 3 class problem, features are assigned to class 1 based on the
sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined.
Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1
and 3 combined values, and Class 3 features are assigned based on the t-statistic
comparing class 3 values with class 1 and class 2 combined. Occasionally, when large
portions of the perimeter of the circle have no features assigned to them, the data points
would all cluster on one side of the circle, pulled by the unbalanced force of the features
present in other sectors. In this case, a variation of the spring force calculation is used,
where the features present are effectively divided into qualitatively different forces
comprised of high and low k value classes. This is done via requiring k to range from –
4
1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and
others ‘push’ (low or –k values) the points to spread them absolutely into the display
space, but maintaining the relative point separations. It should be stated that one can
simply do feature reduction by choosing the top features by t-statistic significance and
then apply those features to a standard classification algorithm. The t-statistic
significance is a standard method for feature reduction in machine learning approaches,
independently of RadViz. The top significance chemicals selected with the t-statistic are
the same as those selected by RadViz. RadViz has this machine learning feature
embedded in it and is responsible for the selections carried out here. The advantage of
RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic
selection. Generally, the amount of visual class separation correlates to the accuracy of
any classifier built from the reduced features. The additional advantage to this
visualization is that sub clusters, outliers and misclassified points can quickly be seen in
the graphical layout. One of the standard techniques to visualize clusters or class labels
is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter
plot using the first few Principle Components as axes. Often this display shows clear
class separation, but the most important features contributing to the PCA are not easily
seen. RadViz is a “visual” classifier that can help one understand important features and
how many features are related.
******an example – could be edited ***************
The RadViz Layout:
An example of the RadViz layout is illustrated in Figure 1. There are 16 variables or
dimensions associated with the 1 point plotted (in red). Sixteen imaginary springs are
anchored to the points on the circumference and attached to one data point. The data
point is plotted where the sum of the forces are zero according to Hooke’s law (F = Kx):
where the force is proportional to the
distance x to the anchor point. The value K
for each spring is the value of the variable
for the data point. In this example the
spring constants (or dimensional values)
are higher for the yellow springs and lower
for the blue springs. Normally, many
points are plotted without showing the
spring lines. Generally, the dimensions
(variables) are normalized to have values
between 0 and 1 so that all dimensions
have “equal” weights. This spring
paradigm layout as some interesting
features.
5
Figure 1 One Point with 16 dimensions in RadViz
For example if all dimensions have the same normalized value the data point will lie
exactly in the center of the circle. If the point is a unit vector then that point will lie
exactly at the fixed point on the edge of the circle (where the spring for that dimension is
fixed). Many points can map to the same position. This represents a non-linear
transformation of the data which preserves certain symmetries and which produces an
intuitive display. Some features of this visualization include:









it is intuitive, higher dimension values “pull” the data points closer to the dimension
on the circumference
points with approximately equal dimension values will lie close to the center
points with similar values whose dimensions are opposite each other on the circle will
lie near the center
points which have one or two dimension values greater than the others lie closer to
those dimensions
the relative locations of the of the dimension anchor points can drastically affect the
layout (the idea behind the “Class discrimination layout” algorithm)
an n-dimensional line gets mapped to a line (or a single point) in RadViz
Convex sets in n-space map into convex sets in RadViz
Computation time is very fast
1000’s of dimensions can be displayed in one visualization
We have studied the following systems related to cancer detection:
1. GI50 compound 60 cancer cell lines
2. Microarray lung cancer data
3. proteomics MS dataset
1. Data Mining the Public Domain NCI Cancer Cell Line Compound
GI50 Data Set using Supervised Learning Techniques
Introduction to the Cheminformatics Problem.
Important objectives in the overall process of molecular design for drug discovery
are: 1) the ability to represent and identify important structure features of any small
molecule, and 2) to select useful molecular structures for further study, usually using
linear QSAR models and based upon simple partitioning of the structures in ndimensional space. To date, partitioning using non-linear QSAR models has not been
widespread, but the complexity and high-dimensionality of the typical data set requires
them. The machine learning and visualization techniques that we describe and utilize here
represent an ideal set of methodologies with which to approach representing structural
features of small molecules followed by selecting molecules via constructing and
applying non-linear QSAR models. QSAR models might typically use calculated
chemical descriptors of compounds along with computed or experimentally determined
6
compound physical properties and interaction parameters (G, Ka, kf, kr, LD50, GI50,
etc) with other large molecules or whole cells. The former types of experimental data
would be generated in silico (G) or via high throughput screening of compound libraries
against appropriate receptors or important signaling pathway macromolecules (Ka, kf,
kr), whereas the LD50, GI50 type of data would be generated against whole cells that are
appropriate to the disease model being investigated. When the data has been generated,
then the application of machine learning can take place. We provide a sample illustration
of this process below.
The National Cancer Institute’s Developmental Therapeutics Program maintains a
compound data set (>700,000 compounds) that is currently being systematically tested
for cytotoxicity (generating 50% growth inhibition, GI50, values) against a panel of 60
cancer cell lines representing 9 tissue types. Therefore, this dataset contains a wealth of
valuable information concerning potential cancer drug pharmacophores. In a data mining
study of the 8 largest public domain chemical structure databases, it was observed that
the NCI compound data set contained by far the largest number of unique compounds of
all the databases (32). The application of sophisticated machine learning techniques to
this unique NCI compound dataset represents an important open problem that motivated
the study we present in this report. Previously, this data set has been mined by supervised
learning techniques such as cluster correlation, principle component analysis and various
neural networks, as well as statistical techniques (33,34). These approaches have
identified compound class subsets such as: tubulin active compounds (35), pyrimidine
biosynthesis inhibitors (36) and topoisomerase II inhibitors (37), that possess similar
mechanisms of action (MOA), share similar structures or develop similar patterns of drug
resistance. Compound structure classes such as the ellipticine derivatives have also been
studied and point to the validity of the concept that fingerprint patterns of activity in the
NCI data set encode information concerning MOAs and other biological behavior of
tested compounds (38). More recently, gene expression analysis has been added to the
data mining activity of the NCI compound data set (39) to predict chemosensitivity, using
the GI50 test data for each compound, for a few hundred compound subset of the NCI
data set (40). After we completed our data mining analysis (41), gene expression data on
the 60 cancer cell lines was combined with NCI compound GI50 data and with a 27,000
feature database computed for the NCI compounds to calculate chemical features similar
to those identified in the following study and as we have presented elsewhere (42).
In the present data mining study, we use microarray based gene expression data to
first establish a number of ‘functional’ classes of the 60 cancer cell lines via a
hierarchical clustering technique. These functional classes are then used to supervise a 3Class learning problem, using a small but complete subset of 1400 of the NCI
compounds’ GI50 values as the input to a clustering algorithm in the RadViz™ program
(43). At p < .01 significance, RadViz™ identifies two small compound subsets that
accurately classify the cancer cell line classes: melanoma from non-melanoma and
leukemia from non-leukemia (41). We then demonstrate that independent analytic
classifiers validate the two small compound subsets we selected. We found them to both
be significantly enriched in quinone compounds of two distinct subtypes. We conclude
that our machine learning approach has yielded important new molecular insights into a
class of compounds demonstrating a high level of specificity in cancer cell type toxicity.
7
Specific Methods Used.
For the ~ 4% missing values found in the 1400 compound data set, we tried and
compared two approaches to missing value replacement: 1) record average replacement;
2) multiple imputation using Schafer’s NORM software (44). Using either missing value
replacement method for the starting data set, there was close agreement( always > 90%)
between the NCI compound lists selected in identical 2-Class Problem classifications we
present below. Therefore, in the present study, we used the record average replacement
method for all the data presented.
Clustering of cell lines was done with R-Project software using the hierarchical
clustering algorithm with “average” linkage method and a dissimilarity matrix computed
as 1 – the Pearson correlations of the gene expression data. AnVil Corporation’s
RadViz™ software (45) was used for feature reduction and initial classification of the
cell lines based on the compound GI50 data. The selected features were validated using
several classifiers from Weka 3.1.9 (Waikato Environment for Knowledge Analysis,
University of Waikato, New Zealand). The classifiers used were IB1 (nearest neighbor),
IB3 (3 nearest neighbor), logistic regression, Naïve Bayes Classifier, support vector
machine, and neural network with back propagation. Both ChemOffice 6.0
(CambridgeSoft Corp.) and the NCI website were used to identify compound structures
via their NSC numbers and substructure searching to identify quinone compounds in the
larger data set were carried out using ChemFinder (CambridgeSoft).
Results and Discussion
Identifying functional cancer cell line classes using gene expression data.
Based upon gene expression data, we identified cancer cell line classes that we could use
in a subsequent supervised learning approach. In Figure 2, we present a hierarchical
clustering dendrogram using the 1-Pearson distances calculated from the T-Matrix,
comprised of 1376 gene expression values determined for the 60 NCI cancer cell lines
(43). There are five well defined clusters observed. Four of the clusters in Figure 2 (renal,
leukemia, ovarian and colorectal from second left to right) represent pure cell line
classes. In only the melanoma class instance does the class contain two members of
another clinical tumor type, two breast cancer cell lines - MDA-MB-435 and MDA-N.
The 2 breast cancer cell lines behave functionally as melanoma cells and seem to be
related to melanoma cell lines via a neuroendocrine origin (43). The remaining cell lines
in the Figure 2 dendrogram, those not found in any of the five functional classes, are
defined as being in the sixth class- the non- melanoma, leukemia, renal, ovarian,
colorectal class. In the supervised learning studies that follow, we treat these six
functional clusters as the ground truth.
3-Class Cancer Cell Classifications and Validation of Selected Compounds.
High class number classification problems are difficult to implement where the data are
not clearly separable into distinct classes and we could not successfully carry out a 6class classification of the cancer cell line classes based upon the starting GI50 compound
data. Therefore, we implemented a 3-Class supervised learning classification utilizing
RadViz™ (25, 45-47). Starting with the small 1400 compounds’ GI50 data set that
contained no missing values for all 60 cell lines, those compounds were selected that
were effective in carrying out the classification at the p < .01 (Bonferroni corrected t
8
statistic) significance level. The 3 -Class problem at p < .01 significance, for the
melanoma, leukemia and non-melanoma, non-leukemia classes are presented in Figure 3.
This produced clear and accurate class separations of the 60 cancer cell lines. There were
14 compounds selected as being most effective against melanoma class cells and 30
compounds were identified as most effective against the leukemia class cells. Similar
classification results were obtained for separate 2-Class problems: melanoma vs. nonmelanoma and leukemia vs. non-leukemia (data not shown; [41]). For all other possible
2-Class problems, we found that few to no compounds could be selected at p <.01.
Our next goal was to validate these results, utilizing 6 independent analytic
classification techniques ( Instance Based 1, Instance based 3, neural networks, logistic
regression and support vector machines), with the same selected compounds’ GI50 values
as a classifier set, using the hold-one-out method (data not shown: see 41). Using these
selected compounds resulted in a greater than 6-fold lowered level of error compared to
using the equivalent numbers of randomly selected compounds, thus validating our
selection methodology.
Quinone Compound Subtypes preferentially effective against melanoma.
Next, we decided to examine the chemical identity of the compounds selected as most
effective against melanoma and leukemia. To summarize, for the 14 compounds selected
as most effective against melanoma, 11 are p-quinones. Of the 11 p-quinones, all 11 are
internal ring quinone structures (41). We display in Figure 4A the most cytotoxic of
these structures. These internal ring quinones possess either 2 neighbor aromatic 5 or 6
member fused rings, some of which are heteroatom containing, on either side of the
quinone ring or an aromatic fused ring neighbor on one side and non-H substitutions off
the other side of the quinone. In nearly all cases, these substitutions have electronegative
atoms covalently bonded to either or both the o and m C positions of the quinone ring.
A recent analysis simultaneously correlating gene expression data for the 60
cancer cell lines with GI50 values, identified a sub-class of compounds containing a
benzothiophenedione core structure that were most highly correlated with the expression
patterns of Rab7 and other melanoma specific genes (42). There is clearly some overlap
between the internal quinone subtype we have defined in the present study and the
benzothiophenedione core structure members. Out of the 11 internal quinone compounds
we identified, 3 are of the benzothiophenedione core structure class, but they are not
amongst the most effective compounds we identified. The Rab7 gene is a member of the
GTP binding protein family involved in the docking of cellular transport vesicles and is a
key regulator of aggregation and fusion of late endocytic lysosomes (48). A number of
other genes whose expression levels highly correlate with the same compounds,
expressed proteins involved in other lysosomal functions, suggesting a link between the
quinone oxidation potential, the proton pump and the electron transport chain. This
suggests the possibility that benzodithiophenedione compounds may act directly as
surrogate oxidizing agents, effectively competing with ubiquinone in the electron
transport chain, thereby disrupting cellular redox processes.
Quinone Compound Subtypes preferentially effective against leukemia. There
were 30 compounds selected as most effective against leukemia in the leukemia, nonleukemia 3-Class Problem, of which 8 are structures containing p-quinones (41). In
9
contrast to the internal ring quinones in the melanoma class, 6 out of the 8 leukemia pquinones were external ring quinones. We display the most cytotoxic example of these
structures in Figure 4B. In contrast to the internal ring quinones, these external ring
quinones had only one aromatic fused ring neighbor, which had no ring heteroatoms in
all cases. Also different, the quinone was itself at the periphery of the molecule and had
no non-H substituents off the exterior side of the ring at either o or m C positions. Thus,
the ‘external’ and ‘internal’ quinone rings should possess different electron densities and
redox potentials for the quinoid oxygens. Besides redox potentials, other possible subtype
differences may exist such as: solubility, steric differences relative to metabolic enzyme
active sites, differential cellular adsorption, etc.
In the study discussed already (42), a sub-class of compounds, comprised of an
indolonaphthoquinone core structure was identified that were most highly correlated with
the expression patterns of LCP1, lymphocyte cytosolic protein 1, HS1, a hematopoietic
lineage specific gene, and other leukemia specific genes. There is overlap between the
external quinone subtype in our study and the indolonaphthoquinone core structure
members. This overlap between the two studies is somewhat remarkable since we
included no gene expression data in our analysis of the GI50 values, as did the other study
(42). This suggests that there is sufficient information inherent in the compound GI50
values to carry out the basic core discovery presented here using sophisticated machine
learning techniques, without the need to include gene expression data in the analysis.
Uniqueness of Two Quinone Subtypes. In order to ascertain the uniqueness of
the two quinone subsets we discovered, we first determined the extent of occurrence of pquinones of all types in our starting data set, via substructure searching using the
ChemFinder 6.0 software. The internal and external quinone subtypes represent a
significant fraction, 25 % (10/41) of all the internal quinones and 40 % (6/15) of all the
external quinones in the entire data set. In addition, we determined that only one
compound, NSC 621179, which is not a quinone but an epoxide, was found to be
effective against both melanoma and leukemia in a 2-Class classification where one class
was both leukemia and melanoma cell lines and the second class was non-melanoma,
non-leukemia cell lines. This result attests to the uniqueness of the specificity of the two
quinone subtype classes. Also, the NCI data set lists 92 well studied compounds known
to fall within one of 6 Mechanism Of Action (MOA) Classes: alkylating agents,
antimitotic agents, topoisomerase I inhibitors, topoisomerase II inhibitors, RNA/DNA
antimetabolites, DNA antimetabolites (33). We determined that the most effective 14 and
30 compounds against melanoma and leukemia respectively that we identified in the 3Class problem do not fall into clusters with any one of these 6 MOA compound classes.
Sub-classification of Leukemia Cell Lines. We next asked whether we these
machine learning techniques could sub-classify either the melanoma or the leukemia cell
lines into distinct clinical sub-classes based upon using our 2 respective quinone subtype
classes. The answer is that we could with a 3-Class based leukemia cell sub-classification
for the acute lymphoblastic leukemia (ALL), non-ALL leukemia (other) and nonleukemia cell classes at p < .05. To carry out the sub-classification, we used the most
effective 30 compounds identified for the p < .01 selection criterion as most effective
against all leukemias and this result is presented in Figure 5. Six of the 30 compounds
10
were most effective against the ALL class; while 12 of the 30 compounds were most
effective against the non-ALL leukemia. In this result, it is clear that there is a separation
of the 2 ALL cell lines (CCRF-CEM and MOLT-4) from the non-ALL leukemia subclass. These two ALL cell lines were also the most closely clustered leukemia cells in the
Figure 2 gene expression based clustering dendrogram. These results suggest the
interesting possibility that the chemical identity of the compounds most effective against
the 2 ALL cell lines are linked to the gene functions most responsible for closely
clustering these 2 ALL cell lines in Figure 1.
NAD(P)H: quinone oxidoreductase 1 -Quinone substrates and Leukemias
Different redox potentials and enzymatic reactivities are likely to be the key to how these
quinone subtypes differentially affect melanoma and leukemia cells. In addition to the
gene candidates identified as potentially involved in quinone activity in the study already
discussed (42), a strong candidate enzyme for the differential toxicity we observed is
NAD(P)H:quinone oxidoreductase 1 (QRI, NQO1, also DT-diaphorase; EC 1.6.99.2).
This enzyme, catalyzing two electron reduction of substrates, most efficiently utilizes
quinones as substrates (49). The X-ray structures of the apoenzyme at 1.7-A resolution
and its complex with substrate duroquinone (2,5A) are known (50,51).
NAD(P)H:quinone oxidoreductase 1 is a chemoprotective enzyme that protects cells
from oxidative challenge. Antitumor quinones, of the type we have identified above in
the NCI data set, may be bioactivated by this enzyme to forms that are cytotoxic.
Interestingly, there are a number of reports that correlate altered forms or alleles of this
enzyme with leukemia (52-54). These reports, associating leukemias with particular
aspects of NAD(P)H:quinone oxidoreductase 1, suggest the enzyme as likely being a
significant factor in why the external quinone subtypes, acting as particularly potent and
effective substrates, exhibit their differential selectivity toward leukemias.
Conclusion.
With this cheminformatics example we have demonstrated that the machine
learning approach described above utilizing RadViz™ has produced a novel discovery.
Two quinone subtypes were identified that possess clearly different and specific toxicity
to the leukemia and melanoma cancer cell types. We believe that this example illustrates
the potential of sophisticated machine learning approaches to uncovering new and
valuable relationships in complex high dimensional chemical compound data sets.
2. Microarrays
Analysis of High Throughput Gene Expression Experiments: Effects of
Normalization Methods on Gene Expression Analysis Clustering Results.
Completion of the Human Genome Project has made possible the study of the gene
expression levels of over 30,000 genes [14,15; although a ‘final’ human genome
sequence is scheduled for release in Spring, 2003]. Major technological advances have
made possible the use of DNA microarrays to speed up this analysis. Even though the
first microarray experiment was only published in 1995, by October 2002 a PubMed
query of microarray literature yielded more than 2300 hits, indicating explosive growth in
the use of this powerful technique. DNA microarrays take advantage of the convergence
11
of a number of technologies and developments including: robotics and miniaturization of
features to the micron scale (currently 20-200 um surface feature sizes for
spotting/printing and immobilizing sequences for hybridization experiments), DNA
amplification by PCR, automated and efficient oligonucleotide synthesis and labeling
chemistries, and sophisticated bioinformatics approaches. It is this latter aspect of the
development of microarray technology that our Phase II proposal addresses.
One significant aspect of analyzing microarray gene expression data is the need for
normalization to remove non-biological sources of variation (noise), in order to make
meaningful comparisons of data from different microarrays. The noise results from
differences in individual chips, labeling chemistry, length of immobilized oligonucleotide
sequence, different optical properties of various data scanners and other sources. The
importance of understanding and controlling these variables has been underscored by the
apparent lack of reproducibility of some published microarray studies. This has led to the
establishment of the MIAME publication guidelines that detail the following
requirements for describing microarray experiments: 1) experimental design, 2) array
design and the name and location of array spots, 3) sample name extraction and labeling,
4) hybridization protocols, 5) image measurement methods, 6) controls used [16-18].
Normalization techniques that have been applied include simple linear scaling, locally
linear transformations, and other nonlinear methods. To some extent, the techniques used
depend on the type of array being used. In 2 channel arrays, for example cDNA
microarrays, the issue is primarily within-chip normalization to correct distortions based
on location and signal intensity. Between-chip normalization is less of an issue for these
arrays because one channel usually contains a reference tissue that is common to all
arrays in the experiment. Between-chip normalization has the potential of introducing
more noise than it eliminates. A number of thorough discussions of normalization
techniques for cDNA arrays have been presented [19,20]. These normalization
approaches include dye swap experiments to correct for differences between the two
channels, using the lowess function to correct for global intensity based differences (i.e.
across all genes on the chip), and using the lowess function locally to account for spatial
and print-tip differences.
For the majority of applications, Affymetrix microarrays are in use. For these arrays,
between-chip normalization is an important issue, and is closely related to the method of
calculation of gene expression value from multiple probes for each gene. Techniques
proposed for calculating expression include the original Affymetrix method of average
difference between perfect match and mismatch probes, the Model Based Expression
Index approach of Li and Wong [21], and the Robust Multichip Average approach of
Irizzary et al [22]. Durbin et al [23] have suggested a variance-stabilizing transformation
to aid microarray analysis. There is the additional consideration of whether to normalize
data based on probe level measurements or expression calculations, and whether to use a
baseline array for comparison or to normalize over the complete set of data. Bolstad et al
[24] present comparisons of some of these techniques. They recommend probe level and
complete data methods in general, and quantile normalization in particular. They also
found that the invariant set normalization approach of Schadt et al [25] using a baseline
array gives results that are comparable to complete data methods. Our experience has
shown that quantile normalization works well even when probe level data are not
12
available. However, quantile normalization makes the implicit assumption that the data
on all chips have the same distribution. For some datasets this may not be appropriate.
Different normalization and modeling techniques can lead to widely varying
judgments and interpretations of differential gene expression. In this Phase II proposal,
we aim to investigate the effects of different data normalizations on clustering. We will
compare quantile normalization, invariant set normalization, lowess local regression, and
simple linear scaling. We will focus primarily on Affymetrix type arrays, but we will
ensure that the platform we develop supports the adaptation and application of these
techniques to two channel microarrays where appropriate. We will also investigate the
effects of different modeling techniques on clusters. The more successful a technique is
at removing noise, the more likely it is that the clusters generated will be accurate and
will have biological meaning. On the other hand, the quality and stability of clusters
could be a useful measure of the appropriateness of the normalization and modeling
techniques used. Therefore, a goal of this Phase II proposal is to provide users with
decision making tools to decide which normalization approach is optimal or close to
optimal for a given microarray dataset. Also, the normalization tools will be integrated
with the perturbation algorithm output, discussed below, to determine the stability of
clusters from different normalizations. In this way, we can provide users with the identity
of those genes that are most stable within clusters, and those that are unstable and jump
between clusters as a result of different normalizations.
Section: NCI Lung Cancer – 3 Classes (agee)
Introduction
An important use for gene expression data is the automatic distinction between
normal and lung cancer tissue samples. In an attempt to understand the feasibility of such
a task AnVil in collaboration with the NCI examined two example data sets of patients
with and without various lung cancers. The initial aim of AnVil’s task was simply to
determine if a patient has lung cancer or not based on microarray data collected from
lung tissue samples. However, AnVil went one step further to analyze a three-class
problem that could distinguish between normal tissue and two subclasses of non-small
cell lung carcinomas, adenocarcinomas and squamous cell carcinomas. Given the
numerous choices and various complexities of this task AnVil took a systematic approach
that included three primary steps: selection, evaluation and relevance. The first step
involves making an intelligent selection of genes via some modeling technique. Because
the selection of genes depends on the number of genes and the selection algorithm, AnVil
experimented with multiple variations. Next, these selected genes are evaluated by some
classification algorithm to determine their ability to distinguish between normal and the
two cancer types. Here AnVil opted to try a number of different classification algorithms
and checking for consistency between these models. The final step adds domain
knowledge to the process by determining the biological relevance of these genes and their
known associations with lung cancer.
13
Available Data
AnVil was provided with two data sets of patients with and without lung cancer.
Both data sets included gene expressions of patient malignant or normal tissue samples
using Affymetrix’s Human Genome U95 Set [1]; only the first of five oligonucleotide
based GeneChip® arrays was used in this experiment. Chip A of the HG U95 array set
contains roughly 12,000 full-length genes and a number of controls.
The first data set was provided directly from NCI, courtesy of Jin Jen and Tatiana
Dracheva, and included 75 patient samples. This set contained 17 normal samples, 30
adenocarcinomas (6 doubles), and 28 squamous cell carcinomas (2 doubles). Doubles
represent replicate samples prepared at different times using different equipment from the
original sample preparation.
A second patient set of 157 samples was provided via public access and courtesy
of Matthew Meyerson at the Dana-Farber Cancer Institute [2]. This set inclused 17
normal samples, 139 adenocarcinomas (127 of these with supporting information) and 21
squamous cell carcinomas. In addition, this Meyerson data set also included 6 small cell
lung cancer samples and 20 pulmonary carcinoid tumors for which AnVil left aside
during this analysis.
Because AnVil was dealing with two data sets both from different sources and
microarray measurements taken at multiple times we needed to consider a normalization
procedure. For this particular analysis we kept with a simple mean of 200 for each
sample. As with our systematic approach to selecting and validating sets of genes AnVil
has also undertaken an analysis of using various normalization techniques, though
currently no conclusions are available yet.
In addition to the consideration of normalization of the samples within each data
set and between the two data sets, AnVil took this opportunity to treat each data set
indepently. By keeping the data sets separate we could use one, the NCI data set, for
training and gene selection whilst using the second Meyerson data set for independent
validation of the selected genes.
Gene Sets
The first step of AnVil’s three-part analysis was the selection of genes that could
distinguish between normal lung tissue and the two types of non-small cell lung
carcinomas, adenocarcinomas and squamous cell carcinomas. When making a selection
of genes for this task we need to consider to requirements: size and procedure. It is quite
clear that one does not need to include all the genes present on the HG U95 chip A as
there are over 12,000 genes and most of these provide no information, that is many of
these genes do not provide adequate expression values when only looking at normal
versus cancerous lung cells. Consequently a decision had to be made as to how many
genes to selection. Secondly, there needed to be a mechanism by which these genes
could be selected, a reproducible procedure for choosing the best set of genes that defines
the three tissue types.
In order to understand the best number of genes for this three-class problem
AnVil took the systematic approach generating sets of genes that varied in sizes from
14
very small so somewhat large relative to the 12,000 genes available. As such we decided
to proceed by generating gene sets ranging from one up through one hundred to provide
an initial understanding has to how many genes might be best for distinguishing the three
tissue types. AnVil set the upper bound at one hundred since most research talk about
small gene sets, mostly around twenty or so genes.
Figure 1. Example RadViz™ Gene Selection
Next came the selection procedures. Once again there are many possible ways by
which one might select subsets of genes from the initial 12,000; so which procedure
would be the most fruitful was the question. AnVil settled on four selection algorithms:
random, F-statistics, RadViz™, and PURS™. It was apparent that we needed some ground
to stand on as to how well any set of genes of some size would perform, so we started by
generating random gene sets, ten independent gene sets for each gene selection size.
These random sets provided the best unintelligent estimate of how well the any set of
genes distinguishes between normal and the two cancer types. Secondly we included an
algorithm using the F-statistic to select some number of genes with the highest
significance in distinguishing the three classes. One would assume that by adding some
intelligence about the data we could select more appropriate genes that simply choosing
random sets. A third algorithm and proprietary to AnVil involves applying the class
distinction algorithm of RadViz™ to this three-class problem (see figure 1 for an
example). And the final algorithm, also proprietary to AnVil, is PURS™ or Principal
Uncorrelated Record Selection; here genes are selected based on their uniqueness in
defining the space of expression values, which works by selecting genes that are most
different from currently selected genes. PURS chooses genes independent of the three
classes and so the initial gene selection to start this algorithm becomes important.
Set Evaluation
After generated a number of gene sets ranging in size from small to large and
using the four selection procedures mentioned above these sets of genes needed to be
evaluated as to how well they truly distinguish the three tissue types: normal,
adenocarcinoma and squamous cell carcinoma. To accomplish this step AnVil applied a
number of classification algorithms to each gene set in order to fully compare the
15
relationship between different numbers of gene and the algorithms used to make the gene
selections. Furthermore AnVil performed ten-fold and hold-one-out cross-validation
using the NCI data set and independent validation using the Meyerson data set. One
thing that was apparent during our independent testing was the unbalanced tissue samples
present in the Meyerson data set: 139 adenocarcinoma samples versus only 38 combined
normal and squamous cell carcinoma samples. In total AnVil used eleven classification
algorithm versions, including variations of K-nearest Neighbors, Naïve Bayes, Support
Vector Machines, and Neural Networks. Figure 2 provides a visual representation of the
ten-fold cross-validation results for all gene sets and algorithms by their associated best
classification score.
Figure 2. Classification Results
Figure 3. Sample Misclassifications
# of Variables – number of genes selected
Light gray circles – random gene sets
Yellow squares – F-statistic gene sets
Blue circles – RadViz™ gene sets
Red triangles – PURS™ gene sets
Gray (left) – normal samples
Blue (center) – adenocarcinomas
Yellow (right) – squamous cell carcinomas
The top row indicates the known tissue type.
An interesting observation appeared when comparing between classifications of
sample for different gene sets and the various classification algorithms was the finding of
consistently misclassified samples. In figure 3 we present an example visualization of the
classification results for each sample (displayed vertically) within the NCI data set.
Notice how we can clearly see two continuous vertical lines; these represent two samples
that have been misclassified by all the classification algorithms. Given that we had no
supporting information for the NCI patients we could not make any inferences about this
predicament other than making recommendations to resample these patients. When
analyzing the consistent misclassifications of the Meyerson samples we were able to
identify six patients, and after reviewing the patient’s supporting information we found
that this sample consisted of mixed tissues type and the classification algorithms caught
the differences.
Biological Relevance
16
ML’s stuff…
Mesh – Informax
Go-ontology
Conclusion
[Overview of the approach taken]
1. Selection of gene sets
2. Evaluation of gene sets
3. Biological relevance
Random
F-statistics
Radviz
PURS - Intelligent Principal Uncorrelated Record Selection (dissimilar)
K-nearest Neighbors
Naïve Bayes
Support Vector Machines
Neural Network
References
1.
2.
Affymetrix, www.affymetrix.com.
Matthew Meyerson Lab, Dana-Farber Cancer Institute,
http://research.dfci.harvard.edu/meyersonlab/lungca/data.html.
3. Proteomics
Conclusions
Acknowledgements
17
AnVil and the authors gratefully acknowledges support from two SBIR Phase I grants
R43 CA94429-01 and R43 CA096179-01 from the National Cancer Institute. Also,
support is acknowledged from ………..X Y Z
References
1.
A. Strehl. Relationship-based Clustering and Cluster Ensembles for Highdimensional Data Mining. Dissertation, The University of Texas at Austin, May,
2002.
2.
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000.
3.
J. A. Hartigan. Clustering Algorithms. New York: John Wiley & Sons, 1975.
4.
D. Fasulo. “An Analysis of Recent Work on Clustering Algorithms.”
http://www.cs.washington.edu/homes/dfasulo/clustering.ps, April 26, 1999.
5.
C. Fraley and A. E. Raftery “Model-Based Clustering, Discrimination Analysis,
and Density Estimation.” Technical Report no. 380, Department of Statistics,
University of Washington, Seattle, October, 2000.
6.
F. Höppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis:
Methods for Classification, Data Analysis and Image Recognition. Chichester: John
Wiley & Sons, 1999..
7.
Everitt, B., Cluster Analysis, Halsted Press, New York (1980).
8.
Schaffer, C., Selecting a classification method by cross-validation, Machine
Learning, 13:135-143 (1993).
9.
Feelders A., Verkooijen W.: Which method learns most from the data? Proc. of
5th International Workshop on Artificial Intelligence and Statistics, January 1995,
Fort Lauderdale, Florida, pp. 219-225, (1995).
10.
Dietterich, T.G., Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Computation, 10(7), 1895-1924.
11.
Cheng, J., Greiner, R., Comparing Bayesian network classifiers. In Proceedings of
the 15th Conference on Uncertainty in Artificial Intelligence (UAI ’99), 101-107,
Morgan Kaufmann Publishers (1999).
12.
Salzberg, S. L., On Comparing Classifiers: A Critique of Current Research and
Methods, Data Mining and Knowledge Discovery, 1999, 1:1-12, Kluwer Academic
Publishers, Boston.
13.
Ramaswamy, S., Ross, K.N., Lander, E.S. and Golub, T.R. A molecular signature
of metastasis in primary solid tumors. Science, 22, 1-5.
14.
Chaussabel., D. and Sher, A. Mining microarray expression data by literature
profiling. Genomebiology, 3, 1-16
15. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.) Advances in
knowledge discovery and data mining, AAAI/MIT Press, 1996.
16. B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy of Information
Visualization,” presented at IEEE Symposium on Visual Languages '96, Boulder, CO,
1996.
17. J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison-Wesley, MA, 1977.
18. R. Rao and S. K. Card, “The Table Lens: Merging Graphical and Symbolic
Representations in an Interactive Focus+Context Visualization for Tabular Information,”
presented at ACM CHI '94, Boston, MA, 1994.
18
19. D. F. Andrews, “Plots of High-Dimensional Data,” Biometrics, vol. 29, pp. 125-136,
1972.
20. H. Chernoff, “The Use of Faces to Represent Points in k-Dimensional Space
Graphically,” Journal of the American Statistical Association, vol. 68, pp. 361-368, 1973.
21. J. Beddow, “Shape Coding of Multidimensional Data on a Microcomputer Display,”
presented at IEEE Visualization '90, San Francisco, CA, 1990.
22. A. Inselberg, “The Plane with Parallel Coordinates,” Special Issue on Computational
Geometry: The Visual Computer, vol. 1, pp. 69-91, 1985.
23. D. A. Keim and H.-P. Kriegel, “VisDB: Database Exploration Using Multidimensional
Visualization,” IEEE Computer Graphics and Applications, vol. 14, pp. 40-49, 1994.
24. J. W. J. Sammon, “A Nonlinear Mapping for Data Structure Analysis,” IEEE
Transactions on Computers, vol. 18, pp. 401-409, 1969.
25. P. Hoffman and G. Grinstein, “Dimensional Anchors: A Graphic Primitive for
Multidimensional Multivariate Information Visualizations,” presented at NPIV '99
(Workshop on New Paradigmsn in Information Visualization and Manipulation), 1999.
26. H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal
Components,” Journal of Educational Psychology, vol. 24, pp. 417-441, 498-520, 1933.
27. T. Hastie and W. Stuetzle, “Principal Curves,” Journal of the American Statistical
Association, vol. 84, pp. 502-516, 1989.
28. D. Asimov, “The Grand Tour: A tool for Viewing Multidimensional Data,” DIAM
Journal on Scientific and Statistical Computing, vol. 61, pp. 128-143, 1985.
29. J. H. Friedman, “Exploratory Projection Pursuit,” Journal of the American Statistical
Association, vol. 82, pp. 249-266, 1987.
30. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, “Engineering Applications of the
Self-Organizing Map,” presented at IEEE, 1996.
31. G. Grinstein, P. E. Hoffman, S. Laskowski, and R. Pickett, “Benchmark
Development for the Evaluation of Visualization for Data Mining,” in
Information Visualization in Data Mining and Knowledge Discovery, The
Morgan Kaufmann Series in Data Managament Systems, U. Fayyad, G. Grinstein,
and A. Wierse, Eds., 1st ed: Morgan-Kaufmann Publishers, 2001.
32. Voigt, K. and Bruggeman, R. (1995)
Toxicology Databases in the Metadatabank of Online Databases
Toxicology, 100, 225-240
33. Weinstein, J.N.,et.al., (1997,) An information-intensive approach to the molecular
pharmacology of cancer, Science, 275, 343-349.
34. Shi, L.M., Fan, Y.,Lee, J.K., Waltham, M., Andrews, D.T., Scherf,U., Paul, K.D.,
and Weinstein, J.N. (2000)
J. Chem. Inf. Comput. Sci., 40, 367-379.
35. Bai, R.L., Paul, K.D., Herald, C.L., Malspeis, L., Pettit, G.R., and Hamel, E.
(1991) Halichondrin B and homahalichondrin B, marine natural products binding in the
vinca domain of tubulin-based mechanism of action by analysis of fifferential
cytotoxicity data
J. Biol. Chem., 266, 15882 – 15889.
36. Cleveland, E.S., Monks, A., Vaigro-Wolff, A., Zaharevitz, D.W., Paul, K.,
Ardalan, K.,Cooney, D.A., and Ford, H. Jr. (1995)
Site of action of two novel pyramidine biosynthesis inhibitors accurately
predicted by COMPARE program
Biochem. Pharmacol., 49, 947-954.
19
37. Gupta, M., Abdel-Megeed M., Hoki, Y, Kohlhagen, G., Paul, K., and Pommier,
Y. (1995) Eukaryotic DNA topoisomerases mediated DNA cleavage induced by new
inhibitor: NSC 665517 Mol. Pharmacol., 48, 658-665
38. Shi, L.M., Myers, T.G., Fan, Y., O’Connors, P.M., Paul, K.D., Friend, S.H., and
Weinstein, J.N. (1998)
Mining the National Cancer Institute Anticancer Drug Discovery Database:
cluster analysis of ellipticine analogs with p53-inverse and central nervous
system-selective patterns of avtivity
Mol. Pharmacology, 53, 241-251.
39. Ross, D.T. et. al., (2000)
Systemamtic variation of gene expression patterns in human cancer cell lines
Nat. Genet., 24, 227-235
40. Staunton, J.E.; Slonim, D.K.; Coller, H.A.; Tamayo, P.; Angelo, M.P.; Park, J.;
Sherf, U.; Lee, J.K.; Reinhold, W.O.; Weinstein, J.N.; Mesirov, J.P.; Landers,
E.S.; Golub, T.R. Chemosensitivity prediction by transcriptional profiling, Proc.
Natl. Acad. Sci., 2001, 98, 10787-10792.
41. Marx, K.A., O’Neil, P., Hoffman, P.; Ujwal, M.L. Data Mining the NCI Cancer
Cell Line Compound GI50 Values: Identifying Quinone Subtypes Effective
Against Melanoma and Leukemia Cell Classes, J. Chem. Inf. Comput. Sci., 2003,
in press.
42. Blower, P.E.; Yang, C.; Fligner, M.A.; Verducci, J.S.; Yu, L.; Richman, S.;
Weinstein, J.N. Pharmacogenomic analysis: correlating molecular substructure classes
with microarray gene expression data, The Pharmacogenomics Journal, 2002, 2, 259271.
43. Scherf, W.; Ross, D.T.; Waltham, M.; Smith, L.H.; Lee, J.K.; Tanabe, L.; Kohn,
K.W.; Reinhold, W.C.; Myers, T.G.; Andrews, D.T.; Scudiero, D.A.; Eisen, M.B.;
Sausville, E.A.; Pommier, Y.; Botstein, D.; Brown, P.O.; Weinstein, J.N. A gene
expression database for the molecular pharmacology of cancer, Nature, 2000, 24, 236247.
44. Schafer, J.L. Analysis of Incomplete Multivariate Data, Monographs on Statistics and
Applied Probability 72, Chapman & Hall/CRC, 1997.
45. RadViz, URL: www.anvilinfo.com
46. Hoffman, P.; Grinstein, G.; Marx, K.; Grosse, I.; Stanley, E. DNA visual and
analytical data mining, IEEE Visualization 1997 Proceedings, pp. 437-441, Phoenix
47. Hoffman, P.; Grinstein, G. Multidimensional information visualization for data
mining with application for machine learning classifiers, Information Visualization in
Data Mining and Knowledge Discovery, Morgan-Kaufmann, San Francisco, 2000.
20
48. Bucci, C.; Thompsen, P.; Nicoziani, P.; McCarthy, J.; van Deurs, B. Rab7: a key to
lysosome biogenesis, Mol. Biol. Cell, 2000, 11, 467-480.
49. Ross, D. NAD(P)H: quinone oxidoreductases, Encyclopedia of Molecular Medicine,
2001, 2208-2212.
50. Faig, M.; Bianchet, M.A.; Talalay, P.; Chen, S.; Winski, S.; Ross, D.; Amzel, L.M.
Structure of recombinant human and mouse NAD(P)H:quinone oxidoreductase: Species
comparison and structural changes with substrate binding and release, Proc. Natl. Acad.
Sci., 2000, 97, 3177-3182
51. Faig, M.; Bianchet, M.A.; Winsky, S.; Moody, C.J.; Hudnott, A.H.; Ross, D.; Amzel,
L.M. Structure-based development of anticancer drugs: complexes of NAD(P)H:quinone
oxidoreductase 1 with chemotherapeutic quinones, Structure (Cambridge), 2001, 9, 659667
52. Smith, M.T.; Wang, Y.; Kane, E.; Rollinson, S.; Wiemels, J.L.; Roman, E.; Roddam,
P.; Cartwright, R.; Morgan, G., Low NAD(P)H: quinone oxidoreductase I activity is
associated with increased risk of acute leukemia in adults, Blood, 2001, 97, 1422-1426
53. Wiemels, J.L.; Pagnamenta, A.; Taylor, G.M.; Eden, O.B.; Alexander, F.E.; Greaves,
M.F. A lack of a functional NAD(P)H:quinone oxidoreductase allele in selectively
associated with pediatric leukemias that have MLL fusions. United Kingdom Childhood
Cancer Study Investigators, Cancer Res., 1999, 59, 4095-4099
54. Naoe T.; Takeyama, K.;, Yokozawa, T.; Kiyoi, H.; Seto, M.; Uike, N.; Ino, T.;
Utsunomiya, A.; Maruta, A.; Jin-nai, I.; Kamada, N.; Kubota, Y.; Nakamura, H.;
Shimazaki, C.; Horiike, S.; Kodera, Y.; Saito, H.; Ueda, R.; Wiemels, J.; Ohno, R.
Analysis of the genetic polymorphism in NQO1, GST-M1, GST-T1 and CYP3A4 in 469
Japanese patients with therapy related leukemia/myelodysplastic syndrome and de novo
acute myeloid leukemia, Clin. Cancer Res., 2000, 6, 4091-4095
Other References (14-25 in CC Grant)
35. Venter, J.C., et.al., The Sequence of the Human Genome. Science, 291, 1303-1351
(2001).
36. Lander, E.S., et.al., Initial Sequencing and Analysis of the Human Genome. Nature,
409, 860-921 (2001).
37. Stoeckert, C.J., et.al., Microarray databases: standards and ontologies. Nat. Genet. 32
(Suppl) 469-473.
38. No author, Microarray standards at last. Nature, 419, 323.
39. Ball, C., et.al., Standards for microarray data., Science, 298, 539.
40. Quackenbush, J. (2001) Computational analysis of cDNA microarray data. Nature
Reviews 2(6): 418-428.
21
41. Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002) Statistical methods for
identifying differentially expressed genes in replicated cDNA microarray experiments.
Statistica Sinica Vol. 12, No. 1, p. 111-139.
42. Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: model
validation, design issues and standard error applications. Genome Biology 2(8),
43. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K., Scherf, U.,
and Speed, T.P. (2003) Exploration, normalization and summaries of high density
oligonucleotide array probe level data. Biostatistics (in press).
44. Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) A variancestabilizing transformation for gene expression microarray data. Bioinformatics 18, 105S110S.
45. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2002) A comparison of
normalization methods for high density oligonucleotide array data based on variance and
bias. Bioinformatics 19(2): 185-193.
Schadt, E.C., Li, C., Eliss, B., and Wong, W.H. (2002) Feature extraction and normalization
algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem.
84(S37), 120-125.
Figure Legends
Figure 1. RadViz figure
Figure 2. Cancer cell line functional class definition using a hierarchical clustering (1Pearson coefficient) dendrogram for 60 cancer cell lines based upon gene expression
data. Five well defined clusters are shown highlighted. We treat the highlighted cell line
clusters as the truth for the purpose of carrying out studies to identify which chemical
compounds are highly significant in their classifying ability
Figure 3. RadViz™ result for the 3-Class problem classification of melanoma, leukemia
and non-melanoma, non-leukemia cancer cell types at the p < .01 criterion. Cell lines are
symbol coded as described in the figure. A total of 14 compounds (bottom of layout)
were most effective against melanoma and they are layed out on the melanoma sector
(counterclockwise from most to least effective). For leukemia, 30 compounds were
identified as most effective and are layed out in that sector. Some 8 compounds were
found to be most effective against non-melanoma, non-leukemia cell lines and are layed
out in that sector.
Figure 4. One example each of the two quinone subtypes selected in Figure 3 are
displayed. A. The most highly effective of the 11 internal quinone subtype compounds
most effective against melanoma is shown. B. The most highly effective of the 6 external
quinone subtype compounds most effective against leukemia is shown
Figure 5. RadViz™ result for the 3-Class Problem classifying the following three
classes: acute lymphoblastic leukemia (ALL), non-ALL leukemia (other-Leukemia) and
non-leukemia cell classes at p < .05. We used as input the 30 compounds identified in the
Figure 3 classification as most effective against all leukemias at the p < .01 selection
22
0.2
ME_LOXIMVI
PR_PC-3
PR_DU-145
RE_SN12C
0.6
0.0
LC_HOP-92
BR_MDA-MB-231/ATCC
CNS_SF-295
CNS_SNB-19
CNS_U251
BR_BT-549
CNS_SF-268
CNS_SF-539
CNS_SNB-75
BR_HS578T
RE_A498
RE_CAKI-1
RE_ACHN
RE_UO-31
RE_TK-10
RE_RXF-393
RE_786-0
LC_NCI-H226
LC_HOP-62
OV_OVCAR-8
BR_MCF7/ADF-RES
LC_NCI-H23
LC_NCI-H522
LC_NCI-H460
LC_A549/ATCC
LC_EKVX
LE_SR
LE_RPMI-8226
LE_K-562
LE_HL-60
LE_CCRF-CEM
LE_MOLT-4
OV_SK-OV-3
OV_IGROV1
OV_OVCAR-3
OV_OVCAR-4
OV_OVCAR-5
LC_NCI-H322M
BR_MCF7
BR_T-47D
CO_HCT-116
CO_SW-620
CO_HCT-15
CO_KM12
CO_HT29
CO_HCC-2998
CO_COLO205
BR_MDA-MB-435
BR_MDA-N
0.4
ME_SK-MEL-5
ME_MALME-3M
ME_SK-MEL-28
ME_UACC-257
ME_SK-MEL-2
ME_UACC-62
ME_M14
Height
criterion. Cell lines are symbol coded as described in the figure. The NSC numbers of the
compounds selected to classify the classes are presented in the order of their ranking from
most effective to least effective moving counterclockwise within each class sector.
Cluster Dendrogram
1.0
0.8
23
24
O
O
O
N+
-
O
O
O
O
O
O
N
O
N
H
N H
H
Cl
N H
N
O
O
O
S
N
O
N
H
N
NH
O
O
1. 670762
+
N Cl
2. 670766
O
3. 642061
O
O
O
O H
N
O
H
O
S
S
N
N
_
O
O
H
4. 658450
5. 602617
6. 690432
O
O
S
O
H
O
N
O
O
O
Cl
8. 644902
7. 690434
O
S
O
H
S
O
Cl
O
S
S
O
O
N
O
S
N
O
11. 628507
25
O
9. 642009
H
H
O
10. 656239
A
O
N
O
O
O
N
O
O
O
O
H
3. 618315
2. 641395
1. 648147
O
O
O
O
S
N
O
O
O
4. 641394
N
O
6. 641396
5. 640192
B
O
Cl
O
N
H
O
1. 621179
26
27
28
29
30