Download NCI 7-31-03 Proceedi..

Applications of Machine Learning Approaches Integrating Analytic Methods and Statistics with High Dimensional Visualizations to Different Problems in Cancer Diagnosis and Detection [LIST OFAUTHORS SUBJECT TO DATASETS USED AND WHO WRITES OR HAS DONE ANALYSIS] John McCarthy*, Kenneth A. Marx, Philip O’Neil, M.L. Ujwal, Patrick Hoffman, Alex Gee and Natasha Markuzon AnVil, Inc. 25 Corporate Drive Burlington, MA 01803 *corresponding author [email protected]; (781) 272-1600 X 460 Abstract 1 Introduction to Data Analysis by Machine Learning Overview of Machine Learning and Visualization Three of the major techniques in machine learning are clustering, classification and feature reduction.. Classificationa and clustering are also broadly known as unsupervised and supervised learning. In supervised learning, the object is to learn predetermined class assignments from other data attributes. For example, given a set of gene expression data for samples with known diseases, a supervised learning algorithm might learn to classify disease states based on patterns of gene expression. In unsupervised learning, there either are no predetermined classes or class assignments are ignored. Cluster analysis is the process by which data objects are grouped together based on some relationship defined between objects. In both classification and clustering an explicit or implicit model is created from the data which can help to predict future data instances or understand the physical process behind the data. Creating these models can be a very compute intensive task, such as training a neural network. Feature reduction or selection reduces the data attributes used in creating a data model. This process can reduce analysis time and create simpler and (sometimes) more accurate models. In the three cancer examples presented all three machine learning techniques are used and will be described, however, one of the primary analysis techniques used is high dimensional visualizations. One particular visualization, RadViz , incorporates all three machine learning techniques in an intuitive, interactive display. Two other high dimensional other visualizations, Parallel Coordinates and PatchGrid (similar to HeatMap) are also used to analyze and display results. Classification techniques used: RadViz – rearranging dimensions based on T-statistic – a visual classifier Naïve Bayes (Weka) Support Vector Machines (Weka) Instance Based or K – nearest neighbor (Weka) Logistic Regression (Weka) Neural Net (Weka) Neural Net (Clementine) Validation technique 10-fold Hold 1 out Training and Test datasets Clustering techniques: RadViz – arranging dimensions not based on class label – ex. Principal Components Hiarchical with Pearson correlation 2 Feature Reduction techniques used: Pairwise t-statistic - equal variance used in RadViz F-statistic – select top dimensions based on the highest F-statistic computed from class labels ***** Phil Should have the new algorithm definition ******* PURS - Principal Uncorrelated Record Selection Initially selection some “seed” dimensions, say based on high t or F statistic, repeatedly delete dimensions that correlate highly to seed dimensions, if not correlated add the dimension to the “seed” dimension set. Repeat and slowly reduce the correlation threshold until “seed” dimensions are reduced to the desired amount. Random – randomly selected dimensions and build/test classifier ****** Not sure if this should be left in ********]. One of AnVil’s strengths is our ability to carry out integrated data mining and visualization analyses on large, complex nonlinear datasets that may have as many as 50,000 data dimensions. Therefore, we have a practical way to overcome the need to reduce dimensionality early on in addressing any specific problem. One advantage this mechanism provides is the ability to simultaneously handle large numbers of data dimensions, enabling us, for example, to add contextual knowledge into already largedimensionality datasets that researchers have to analyze; the contextual knowledge is simply considered as additional data dimensions. We discuss the distinct advantages of our technology in greater detail in the following sections. ****** This probably should be reduced ********]. The Importance of High-dimensional Data Visualization and its Integration with Analytic Data Mining Techniques. Visualization, data mining, statistics, as well as mathematical modeling and simulation are all methodologies that can be used to enhance the discovery process [15]. AnVil’s expertise lies in a combination of analytic data mining techniques integrated with advanced high-dimensional visualizations (HDVs). There are numerous visualizations and a good number of valuable taxonomies (See [16] for an overview of taxonomies). Most information visualization systems focus on tables of numerical data (rows and columns), such as 2D and 3D scatterplots [17], although many of the techniques apply to categorical data. Looking at the taxonomies, the following stand out as high-dimensional visualizations: Matrix of scatterplots [17]; Heat maps [17]; Height maps [17]; Table lens [18]; Survey plots [19]; Iconographic displays [20]; Dimensional stacking (general logic diagrams) [21]; parallel coordinates [22]; Pixel techniques, circle segments [23]; Multidimensional scaling [23]; Sammon plots [24]; Polar charts [17]; RadViz [25]; Principal component analysis [26]; Principal curve analysis [27]; Grand Tours [28]; Projection pursuit [29]; Kohonen self-organizing maps [30]. Grinstein et.al., [31] have compared the capabilities of most of these visualizations. Historically, static displays include histograms, scatterplots, and large numbers of their extensions. These can be seen in most commercial graphics and statistical packages (Spotfire, S-PLUS, SPSS, SAS, MATLAB, Clementine, Partek, Visual Insight’s Advisor, 3 and SGI’s Mineset, to name a few). Most software packages provide limited features that allow interactive and dynamic querying of data. HDVs have been limited to research applications and have not been incorporated into many commercial products. However, HDVs are extremely useful because they provide insight during the analysis process and guide the user to more targeted queries. Visualizations fall into two main categories: (1) low-dimensional, which includes scatterplots, with from 2-9 variables (fields, columns, parameters) and (2) highdimensional, with 100-1000+ variables. Parallel coordinates or a spider chart or radar display in Microsoft Excel can display up to 100 dimensions, but place a limit on the number of records that can be interpreted. There are a few visualizations that deal with a large number (>100) of dimensions quite well: Heatmaps, Heightmaps, Iconographic Displays, Pixel Displays, Parallel Coordinates, Survey Plots, and RadViz. When more than 1000 records are displayed, the lines overlap and cannot be distinguished. Of these, only RadViz is uniquely capable of dealing with ultra–high-dimensional (>10,000 dimensions) datasets, and we discuss it in detail below. RadViz™ is a visualization and classification tool that uses a spring analogy for placement of data points and incorporates machine learning feature reduction techniques as selectable algorithms. 13-15 The “force” that any feature exerts on a sample point is determined by Hooke’s law: f  kd . The spring constant, k, ranging from 0.0 to1.0 is the value of the feature for that sample, and d is the distance between the sample point and the perimeter point on the RadViz circle assigned to that feature-see Figure 1. The placement of a sample point, as described in Figure 1 is determined by the point where the total force determined vectorially from all features is 0. The RadViz display combines the n data dimensions into a single point for the purpose of clustering, but it also integrates analytic embedded algorithms in order to intelligently select and radially arrange the dimensional axes. This arrangement is performed through Autolayout, a unique, proprietary set of algorithmic features based upon the dimensions’ significance statistics that optimizes clustering by optimizing the distance separating clusters of points. The default arrangement is to have all features equally spaced around the perimeter of the circle, but the feature reduction and class discrimination algorithms arrange the features unevenly in order to increase the separation of different classes of sample points. The feature reduction technique used in all figures in the present work is based on the t statistic with Bonferroni correction for multiple tests. The circle is divided into n equal sectors or “pie slices,” one for each class. Features assigned to each class are spaced evenly within the sector for that class, counterclockwise in order of significance (as determined by the t statistic, comparing samples in the class with all other samples). As an example, for a 3 class problem, features are assigned to class 1 based on the sample’s t-statistic, comparing class 1 samples with class 2 and 3 samples combined. Class 2 features are assigned based on the t-statistic comparing class 2 values with class 1 and 3 combined values, and Class 3 features are assigned based on the t-statistic comparing class 3 values with class 1 and class 2 combined. Occasionally, when large portions of the perimeter of the circle have no features assigned to them, the data points would all cluster on one side of the circle, pulled by the unbalanced force of the features present in other sectors. In this case, a variation of the spring force calculation is used, where the features present are effectively divided into qualitatively different forces comprised of high and low k value classes. This is done via requiring k to range from – 4 1.0 to 1.0. The net effect is to make some of the features pull (high or +k values) and others ‘push’ (low or –k values) the points to spread them absolutely into the display space, but maintaining the relative point separations. It should be stated that one can simply do feature reduction by choosing the top features by t-statistic significance and then apply those features to a standard classification algorithm. The t-statistic significance is a standard method for feature reduction in machine learning approaches, independently of RadViz. The top significance chemicals selected with the t-statistic are the same as those selected by RadViz. RadViz has this machine learning feature embedded in it and is responsible for the selections carried out here. The advantage of RadViz is that one immediately sees a “visual” clustering of the results of the t-statistic selection. Generally, the amount of visual class separation correlates to the accuracy of any classifier built from the reduced features. The additional advantage to this visualization is that sub clusters, outliers and misclassified points can quickly be seen in the graphical layout. One of the standard techniques to visualize clusters or class labels is to perform a Principle Component Analysis and show the points in a 2d or 3d scatter plot using the first few Principle Components as axes. Often this display shows clear class separation, but the most important features contributing to the PCA are not easily seen. RadViz is a “visual” classifier that can help one understand important features and how many features are related. ******an example – could be edited *************** The RadViz Layout: An example of the RadViz layout is illustrated in Figure 1. There are 16 variables or dimensions associated with the 1 point plotted (in red). Sixteen imaginary springs are anchored to the points on the circumference and attached to one data point. The data point is plotted where the sum of the forces are zero according to Hooke’s law (F = Kx): where the force is proportional to the distance x to the anchor point. The value K for each spring is the value of the variable for the data point. In this example the spring constants (or dimensional values) are higher for the yellow springs and lower for the blue springs. Normally, many points are plotted without showing the spring lines. Generally, the dimensions (variables) are normalized to have values between 0 and 1 so that all dimensions have “equal” weights. This spring paradigm layout as some interesting features. 5 Figure 1 One Point with 16 dimensions in RadViz For example if all dimensions have the same normalized value the data point will lie exactly in the center of the circle. If the point is a unit vector then that point will lie exactly at the fixed point on the edge of the circle (where the spring for that dimension is fixed). Many points can map to the same position. This represents a non-linear transformation of the data which preserves certain symmetries and which produces an intuitive display. Some features of this visualization include:          it is intuitive, higher dimension values “pull” the data points closer to the dimension on the circumference points with approximately equal dimension values will lie close to the center points with similar values whose dimensions are opposite each other on the circle will lie near the center points which have one or two dimension values greater than the others lie closer to those dimensions the relative locations of the of the dimension anchor points can drastically affect the layout (the idea behind the “Class discrimination layout” algorithm) an n-dimensional line gets mapped to a line (or a single point) in RadViz Convex sets in n-space map into convex sets in RadViz Computation time is very fast 1000’s of dimensions can be displayed in one visualization We have studied the following systems related to cancer detection: 1. GI50 compound 60 cancer cell lines 2. Microarray lung cancer data 3. proteomics MS dataset 1. Data Mining the Public Domain NCI Cancer Cell Line Compound GI50 Data Set using Supervised Learning Techniques Introduction to the Cheminformatics Problem. Important objectives in the overall process of molecular design for drug discovery are: 1) the ability to represent and identify important structure features of any small molecule, and 2) to select useful molecular structures for further study, usually using linear QSAR models and based upon simple partitioning of the structures in ndimensional space. To date, partitioning using non-linear QSAR models has not been widespread, but the complexity and high-dimensionality of the typical data set requires them. The machine learning and visualization techniques that we describe and utilize here represent an ideal set of methodologies with which to approach representing structural features of small molecules followed by selecting molecules via constructing and applying non-linear QSAR models. QSAR models might typically use calculated chemical descriptors of compounds along with computed or experimentally determined 6 compound physical properties and interaction parameters (G, Ka, kf, kr, LD50, GI50, etc) with other large molecules or whole cells. The former types of experimental data would be generated in silico (G) or via high throughput screening of compound libraries against appropriate receptors or important signaling pathway macromolecules (Ka, kf, kr), whereas the LD50, GI50 type of data would be generated against whole cells that are appropriate to the disease model being investigated. When the data has been generated, then the application of machine learning can take place. We provide a sample illustration of this process below. The National Cancer Institute’s Developmental Therapeutics Program maintains a compound data set (>700,000 compounds) that is currently being systematically tested for cytotoxicity (generating 50% growth inhibition, GI50, values) against a panel of 60 cancer cell lines representing 9 tissue types. Therefore, this dataset contains a wealth of valuable information concerning potential cancer drug pharmacophores. In a data mining study of the 8 largest public domain chemical structure databases, it was observed that the NCI compound data set contained by far the largest number of unique compounds of all the databases (32). The application of sophisticated machine learning techniques to this unique NCI compound dataset represents an important open problem that motivated the study we present in this report. Previously, this data set has been mined by supervised learning techniques such as cluster correlation, principle component analysis and various neural networks, as well as statistical techniques (33,34). These approaches have identified compound class subsets such as: tubulin active compounds (35), pyrimidine biosynthesis inhibitors (36) and topoisomerase II inhibitors (37), that possess similar mechanisms of action (MOA), share similar structures or develop similar patterns of drug resistance. Compound structure classes such as the ellipticine derivatives have also been studied and point to the validity of the concept that fingerprint patterns of activity in the NCI data set encode information concerning MOAs and other biological behavior of tested compounds (38). More recently, gene expression analysis has been added to the data mining activity of the NCI compound data set (39) to predict chemosensitivity, using the GI50 test data for each compound, for a few hundred compound subset of the NCI data set (40). After we completed our data mining analysis (41), gene expression data on the 60 cancer cell lines was combined with NCI compound GI50 data and with a 27,000 feature database computed for the NCI compounds to calculate chemical features similar to those identified in the following study and as we have presented elsewhere (42). In the present data mining study, we use microarray based gene expression data to first establish a number of ‘functional’ classes of the 60 cancer cell lines via a hierarchical clustering technique. These functional classes are then used to supervise a 3Class learning problem, using a small but complete subset of 1400 of the NCI compounds’ GI50 values as the input to a clustering algorithm in the RadViz™ program (43). At p < .01 significance, RadViz™ identifies two small compound subsets that accurately classify the cancer cell line classes: melanoma from non-melanoma and leukemia from non-leukemia (41). We then demonstrate that independent analytic classifiers validate the two small compound subsets we selected. We found them to both be significantly enriched in quinone compounds of two distinct subtypes. We conclude that our machine learning approach has yielded important new molecular insights into a class of compounds demonstrating a high level of specificity in cancer cell type toxicity. 7 Specific Methods Used. For the ~ 4% missing values found in the 1400 compound data set, we tried and compared two approaches to missing value replacement: 1) record average replacement; 2) multiple imputation using Schafer’s NORM software (44). Using either missing value replacement method for the starting data set, there was close agreement( always > 90%) between the NCI compound lists selected in identical 2-Class Problem classifications we present below. Therefore, in the present study, we used the record average replacement method for all the data presented. Clustering of cell lines was done with R-Project software using the hierarchical clustering algorithm with “average” linkage method and a dissimilarity matrix computed as 1 – the Pearson correlations of the gene expression data. AnVil Corporation’s RadViz™ software (45) was used for feature reduction and initial classification of the cell lines based on the compound GI50 data. The selected features were validated using several classifiers from Weka 3.1.9 (Waikato Environment for Knowledge Analysis, University of Waikato, New Zealand). The classifiers used were IB1 (nearest neighbor), IB3 (3 nearest neighbor), logistic regression, Naïve Bayes Classifier, support vector machine, and neural network with back propagation. Both ChemOffice 6.0 (CambridgeSoft Corp.) and the NCI website were used to identify compound structures via their NSC numbers and substructure searching to identify quinone compounds in the larger data set were carried out using ChemFinder (CambridgeSoft). Results and Discussion Identifying functional cancer cell line classes using gene expression data. Based upon gene expression data, we identified cancer cell line classes that we could use in a subsequent supervised learning approach. In Figure 2, we present a hierarchical clustering dendrogram using the 1-Pearson distances calculated from the T-Matrix, comprised of 1376 gene expression values determined for the 60 NCI cancer cell lines (43). There are five well defined clusters observed. Four of the clusters in Figure 2 (renal, leukemia, ovarian and colorectal from second left to right) represent pure cell line classes. In only the melanoma class instance does the class contain two members of another clinical tumor type, two breast cancer cell lines - MDA-MB-435 and MDA-N. The 2 breast cancer cell lines behave functionally as melanoma cells and seem to be related to melanoma cell lines via a neuroendocrine origin (43). The remaining cell lines in the Figure 2 dendrogram, those not found in any of the five functional classes, are defined as being in the sixth class- the non- melanoma, leukemia, renal, ovarian, colorectal class. In the supervised learning studies that follow, we treat these six functional clusters as the ground truth. 3-Class Cancer Cell Classifications and Validation of Selected Compounds. High class number classification problems are difficult to implement where the data are not clearly separable into distinct classes and we could not successfully carry out a 6class classification of the cancer cell line classes based upon the starting GI50 compound data. Therefore, we implemented a 3-Class supervised learning classification utilizing RadViz™ (25, 45-47). Starting with the small 1400 compounds’ GI50 data set that contained no missing values for all 60 cell lines, those compounds were selected that were effective in carrying out the classification at the p < .01 (Bonferroni corrected t 8 statistic) significance level. The 3 -Class problem at p < .01 significance, for the melanoma, leukemia and non-melanoma, non-leukemia classes are presented in Figure 3. This produced clear and accurate class separations of the 60 cancer cell lines. There were 14 compounds selected as being most effective against melanoma class cells and 30 compounds were identified as most effective against the leukemia class cells. Similar classification results were obtained for separate 2-Class problems: melanoma vs. nonmelanoma and leukemia vs. non-leukemia (data not shown; [41]). For all other possible 2-Class problems, we found that few to no compounds could be selected at p <.01. Our next goal was to validate these results, utilizing 6 independent analytic classification techniques ( Instance Based 1, Instance based 3, neural networks, logistic regression and support vector machines), with the same selected compounds’ GI50 values as a classifier set, using the hold-one-out method (data not shown: see 41). Using these selected compounds resulted in a greater than 6-fold lowered level of error compared to using the equivalent numbers of randomly selected compounds, thus validating our selection methodology. Quinone Compound Subtypes preferentially effective against melanoma. Next, we decided to examine the chemical identity of the compounds selected as most effective against melanoma and leukemia. To summarize, for the 14 compounds selected as most effective against melanoma, 11 are p-quinones. Of the 11 p-quinones, all 11 are internal ring quinone structures (41). We display in Figure 4A the most cytotoxic of these structures. These internal ring quinones possess either 2 neighbor aromatic 5 or 6 member fused rings, some of which are heteroatom containing, on either side of the quinone ring or an aromatic fused ring neighbor on one side and non-H substitutions off the other side of the quinone. In nearly all cases, these substitutions have electronegative atoms covalently bonded to either or both the o and m C positions of the quinone ring. A recent analysis simultaneously correlating gene expression data for the 60 cancer cell lines with GI50 values, identified a sub-class of compounds containing a benzothiophenedione core structure that were most highly correlated with the expression patterns of Rab7 and other melanoma specific genes (42). There is clearly some overlap between the internal quinone subtype we have defined in the present study and the benzothiophenedione core structure members. Out of the 11 internal quinone compounds we identified, 3 are of the benzothiophenedione core structure class, but they are not amongst the most effective compounds we identified. The Rab7 gene is a member of the GTP binding protein family involved in the docking of cellular transport vesicles and is a key regulator of aggregation and fusion of late endocytic lysosomes (48). A number of other genes whose expression levels highly correlate with the same compounds, expressed proteins involved in other lysosomal functions, suggesting a link between the quinone oxidation potential, the proton pump and the electron transport chain. This suggests the possibility that benzodithiophenedione compounds may act directly as surrogate oxidizing agents, effectively competing with ubiquinone in the electron transport chain, thereby disrupting cellular redox processes. Quinone Compound Subtypes preferentially effective against leukemia. There were 30 compounds selected as most effective against leukemia in the leukemia, nonleukemia 3-Class Problem, of which 8 are structures containing p-quinones (41). In 9 contrast to the internal ring quinones in the melanoma class, 6 out of the 8 leukemia pquinones were external ring quinones. We display the most cytotoxic example of these structures in Figure 4B. In contrast to the internal ring quinones, these external ring quinones had only one aromatic fused ring neighbor, which had no ring heteroatoms in all cases. Also different, the quinone was itself at the periphery of the molecule and had no non-H substituents off the exterior side of the ring at either o or m C positions. Thus, the ‘external’ and ‘internal’ quinone rings should possess different electron densities and redox potentials for the quinoid oxygens. Besides redox potentials, other possible subtype differences may exist such as: solubility, steric differences relative to metabolic enzyme active sites, differential cellular adsorption, etc. In the study discussed already (42), a sub-class of compounds, comprised of an indolonaphthoquinone core structure was identified that were most highly correlated with the expression patterns of LCP1, lymphocyte cytosolic protein 1, HS1, a hematopoietic lineage specific gene, and other leukemia specific genes. There is overlap between the external quinone subtype in our study and the indolonaphthoquinone core structure members. This overlap between the two studies is somewhat remarkable since we included no gene expression data in our analysis of the GI50 values, as did the other study (42). This suggests that there is sufficient information inherent in the compound GI50 values to carry out the basic core discovery presented here using sophisticated machine learning techniques, without the need to include gene expression data in the analysis. Uniqueness of Two Quinone Subtypes. In order to ascertain the uniqueness of the two quinone subsets we discovered, we first determined the extent of occurrence of pquinones of all types in our starting data set, via substructure searching using the ChemFinder 6.0 software. The internal and external quinone subtypes represent a significant fraction, 25 % (10/41) of all the internal quinones and 40 % (6/15) of all the external quinones in the entire data set. In addition, we determined that only one compound, NSC 621179, which is not a quinone but an epoxide, was found to be effective against both melanoma and leukemia in a 2-Class classification where one class was both leukemia and melanoma cell lines and the second class was non-melanoma, non-leukemia cell lines. This result attests to the uniqueness of the specificity of the two quinone subtype classes. Also, the NCI data set lists 92 well studied compounds known to fall within one of 6 Mechanism Of Action (MOA) Classes: alkylating agents, antimitotic agents, topoisomerase I inhibitors, topoisomerase II inhibitors, RNA/DNA antimetabolites, DNA antimetabolites (33). We determined that the most effective 14 and 30 compounds against melanoma and leukemia respectively that we identified in the 3Class problem do not fall into clusters with any one of these 6 MOA compound classes. Sub-classification of Leukemia Cell Lines. We next asked whether we these machine learning techniques could sub-classify either the melanoma or the leukemia cell lines into distinct clinical sub-classes based upon using our 2 respective quinone subtype classes. The answer is that we could with a 3-Class based leukemia cell sub-classification for the acute lymphoblastic leukemia (ALL), non-ALL leukemia (other) and nonleukemia cell classes at p < .05. To carry out the sub-classification, we used the most effective 30 compounds identified for the p < .01 selection criterion as most effective against all leukemias and this result is presented in Figure 5. Six of the 30 compounds 10 were most effective against the ALL class; while 12 of the 30 compounds were most effective against the non-ALL leukemia. In this result, it is clear that there is a separation of the 2 ALL cell lines (CCRF-CEM and MOLT-4) from the non-ALL leukemia subclass. These two ALL cell lines were also the most closely clustered leukemia cells in the Figure 2 gene expression based clustering dendrogram. These results suggest the interesting possibility that the chemical identity of the compounds most effective against the 2 ALL cell lines are linked to the gene functions most responsible for closely clustering these 2 ALL cell lines in Figure 1. NAD(P)H: quinone oxidoreductase 1 -Quinone substrates and Leukemias Different redox potentials and enzymatic reactivities are likely to be the key to how these quinone subtypes differentially affect melanoma and leukemia cells. In addition to the gene candidates identified as potentially involved in quinone activity in the study already discussed (42), a strong candidate enzyme for the differential toxicity we observed is NAD(P)H:quinone oxidoreductase 1 (QRI, NQO1, also DT-diaphorase; EC 1.6.99.2). This enzyme, catalyzing two electron reduction of substrates, most efficiently utilizes quinones as substrates (49). The X-ray structures of the apoenzyme at 1.7-A resolution and its complex with substrate duroquinone (2,5A) are known (50,51). NAD(P)H:quinone oxidoreductase 1 is a chemoprotective enzyme that protects cells from oxidative challenge. Antitumor quinones, of the type we have identified above in the NCI data set, may be bioactivated by this enzyme to forms that are cytotoxic. Interestingly, there are a number of reports that correlate altered forms or alleles of this enzyme with leukemia (52-54). These reports, associating leukemias with particular aspects of NAD(P)H:quinone oxidoreductase 1, suggest the enzyme as likely being a significant factor in why the external quinone subtypes, acting as particularly potent and effective substrates, exhibit their differential selectivity toward leukemias. Conclusion. With this cheminformatics example we have demonstrated that the machine learning approach described above utilizing RadViz™ has produced a novel discovery. Two quinone subtypes were identified that possess clearly different and specific toxicity to the leukemia and melanoma cancer cell types. We believe that this example illustrates the potential of sophisticated machine learning approaches to uncovering new and valuable relationships in complex high dimensional chemical compound data sets. 2. Microarrays Analysis of High Throughput Gene Expression Experiments: Effects of Normalization Methods on Gene Expression Analysis Clustering Results. Completion of the Human Genome Project has made possible the study of the gene expression levels of over 30,000 genes [14,15; although a ‘final’ human genome sequence is scheduled for release in Spring, 2003]. Major technological advances have made possible the use of DNA microarrays to speed up this analysis. Even though the first microarray experiment was only published in 1995, by October 2002 a PubMed query of microarray literature yielded more than 2300 hits, indicating explosive growth in the use of this powerful technique. DNA microarrays take advantage of the convergence 11 of a number of technologies and developments including: robotics and miniaturization of features to the micron scale (currently 20-200 um surface feature sizes for spotting/printing and immobilizing sequences for hybridization experiments), DNA amplification by PCR, automated and efficient oligonucleotide synthesis and labeling chemistries, and sophisticated bioinformatics approaches. It is this latter aspect of the development of microarray technology that our Phase II proposal addresses. One significant aspect of analyzing microarray gene expression data is the need for normalization to remove non-biological sources of variation (noise), in order to make meaningful comparisons of data from different microarrays. The noise results from differences in individual chips, labeling chemistry, length of immobilized oligonucleotide sequence, different optical properties of various data scanners and other sources. The importance of understanding and controlling these variables has been underscored by the apparent lack of reproducibility of some published microarray studies. This has led to the establishment of the MIAME publication guidelines that detail the following requirements for describing microarray experiments: 1) experimental design, 2) array design and the name and location of array spots, 3) sample name extraction and labeling, 4) hybridization protocols, 5) image measurement methods, 6) controls used [16-18]. Normalization techniques that have been applied include simple linear scaling, locally linear transformations, and other nonlinear methods. To some extent, the techniques used depend on the type of array being used. In 2 channel arrays, for example cDNA microarrays, the issue is primarily within-chip normalization to correct distortions based on location and signal intensity. Between-chip normalization is less of an issue for these arrays because one channel usually contains a reference tissue that is common to all arrays in the experiment. Between-chip normalization has the potential of introducing more noise than it eliminates. A number of thorough discussions of normalization techniques for cDNA arrays have been presented [19,20]. These normalization approaches include dye swap experiments to correct for differences between the two channels, using the lowess function to correct for global intensity based differences (i.e. across all genes on the chip), and using the lowess function locally to account for spatial and print-tip differences. For the majority of applications, Affymetrix microarrays are in use. For these arrays, between-chip normalization is an important issue, and is closely related to the method of calculation of gene expression value from multiple probes for each gene. Techniques proposed for calculating expression include the original Affymetrix method of average difference between perfect match and mismatch probes, the Model Based Expression Index approach of Li and Wong [21], and the Robust Multichip Average approach of Irizzary et al [22]. Durbin et al [23] have suggested a variance-stabilizing transformation to aid microarray analysis. There is the additional consideration of whether to normalize data based on probe level measurements or expression calculations, and whether to use a baseline array for comparison or to normalize over the complete set of data. Bolstad et al [24] present comparisons of some of these techniques. They recommend probe level and complete data methods in general, and quantile normalization in particular. They also found that the invariant set normalization approach of Schadt et al [25] using a baseline array gives results that are comparable to complete data methods. Our experience has shown that quantile normalization works well even when probe level data are not 12 available. However, quantile normalization makes the implicit assumption that the data on all chips have the same distribution. For some datasets this may not be appropriate. Different normalization and modeling techniques can lead to widely varying judgments and interpretations of differential gene expression. In this Phase II proposal, we aim to investigate the effects of different data normalizations on clustering. We will compare quantile normalization, invariant set normalization, lowess local regression, and simple linear scaling. We will focus primarily on Affymetrix type arrays, but we will ensure that the platform we develop supports the adaptation and application of these techniques to two channel microarrays where appropriate. We will also investigate the effects of different modeling techniques on clusters. The more successful a technique is at removing noise, the more likely it is that the clusters generated will be accurate and will have biological meaning. On the other hand, the quality and stability of clusters could be a useful measure of the appropriateness of the normalization and modeling techniques used. Therefore, a goal of this Phase II proposal is to provide users with decision making tools to decide which normalization approach is optimal or close to optimal for a given microarray dataset. Also, the normalization tools will be integrated with the perturbation algorithm output, discussed below, to determine the stability of clusters from different normalizations. In this way, we can provide users with the identity of those genes that are most stable within clusters, and those that are unstable and jump between clusters as a result of different normalizations. Section: NCI Lung Cancer – 3 Classes (agee) Introduction An important use for gene expression data is the automatic distinction between normal and lung cancer tissue samples. In an attempt to understand the feasibility of such a task AnVil in collaboration with the NCI examined two example data sets of patients with and without various lung cancers. The initial aim of AnVil’s task was simply to determine if a patient has lung cancer or not based on microarray data collected from lung tissue samples. However, AnVil went one step further to analyze a three-class problem that could distinguish between normal tissue and two subclasses of non-small cell lung carcinomas, adenocarcinomas and squamous cell carcinomas. Given the numerous choices and various complexities of this task AnVil took a systematic approach that included three primary steps: selection, evaluation and relevance. The first step involves making an intelligent selection of genes via some modeling technique. Because the selection of genes depends on the number of genes and the selection algorithm, AnVil experimented with multiple variations. Next, these selected genes are evaluated by some classification algorithm to determine their ability to distinguish between normal and the two cancer types. Here AnVil opted to try a number of different classification algorithms and checking for consistency between these models. The final step adds domain knowledge to the process by determining the biological relevance of these genes and their known associations with lung cancer. 13 Available Data AnVil was provided with two data sets of patients with and without lung cancer. Both data sets included gene expressions of patient malignant or normal tissue samples using Affymetrix’s Human Genome U95 Set [1]; only the first of five oligonucleotide based GeneChip® arrays was used in this experiment. Chip A of the HG U95 array set contains roughly 12,000 full-length genes and a number of controls. The first data set was provided directly from NCI, courtesy of Jin Jen and Tatiana Dracheva, and included 75 patient samples. This set contained 17 normal samples, 30 adenocarcinomas (6 doubles), and 28 squamous cell carcinomas (2 doubles). Doubles represent replicate samples prepared at different times using different equipment from the original sample preparation. A second patient set of 157 samples was provided via public access and courtesy of Matthew Meyerson at the Dana-Farber Cancer Institute [2]. This set inclused 17 normal samples, 139 adenocarcinomas (127 of these with supporting information) and 21 squamous cell carcinomas. In addition, this Meyerson data set also included 6 small cell lung cancer samples and 20 pulmonary carcinoid tumors for which AnVil left aside during this analysis. Because AnVil was dealing with two data sets both from different sources and microarray measurements taken at multiple times we needed to consider a normalization procedure. For this particular analysis we kept with a simple mean of 200 for each sample. As with our systematic approach to selecting and validating sets of genes AnVil has also undertaken an analysis of using various normalization techniques, though currently no conclusions are available yet. In addition to the consideration of normalization of the samples within each data set and between the two data sets, AnVil took this opportunity to treat each data set indepently. By keeping the data sets separate we could use one, the NCI data set, for training and gene selection whilst using the second Meyerson data set for independent validation of the selected genes. Gene Sets The first step of AnVil’s three-part analysis was the selection of genes that could distinguish between normal lung tissue and the two types of non-small cell lung carcinomas, adenocarcinomas and squamous cell carcinomas. When making a selection of genes for this task we need to consider to requirements: size and procedure. It is quite clear that one does not need to include all the genes present on the HG U95 chip A as there are over 12,000 genes and most of these provide no information, that is many of these genes do not provide adequate expression values when only looking at normal versus cancerous lung cells. Consequently a decision had to be made as to how many genes to selection. Secondly, there needed to be a mechanism by which these genes could be selected, a reproducible procedure for choosing the best set of genes that defines the three tissue types. In order to understand the best number of genes for this three-class problem AnVil took the systematic approach generating sets of genes that varied in sizes from 14 very small so somewhat large relative to the 12,000 genes available. As such we decided to proceed by generating gene sets ranging from one up through one hundred to provide an initial understanding has to how many genes might be best for distinguishing the three tissue types. AnVil set the upper bound at one hundred since most research talk about small gene sets, mostly around twenty or so genes. Figure 1. Example RadViz™ Gene Selection Next came the selection procedures. Once again there are many possible ways by which one might select subsets of genes from the initial 12,000; so which procedure would be the most fruitful was the question. AnVil settled on four selection algorithms: random, F-statistics, RadViz™, and PURS™. It was apparent that we needed some ground to stand on as to how well any set of genes of some size would perform, so we started by generating random gene sets, ten independent gene sets for each gene selection size. These random sets provided the best unintelligent estimate of how well the any set of genes distinguishes between normal and the two cancer types. Secondly we included an algorithm using the F-statistic to select some number of genes with the highest significance in distinguishing the three classes. One would assume that by adding some intelligence about the data we could select more appropriate genes that simply choosing random sets. A third algorithm and proprietary to AnVil involves applying the class distinction algorithm of RadViz™ to this three-class problem (see figure 1 for an example). And the final algorithm, also proprietary to AnVil, is PURS™ or Principal Uncorrelated Record Selection; here genes are selected based on their uniqueness in defining the space of expression values, which works by selecting genes that are most different from currently selected genes. PURS chooses genes independent of the three classes and so the initial gene selection to start this algorithm becomes important. Set Evaluation After generated a number of gene sets ranging in size from small to large and using the four selection procedures mentioned above these sets of genes needed to be evaluated as to how well they truly distinguish the three tissue types: normal, adenocarcinoma and squamous cell carcinoma. To accomplish this step AnVil applied a number of classification algorithms to each gene set in order to fully compare the 15 relationship between different numbers of gene and the algorithms used to make the gene selections. Furthermore AnVil performed ten-fold and hold-one-out cross-validation using the NCI data set and independent validation using the Meyerson data set. One thing that was apparent during our independent testing was the unbalanced tissue samples present in the Meyerson data set: 139 adenocarcinoma samples versus only 38 combined normal and squamous cell carcinoma samples. In total AnVil used eleven classification algorithm versions, including variations of K-nearest Neighbors, Naïve Bayes, Support Vector Machines, and Neural Networks. Figure 2 provides a visual representation of the ten-fold cross-validation results for all gene sets and algorithms by their associated best classification score. Figure 2. Classification Results Figure 3. Sample Misclassifications # of Variables – number of genes selected Light gray circles – random gene sets Yellow squares – F-statistic gene sets Blue circles – RadViz™ gene sets Red triangles – PURS™ gene sets Gray (left) – normal samples Blue (center) – adenocarcinomas Yellow (right) – squamous cell carcinomas The top row indicates the known tissue type. An interesting observation appeared when comparing between classifications of sample for different gene sets and the various classification algorithms was the finding of consistently misclassified samples. In figure 3 we present an example visualization of the classification results for each sample (displayed vertically) within the NCI data set. Notice how we can clearly see two continuous vertical lines; these represent two samples that have been misclassified by all the classification algorithms. Given that we had no supporting information for the NCI patients we could not make any inferences about this predicament other than making recommendations to resample these patients. When analyzing the consistent misclassifications of the Meyerson samples we were able to identify six patients, and after reviewing the patient’s supporting information we found that this sample consisted of mixed tissues type and the classification algorithms caught the differences. Biological Relevance 16 ML’s stuff… Mesh – Informax Go-ontology Conclusion [Overview of the approach taken] 1. Selection of gene sets 2. Evaluation of gene sets 3. Biological relevance Random F-statistics Radviz PURS - Intelligent Principal Uncorrelated Record Selection (dissimilar) K-nearest Neighbors Naïve Bayes Support Vector Machines Neural Network References 1. 2. Affymetrix, www.affymetrix.com. Matthew Meyerson Lab, Dana-Farber Cancer Institute, http://research.dfci.harvard.edu/meyersonlab/lungca/data.html. 3. Proteomics Conclusions Acknowledgements 17 AnVil and the authors gratefully acknowledges support from two SBIR Phase I grants R43 CA94429-01 and R43 CA096179-01 from the National Cancer Institute. Also, support is acknowledged from ………..X Y Z References 1. A. Strehl. Relationship-based Clustering and Cluster Ensembles for Highdimensional Data Mining. Dissertation, The University of Texas at Austin, May, 2002. 2. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann, 2000. 3. J. A. Hartigan. Clustering Algorithms. New York: John Wiley & Sons, 1975. 4. D. Fasulo. “An Analysis of Recent Work on Clustering Algorithms.” http://www.cs.washington.edu/homes/dfasulo/clustering.ps, April 26, 1999. 5. C. Fraley and A. E. Raftery “Model-Based Clustering, Discrimination Analysis, and Density Estimation.” Technical Report no. 380, Department of Statistics, University of Washington, Seattle, October, 2000. 6. F. Höppner, F. Klawonn, R. Kruse, and T. Runkler. Fuzzy Cluster Analysis: Methods for Classification, Data Analysis and Image Recognition. Chichester: John Wiley & Sons, 1999.. 7. Everitt, B., Cluster Analysis, Halsted Press, New York (1980). 8. Schaffer, C., Selecting a classification method by cross-validation, Machine Learning, 13:135-143 (1993). 9. Feelders A., Verkooijen W.: Which method learns most from the data? Proc. of 5th International Workshop on Artificial Intelligence and Statistics, January 1995, Fort Lauderdale, Florida, pp. 219-225, (1995). 10. Dietterich, T.G., Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895-1924. 11. Cheng, J., Greiner, R., Comparing Bayesian network classifiers. In Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI ’99), 101-107, Morgan Kaufmann Publishers (1999). 12. Salzberg, S. L., On Comparing Classifiers: A Critique of Current Research and Methods, Data Mining and Knowledge Discovery, 1999, 1:1-12, Kluwer Academic Publishers, Boston. 13. Ramaswamy, S., Ross, K.N., Lander, E.S. and Golub, T.R. A molecular signature of metastasis in primary solid tumors. Science, 22, 1-5. 14. Chaussabel., D. and Sher, A. Mining microarray expression data by literature profiling. Genomebiology, 3, 1-16 15. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (Eds.) Advances in knowledge discovery and data mining, AAAI/MIT Press, 1996. 16. B. Shneiderman, “The Eyes Have It: A Task by Data Type Taxonomy of Information Visualization,” presented at IEEE Symposium on Visual Languages '96, Boulder, CO, 1996. 17. J. W. Tukey, Exploratory Data Analysis. Reading, MA: Addison-Wesley, MA, 1977. 18. R. Rao and S. K. Card, “The Table Lens: Merging Graphical and Symbolic Representations in an Interactive Focus+Context Visualization for Tabular Information,” presented at ACM CHI '94, Boston, MA, 1994. 18 19. D. F. Andrews, “Plots of High-Dimensional Data,” Biometrics, vol. 29, pp. 125-136, 1972. 20. H. Chernoff, “The Use of Faces to Represent Points in k-Dimensional Space Graphically,” Journal of the American Statistical Association, vol. 68, pp. 361-368, 1973. 21. J. Beddow, “Shape Coding of Multidimensional Data on a Microcomputer Display,” presented at IEEE Visualization '90, San Francisco, CA, 1990. 22. A. Inselberg, “The Plane with Parallel Coordinates,” Special Issue on Computational Geometry: The Visual Computer, vol. 1, pp. 69-91, 1985. 23. D. A. Keim and H.-P. Kriegel, “VisDB: Database Exploration Using Multidimensional Visualization,” IEEE Computer Graphics and Applications, vol. 14, pp. 40-49, 1994. 24. J. W. J. Sammon, “A Nonlinear Mapping for Data Structure Analysis,” IEEE Transactions on Computers, vol. 18, pp. 401-409, 1969. 25. P. Hoffman and G. Grinstein, “Dimensional Anchors: A Graphic Primitive for Multidimensional Multivariate Information Visualizations,” presented at NPIV '99 (Workshop on New Paradigmsn in Information Visualization and Manipulation), 1999. 26. H. Hotelling, “Analysis of a Complex of Statistical Variables into Principal Components,” Journal of Educational Psychology, vol. 24, pp. 417-441, 498-520, 1933. 27. T. Hastie and W. Stuetzle, “Principal Curves,” Journal of the American Statistical Association, vol. 84, pp. 502-516, 1989. 28. D. Asimov, “The Grand Tour: A tool for Viewing Multidimensional Data,” DIAM Journal on Scientific and Statistical Computing, vol. 61, pp. 128-143, 1985. 29. J. H. Friedman, “Exploratory Projection Pursuit,” Journal of the American Statistical Association, vol. 82, pp. 249-266, 1987. 30. T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, “Engineering Applications of the Self-Organizing Map,” presented at IEEE, 1996. 31. G. Grinstein, P. E. Hoffman, S. Laskowski, and R. Pickett, “Benchmark Development for the Evaluation of Visualization for Data Mining,” in Information Visualization in Data Mining and Knowledge Discovery, The Morgan Kaufmann Series in Data Managament Systems, U. Fayyad, G. Grinstein, and A. Wierse, Eds., 1st ed: Morgan-Kaufmann Publishers, 2001. 32. Voigt, K. and Bruggeman, R. (1995) Toxicology Databases in the Metadatabank of Online Databases Toxicology, 100, 225-240 33. Weinstein, J.N.,et.al., (1997,) An information-intensive approach to the molecular pharmacology of cancer, Science, 275, 343-349. 34. Shi, L.M., Fan, Y.,Lee, J.K., Waltham, M., Andrews, D.T., Scherf,U., Paul, K.D., and Weinstein, J.N. (2000) J. Chem. Inf. Comput. Sci., 40, 367-379. 35. Bai, R.L., Paul, K.D., Herald, C.L., Malspeis, L., Pettit, G.R., and Hamel, E. (1991) Halichondrin B and homahalichondrin B, marine natural products binding in the vinca domain of tubulin-based mechanism of action by analysis of fifferential cytotoxicity data J. Biol. Chem., 266, 15882 – 15889. 36. Cleveland, E.S., Monks, A., Vaigro-Wolff, A., Zaharevitz, D.W., Paul, K., Ardalan, K.,Cooney, D.A., and Ford, H. Jr. (1995) Site of action of two novel pyramidine biosynthesis inhibitors accurately predicted by COMPARE program Biochem. Pharmacol., 49, 947-954. 19 37. Gupta, M., Abdel-Megeed M., Hoki, Y, Kohlhagen, G., Paul, K., and Pommier, Y. (1995) Eukaryotic DNA topoisomerases mediated DNA cleavage induced by new inhibitor: NSC 665517 Mol. Pharmacol., 48, 658-665 38. Shi, L.M., Myers, T.G., Fan, Y., O’Connors, P.M., Paul, K.D., Friend, S.H., and Weinstein, J.N. (1998) Mining the National Cancer Institute Anticancer Drug Discovery Database: cluster analysis of ellipticine analogs with p53-inverse and central nervous system-selective patterns of avtivity Mol. Pharmacology, 53, 241-251. 39. Ross, D.T. et. al., (2000) Systemamtic variation of gene expression patterns in human cancer cell lines Nat. Genet., 24, 227-235 40. Staunton, J.E.; Slonim, D.K.; Coller, H.A.; Tamayo, P.; Angelo, M.P.; Park, J.; Sherf, U.; Lee, J.K.; Reinhold, W.O.; Weinstein, J.N.; Mesirov, J.P.; Landers, E.S.; Golub, T.R. Chemosensitivity prediction by transcriptional profiling, Proc. Natl. Acad. Sci., 2001, 98, 10787-10792. 41. Marx, K.A., O’Neil, P., Hoffman, P.; Ujwal, M.L. Data Mining the NCI Cancer Cell Line Compound GI50 Values: Identifying Quinone Subtypes Effective Against Melanoma and Leukemia Cell Classes, J. Chem. Inf. Comput. Sci., 2003, in press. 42. Blower, P.E.; Yang, C.; Fligner, M.A.; Verducci, J.S.; Yu, L.; Richman, S.; Weinstein, J.N. Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data, The Pharmacogenomics Journal, 2002, 2, 259271. 43. Scherf, W.; Ross, D.T.; Waltham, M.; Smith, L.H.; Lee, J.K.; Tanabe, L.; Kohn, K.W.; Reinhold, W.C.; Myers, T.G.; Andrews, D.T.; Scudiero, D.A.; Eisen, M.B.; Sausville, E.A.; Pommier, Y.; Botstein, D.; Brown, P.O.; Weinstein, J.N. A gene expression database for the molecular pharmacology of cancer, Nature, 2000, 24, 236247. 44. Schafer, J.L. Analysis of Incomplete Multivariate Data, Monographs on Statistics and Applied Probability 72, Chapman & Hall/CRC, 1997. 45. RadViz, URL: www.anvilinfo.com 46. Hoffman, P.; Grinstein, G.; Marx, K.; Grosse, I.; Stanley, E. DNA visual and analytical data mining, IEEE Visualization 1997 Proceedings, pp. 437-441, Phoenix 47. Hoffman, P.; Grinstein, G. Multidimensional information visualization for data mining with application for machine learning classifiers, Information Visualization in Data Mining and Knowledge Discovery, Morgan-Kaufmann, San Francisco, 2000. 20 48. Bucci, C.; Thompsen, P.; Nicoziani, P.; McCarthy, J.; van Deurs, B. Rab7: a key to lysosome biogenesis, Mol. Biol. Cell, 2000, 11, 467-480. 49. Ross, D. NAD(P)H: quinone oxidoreductases, Encyclopedia of Molecular Medicine, 2001, 2208-2212. 50. Faig, M.; Bianchet, M.A.; Talalay, P.; Chen, S.; Winski, S.; Ross, D.; Amzel, L.M. Structure of recombinant human and mouse NAD(P)H:quinone oxidoreductase: Species comparison and structural changes with substrate binding and release, Proc. Natl. Acad. Sci., 2000, 97, 3177-3182 51. Faig, M.; Bianchet, M.A.; Winsky, S.; Moody, C.J.; Hudnott, A.H.; Ross, D.; Amzel, L.M. Structure-based development of anticancer drugs: complexes of NAD(P)H:quinone oxidoreductase 1 with chemotherapeutic quinones, Structure (Cambridge), 2001, 9, 659667 52. Smith, M.T.; Wang, Y.; Kane, E.; Rollinson, S.; Wiemels, J.L.; Roman, E.; Roddam, P.; Cartwright, R.; Morgan, G., Low NAD(P)H: quinone oxidoreductase I activity is associated with increased risk of acute leukemia in adults, Blood, 2001, 97, 1422-1426 53. Wiemels, J.L.; Pagnamenta, A.; Taylor, G.M.; Eden, O.B.; Alexander, F.E.; Greaves, M.F. A lack of a functional NAD(P)H:quinone oxidoreductase allele in selectively associated with pediatric leukemias that have MLL fusions. United Kingdom Childhood Cancer Study Investigators, Cancer Res., 1999, 59, 4095-4099 54. Naoe T.; Takeyama, K.;, Yokozawa, T.; Kiyoi, H.; Seto, M.; Uike, N.; Ino, T.; Utsunomiya, A.; Maruta, A.; Jin-nai, I.; Kamada, N.; Kubota, Y.; Nakamura, H.; Shimazaki, C.; Horiike, S.; Kodera, Y.; Saito, H.; Ueda, R.; Wiemels, J.; Ohno, R. Analysis of the genetic polymorphism in NQO1, GST-M1, GST-T1 and CYP3A4 in 469 Japanese patients with therapy related leukemia/myelodysplastic syndrome and de novo acute myeloid leukemia, Clin. Cancer Res., 2000, 6, 4091-4095 Other References (14-25 in CC Grant) 35. Venter, J.C., et.al., The Sequence of the Human Genome. Science, 291, 1303-1351 (2001). 36. Lander, E.S., et.al., Initial Sequencing and Analysis of the Human Genome. Nature, 409, 860-921 (2001). 37. Stoeckert, C.J., et.al., Microarray databases: standards and ontologies. Nat. Genet. 32 (Suppl) 469-473. 38. No author, Microarray standards at last. Nature, 419, 323. 39. Ball, C., et.al., Standards for microarray data., Science, 298, 539. 40. Quackenbush, J. (2001) Computational analysis of cDNA microarray data. Nature Reviews 2(6): 418-428. 21 41. Dudoit, S., Yang, Y.H., Speed, T.P., and Callow, M.J. (2002) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica Vol. 12, No. 1, p. 111-139. 42. Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error applications. Genome Biology 2(8), 43. Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K., Scherf, U., and Speed, T.P. (2003) Exploration, normalization and summaries of high density oligonucleotide array probe level data. Biostatistics (in press). 44. Durbin, B.P., Hardin, J.S., Hawkins, D.M., and Rocke, D.M. (2002) A variancestabilizing transformation for gene expression microarray data. Bioinformatics 18, 105S110S. 45. Bolstad, B.M., Irizarry, R.A., Astrand, M., and Speed, T.P. (2002) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19(2): 185-193. Schadt, E.C., Li, C., Eliss, B., and Wong, W.H. (2002) Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data. J. Cell. Biochem. 84(S37), 120-125. Figure Legends Figure 1. RadViz figure Figure 2. Cancer cell line functional class definition using a hierarchical clustering (1Pearson coefficient) dendrogram for 60 cancer cell lines based upon gene expression data. Five well defined clusters are shown highlighted. We treat the highlighted cell line clusters as the truth for the purpose of carrying out studies to identify which chemical compounds are highly significant in their classifying ability Figure 3. RadViz™ result for the 3-Class problem classification of melanoma, leukemia and non-melanoma, non-leukemia cancer cell types at the p < .01 criterion. Cell lines are symbol coded as described in the figure. A total of 14 compounds (bottom of layout) were most effective against melanoma and they are layed out on the melanoma sector (counterclockwise from most to least effective). For leukemia, 30 compounds were identified as most effective and are layed out in that sector. Some 8 compounds were found to be most effective against non-melanoma, non-leukemia cell lines and are layed out in that sector. Figure 4. One example each of the two quinone subtypes selected in Figure 3 are displayed. A. The most highly effective of the 11 internal quinone subtype compounds most effective against melanoma is shown. B. The most highly effective of the 6 external quinone subtype compounds most effective against leukemia is shown Figure 5. RadViz™ result for the 3-Class Problem classifying the following three classes: acute lymphoblastic leukemia (ALL), non-ALL leukemia (other-Leukemia) and non-leukemia cell classes at p < .05. We used as input the 30 compounds identified in the Figure 3 classification as most effective against all leukemias at the p < .01 selection 22 0.2 ME_LOXIMVI PR_PC-3 PR_DU-145 RE_SN12C 0.6 0.0 LC_HOP-92 BR_MDA-MB-231/ATCC CNS_SF-295 CNS_SNB-19 CNS_U251 BR_BT-549 CNS_SF-268 CNS_SF-539 CNS_SNB-75 BR_HS578T RE_A498 RE_CAKI-1 RE_ACHN RE_UO-31 RE_TK-10 RE_RXF-393 RE_786-0 LC_NCI-H226 LC_HOP-62 OV_OVCAR-8 BR_MCF7/ADF-RES LC_NCI-H23 LC_NCI-H522 LC_NCI-H460 LC_A549/ATCC LC_EKVX LE_SR LE_RPMI-8226 LE_K-562 LE_HL-60 LE_CCRF-CEM LE_MOLT-4 OV_SK-OV-3 OV_IGROV1 OV_OVCAR-3 OV_OVCAR-4 OV_OVCAR-5 LC_NCI-H322M BR_MCF7 BR_T-47D CO_HCT-116 CO_SW-620 CO_HCT-15 CO_KM12 CO_HT29 CO_HCC-2998 CO_COLO205 BR_MDA-MB-435 BR_MDA-N 0.4 ME_SK-MEL-5 ME_MALME-3M ME_SK-MEL-28 ME_UACC-257 ME_SK-MEL-2 ME_UACC-62 ME_M14 Height criterion. Cell lines are symbol coded as described in the figure. The NSC numbers of the compounds selected to classify the classes are presented in the order of their ranking from most effective to least effective moving counterclockwise within each class sector. Cluster Dendrogram 1.0 0.8 23 24 O O O N+ - O O O O O O N O N H N H H Cl N H N O O O S N O N H N NH O O 1. 670762 + N Cl 2. 670766 O 3. 642061 O O O O H N O H O S S N N _ O O H 4. 658450 5. 602617 6. 690432 O O S O H O N O O O Cl 8. 644902 7. 690434 O S O H S O Cl O S S O O N O S N O 11. 628507 25 O 9. 642009 H H O 10. 656239 A O N O O O N O O O O H 3. 618315 2. 641395 1. 648147 O O O O S N O O O 4. 641394 N O 6. 641396 5. 640192 B O Cl O N H O 1. 621179 26 27 28 29 30

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download NCI 7-31-03 Proceedi..