* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Microarray Data Analysis
Public health genomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Pathogenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Microevolution wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Metagenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microarray Data Analysis (MDA) CS 491 (Individual Project) Summer Fall 2009 CalState Los Angeles Department of Computer Science Prof. Chengyu Sun Prepared by Modi, Hardik December 11th, 2009 Abstract Microarray technology is widely used for simultaneous monitoring of gene expression profiles of tens of thousands of genes. The protocol usually results in massive amounts of raw gene expression data. While softwares to analyze the data are freely available from academic institutions; open source softwares are not. In the present project, I have developed a Microarray Data Analysis (MDA) application to analyze raw gene expression data, which provides a basic framework for development of a comprehensive MDA application. This MDA application has been used to analyze data on leukemia published by Golub et al (Science 1999). After primary analysis using this application, genes whose expression is associated with the leukemic phenotype was identified. The outcome variable was deduced using the student’s t-test and this is represented as a bar graph. Next, this application was used to predict the class membership of new leukemia samples using the k nearest neighborhood (kNN) algorithm. Results analyzed using MDA replicate the quantitative conclusion of Golub et al. Finally, hierarchical and k-means clustering analysis showed that different classes of leukemia samples are well correlated with their own classes. Overall this application provides, 1) a tool for primary data analysis and visualization, 2) successful implementation of statistical method 3) prediction of class membership and 4) clustering of gene expression data. 1. Introduction Microarray technology (Appendix b) is widely used to monitor gene expression profiles of tens of thousands of genes in parallel, from different cells and different experimental conditions. One such typical experiment is used to monitor gene expression level from different group of patients. At the end of a typical microarray experiment an image (TIFF) file of the gene expression data is generated and commercial software is used to report the primary data, which is Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 2 of 22 then further analyzed. The application described here, MDA, will be used to analyze the raw primary data generated by an experiment. Briefly, it will first preprocess the data, i.e., data normalization and transformation. This step will eliminate control (housekeeping) genes, replacing the ones with expression values below a predetermined threshold value; eliminate genes with no change in fold expression level etc, and represent in a different format, for example, absolute measurement, relative measurement or expression ratio, log2 (expression ratio). These data are then represented in graph format to easily visualize. A major goal of this application is to make meaningful biological inference about the set of genes or samples using class prediction. This is based on supervised data analysis methods that impose known groups on data sets, like the k-nearest neighborhood (kNN) classification method. Finally, it also employs the hierarchical clustering to visualize the data and to explore relationships among distance metric, variable selection and classification. In conclusion, this application will provide a basic framework for complete analysis of raw microarray data. 1.1. Rationale Microarray technology is commercially available from multiple vendors. However, each vendor requires the user to analyze the data using their proprietary software. The first step after a microarray experiment is data acquisition to generate the raw data followed by data analysis. A major issue with microarray data analysis is cost; which may be very significant for laboratories that do not use microarray routinely. Although, free software from academic institutes is available, open source software is not. Besides, there are many different methods for data analysis - statistical and algorithm. In addition, a laboratory usually needs to tailor-make their own software to analyze their data in multiple ways that may not be an option provided by the Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 3 of 22 vendor. My objective is to develop open source microarray data analysis software, which will used to analyze raw gene expression data. 1.2. Implementation The Programming language Perl (http://www.perl.org) is choice of the language for MDA, because it has number of features such as built in support for text processing, lists, and hast tables probably would make it possible to express algorithms very concisely. It has simple to use database support with advanced features. Next, it’s already widely used for server-side scripting (CGI) in web-based applications, and a large library of code (the module of bioperl effort described at www.bioperl.org) is freely available to assist bioinformatics programmers. Last but not least Perl is portable, i.e., it runs on all major operating systems and is freely available at www.perl.org. 2. System Overview of MDA Normalization of the Data Raw Data 1. 2. 3. 4. Visual Representation of data in bar graph and scatter plot Significance t-test Extraction of fluorescent value Removal of control genes Calculation of expression change Elimination of genes with less than two fold changes Classification K nearest neighborhood (kNN) Microarray Data Analysis, MDA, by Modi, Summer 2009. Clustering Hierarchical Clustering K means clustering Page 4 of 22 The above schematic workflow outlines the basic steps of MDA to analyze raw gene expression data, a fluorescent intensity table. This fluorescent intensity table was generated at the end of microarray experiment and is input for this application for further analysis. In these tables, rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of a particular gene for particular sample. The first step in this application is primary data analysis; briefly first it retrieves specific value for a particular sample for a particular gene, followed by preprocessing or data normalization and transformation. Basically this step eliminates house keeping or control genes, genes with less than two fold change in expression level to reduce the noise and identify genes with specific change in expression level. Next it will calculate differentially expressed genes (i.e., up or down regulated) using intensity ratio, just the raw expression value; log2 (ratio); or using fold change. This primary data are also represented in form of bar graph and scatter plot, to easily visualize and further analysis. Next, these differentially expressed genes can be found more effectively by doing statistically analysis, if the samples are in replicates. This can be done by student’s t-test, gives the probability associated with it. Student’s T-TEST (test statistics) used to determine whether two samples are likely to have come from the same two underlying populations that have the same mean. Next, step is to do exploratory analysis and is class prediction, to accurately predict classes based on patterns of expression across multiple genes from different samples. Class prediction can be done by k-nearest neighbor (k-NN), an instancebased learning, is based on supervised data analysis methods that impose known groups on datasets. Next, hierarchical and k-means analysis would also employ for clustering of gene expression data. By doing this user will able to visualize the data by looking at the changes in expression pattern and not to follow the actual numerical or absolute change. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 5 of 22 3. Detail of MDA system 3.1. Preprocessing The first step after the data acquisition is preprocessing which includes retrieving fluorescent intensity value for a particular sample for specific gene, followed by primary data analysis that need to be applied to the data before it is suitable for a detailed analysis. 3.2. Removal of endogenous control gene Expression ratio of control gene or house keeping should not be change under two conditions, but often one finds that it deviates from 1. This may be due to various reasons, for example, variation caused by differential labeling efficiency of the two fluorescent dyes, or different amounts of starting mRNA material in the two samples. Preprocessing or normalization is a term used to describe the process of eliminating such variations to allow appropriate comparison of data obtained from two samples. Here, endogenous control gene or miscellaneous control gene expression value remove by scanning through the data file and remove from further analysis. 3.3. Calculation of expression change There are three commonly used measures of expression change. 1) Absolute value 2) Intensity Ratio is the raw expression value, and 3) Log Ratio and 4) Fold Change are transformationally derived from it. 3.3.1. Absolute value Expression values that retrieve in first step then can be viewed as gene expression matrix in which rows representing genes and columns representing particular conditions. Each cell contains a value, given in arbitrary units, that reflects the expression level of a gene under a corresponding condition. Negative expression level of particular gene was replaced by minimum value for e.g. a value of 20, to represent data in log2 form. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 6 of 22 3.3.2. Intensity Ratio Its simplest approach and calculated by dividing the intensity of a gene in the sample by the intensity level of the same gene in the control. The formula for two color data is TK = RK/GK TK is expression ratio, RK represents the spot intensity metric for the test sample and GK represents the spot intensity metric for the control sample. The intensity ratio is one for an unchanged expression, less than one for down regulated genes and larger than one for upregulated genes. 3.3.3. Log Ratio Log transformation can be applied to absolute value or intensity ratio to make the data symmetric (normal-like). The most commonly used log-transformation is 2-based (log 2). It can be simply calculated by taking log2 of intensity ratio Log ratio = log 2(intensity ratio) After the log-transformation, unchanged expression is zero, and both up-regulated and downregulated genes can take values from zero to infinity. 3.3.4. Fold change Fold change is another means to make the intensity ratio more symmetric. The fold change is similar to intensity ratio, when expression is higher than one. Below one is intensity ratio is equal to the inversed intensity ratio. For values > 1, fold change = TK = RK/GK For values < 1, fold change = TK’ = 1/ (RK/GK) As with log transformation, fold change makes distribution makes more symmetric and both up and down regulated genes takes from zero and infinity. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 7 of 22 3.4. Removal of genes with less than two fold change Generally goal of microarray experiment is to find the few genes for further studies of the biologically interesting phenomenon. So it’s a practical to remove the uninteresting genes that don’t show any expression changes during the experiment, part of the data before classification or clustering analyses. Usually the intensity ratio cut-offs for uninteresting data (not-changing) genes are set at 0.5 and 2.0. One way is to remove genes with less than two fold change as just calculated. The other way to calculate is to find fold change of particular gene is to find the minimum and maximum value of expression with different experiment and remove that gene from all across the experiment if fold change is less that two. 3.5. Significance/Finding predictor gene Up and down regulated genes can be more effectively found, if the chips/samples are replicated, as well as in order to construct a classifier this application uses the absolute value of the twosample t-statistic with unequal variances. This will gives statistical significance of particular gene and find out easily differentially expressed genes as well as best predictor genes. The t statistic can be calculated as where Where s2 is the unbiased estimator of the variance of the two samples, n = number of participants, 1 = group one, 2 = group two. For use in significance testing, the distribution of the test statistic is approximated as being an ordinary Student's t distribution with the degrees of freedom calculated using Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 8 of 22 3.6. Classification The kNN algorithm uses genes with highest prediction strength, determined using data from training set in previous step, followed by kNN rule to classify the test sample. The number of knearest neighbors, K, is user-defined and implemented by prompting the user for K. Each sample from the test set is classified by finding the k-nearest neighboring training samples based on the Euclidean distance of normalized expression intensity. The class membership of k-nearest neighbors is enumerated and assigns the vote to that class. The class with the higher vote wins if K is odd, if K is even equal number of vote results in unclassification of that sample. 3.6.1. K-nearest Neighbor algorithm: a) Counts the k-nearest samples (in Euclidean distance) in the training set to the new sample to be classified. At this step it also prompts the user classification based on Euclidean distance between the samples or between the two genes. b) Determines the proportion of neighbor samples from each class and makes a ‘vote’ for each class. c) Majority rules applied at end. d) Allows “no prediction” result if K is even and results in equal number of votes. 3.6.2. Calculation of Euclidean distance Euclidean distance is one of the common distance measures used to calculate similarity between expression profiles. For example the Euclidean distance between two points with dimensions 2, say A = [a1, a2] and B = [b1, b2] can be calculated as: Deuc (A,B) = square root of ((a1-b1)2 + (a2-b2)2 ) Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 9 of 22 Thus for genes with expression data available for n conditions, represented as A = [a1,a2,a3,…an] and B = [b1,b2,b3,…bn], Euclidean distance can be calculated as: n Deuc (A,B) = square root of ( Σ (ai-bi)2 ) i=1 In other words, the Euclidean distance between two genes is the square root of the sum of the squares of distances between the values in each condition. 3.7. Cross Validation Cross validation tests how well predictor genes and the prediction rules are at discriminating between classes. The cross-validation method, also known as drop-one-out approach, removes one sample from the training set and uses it as a test sample. The remaining training samples are used to predict the removed test sample. Best predictor genes obtained in training phase will also selected from the test samples for further analysis. Example of Cross-validation in AML ALL data set (38 training samples total): a) Remove one leukemia sample b) Predict the membership of the removed leukemia sample using the prediction rule and data from the remaining 37 training samples. c) Return removed sample back to training set. Remove another sample d) Repeat step 2 and 3 until all samples have been predicted This leave-one-out approach detects samples that have different expression from other samples in the same group. Thus, potential outlier samples can be detected during cross validation. 3.8. Clustering Clustering organizes the data into a relatively small number of homogenous groups. In analyzing the microarray data we are interested in changes in expression patterns, not to follow the actual numeric changes, so we are using normalized expression value. So, these methods are used to Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 10 of 22 find similar expression motifs irrespective of the expression level. If the expression profiles are correlated by shape, both low and high expression level genes can end up in the same cluster. Clustering can be classified in many different ways such as supervised and unsupervised learning. In supervised learning methods assign some predefined classes to a data set, whereas in unsupervised methods no prior assumptions are applied. Most commonly used clustering methods are Hierarchical clustering, K-means, self-organizing maps (SOMs), Principal component analysis (PCA). 3.8.1. Hierarchical Clustering Hierarchical clustering is a statistical method for determining relatively similar clusters and mainly it divided in two separate phases. First a distance matrix containing all the pairwise distances between the genes is calculated. Followed by hierarchical algorithm iteratively joins the two closest clusters starting from single clusters, a bottom up approach. After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. For example, a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the hierarchical clustering is performed as follows. Best predictor genes found in class prediction phase were further used for clustering. a) Assign each gene to a cluster of its own. b) Find the closest pair of clusters and merge them into a single cluster. c) Compute the distances (similarities) between the new cluster and each of the old clusters using single linkage method. d) Repeat steps 2 and 3 until all genes are clustered. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 11 of 22 3.8.2. K-Means Clustering K-means, a non-hierarchical clustering, is a least-squares partitioning method for which the number of groups, K, has to be provided or predetermined. The algorithm computes cluster centroid and uses them as new cluster centroid, and assigns each object to the nearest centroid. However, it is also possible to estimate K (no of group) from the data, taking the approach of a mixture density estimation problem, for e.g, one can use the data generated from the hierarchical clustering. Briefly algorithm for k-means is a) The number of clusters can be chosen randomly or estimated by first performing a hierarchical clustering of the data. (Here its divided in two cluster randomly) b) Next initialization is performed by calculating average expression profile (centroid) for each cluster. c) Next, individual objects are reattributed from one cluster to the other depending on which centroid is closer to the gene or sample. d) This procedure of calculating the centroid for each cluster and re-grouping objects closer to available centroids is performed in an iterative manner for a fixed number of times, or until convergence (state when composition of clusters remains unaltered by further iterations). Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 12 of 22 4. Results and Discussion 4.1. Samples: To test this application data described in paper from Golub et al 1999 were used. There were total 72 samples and each with 7129 genes obtained by microarray experiment using Affymetrix high density oligonucleotide array. Each sample belongs to either one of the leukemia type, acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML). The data were collected in two separate phases and therefore allow for natural definition of training and test set: a training set of 38 (27 ALL vs. 11 AML) and a test set of 34 (20 ALL vs. 14 AML). Originally Golub et al have applied clustering and neighborhood analysis and voting scheme to address visualization, correlation and predictive classification, respectively. Class prediction results correct classification of 29/34 samples in ALL/AML subtypes and five not classified being consider too close to classify. 4.2. Results: 4.2.1. Primary analysis. To verify the Microarray Data Analysis (MDA) application developed here, previously published data sets are used. As mention first retrieve the gene expression value, followed by data normalization and transformation. This step results in 6454 genes, i.e., after removing the house keeping and miscellaneous control genes. Applying t-test for sample assuming unequal variance yields 1640 genes (23.29%) from the training set are significant, p value ≤ 0.05. These are the also the best predictor genes and out of 1640 top 50-100 genes which has lowest p value are used for classification. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 13 of 22 4.2.2. kNN classification The best performing classifier on the training set has (100 genes and 6 neighbors, k). This classifier applied to test set and correctly classify 25/35 samples (71%) with 17 out of 21 ALL and 8 out of 14 correct for AML samples and 4 samples are not able to classify in either of the group. Applying this classifier to training set itself it correctly classify 21/27 samples, with 6/11 ALL and 7/11 AML samples and 6 samples are not able to classify in either of the group (Summary as shown below and detail result in appendix c, Table 1). Results for Test Sample Classification Classification (~100 Genes) (~1500 True (pVal = Genes) 0.00012) (pVal = 0.05) Classification ALL Total AML Total Unclassified 17 8 4 Wrong Classification 6 21 14 16 11 4 Results for Training Classification (~100 Genes) True (pVal = 0.00012) Classification 21 7 4 Sample Classification (~1500 Genes) (pVal = 0.05) 27 11 6 7 Summary table 1: Classification results using kNN. Further, also explore the robustness of classification performance under different choices of the number of genes selected and number of neighbors used. The ALLs are classified relatively well under any set of parameters except for large number of genes and K=1. The AMLs are more difficult to classify, with 4 to 7 are correctly classified. Analyzing the AMLs samples reveal that mistakes focused on two of the AMLs samples. 4.2.3. Hierarchical Clustering Using the 100 genes with highest absolute t-statistics from the training set, clustered the samples from both training and test data set. The snap shot of the results are shown in summary table 2 below and detail results in appendix d, table 2. Generally the genes from one group are clustering well. Analyzing those results revealed that ALLs have higher expression compare to AML samples. Microarray Data Analysis, MDA, by Modi, Summer 2009. 21 6 4 Page 14 of 22 STAGE DISTANCE 0 3.90238609 1 4.736685142 2 4.756178629 3 4.882651946 46 47 48 49 50 51 CLUSTER 1 M55150_at HG1612-HT1612_at M31523_at M55150_at CLUSTER 2 U50136_rna1_at M31303_rna1_at U29175_at M81695_s_at M80254_at M80254_at M27891_at M19507_at M89957_at M12959_s_at U82759_at X66533_at M28130_rna1_s_at M28130_rna1_s_at X66533_at M28130_rna1_s_at 16.74044878 17.26896308 17.9501259 21.20353106 22.7866839 28.170148 Summary table 2: Hierarchical Clustering results of best predictor genes. 4.2.4. K-means Clustering Using the 100 genes with highest absolute t-statistics from the training set, clustered the samples from both training and test data set. The snap shot of the results are shown in summary table 3 below. For example clustering genes, first genes are divided in two clusters randomly followed by performing K-means algorithm. Greater than 90% of genes are stabilize into clusters and don’t move after 8-9 rounds to clustering, although one or two genes are keep moving from one clusters to another probably because of distance similarity to both centroids. Further visual analysis of these clusters shows that 70-72% of genes are best predictor genes of their own class. STAGE 1 2 3 4 5 6 7 8 9 10 11 12 13 100 Cluster1 Cluster2 53 52 92 13 82 23 87 18 79 26 72 33 71 34 70 35 71 34 70 35 71 34 70 35 71 34 70 34 Summary table 3: K-means Clustering results of best predictor genes. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 15 of 22 Conclusion Here, I have developed an application, MDA, to analyze the raw gene expression data generated from microarray experiment using Affymetrix chips. First empirically determine which method to choose for data normalization and transformation, classification and clustering followed by implementations. Overall, this application provides the basic framework for analyzing gene expression data and performs primary data analysis; represent data in graphical format, statistical analysis, classification and clustering. Further, correct implementation of approach used in MDA verified by analyzing data published by Golub et al. It should be noted that the Golub et al are able to correctly classify 29/ 34 samples, while this application was able to correctly classify only 24 samples. Further clustering by hierarchical and k-means shows that 70-75% genes very well clustered with their own class. In addition it also suggests that remaining genes are still background noise and requires further improvement in data normalization and transformation. Never the less these results validate the correct implementation of algorithm. At present this application is sequential nature of file input and output, so future work is of course, to develop user-friendly application with more statistics and algorithm of choice to analyze the data. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 16 of 22 Bibliography a) Genomic Perl. Rex A Dwyer. b) An introduction to microarray data analysis. M. Madan Babu. c) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomÞeld, E. S. Lander. Science, VOL 286 15 OCTOBER 1999 Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 17 of 22 Appendices a) Background, Central dogma of molecular biology Replication Transcription Translation Measured by Microarray The central dogma of molecular biology relates DNA, RNA and proteins. And simply “DNA is transcribed into RNA which is then translated into protein. Briefly put, the Central Dogma makes the following claims (1) The amino acid sequence of a protein provides an adequate “blueprint” for the protein’s production. Protein blueprints are encoded in DNA in the chromosomes. The encoded blueprint for a single protein is called a gene. A dividing cell passes on the blueprints to its daughter cells by making copies of its DNA in a process called replication. The blueprints are transmitted from the chromosomes to the protein factories in the cell in the form of RNA. The process of copying DNA into RNA is called transcription. The RNA blueprints are read and used to assemble proteins from amino acids in a process known as translation. Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 18 of 22 b) What Is MicroArray Microarray technology uses the advantage of human genome sequencing project and compares the expression of genes (DNA) of your sample to known genes. A microarray is typically a glass slide (Figure 1) on to which known DNA molecules are fixed in an orderly manner at specific locations called spots or features. Typically a single slide may contain thousands of spots and within each spot may contain a few million copies of identical DNA molecules that uniquely correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligonucleotide strands that correspond to a gene. To accommodate more and more unique DNA the size of DNA is getting smaller and smaller. The spots are fixed on to the glass slide by the process of photolithography. (2) Microarray has many applications in both basic and clinical research, for e.g., it used to compare gene expression level in normal vs. cancerous tissue, drug treated cells vs. untreated cells or probably for time course study of particular cells or tissues. It has been widely used to determine the expression level of drug treated cells (condition A) with untreated cells (condition B). Briefly the steps involved are shown schematically shown in figure 1B. First, mRNA, a type of nucleic acid, is extracted from the cells and reverse transcribed into cDNA using the enzyme called reverse transcriptase (Usually DNA transcribe into RNA, but here RNA forced to make DNA and hence reverse transcribed). Next, the cDNA, which are just reverse transcribed, are labeled with different fluorescent dye, for e.g., cDNA from condition A with red and green dye for condition B cDNA. After that these differentially labeled cDNA will allow to hybridize (bind) to DNA on to the glass slide. These cDNA will bind to glass slide if and only if sequences of both DNA are match, i.e., sequences are complementary. The amount of cDNA bound to a spot and fluorescence emitted will be directly proportional to the initial number of RNA Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 19 of 22 molecules present for that gene in both samples. Fluorescence emissions are excited by a laser at suitable wavelength to detect red and green fluorescent dye. For instance, if cDNA from condition A for a particular gene was in greater abundance than that from condition B, one would find the spot to be red. If it were the other way, the spot would be green. If the gene were expressed to the same extent in both conditions, one would find the spot to be yellow, and if the gene were not expressed in both conditions, the spot would be black. Thus, what is seen at the end of the experimental stage is an image of the microarray, in which each spot that corresponds to a gene has an associated fluorescence value representing the relative expression level of that gene. Figure 1. Figure 1. A) Microarray representation and B) Schematic of the experimental protocol to study differential expression of genes. (Adopted from “An Introduction to Microarray Data Analysis”, M. Madan Babu Microarray Data Analysis, MDA, by Modi, Summer 2009. Page 20 of 22 c) Table1 1: Detail analysis results of kNN classification SAMPLE 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 ALL AML VOTES VOTES 6 0 2 4 6 0 3 3 2 4 2 4 6 0 2 4 4 2 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 2 4 6 0 3 3 6 0 3 3 6 0 2 4 4 2 3 3 6 0 2 4 6 0 2 4 6 0 2 4 6 0 3 3 2 4 ALL Total AML Total Unclassified Wrong Classification Results for Test Sample Classification Classification (~100 Genes) (~1500 True (pVal = Genes) 0.00012) (pVal = 0.05) Classification ALL ALL ALL AML ALL AML ALL ALL ALL ALL ALL ALL AML ALL AML ALL ALL ALL ALL ALL AML AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML ALL AML ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML ALL AML ALL ALL ALL ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML ALL AML ALL AML ALL AML ALL AML AML AML AML AML AML AML AML AML AML AML ALL AML AML AML AML AML AML AML AML 17 21 8 14 4 6 Microarray Data Analysis, MDA, by Modi, Summer 2009. 16 11 4 Results for Training Classification (~100 Genes) True (pVal = 0.00012) Classification ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML ALL ALL ALL AML ALL ALL ALL ALL ALL ALL ALL ALL AML ALL AML ALL ALL ALL ALL ALL ALL AML AML AML AML AML AML AML AML AML AML AML AML AML AML ALL AML AML AML ALL AML 21 7 4 6 Page 21 of 22 Sample Classification 27 11 (~1500 Genes) (pVal = 0.05) ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL ALL AML ALL ALL ALL ALL ALL AML ALL ALL ALL AML AML ALL ALL AML AML AML ALL AML AML ALL AML ALL 21 6 4 7 d) Table 2. Clustering results of predictor genes from the training sets. STAGE DISTANCE 0 3.90238609 1 4.736685142 2 4.756178629 3 4.882651946 4 4.908484268 5 5.182083996 6 5.233545688 7 5.602330349 8 5.935037805 9 6.033568191 10 6.038776923 11 6.083551181 12 6.383679177 13 6.556810674 14 6.658057996 15 6.714705754 16 6.9358182 17 7.307777142 18 7.465346691 19 7.47753534 20 7.865546755 21 7.899660363 22 8.147036047 23 8.169115014 24 8.538012669 25 8.61081586 26 9.031163475 27 9.815361402 28 10.0079682 29 10.21339449 30 10.47210596 31 10.76822899 32 10.82831893 33 10.83775045 34 11.02406968 35 11.24009344 36 11.42818776 37 11.61143783 38 11.68816622 39 12.02556804 40 12.02844307 41 12.89215862 42 13.22957654 43 13.36849708 44 14.21324986 45 14.57555099 46 16.74044878 47 17.26896308 48 17.9501259 49 21.20353106 50 22.7866839 51 28.170148 52 30.04587055 CLUSTER 1 M55150_at HG1612-HT1612_at M31523_at M55150_at HG1612-HT1612_at X74801_at M23197_at Z15115_at M91432_at M91432_at L13278_at U20998_at Y12670_at M23197_at M91432_at M77142_at M91432_at L08246_at L13278_at M92287_at L13278_at L08246_at L08246_at M37435_at L13278_at M16038_at M81933_at M16038_at U22376_cds2_s_at U90546_at D10495_at M80254_at M16038_at X82240_rna1_at D10495_at D88422_at X66533_at M89957_at L08246_at M16038_at U46499_at U82759_at U62136_at M28130_rna1_s_at M80254_at M80254_at M80254_at M80254_at M27891_at M19507_at M89957_at M12959_s_at X66533_at Microarray Data Analysis, MDA, by Modi, Summer 2009. CLUSTER 2 U50136_rna1_at M31303_rna1_at U29175_at M81695_s_at M92287_at U29175_at X70297_at U22376_cds2_s_at X74262_at X74801_at X52142_at M12959_s_at M81695_s_at U12471_cds1_at U32944_at X15949_at U62136_at X04085_rna1_at M83233_at U22376_cds2_s_at M77142_at X17042_at M63138_at U12471_cds1_at X66533_at Y12670_at U12471_cds1_at L09209_s_at M12959_s_at M28170_at X07743_at M81933_at X95735_at L33930_s_at M80254_at U46499_at U90546_at L33930_s_at M12959_s_at X85116_rna1_s_at M27783_s_at X85116_rna1_s_at U82759_at Y00787_s_at M27783_s_at X61587_at U82759_at X66533_at M28130_rna1_s_at M28130_rna1_s_at X66533_at M28130_rna1_s_at M28130_rna1_s_at Page 22 of 22