Download ppt

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring Overview Motivation  Microarray Background  Our Test Case  Class Prediction  Class Discovery  Motivation Importance of cancer classification  Cancer classification has historically relied on specific biological insights  We will discuss a systematic and unbiased approach for recognizing tumor subtypes  Microarray Background Microarrays enable simultaneous measurement of the expression levels of thousands of genes in a sample  Microarray:  – Glass slide with a matrix of thousands of spots printed on to it – Each spot contains probes which bind to a specific gene Microarray Background (cont.)  The process: – DNA samples are taken from the test subjects – Samples are dyed with fluorescent colors and placed on the Microarray – Hybridization of DNA and cDNA  The result: – Spots in the array are dyed in shades of red to green Microarray Background (cont.)  Sample 1 Sample 2 Gene 1 1.04 2.08 Gene 2 3.2 10.5 Gene 3 3.34 1.05 Gene 4 1.85 0.09 Microarray data is translated into an n x p table (p – number of genes, n – number of samples) Demonstration http://www.bio.davidson.edu/courses/genomics/chip/chip.html Our Test Case 38 bone marrow samples from acute leukemia patients (27 ALL, 11 AML)  RNA from the samples was hybridized to microarrays containing probes for 6817 human genes  For each gene, an expression level was obtained  Class Prediction Initial collection of samples belonging to known classes  Goal: create a “class predictor” to classify new samples  – Look for “informative genes” – Make a prediction based on these genes – Test the validity of the predictor Informative genes  Genes whose expression pattern is strongly correlated with the class distinction strongly correlated poorly correlated Neighborhood Analysis  Are the observed correlations stronger than would be expected by chance? C represents the AML/ALL class distinction C* is a random permutation of C. Represents a random class distinction Application to the Test Case  Roughly 1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance Make a Prediction Use a fixed subset of “informative genes” (most correlated with the class distinction)  Make a prediction on the basis of the expression level of these genes in a new sample  Prediction Algorithm  Each gene Gi votes, depending on whether its expression level Xi in the sample is closer to µ or µ AML  ALL The magnitude of the vote is Wi Vi – Wi reflects how well the gene is correlated with the class distinction – Vi  X i   AML   ALL 2 reflects the deviation of Xi from the average of µ and µ ALL AML Prediction Algorithm (cont.)  The votes for each class are summed to obtain total votes VAML and VALL Prediction Algorithm (cont.)  The prediction strength is calculated: Vwin  Vlose PS  Vwin  Vlose  The sample is assigned to the winning class provided that the PS exceeds a predetermined threshold (0.3 in the test case) Testing the Validity of Class Predictors  Cross Validation – withhold a sample – build a predictor based on the remaining samples – predict the class of the withheld sample – repeat for each sample  Assess accuracy on an independent set of samples Application to the Test Case  50 genes most highly correlated with the AML-ALL distinction were chosen  A class predictor based on these genes was built Application to the Test Case  Performance in cross validation: – Out of 38 samples there were 36 predictions and 2 uncertainties (PS < 0.3) – 100% accuracy – PS median 0.77 Application to the Test Case (cont.)  Performance on an independent set of samples: – Out of 34 samples there were 29 predictions and 5 uncertainties (PS < 0.3) – 100% accuracy – PS median 0.73 Comments  Why 50 genes? – Large enough to be robust against noise – Small enough to be readily applied in a clinical setting – Predictors based on between 10 to 200 genes all performed well  Genes useful for cancer class prediction may also provide insight into cancer pathogenesis and pharmacology Comments (cont.) Creation of a new predictor involves expression analysis of thousands of genes  Application of the predictor then requires only monitoring the expression level of few informative genes  Class Discovery  Cluster tumors by gene expression – Apply a clustering technique to produce presumed classes  Evaluation of the Classes: – Are the classes meaningful? – Do they reflect true structure? Clustering Technique - SOMs  SOMs – Self Organizing Maps Well suited for identifying a small number of prominent classes – Find an optimal set of “centroids” – Partition the data set according to the centroids – Each centroid defines a cluster consisting of the data points nearest to it  We won't go into details about the calculation of SOMs Application of a two-cluster SOM to the test case Class A1: 24 ALL, 1 AML Class A2: 10 AML, 3 AML   Quite effective at automatically discovering the two types of leukemia Not perfect Evaluation of the Classes How can we evaluate such classes if the “right” answer is not already known?  Hypothesis: class discovery can be tested by class prediction  – If the classes reflect true structure, then a class predictor based on them should perform well  Let’s test this hypothesis... Validity of Predictors Based on A1 and A2 Predictors based on different numbers of informative genes performed well  For example: a 20-gene predictor  Validity of Predictors Based on A1 and A2 cont.  Performance on independent samples: – PS median 0.61 – Prediction made for 74% of samples Validity of Predictors Based on A1 and A2 cont.  Performance in cross validation: – 34 accurate predictions with high prediction strength – One error – Three uncertains the one cross validation error 2 of the 3 cross validation uncertains Iterative Procedure Use a SOM to initially cluster the data  Construct a predictor  Remove samples that are not correctly predicted in cross-validation  Use the remaining samples to generate an improved predictor  Test on an independent data set  Validity of Predictors Based on Random Clusters  Performance: – Poor accuracy in cross validation – Low PS on independent samples Conclusion  The AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge Application of a 4-cluster SOM to the Test Case Evaluation of the Classes  Complement approach: – Construct class predictors to distinguish each class from its complement  Pair-wise approach: – Construct class predictors to distinguish between each pair of classes Ci,Cj – Perform cross validation only on samples in Ci and Cj Evaluation of the Classes  Class predictors distinguished the classes from one another, with the exception of B3 versus B4 Conclusion The results suggest the merging of classes B3 and B4  The distinction corresponding to AML, B-ALL and T-ALL was confirmed  Uses of Class Discovery Identify fundamental subtypes of any cancer  Search for fundamental mechanisms that cut across distinct types of cancers  Questions?  Thank you for listening

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt