Download Presentation (PowerPoint File)

Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research Branch National Cancer Institute One Potential of Gene Expression Data • Specimens will be distinguishable by their gene expression profiles • NCI Director’s Challenge: Toward a Molecular Classification of Tumors – “This challenge is intended to lay the groundwork for changing the basis of tumor classification from morphological to molecular characteristics.” – Purpose is “...to define comprehensive profiles of molecular alterations in tumors that can be used to identify subsets of patients.” • So one important goal is: Classification What is meant by “Classification”? Two important and distinct answers: Class Discovery • Identification of previously unknown classes of specimens • Use of “unsupervised” methods – – – – Hierarchical Clustering k-means Clustering SOMs Others • Prevalent method used in literature for analysis of gene expression data. Class Prediction • Assignment of specimens into known classes • Use of “supervised” methods – – – – Logistic Regression CART Discriminant Analysis Others • Class prediction is more powerful than class discovery for distinguishing specimens based on a priori defined classes. Example of Class Discovery: Distinct Types of Diffuse Large B-Cell Lymphoma • DLBCL is clinically heterogeneous • Specimens were clustered based on their expression profiles of GC B-cell associated genes. • Two subgroups were discovered: – GC B-like DLBCL – Activated B-like DLBCL (Figures and information taken from Alizadeh et al., Nature 403:503-11, 2000) What is meant by “classification”? Two important and distinct answers: Class Discovery • Identification of previously unknown classes of specimens • Use of “unsupervised” methods – – – – Hierarchical Clustering k-means Clustering SOMs Others • Prevalent method used in literature for analysis of gene expression data. Class Prediction • Assignment of specimens into known classes • Use of “supervised” methods – – – – Logistic Regression CART Discriminant Analysis Others • Class prediction is more powerful than class discovery for distinguishing specimens based on a priori defined classes. Study of Gene Expression in Breast Tumors (NHGRI, J. Trent) cDNA Microarrays Parallel Gene Expression Analysis 6526 genes /tumor • How similar are the gene expression profiles of BRCA1 and BRCA2 (+) and sporadic breast cancer patient biopsies? • Can we identify a set of genes that distinguish the different tumor types? • Tumors studied: – 7 BRCA1 + – 8 BRCA2 + – 7 Sporadic BRCA2- BRCA2- BRCA2+ BRCA2+ BRCA2- BRCA2BRCA2- BRCA2- BRCA2- BRCA2- BRCA2- BRCA2+ BRCA2+ BRCA2- BRCA2- BRCA2- BRCA1- BRCA1- BRCA2- BRCA1- BRCA1 Clustering BRCA2+ BRCA2- BRCA2+ BRCA2+ BRCA2+ BRCA1+ BRCA1+ BRCA1- BRCA1+ BRCA1+ BRCA1- BRCA1- BRCA1+ BRCA1BRCA1- BRCA1- BRCA1- BRCA1- BRCA1- BRCA1- BRCA1- BRCA1- BRCA1+ BRCA1+ BRCA1 +/- and BRCA2 +/- Classification: Results from Hierarchical Clustering BRCA2 Clustering Class Prediction Paradigm 1 Begin with a data set that can be separated into known groups. 2 Choose a method of class prediction. 3 Perform class prediction on the data set using “leave-one-out” cross-validation. • • • • Leave one specimen out of data set. Build the class predictor using remaining data. Predict class of the left out specimen. Repeat so that a prediction is made for every specimen. 4 Use a permutation test to determine if there is a significant difference in expression patterns between the groups. • • • • Permute class labels among specimens. Perform class prediction on the permuted data. Repeat many times. Report the % of permuted sets with an error rate equivalent to or less than that for the actual data set. The Compound Covariate Predictor (CCP) • We consider only genes that are differentially expressed between the two groups (using a two-sample t-test with small a). • The CCP – Motivated by J. Tukey, Controlled Clinical Trials, 1993 – Simple approach that may serve better than complex multivariate analysis – A compound covariate is built from the basic covariates (log-ratios) CCPi   t j xij j tj is the two-sample t-statistic for gene j. xij is the log-ratio measure of sample i for gene j. Sum is over all differentially expressed genes. • Threshold of classification: midpoint of the CCP means for the two classes. BRCA1 +/- and BRCA2 +/- Classification: Results from Class Prediction with CCP Classification BRCA1+ vs. BRCA1BRCA2+ vs. BRCA2- a 0.0001 0.0001 Number of Number of % of significant misclassified permutations genes tumors (m) with m or fewer misclassifications 9 1 0 BRCA1+ 1 BRCA1- 0.3 11 4 3 BRCA2+ 1 BRCA2- 4.0 Sample Size Considerations for Accurate Class Prediction BRCA2 +/- Classification Mean th 95 percentile 0.2 Misclassification error rate of original data set Misclassification error rate of original data set BRCA1 +/- Classification 0.1 0.0 Mean th 95 percentile 0.2 0.1 0.0 40 80 120 160 200 Size of training set (bootstrap data) 40 80 120 160 200 Size of training set (bootstrap data) Summary • Class discovery and prediction methods have distinct goals. • When class information is known, class prediction is a more powerful method for detecting differences. • BRCA1 and BRCA2 mutation positive tumors have distinguishable gene expression patterns. – BRCA1 distinction is stronger than BRCA2. – Some biological insight concerning misclassified specimens. – Not at level of clinical classification yet. • Sample size issues Collaborators NCI Richard Simon NHGRI Mike Bittner Yidong Chen David Duggan Ingrid Hedenfalk Jeff Trent

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Presentation (PowerPoint File)