Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Statistical Framework for Expression-Based Molecular Classification Elizabeth Garrett Sidney Kimmel Cancer Center Johns Hopkins University Molecular Classification of Cancer • Goals – Short term: • To use gene expression array data to identify and hypothesize subtypes of cancer • To discover new cancer classes that are interpretable and amenable to further biological analysis • To translate classes into clinical tools – Long term: • To eventually refine individualized prognosis and therapy Outline of Talk • Molecular Classifications – the role of statistics in molecular classification – defining a molecular profile • Modeling latent classes: POE (Probability of Expression) – Bayesian mixture models – visualization tools • “Mining” using latent classes • Using POE to combine across platforms Botstein-Brown style of visualizing gene expression data (Garber et al. PNAS 2001) The fine print Motivating Datasets • Unclassified cancer samples: Are the gene expressions patterns informative about subclasses? – Ductal breast cancers – Adenocarcinomas of the lung – Diffuse large B-cell lymphoma • Related tissues: Are subtypes associated with prognosis? – Normal tissues and cancers tissues – Outcome data (e.g. survival, recurrence, response) • Genes: Are hypothesized genes associated with cancer types? – Functional information – Custom array General Approach of POE (Probability of Expression) • Define a reference expression value: – “normal” vs. over expressed vs. under expressed – unsupervised in nature • Use scale-independent measures of expression – allows combination of data across platforms – incorporates measurement errors • Choose molecular profile that predicts cancer class based on a small number of genes – yields clinical implications – choose genes using combination of statistical and biological evidence • Caveat: NOT intended for gene clustering and not for manual clustering of genes Molecular Profiles (based on 3 genes A, B, and C) 27 = 33 possible profiles Gene A Gene B Gene C Profile 1 -1 -1 -1 Profile 2 -1 -1 0 Profile 3 -1 -1 1 Profile 4 -1 0 -1 Profile 5 -1 0 0 Profile 6 -1 0 1 …. …. …. …. Profile 24 1 0 1 Profile 25 1 1 -1 Profile 26 1 1 0 Profile 27 1 1 1 Mixture of Normal and Two Uniform Distributions Empirical Density of Expression Levels in One Gene Across 203 Lung Samples Bhattacharjee, PNAS 2001 Latent Expression Classes • Notation: e gt 1 gene g has abnormally low expression in tumor t e gt 0 gene g has normal expression in tumor t e gt 1 gene g has abnormally high expression in tumor t • Modeling observed gene expression, agt: a gt |( egt e) ~ f e, g () e {1,0,1} • For gene g, the proportions of differentially expressed tumors in the population of unclassified tumors are g P(egt 1) g P(egt 1) Probability Scale for Expression Data p gt P( egt 1| a gt , g , g , f1, g , f 0, g ) g f1, g ( a gt ) g f1, g ( a gt ) (1 g g ) f 0, g ( a gt ) Interpretation: The probability that gene g in tumor t is over expressed given observed expression and the model parameters p gt P( egt 1| a gt , g , g , f 1, g , f 0, g ) g f 1, g ( a gt ) g f 1, g ( a gt ) (1 g g ) f 0, g ( a gt ) Interpretation: The probability that gene g in tumor t is under expressed given observed expression and the model parameters Distributional Assumptions Samples: Normal/Uniform mixture f 1, g () U ( g t g , t g ) f 0, g () N ( t g , g ) f1, g () U ( t g , t g g ) g | , ~ N ( , ) g 2 | , ~ G( , ) Genes: Second stage model g | ~ E ( ) g | ~ E ( ) logit( g )| , ~ N ( , ) logit( g )| , ~ N ( , ) Original Scale After Transformation Harvard Lung Cancer Data (Bhattacharjee, PNAS, 2001) MCMC Estimation Approach • Relatively straightforward • A couple comments: – Data augmentation using unknown expression variables egt. Sampling of ’s unconditional on e’s [ | ] * [e|| , ] * [ | , e] * – Starting conditions are critical. K-means clustering (k=2 or 3) useful for picking starting centers and spread – Constrain min(g+,g- ) > kg Denoising Expression Data E(gt | a gt , ) g t ( pgt pgt )(a gt g t ) Provides “cleaner” version of the original expression level data. Mining for Genes • Two quantities of interest in looking for and grouping genes. • Probability that gene g follows a specified pattern: P(eg1 ,..., egT | ) ( pgt ) I ( egt 1) ( pgt ) I ( egt 1) (1 pgt pgt ) I ( egt 0 ) t • Probability that all genes in set G0 have the same pattern across samples q(G0 ) t ( p p p p ( 1 p p )( 1 p p gt g't gt g't gt gt g 't g 't )) g , g 'G0 Identifying Gene Groups • Preselect proportions of over and under expressed genes (e.g. 20% under, 5% over) • Select genes consistent with proportions via P(eg1,….,egT|) • Chose genes which are similar in expression pattern to add to group via q(G0). • Look at “mining” plot to identify genes which are sensible (biologically). 5% underexpressed, 15% overexpressed, 4 sets Molecular Profiles Combining Across Platforms • Example: Stanford, Harvard, Michigan lung cancer datasets • Publicly available • Different platforms: Affymetrix, cDNA glass slides • POE rescales to probability metric • With some caveats, can combine data • Statistics: G. Parmigiani, E. Garrett • Arrays, Biology: E. Gabrielson, R. Anbazhagan • http://astor.som.jhmi.edu/poe • G. Parmigiani, E. Garrett, R. Anbazhagan, E. Gabrielson. A statistical framework for expression-based molecular classification in cancer. JRSS, in press. • E. Garrett, G. Parmigiani. POE: Statistical Methods for Qualitative Analysis of Gene Expression. In The Analysis of Gene Expression Data: Methods and Software (eds. G Parmigiani, E. Garrett, R. Irrizarry, S. Zeger). To appear 2003.