Download A Statistical Framework for Expression

A Statistical Framework for Expression-Based Molecular Classification Elizabeth Garrett Sidney Kimmel Cancer Center Johns Hopkins University Molecular Classification of Cancer • Goals – Short term: • To use gene expression array data to identify and hypothesize subtypes of cancer • To discover new cancer classes that are interpretable and amenable to further biological analysis • To translate classes into clinical tools – Long term: • To eventually refine individualized prognosis and therapy Outline of Talk • Molecular Classifications – the role of statistics in molecular classification – defining a molecular profile • Modeling latent classes: POE (Probability of Expression) – Bayesian mixture models – visualization tools • “Mining” using latent classes • Using POE to combine across platforms Botstein-Brown style of visualizing gene expression data (Garber et al. PNAS 2001) The fine print Motivating Datasets • Unclassified cancer samples: Are the gene expressions patterns informative about subclasses? – Ductal breast cancers – Adenocarcinomas of the lung – Diffuse large B-cell lymphoma • Related tissues: Are subtypes associated with prognosis? – Normal tissues and cancers tissues – Outcome data (e.g. survival, recurrence, response) • Genes: Are hypothesized genes associated with cancer types? – Functional information – Custom array General Approach of POE (Probability of Expression) • Define a reference expression value: – “normal” vs. over expressed vs. under expressed – unsupervised in nature • Use scale-independent measures of expression – allows combination of data across platforms – incorporates measurement errors • Choose molecular profile that predicts cancer class based on a small number of genes – yields clinical implications – choose genes using combination of statistical and biological evidence • Caveat: NOT intended for gene clustering and not for manual clustering of genes Molecular Profiles (based on 3 genes A, B, and C) 27 = 33 possible profiles Gene A Gene B Gene C Profile 1 -1 -1 -1 Profile 2 -1 -1 0 Profile 3 -1 -1 1 Profile 4 -1 0 -1 Profile 5 -1 0 0 Profile 6 -1 0 1 …. …. …. …. Profile 24 1 0 1 Profile 25 1 1 -1 Profile 26 1 1 0 Profile 27 1 1 1 Mixture of Normal and Two Uniform Distributions Empirical Density of Expression Levels in One Gene Across 203 Lung Samples Bhattacharjee, PNAS 2001 Latent Expression Classes • Notation: e gt  1 gene g has abnormally low expression in tumor t e gt  0 gene g has normal expression in tumor t e gt  1 gene g has abnormally high expression in tumor t • Modeling observed gene expression, agt: a gt |( egt  e) ~ f e, g () e {1,0,1} • For gene g, the proportions of differentially expressed tumors in the population of unclassified tumors are  g  P(egt  1)  g  P(egt  1) Probability Scale for Expression Data p gt  P( egt  1| a gt ,  g ,  g , f1, g , f 0, g )  g f1, g ( a gt )    g f1, g ( a gt )  (1   g   g ) f 0, g ( a gt ) Interpretation: The probability that gene g in tumor t is over expressed given observed expression and the model parameters p gt  P( egt  1| a gt ,  g ,  g , f  1, g , f 0, g )  g f  1, g ( a gt )    g f  1, g ( a gt )  (1   g   g ) f 0, g ( a gt ) Interpretation: The probability that gene g in tumor t is under expressed given observed expression and the model parameters Distributional Assumptions Samples: Normal/Uniform mixture f  1, g ()  U (  g   t   g ,  t   g ) f 0, g ()  N ( t   g ,  g ) f1, g ()  U ( t   g ,  t   g   g )  g |   ,   ~ N (  ,   )  g 2 |  ,  ~ G( ,  ) Genes: Second stage model  g |  ~ E ( )  g |  ~ E ( ) logit( g )|  ,   ~ N ( ,   ) logit( g )|  ,   ~ N ( ,   ) Original Scale After Transformation Harvard Lung Cancer Data (Bhattacharjee, PNAS, 2001) MCMC Estimation Approach • Relatively straightforward • A couple comments: – Data augmentation using unknown expression variables egt. Sampling of ’s unconditional on e’s [ |  ] * [e|| ,  ] * [ | , e] * – Starting conditions are critical. K-means clustering (k=2 or 3) useful for picking starting centers and spread – Constrain min(g+,g- ) > kg Denoising Expression Data E(gt | a gt ,  )   g   t  ( pgt  pgt )(a gt   g   t ) Provides “cleaner” version of the original expression level data. Mining for Genes • Two quantities of interest in looking for and grouping genes. • Probability that gene g follows a specified pattern: P(eg1 ,..., egT |  )   ( pgt ) I ( egt  1) ( pgt ) I ( egt 1) (1  pgt  pgt ) I ( egt 0 ) t • Probability that all genes in set G0 have the same pattern across samples q(G0 )   t         ( p p  p p  ( 1  p  p )( 1  p  p  gt g't gt g't gt gt g 't g 't )) g , g 'G0 Identifying Gene Groups • Preselect proportions of over and under expressed genes (e.g. 20% under, 5% over) • Select genes consistent with proportions via P(eg1,….,egT|) • Chose genes which are similar in expression pattern to add to group via q(G0). • Look at “mining” plot to identify genes which are sensible (biologically). 5% underexpressed, 15% overexpressed, 4 sets Molecular Profiles Combining Across Platforms • Example: Stanford, Harvard, Michigan lung cancer datasets • Publicly available • Different platforms: Affymetrix, cDNA glass slides • POE rescales to probability metric • With some caveats, can combine data • Statistics: G. Parmigiani, E. Garrett • Arrays, Biology: E. Gabrielson, R. Anbazhagan • http://astor.som.jhmi.edu/poe • G. Parmigiani, E. Garrett, R. Anbazhagan, E. Gabrielson. A statistical framework for expression-based molecular classification in cancer. JRSS, in press. • E. Garrett, G. Parmigiani. POE: Statistical Methods for Qualitative Analysis of Gene Expression. In The Analysis of Gene Expression Data: Methods and Software (eds. G Parmigiani, E. Garrett, R. Irrizarry, S. Zeger). To appear 2003.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Statistical Framework for Expression