Download slides

Statistical challenges in micro-array data Hans C. van Houwelingen Department of Medical Statistics Leiden University Medical Center, The Netherlands email: [email protected] jcvh, London , 5 February 2003, page 1 Key paper Golub TR, Slonim DK, Tamayo P, et al. , Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, SCIENCE 286, 531-537, 1999 jcvh, London , 5 February 2003, page 2 Task Force Bio-informatics Leiden Leiden University Medical Center Human Genetics Molecular Cell Biology Pathology Medical Statistics Faculty of Mathematics and Natural Sciences Mathematical Statistics jcvh, London , 5 February 2003, page 3 Collaboration between Medical Statistic and Mathematical Statistics led to jcvh, London , 5 February 2003, page 4 Micro-arrays are used for measuring gene expression profiles Affymetrix (single colour) cDNA (two colours) measuring gene amplification/deletion Comparative Genome Micro-array (single colour). In each array experiment measurements are obtained for very many genes (500-25000). jcvh, London , 5 February 2003, page 5 Practical problems. techniques are based on hybridization and optical reading, hence there are problems to distinguish the signal from the background to carry out a proper background correction. The measurements can be far from perfect, leading to a lot of missing data. absolute intensity levels vary from array to array (and from colour to colour). to make the data comparable, they have to be normalized. the simplest way is by setting the geometric mean =1 jcvh, London , 5 February 2003, page 6 Sources of random variation at different levels: between pixels within spots, between spots within arrays, between arrays within samples, between samples within individuals between individuals within groups. We ignore the variation within the array and assume that each array gives one measurement per gene, usually expressed as ln((relative) intensity). jcvh, London , 5 February 2003, page 7 Data sets/ designs Micro-arrays are still quite costly per array (but not per gene). Large data sets have about 100 arrays data sets with only a few arrays are very common Study designs depend on the field of application (plants/animals/human). In non-human applications material is often pooled to reduce the number of arrays. In medical research pooling of patients is less natural. jcvh, London , 5 February 2003, page 8 Simplified scheme X: gene-expression data on many genes (for few patients) Y: patient characteristic jcvh, London , 5 February 2003, page 9 Designs/Questions/Analyses 1. Sequence of observations X1,...,Xn Cluster analysis 2. Factorial designs Y X Multivariate Analysis of Variance Searching differentially expressed genes 3. Classification and prediction X Y Discriminant analysis/ Multiple regression Searching influential gene jcvh, London , 5 February 2003, page 10 Remark about the multiplicity problem Having many genes has advantages and disadvantages. disadvantage: multiple testing problem. There is a big chance of false positives. Bonferoni-type corrections could be far too conservative and too restrictive. advantage: all genes give similar data. Information from the other genes can be used to make inference about one particular gene. That makes micro-array data the ideal playing field for empirical Bayes methodology . jcvh, London , 5 February 2003, page 11 General challenges Computational not really, as long as the number of experiments is relatively small however, standard software has trouble coping with few rows and many colums Conceptual what do we want ? what can be delivered ? jcvh, London , 5 February 2003, page 12 Specific challenges Structured cluster analysis (design 1) Increasing the degrees of freedom when searching for differentially expressed genes (design 2) Finding reliable predictors (design 3) Finding influential genes (design 3) jcvh, London , 5 February 2003, page 13 Design 1. No structure between experiments Typical data Xij = ln (ratio experimental/control), i=1,...,G, j=1,..,J i stands for the gene j stands for different experiments. The “natural” value is Xij 0 . Natural “cut-off” ln(2) (two-fold change). jcvh, London , 5 February 2003, page 14 For J=1 classification in under-expressed, normal and overexpressed can be based on one-dimensional cluster analysis or latent class models. Example of histogram over all genes jcvh, London , 5 February 2003, page 15 For J>1 cluster analysis (unsupervised learning) can be used to cluster the genes. Statistical challenges are: to quantify the uncertainty of the clusters and cluster membership by proper statistical modelling. to limit the possible cluster profiles by proper modelling (prestructured clustering) jcvh, London , 5 February 2003, page 16 Example: CG (comparative genomic) micro-arrays Used in diagnosis of Down syndrome. 21 individuals, 448 genes. Data per “individual” look like .8 .6 .4 amplified .2 normal .0 -.2 deleted -.4 009000c -.6 -.8 -1.0 genome jcvh, London , 5 February 2003, page 17 Data per gene .1 look like -.4 -.5 ln(ratio) .3 .2 .0 -.1 -.2 -.3 c 33 m oe xfe e al m xfem m fe c 34 ba 3c 3 ba 6c 2 ba c 22 ba 9c 1 ba 8c 1 ba c 15 1c ba 649 _ c 99 75 92 c 98 55 58 3c 98 996 _ 5c 98 628 _ c 98 039 6 21 97 73 _1 03 97 141 _ c 96 708 5 1c 92 46 12 c 90 000 9 00 patient jcvh, London , 5 February 2003, page 18 Interest in clustering genes Probabilistic clustering True clusters overlap. Each cluster has its own mean Variation can depend on sample (patient) (not on cluster) samples within cluster independent So, model within cluster Xij~N(µ kj, 2 j )) jcvh, London , 5 February 2003, page 19 Statistical procedure estimates cluster means standard deviations prior probabilities of clusters posterior probabilities of genes Easily fitted by EM, missing data no problem jcvh, London , 5 February 2003, page 20 Cluster 1: prior prob. 0.42 (mean ± 2 st. dev.) jcvh, London , 5 February 2003, page 21 Cluster 2: prior prob. 0.22 jcvh, London , 5 February 2003, page 22 Cluster 3: prior prob 0.24 jcvh, London , 5 February 2003, page 23 Cluster 4: prior prob. 0.12 jcvh, London , 5 February 2003, page 24 All means in one picture jcvh, London , 5 February 2003, page 25 Structured probabilistic clustering. Put prior/biological/genetic/medical knowledge in the model Proposal for this kind of data: µ kj ak j with k 0 for the cluster of normal genes This is a one-factor latent-class model. jcvh, London , 5 February 2003, page 26 Cluster means for structured model jcvh, London , 5 February 2003, page 27 cluster prob. ln(ratio) (on relative scale) deleted 0.2477 -0.9319 normal 0.6180 0 amplified 0.1321 2.2988 overamplified 0.0022 6.2428 jcvh, London , 5 February 2003, page 28 Posteriors along the genome CHROMOSO: 1.2 1.0 1.0 .8 .8 .6 .6 .4 overamplified .8 .6 deleted DISTANCE .4 overamplified normal deleted 0.0 DISTANCE overamplified amplified amplified .2 .2 normal 20.00 1.0 amplified .2 0.0 CHROMOSO: 7.00 1.2 .4 Value probability 1.00 Value CHROMOSO: 1.2 normal deleted 0.0 DISTANCE jcvh, London , 5 February 2003, page 29 Points of discussion check the fit of the model include dependencies in cluster status along the chromosome (hidden (semi-) Markov model) jcvh, London , 5 February 2003, page 30 Design 2. Searching differentially expressed genes example: affymetrix-array 12488 genes 9 experiments 3 wildtype 2 MX 3 HDMD 1 HDMDxMDX Question: Which genes show difference between groups ? jcvh, London , 5 February 2003, page 31 Data for some of the genes 5 4 3 log(intensity) 2 1 0 H H H H D D D D D M D M D M 2 1 D 3 2 1 xM 3 2 1 pe -ty pe -ty pe -ty D M X M ild ild ild X M w w w X experiment jcvh, London , 5 February 2003, page 32 Typical data : Xicj i=1,..,G stands for the genes c=1,..,C for different conditions j 1,..,Jc for the repetitions within the conditions. The usual case is C=2 and J1 and J2 quite small ( in the range from 2-4). jcvh, London , 5 February 2003, page 33 Let µ ic E[Xcij|i,c] and i var[Xicj|i,c] Due to the small sample sizes, it is impossible to carry out significance tests per gene. Much can be gained from modeling the variation of (µ i1,..,µ iC) and i over all genes. A simple model that relates i to µ i1 can help dramatically. jcvh, London , 5 February 2003, page 34 Back to the example. First attempt : F-test per gene. Histogram of all p-values .06 .05 .03 .02 Std. Dev = .28 Mean = .517 N = 12488.00 0.00 .025 .225 .125 .425 .325 .625 .525 .825 .725 .925 p-values per gene jcvh, London , 5 February 2003, page 35 1.0 Q-Q-plot .8 .6 p-value per gene .4 .2 0.0 0.0 .2 .4 .6 .8 1.0 standardized rank of p-value per gene It is nearly perfectly uniform, no proof of any significance. Main reason: too few degrees of freedom in the denominator. jcvh, London , 5 February 2003, page 36 Look at the distribution of the within group variances. 4000 There is some 3000 overdispersion CV=1.1 instead of 2000 CV 2/5 0.63 1000 Std. Dev = .11 Mean = .10 N = 12488.00 0 1 1. 0 1. 0 .9 0 .8 0 .7 0 .6 0 .5 0 .4 0 .3 0 .2 0 .1 0 0. 0 0 0 variance within group per gene jcvh, London , 5 February 2003, page 37 Second attempt: using the average variance in the denominator Histogram of all p-values .200 .150 .100 .050 Std. Dev = .33 Mean = .60 N = 12488.00 0.000 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 p-value per gene using average variance jcvh, London , 5 February 2003, page 38 p-value per gene using average variance 1.0 .8 .6 .4 .2 0.0 0.0 .2 .4 .6 .8 1.0 rank of p-value Looks like the p-values are computed from the wrong distribution jcvh, London , 5 February 2003, page 39 Third attempt: relating i to µ i1 . 2 1 .5 .4 .3 .2 .1 variance within groups .05 .04 .03 .02 .01 .005 .004 .003 .002 .001 .0005 .0004 .0003 .0002 .0001 Rsq = 0.3950 0 1 2 3 4 5 mean of wild-types jcvh, London , 5 February 2003, page 40 Using predicted variances instead of mean variance : .14 .10 .05 Std. Dev = .32 Mean = .453 N = 12488.00 0.00 .025 .225 .125 .425 .325 .625 .525 .825 .725 .925 p-value based on estimated variance jcvh, London , 5 February 2003, page 41 1.0 .8 .6 .4 PHAT .2 0.0 0.0 .2 .4 .6 .8 1.0 rank of p-value This looks really promising! jcvh, London , 5 February 2003, page 42 The interesting issue is that the collection of p-values per gene give a good impression about the validity of the test. Alternative to the procedure above: Variance stabilizing transform à la Huber et al. Further improvement Hierarchical model for variation in standard deviation à la Baldi and Long (work in progress) jcvh, London , 5 February 2003, page 43 Design 3 Classification and prediction Data structure: expression data Xij , i=1,..,G again stands for the gene j=1,..,J for the individual (patient) outcome Yj per patient Wanted: To make a model to predict Y from X Problem: High-dimension G of the design matrix jcvh, London , 5 February 2003, page 44 Example: Golub data set with dichotomous Y (ALL/AML) J=38 individuals G=3571 genes (“bad” genes thrown away) Histogram of correlations of outcome Y with all gene expressions in the Golub data-set Std. Dev = .29 Mean = .03 N = 3571.00 1 .8 9 .6 6 .5 4 .4 1 .3 9 .1 9 1 4 6 9 1 6 .0 6 -.0 -.1 -.3 -.4 -.5 -.6 -.8 jcvh, London , 5 February 2003, page 45 Natural model: logistic regression (X) ln( (X X̄) with ) 1 (X) (X) P(Y 1|X) If you fit this to complete data-set you get ˆ i or ˆ i ˆ(X) Y Way out penalization (Eilers et al.) empirical Bayes jcvh, London , 5 February 2003, page 46 Empirical Bayes approach i i.i.d. N(0, 2) With unknown remaining parameters and 2 2 Applications: simple test for no effect: 2 0 vs 2 >0 regularized estimate of jcvh, London , 5 February 2003, page 47 Score test for effect: 2 0 vs 2>0 (Goeman et al. ) Test statistics Q (Y Ȳ) (X X̄)(X X̄) (Y Ȳ) P-value based of distribution of Y given X’s (and Ȳ ) P-value easily obtained by random permutation jcvh, London , 5 February 2003, page 48 Result: awfully significant (Graphs show permutation distribution of Q and position of the observed Q) jcvh, London , 5 February 2003, page 49 Estimating , 2 Marginal likelihood: L( , 2) L( , 1,...., G)f( 1, 2,..., G| 2)d 1...d G Integrals very cumbersome to compute. Approximations far from perfect. jcvh, London , 5 February 2003, page 50 One the parameters , 2 have been estimated, the ’s are obtained from the posterior distribution Posterior mode penalized likelihood estimator, minimizes ln(L( , 1,..., G)) 0.5 2 2 i/ Posterior distribution gives impression of precision of linear predictor Xi individual j ‘s (doable) (hopeless) jcvh, London , 5 February 2003, page 51 Confidence interval for linear predictor look O.K. jcvh, London , 5 February 2003, page 52 Posterior modes look messy (much better for peaked priors) jcvh, London , 5 February 2003, page 53 Posterior Z-values look hopeless (independent of prior) jcvh, London , 5 February 2003, page 54 Big problem: Lack of any prior structure among different genes This makes selection of influential hopeless Big challenge: Bring structure in the 20000 genes using either biology: pathways and the like statistics: meta-analysis on all data sets from the same platform jcvh, London , 5 February 2003, page 55 References Baldi P, Long AD , Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes, BIOINFORMATICS, 17, 509-519, 2001 Eilers PHC, Boer J, van Ommen GJB, van Houwelingen JC, Classification of microarrays with penalized logistic regression. Proceedins of SPIE, Volume 4266, Progress in Biomedical Optics and Immaging 2, 187-198, 2001 Goeman JJ, van de Geer SA, de Kort F, van Houwelingen JC, A global score test for differential expression of groups of genes, preprint, 2003 jcvh, London , 5 February 2003, page 56 Golub TR, Slonim DK, Tamayo P, et al. , Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring, SCIENCE 286, 531-537, 1999 Huber W, v.Heydebreck A, Sueltmann H, Poustka A and Vingron M, Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Proceedings of ISMB 2002, Bioinformatics, 18, Suppl 1:S96-S104, 2002 Lee MLT, Kuo FC, Whitmore GA, et al., Importance of replication in microarray gene expression studies: Statistical methods and evidence from repetitive cDNA hybridizations, P NATL ACAD SCI USA 97, 9834-9839, 2000 jcvh, London , 5 February 2003, page 57 de Menezes RX, Boer JM, van Houwelingen HC, Microarray data analysis: hierarchical modelling to handle heteroscedasticit, preprint, 2003 Tusher VG, Tibshirani R, Chu G , Significance analysis of microarrays applied to the ionizing radiation response, P NATL ACAD SCI USA 98, 5116-5121, 2001 West M, Blanchette C, Dressman H, et al. , Predicting the clinical status of human breast cancer by using gene expression profiles, P NATL ACAD SCI USA 98, 11462-11467, 2001 jcvh, London , 5 February 2003, page 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides