Download 投影片 1

Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28 1 Outline • • • • • • Simple description of gene chip data Earlier works Mutiple t-test and SAM Lee’s ANOVA Wong’s factor models Efron’s empirical Bayes 2 Remarks • Most works are statistical analysis, not really machine learning type • Very small set of training sample – not to mention the test sample • Medical research needs scientific rigor when we can 3 Arthritis and Rheumatism • Guidelines for the submission and reviews of reports involving microarray technology v.46, no. 4, 859-861 4 Reproducibility • Should document the accuracy and precision of data, including run-to-run variability of each gene • No arbitrary setting of threshold (e.g., 2fold) • Careful evaluation of false discovery rate 5 Statistical Analysis • Statistical analysis is absolutely necessary to support claims of an increase or decrease of gene expression • Such rigor requires multiple experiments and analysis of standard statistical instruments. 6 Sample Heterogenenity • … Strongly recommends that investigators focus studies on homogenous cell populations until other methodological and data analysis problems can be resolved. 7 Independent Confirmation • It is important that the findings be confirmed using an independent method, preferably with separate samples rather than restating of the original mRNA. 8 Microarray • Other terms: DNA array DNA chips biochips Gene chips 9 • The underlying principle is the same for all microarrays, no matter how they are made • Gene function is the key element researchers want to extract from the sequence • DNA array is one of the most important tools (Nature, v.416, April 2002 885-891) 10 2 types of microarray • cDNA • Oligonucleotides • DIY type 11 • Microarray allows the researchers to determine which genes are being expressed in a given cell type at a particular time and under particular condition Gene-expression 12 Basic data form • On each array, there are p “spots” (p>1000, sometimes 20000). Each spot has k probes (k=20 or so). There are usually 2k measurements (expressions) per spot, and the k differences, or the difference of logs, are used. • Sometimes they only give you a summary statistics, e.g. median, mean,.. per spot 13 • Each spot corresponding to a “gene” • For each study, we can arrange the chips so that the i-th spot represents the i-th gene. (genes close in index may not be close physically at all) • This means that when we read the i-th spot of all chips in one study, we know we get different measurements of the same ith gene 14 • Data of one chip can be arranged in a matrix form, Y; X_1, X_2, …, X_p Just as in a regression setup. But in practice, n (chips used) is small compared with p. Y is the response: cell type, experimental condition, survival time, … 15 • For a spot with 20 probes, see Efron et al. (2001, JASA, p.1153). 16 Earlier works • Cluster analysis • Fold methods • Multiple t with Bonferroni correction 17 Multiple t with Bonferroni correction • It is too conservative • Family wise error rate Among G tests, the probability of at least one false reject – basically goes to 1 with exponential rate in G 18 Sidak’s single-step adjusted p-value p’=1-(1-p)^G Bonferroni’s single-step adjusted p-value p’=min{Gp,1} All are very conservative 19 FDR –false discovery rate • Roughly: Among all rejected cases, how many are rejected wrong? (Benjamini and Hochberg 1995 JRSSB, 289-300) “Sequential p-method” 20 Sequential p-method • Using the observed data, it estimates the rejection regions so that the FDR < alpha Order all p-values, from small to large, and obtain a k so the first k hypotheses (wrt the smallest k p-values) are rejected. 21 • Since we have a different definition for error to control, it will increase the “power” • For modifications, see Storey (2002, JRSSB, 479-498) • These are criteria specifically designed to handle risk assessment when G is large 22 Role of permutation • For tests (multiple or not), it is important to use a null distribution • It is generated by a well-designed permutation (of the columns of the data matrix) –column refers to observations, not genes. 23 One simple example • Let us say we look at the first gene, with n_1 arrays for treatment and n_2 arrays for control • We use a t-statistics, t_1, say. What is the p-value corresponding to this observed t_1? 24 • Permute the n=n_+n_2 columns of data of the data matrix. Look at first row (corresponds to the first gene) • Treat the first n_1 numbers as a fake “treatment”, the last n_2 numbers as a fake “control” , compute a t-value, say we get s_1 25 • Permute again and do the same thing and we get s_2, …. • Do it B times and get s_1, s_2, …., s_B • Treat these s’s as a (bootstrap) sample for the null distribution of the t_1 statistic • The p-value of the earlier t_1 is found from the ecdf of the s_j, j=1,2,…,B 26 • Permutation plays a major role --- finding a reference measure of variation in various situations • For a well designed experiment with microarray, DOE techniques will play an important role in determining how to do proper permutations. 27 SAM– significance analysis of microarray • A standard method of microarray analysis, taught many times in Stanford short courses of data mining • Modified multiple t-tests • Using the permutation of certain data columns to evaluate variation of data in each gene 28 • Original paper is hard to read: (Tusher, Tibshirani and Chu, PNAS 2001, v.98, no.9, 5116-5121) But the SAM manual is a lot easier to read for statisticians: (free software for academia use) 29 • D(i)={X_treatment – X_control} over {s(i)+s_0} i=1,2,…,G D(1)<D(2)<….. Used in SAM, s_0 is a carefully determined constant >0. 30 • D(i)* are used with certain group of permutations of the columns; D(i)* are also ordered • Plot D vs. D*, points outside the 45-degree line by a threshold Delta are signals of significant expression change. • Control the value of Delta to get different FDR. 31 Other model-based methods • Wong’s model PM-MM= \theta \phi + \epsilon Outlier detection Model validation Li and Wong (2001, PNAS v.98, no.1, 31-36) 32 Lee’s work • ANOVA based • May do unbalanced data – e.g., 7 microarray chips (Lee et al. 2000, PNAS, v.97, 9834-9839) 33 Empirical Bayes • (Efron et al. (2001) JASA, v.96, 1151-1160) • Use a mix model f(z)=p_0 f_0(z)+p_1 f_1(z) with f_0, f_1 estimated by data. p_1=prior prob that a gene expression is affected (by a treatment) 34 • A key idea is to use permuted (columns) data to estimate f_0 • Use a tricky logistic regression method • Eventually found p_1(Z)= the a posteriori probability that a gene at expression level Z is affected 35 Part I conclusion • Earlier methods are relatively easy to understand, but to get familiar with the biolanguage needs time • More powerful data analytic methods will continue to develop • It is important to first understand the basic problems of biologist before we jump with the fancy stat methods 36 • We may do the wrong problem … • But if the problem is relevant, even simple methods can get good recognition • All methods so far are “first moment only” – ie, not too much different from multiple t tests; or, they all are one-geneat-a-time methods. 37 • We did not address issues about data cleaning, outlier detection, normalization, etc. Microarray data are highly noisy, these problems are by no means trivial. • As the cost per chip goes down, the number of chips per problem may grow. But still well-designed experiments, e.g., fractional factorial, has room to play in this game 38 • Statistical methods, as compared with machine learn based methods, will play a more important role for this type of data since, with a model, parametric or not, one can attach a measure of confidence to the claimed result. This is crucial for scientific development. 39 Quote: • The statistical literature for microarrays, still in its infancy and with much of it unpublished, has tended to focus on frequentist data-analytical devices, such as cluster analysis, bootstrapping and linear models. (Efron, B. 2001) 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 投影片 1