Download Talk2.stat.methods

Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC Outline of the talk  Two types of explanatory variables (“experimental conditions”)  Some scientific questions of interest  A brief discussion on false discovery rate (FDR) analysis  Some existing statistical methods for analyzing microarray data Types of explanatory variables Types of explanatory variables (“experimental conditions”)  Nominal variables: – No intrinsic order among the levels of the explanatory variable(s). – No loss of information if we permuted the labels of the conditions.  E.g. Comparison of gene expression of samples from “normal” tissue with those from “tumor” tissue. Types of explanatory variables (“experimental conditions”)  Ordinal/interval variables: – Levels of the explanatory variables are ordered. – E.g.   Comparison of gene expression of samples from different stages of severity of lessions such as “normal”, “hyperplasia”, “adenoma” and “carcinoma”. (categorically ordered) Time-course/dose-response experiments. (numerically ordered) Focus of this talk: Nominal explanatory variables Types of microarray data  Independent samples – E.g. comparison of gene expression of independent samples drawn from normal patients versus independent samples from tumor patients.  Dependent samples – E.g. comparison of gene expression of samples drawn from normal tissues and tumor tissues from the same patient. Possible questions of interest  Identify significant “up/down” regulated genes for a given “condition” relative to another “condition” (adjusted for other covariates).  Identify genes that discriminate between various “conditions” and predict the “class/condition” of a future observation.  Cluster genes according to patterns of expression over “conditions”.  Other questions? Challenges  Small sample size but a large number of genes.  Multiple testing – Since each microarray has thousands of genes/probes, several thousand hypotheses are being tested. This impacts the overall Type I error rates.  Complex dependence structure between genes and possibly among samples. – Difficult to model and/or account for the underlying dependence structures among genes. Multiple Testing: Type I Errors - False Discovery Rates … The Decision Table Number of Not rejected H0 Number of rejected H0 Number of True H0 The only observable values Number of True Total U V m0 T S m1 Ha Total W R m Strong and weak control of type I error rates  Strong control: control type I error rate under any combination of true H and H 0  a Weak control: control type I error rate only when all null hypotheses are true Since we do not know a priori which hypotheses are true, we will focus on strong control of type I error rate. Consequences of multiple testing  Suppose we test each hypothesis at 5% level of significance. – Suppose n = 10 independent tests performed. Then the probability of declaring at least 1 of the 10 tests significant is 1 – 0.9510 = 0.401. – If 50,000 independent tests are performed as in Affymetrix microarray data then you should expect 2500 false positives! Types of errors in the context of multiple testing  Per-Family Error “Rate” (PFER): E(V ) – Expected number of false rejection of H 0  Per-Comparison Error Rate (PCER): E(V )/m – Expected proportion of false rejections of H 0 among all m hypotheses.  Family-Wise Error Rate (FWER): P( V > 0 ) – Probability of at least one false rejection of H 0 among all m hypotheses Types of errors in the context of multiple testing  False Discovery Rate (FDR): – Expected proportion of Type I errors among all rejected hypotheses.  Benjamini-Hochberg (BH): Set V/R = 0 if R = 0. V V E ( 1{ R 0} )  E ( | R  0) P( R  0) R R  Storey: Only interested in the case R > 0. (Positive FDR) pFDR  E ( V V 1{ R 0} )  E ( | R  0) R R Some useful inequalities Since V  R  m, therefor e V V  1{ R 0} m R (1) Again, since V  R and R  0  V  0 Therefore V 1{R 0}  R 1{V 0}. V Thus 1{R 0}  1{V 0} . R Also 1{V 0}  V (2) (3) Some useful inequalities Combining (1), (2) and (3), we have : V V  1{R 0}  1{V 0}  V m R (4) Taking expectatio ns in (4) we have : V  V  E    E  1{ R 0}   E{1{V 0}}  E{V } m R  (5) Some useful inequalities Thus we have : PCER  FDR  FWER  PFER Trivially FDR  pFDR (6) (7) Conclusion  It is conservative to control FWER rather than FDR!  It is conservative to control pFDR rather than FDR! Some useful inequalities Question: Is pFDR  FWER? Some useful inequalities Example : Suppose m0  m. Note : m0  m  m1  0  S  0 V  R V   FDR  E  1{ R 0}  R   E (1{V  0} )  P (V  0)  FWER Some useful inequalities V  But pFDR  E  | R  0 R   E (1 | R  0)  1. Hence if m0  m then 1  pFDR  FDR  FWER Some useful inequalities However, in most applications such as microarrays, one expects m  0 1 In general, there is no proof of the statement pFDR  FWER Some popular Type I error controlling procedures  Let P(1)  P( 2)  ...  P( m) denote the ordered p-values for the ‘m’ tests that are being performed.  Let  (1)   ( 2)  ...   ( m) denote the ordered levels of significance used for testing the ‘m’ null hypotheses, H 0(1) , H 0( 2) ,..., H 0( m) respectively. Some popular controlling procedures  Step-down procedure: Step 1 : If P(1)   (1) then reject H 0(1) - Goto Step 2 Else Stop. Step 2 : If P( 2)H  ,( 2H reject H 0( 2) - Goto Step 3 ) then ..., H (1) ( 2) (r ) Else Stop. Step 3 : If P(3)   (3) then reject H 0(3) - Goto Step 3 Else Stop. and so on. Some popular controlling procedures  Step –up procedure: Step 1 : If P( m)   ( m) then reject H 0(i ) , i  1,2,...m and stop. Else goto Step 2. Step 2 : If P( m1)   ( m1) then reject H 0(i ) , i  1,2,...m  1 and stop. Else goto Step 3. Step 3 : If P( m2)   ( m2) then reject H 0(i ) , i  1,2,...m  2 and stop. Else goto Step 4. and so on! Some popular controlling procedures  Single-step procedure A stepwise procedure with critical same critical constant for all ‘m’ hypotheses.  (1)   ( 2)  ...   ( m) Some typical stepwise procedures: FWER controlling procedures  Bonferroni: A single-step procedure with   Sidak: A single-step procedure with  i  1  (1   )1/ m  Holm: A step-down procedure with i   /( m  i  1)  Hochberg: A step-up procedure with i   /( m  i  1) i  /m minP method: A resampling-based single-step procedure with   c where c be the α quantile of the distribution of  i  the minimum p-value. Comments on the methods  Bonferroni: Very general but can be too conservative for large number of hypotheses.  Sidak: More powerful than Bonferroni, but applicable when the test statistics are independent or have certain types of positive dependence. Comments on the methods  Holm: More powerful than Bonferroni and is applicable for any type of dependence structure between test statistics.  Hochberg: More powerful than Holm’s procedure but the test statistics should be either independent or the test statistic have a MTP2 property. Comments on the methods  Multivariate Total Positivity of Order 2 (MTP2) f (x) is said to MTP2 if for all x,y  R p , f (x  y) f (x  y)  f (x) f (y)  Some typical stepwise procedures: FDR controlling procedure  Benjamini-Hochberg: A step-up procedure with i  i / m An Illustration  Lobenhofer et al. (2002) data:  Expose breast cancer cells to estrodial for 1 hour or (12, 24 36 hours).  Number of genes on the cDNA 2 spot array - 1900.  Number of samples per time point 8.,  Compare 1 hour with (12, 24 and 36 hours) using a two-sided bootstrap t-test. Some Popular Methods of Analysis 1. Fold-change 1. Fold-change in gene expression  For gene “g” compute the fold change between two conditions (e.g. treatment and control): X trt fg  X cont 1. Fold-change in gene expression    R1, R2 : pre-defined constants.  f g  R1 : gene “g” is “up-regulated”.  f g  R2 : gene “g” is “down-regulated”. 1. Fold-change in gene expression  Strengths: – Simple to implement. – Biologists find it very easy to interpret. – It is widely used.  Drawbacks: – Ignores variability in mean gene expression. – Genes with subtle gene expression values can be overlooked. i.e. potentially high false negative rates – Conversely, high false positive rates are also possible. 2. t-test type procedures 2.1 Permutation t-test For each gene “g” compute the standard two-sample t-statistic: X g ,trt  X g ,cont tg  1 1 Sg  ntrt ncont where X g ,trt , X g ,cont are the sample means and Sg is the pooled sample standard deviation.  2.1 Permutation t-test Statistical significance of a gene is determined by computing the null distribution of t g using either permutation or bootstrap procedure. 2.1 Permutation t-test Strengths:  – – – Simple to implement. Biologists find it very easy to interpret. It is widely used. Drawback:  – Potentially, for some genes the pooled sample standard deviation could be very small and hence it may result in inflated Type I errors and inflated false discovery rates. 2.2 SAM procedure (Significance Analysis of Microarrays) (Tusher et al., PNAS 2001) For each gene “g” modify the standard two-sample t-statistic as: dg  X g ,trt  X g ,cont s0  S g 1 1  ntrt ncont The “fudge” factor s0 is obtained such that the coefficient of variation in the above test statistic is minimized. 3. F-test and its variations for more than 2 nominal conditions  Usual F-test and the P-values can be obtained by a suitable permutation procedure.  Regularized F-test: Generalization of Baldi and Long methodology for multiple groups. – It better controls the false discovery rates and the powers comparable to the F-test.  Cui and Churchill (2003) is a good review paper. 4. Linear fixed effects models  Effects: – Array (A) - sample – Dye (D) – Variety (V) – test groups – Genes (G) – Expression (Y) 4. Linear fixed effects models (Kerr, Martin, and Churchill, 2000)  Linear fixed effects model: log( Yijkg )    Ai  D j  Gg  ( AD)ij  ( AG)ig  ( DG ) jg  (VG) kg   ijkg iid  ijkg ~ N (0,  ). 2 H 0 : (VG) kg  0 for all k  1,2,..., v 4. Linear fixed effects models  All effects are assumed to be fixed effects.  Main drawback – all genes have same variance! 5. Linear mixed effects models (Wolfinger et al. 2001)  Stage 1 (Global normalization model) log( Ygij )    Ti  A j  (TA)ij   gij  Stage 2 (Gene specific model) ˆgij  Gg  (GT ) gi  (GA) gj   gij 5. Linear mixed effects models  Assumptions: iid Ai ~ 2 N (0,   ), iid (TA)ij ~ 2 N (0,  TA ) iid  ijkg ~ N (0,  ), (GA) gj ~ 2 iid  gij ~ 2 N (0,  g ) 2 N (0,  GAg ) 5. Linear mixed effects models (Wolfinger et al. 2001)  Perform inferences on the interaction term (GT) gi A popular graphical representation: The Volcano Plots  A scatter plot of  log 10 ( p  value) vs  log 2 ( fold change) Genes with large fold change will lie outside a pair of vertical “threshold” lines. Further, genes which are highly significant with large fold change will lie either in the upper right hand or upper left hand corner. A useful review article  Cui, X. and Churchill, G (2003), Genome Biology.  Software: R package: statistics for microarray analysis. http://www.stat.berkeley.edu/users/terry/zarray/Software/smacode.html SAM: Significance Analysis of Microarray. http://www-stat.stanford.edu/%7Etibs/SAM Supervised classification algorithms Discriminant analysis based methods A. Linear and Quadratic Discriminant analysis based methods:  Strength: – Well studied in the classical statistics literature  Limitations: – Based on normality – Imposes constraints on the covariance matrices. Need to be concerned about the singularity issue. – No convenient strategy has been proposed in the literature to select “best” discrminating subset of genes. Discriminant analysis based methods B. Nonparametric classification using Genetic Algorithm and Knearest neighbors. – Li et al. (Bioinformatics, 2001)  Strengths: – Entirely nonparametric – Takes into account the underlying dependence structure among genes – Does not require the estimation of a covariance matrix  Weakness: – Computationally very intensive GA/KNN methodology – very brief description  Computes the Euclidean distance between all pairs of samples based on a sub-vector on, say, 50 genes.  Clusters each sample into a treatment group (i.e. condition) based on the K-Nearest Neighbors.  Computes a fitness score for each subset of genes based on how many samples are correctly classified. This is the objective function.  The objective function is optimized using Genetic Algorithm K-nearest neighbors classification (k=3) X Expression levels of gene 1 Subcategories within a class Expression levels of gene 1 Advantages of KNN approach Simple, performs as well as or better than more complex methods  Free from assumptions such as normality of the distribution of expression levels  Multivariate: takes account of dependence in expression levels  Accommodates or even identifies distinct subtypes within a class  Expression data: many genes and few samples  There may be many subsets of genes that can statistically discriminate between the treated and untreated.  There are too many possible subsets to look at. With 3,000 genes, there are about 1072 ways to make subsets of size 30. The genetic algorithm  Computer algorithm (John Holland) that works by mimicking Darwin's natural selection  Has been applied to many optimization problems ranging from engine design to protein folding and sequence alignment  Effective in searching high dimensional space GA works by mimicking evolution  Randomly select sets (“chromosomes”) of 30 genes from all the genes on the chip  Evaluate the “fitness” of each “chromosome” – how well can it separate the treated from the untreated?  Pass “chromosomes” randomly to next generation, with preference for the fittest Summary  Pay attention to multiple testing problem. – Use FDR over FWER for large data sets such as gene expression microarrays  Linear mixed effects models may be used for comparing expression data between groups.  For classification problem, one may want to consider GA/KNN approach.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Talk2.stat.methods