Download EXPERIMENTAL DESIGN is - Universitat de Barcelona

Some Statistical Issues in Microarray Data Analysis Alex Sánchez Estadística i Bioinformàtica Departament d’Estadística Universitat de Barcelona Unitat d’Estadística i BioinformàticaIR-HUVH Outline    Introduction Experimental design Selecting differentially expressed genes  Statistical tests  Significance testing  Linear models and Analysis of the variance  Multiple testing  Software for microarray data analysis 2 Introduction Microarray experiments: Overview 4 Why are we talking of statistics?  A microarray experiment is, as called, an experiment, that is:  It has been performed to determine if some previous hypothesis are true or false (although it can also lead to new hypotheses)  It is subject to errors which may arise from many sources 5 Sources of variability   Biological Heterogeneity in Population Specimen Collection/ Handling Effects   Tumor: surgical bx, FNA Cell Line: culture condition, confluence level  Biological Heterogeneity in Specimen RNA extraction RNA amplification  Fluor labeling  Hybridization  Scanning – PMT voltage – laser power     (Geschwind, Nature Reviews Neuroscience, 2001) 6 Categories of variability  Systematic variability  Amount of RNA in the biopsy  Efficiencies of lab procedures such as:     RNA extraction, reverse transcription, Labeling or photodetection  Random variation  PCR yield  DNA quality  spotting efficiency,  spot size  cross-/unspecific hybridization  stray signal 7 Dealing with systematic variability  Systematic variability has similar effects on many measurements  Corrections can be estimated from data  CALIBRATION or NORMALIZATION is the general name for processes that correct for systematic variability 8 Dealing with random variation  Random variation cannot be explicitly accounted for  Usual way to deal with it is to assume some ERROR MODELS (e.g. ei~N(0, s2))  Assuming these error models are true… EXPERIMENTAL DESIGN is (must be) used to control the action of random variation  STATISTICAL INFERENCE is (must be) used to extract conclusions in the presence of random variation  9 Biological question Experimental design Failed Microarray experiment Quality Measurement Image analysis Today Normalization Pass Analysis Estimation Testing Clustering Biological verification and interpretation Discrimination 10 Experimental design Why experimental design?  The objective of experimental design is to make the analysis of the data and the interpretation of the results  As simple and as powerful as possible  Given the purpose of the experiment  And the constraints of the experimental material 12 Scientific aims and design choice  The primary focus of the experiments needs to be clearly stated, whether it is:  to identify differentially expressed genes  to search for specific gene-expression patterns  to identify phenotypic subclasses  Aim of the experiment guides design choice  Sometimes only one choice is reasonable  Sometimes different options available 13 Designing microarray experiments  The appropriate design of a microarray experiment must consider  Design of the array  Allocation of mRNA samples to the slides 14 I: Layout of the array  Which sequences to use  cDNA’s  Selection of cDNA from library Riken, NIA, etc  Affymetrix  Oligo probes selection (from Operon, Agilent, etc)  Control   PM’s and MM’s probes What %?. Where should controls be put How many sequences to use  Should there be replicate spots within a slide? 15 II: Allocating samples in slides  Types of Samples  Replication: technical vs biological  Pooled vs individual samples  Different design layout / data analysis:  Scientific aim of the experiment  Efficiency, Robustness, Extensibility  Physical limitations (cost) :  Number of slides  Amount of material 16 Basic principles of experimental design  Apply the following principles to best attain the objectives of experimental design  Replication  Local control or Blocking  Randomization 17 1. Replication  It’s important  To reduce uncertainty (increase precision)  To obtain sufficient power for the tests  As a formal basis for inferential procedures   s X2   var  X    n   Consider different types of replicates  Technical  Duplicate spots  Multiple hybridizations from the same sample  Biological  Repeat most what is expected to vary most! 18 Biological vs Technical Replicates s B2 s A2 s e2 @ Nature reviews & G. Churchill (2002) 19 Replication vs Pooling  mRNA from different samples are often combined to form a ``pooled-sample’’ or pool. Why?  If each sample doesn’t yield enough mRNA  To compensate an excess of variability  ?  Statisticians tend not to like it but pooling may be OK if properly done  Combine several samples in each pool  Use several pools from different samples  Do not use pools when individual information is important (e.g.paired designs) 20 2. Blocking   Assume we wish to perform an experiment to compare two treatments. The samples or their processing may not be homogeneous: There are blocks  Subjects: Male/Female  Arrays produced in two  lots (February, March) If there are systematic differences between blocks the effects of interest (e.g. tretament) may be confounded  Observed differences are attributable to treatment effect or to confounding factors? 21 Confounding block with treatment effects Sample 1 2 3 4 5 6 7 8  Awful design Treatment Sex Batch A Male 1 A Male 1 A Male 1 A Male 1 B Female 2 B Female 2 B Female 2 B Female 2 Sample 1 2 3 4 5 6 7 8 Balanced design Treatment Sex Batch A Male 1 A Female 2 A Male 1 A Female 2 B Male 1 B Female 2 B Male 1 B Female 2 Two alternative designs to investigate treatment effects   Left: Treatment effects confounded with Sex and Batch effect Right: Treatments are balanced between blocks   Influence of blocks is automatically compensated Statistical analysis may separate block from treatment efefect 22 3. Randomisation  Randomly assigning samples to groups to eliminate unspecific disturbances  Randomly assign individuals to treatments.  Randomise order in which experiments are performed. Randomisation required to ensure validity of statistical procedures.  Block what you can and randomize what you cannot  23 Experimental layout How are mRNA samples assigned to arrays  The experimental layout has to be chosen so that the resulting analysis can be done as efficient and robust as possible   Sometimes there is only one reasonable choice  Sometimes several choices are available 24 Example I: Only one design choice Case 1: Meaningful biological control (C) Samples: Liver tissue from 4 mice treated by cholesterol modifying drugs. Question 1: Genes that respond differently between the T and the C. Question 2: Genes that responded similarly across two or more treatments relative to control. Case 2: Use of universal reference. Samples: Different tumor samples. Question: To discover tumor subtypes. T1 T2 T3 C T4 T1 T2 Tn-1 Tn Ref 25 Example 2: a number of different designs are suitable for use (1)  Time course experiments  Design choice depends on the comparisons of interest T1 T2 T3 T4 Ref T1 T2 T3 T4 T1 T2 T3 T4 T1 T2 T3 T4 26 How can we decide?   A-optimality: choosee design which minimizes variance of estimates of effects of interest A simple example: Direct vs indirect estimates Indirect A Direct A B average (log (A/B)) s2 /2 R B log (A / R) – log (B / R ) 2s2 27 Summary  Selection of mRNA samples is important  Most important: biological replicates  Technical replicates also useful, but different  If needed and possible use pooling wisely  Choice of experimental layout guided by  The scientific question  Experimental design principles  Efficiency and robustness considerations  Correspondence between experimental Designs-Linear Models-ANOVA can be exploited to select model and analyze data 28 Experimental design, Linear Models and Analysis of the Variance In experimental design the different sources of variability influencing the observed response may be identified.  These sources can be related with the response using a linear model  Analysis of the variance can be used to separately estimate and test the relative importance of each source of variability.  29 Statistical methods to detect differentially expressed genes Class comparison: Identifying differentially expressed genes  Identify genes differentially expressed between different conditions such as  Treatment, cell type,... (qualitative covariates)  Dose, time, ... (quantitative covariate)  Survival, infection time,... !  Estimate effects/differences between groups probably using log-ratios, i.e. the difference on log scale log(X)-log(Y) [=log(X/Y)] 31 What is a “significant change”?   Depends on the variability within groups, which may be different from gene to gene. To assess the statistical significance of differences, conduct a statistical test for each gene. 32 Different settings for statistical tests  Indirect comparisons: 2 groups, 2 samples, unpaired  E.g. 10 individuals: 5 suffer diabetes, 5 healthy  One sample fro each individual  Typically: Two sample t-test or similar  Direct comparisons: Two groups, two samples, paired  E.g. 6 individuals with brain stroke.  Two samples from each: one from healthy (region 1) and one from affected (region 2).  Typically: One sample t-test (also called paired t-test) or similar based on the individual differences between conditions. 33 Different ways to do the experiment   An experiment use cDNA arrays (“two-colour”) or affy (“one-colour). Depending on the technology used allocation of conditions to slides changes. Type of chip cDNA (2-col) Affy (1-col) Experiment 10 indiv. Diab (5) Heal (5) Reference design. (5) Diab/Ref (5) Heal/Ref Comparison design. (5) Diab vs (5) Heal 6 indiv. Region 1 Region 2 6 slides 1 individual per slide (6) reg1/reg2 12 slides (6) Paired differences 34 “Natural” measures of discrepancy For Direct comparisons in two colour or paired-one colour. 1 Mean (log) ratio = nT nT  R , (R or M used indistinctly) i 1 i Classical t-test = t  ( R) SE , ( SE estimates standard error of R) Robust t-test = Use robust estimates of location &scale For Indirect comparisons in two colour or Direct comparisons in one colour. 1 Mean difference = nT nT 1 Ti   nC i 1 nC C i 1 i T C Classical t-test = t  (T  C ) s p 1/ nT  1/ nC Robust t-test = Use robust estimates of location &scale 35 Some Issues    Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene A B M1 2.5 M2 M3 M4 M5 M6 Mean SD t 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 Courtesy of Y.H. Yang 36 Some Issues    Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene A B M1 2.5 M2 M3 M4 M5 M6 Mean SD t 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 •Averages can be driven by outliers. Courtesy of Y.H. Yang 37 Some Issues    Can we trust average effect sizes (average difference of means) alone? Can we trust the t statistic alone? Here is evidence that the answer is no. Gene A B M1 2.5 M2 M3 M4 M5 M6 Mean SD t 2.7 2.5 2.8 3.2 2 2.61 0.40 16.10 0.01 0.05 -0.05 0.01 0 0 0.003 0.03 0.25 C 2.5 2.7 2.5 1.8 20 1 5.08 7.34 1.69 D 0.5 0 0.2 0.1 -0.3 0.3 0.13 0.27 1.19 E 0.1 0.11 0.1 0.1 0.11 0.09 0.10 0.01 33.09 •t’s can be driven by tiny variances. Courtesy of Y.H. Yang 38 Variations in t-tests (1)  Let  Rg mean observed log ratio  SEg standard error of Rg estimated from data on gene g.  SE standard error of Rg estimated from data across all genes.   Global t-test: Gene-specific t-test t=Rg/SE t=Rg/SEg 39 Some pro’s and con’s of t-test Test Pro’s Global t-test: Yields stable variance t=Rg/SE estimate Gene-specific: Robust to variance t=Rg/SEg heterogeneity Con’s Assumes variance homogeneity  biased if false Low power  Yields unstable variance estimates (due to few data)  40 T-tests extensions SAM (Tibshirani, 2001) Regularized-t (Baldi, 2001) EB-moderated t (Smyth, 2003) S t Rg c  SEg Rg v0 SE 2  (n  1) SEg2 v0  n  2 t Rg d 0  SE02  d  SEg2 d0  d 41 Up to here…: Can we generate a list of candidate genes? With the tools we have, the reasonable steps to generate a list of candidate genes may be: Gene 1: M11, M12, …., M1k Gene 2: M21, M22, …., M2k ……………. Gene G: MG1, MG2, …., MGk For every gene, calculate Si=t(Mi1, Mi2, …., Mik), e.g. t-statistics, S, B,… Statistics of interest S1, S2, …., SG ? A list of candidate DE genes We need an idea of how significant are these values We’d like to assign them p-values 42 Significance testing Nominal p-values  After a test statistic is computed, it is convenient to convert it to a p-value: The probability that a test statistic, say S(X), takes values equal or greater than that taken on the observed sample, say S(X0), under the assumption that the null hypothesis is true p=P{S(X)>=S(X0)|H0 true} 44 Significance testing  Test of significance at the a level: Reject the null hypothesis if your p-value is smaller than the significance level It has advantages but not free from criticisms  Genes with p-values falling below a prescribed level may be regarded as significant 45 Hypothesis testing overview for a single gene Reported decision H0 is Rejected (gene is Selected) State of the nature ("Truth") H0 is false (Affected) H0 is true (Not Affected) TP, prob: 1-a H0 is Accepted (gene not Selected) FN, prob: 1-b Type II error FP, P[Rej H0|H0]<= a Type I error TN , prob: b Positive predictive value TP/[TP+FP] Negative predictive value TN/[TN+FN] Sensitiviy TP/[TP+FN] Specificity TN/[TN+FP] 46 Calculation of p-values  Standard methods for calculating p- values: (i) Refer to a statistical distribution table (Normal, t, F, …) or (ii) Perform a permutation analysis 47 (i) Tabulated p-values Tabulated p-values can be obtained for standard test statistics (e.g.the t-test)  They often rely on the assumption of normally distributed errors in the data  This assumption can be checked (approximately) using a   Histogram  Q-Q plot 48 Example Golub data, 27 ALL vs 11 AML samples, 3051 genes A t-test yields 1045 genes with p< 0.05 49 (ii) Permutations tests  Based on data shuffling. No assumptions    Repeat for every possible permutation, b=1…B    Random interchange of labels between samples Estimate p-values for each comparison (gene) by using the permutation distribution of the t-statistics Permute the n data points for the gene (x). The first n1 are referred to as “treatments”, the second n2 as “controls” For each gene, calculate the corresponding two sample t-statistic, tb After all the B permutations are done put p = #{b: |tb| ≥ |tobserved|}/B 50 Permutation tests (2) 51 Volcano plot : fold change vs log(odds)1 Significant change detected No change detected 52 Linear models and Analysis of the Variance to analyze designed experiments From experimental design to linear models  Some weaknesses of statistical framework  What to do if treatment has more than 2 levels?  How to deal with more than one treatment or experimental condition?  How to deal with nuisance factors such as batch effects, covariates, etc…?  Most of this can be solved with an alternative approach: Analysis of the Variance 54 Multiple testing How far can we trust the decision?  The test: "Reject H0 if p-val ≤ a"  is said to control the type I error because, under a certain set of assumptions, the probability of falsely rejecting H0 is less than a fixed small threshold  Nothing is warranted about P[FN] “Optimal” tests are built trying to minimize this probability  In practical situations it is often high  56 What if we wish to test more than one gene at once? (1)  Consider more than one test at once  Two tests each at 5% level. Now probability of getting a false positive is 1 – 0.95*0.95 = 0.0975  Three tests  1 – 0.953 =0.1426  n tests  1 – 0.95n  Converge towards 1 as n increases  Small p-values don’t necessarily imply significance!!!  We are not controlling the probability of type I error anymore 57 What if we wish to test more than one gene at once? (2): a simulation  Simulation of this process for 6,000 genes with 8 treatments and 8 controls  All the gene expression values were simulated i.i.d from a N (0,1) distribution, i.e. NOTHING is differentially expressed in our simulation  The number of genes falsely rejected will be on the average of (6000 · a), i.e. if we wanted to reject all genes with a p-value of less than 1% we would falsely reject around 60 genes See example 58 Multiple testing: Counting errors Decision reported H0 is Rejected (Genes Selected) State of the nature ("Truth") H0 is accepted (Genes not Selected) Total H0 is false (Affected) ma am0 (S) (m-mo)(ma am0 (T) m-mo H0 is true (Not Affected) am0 (V) mo-am0 (U) mo Ma (R) m-ma (m-R) m Total V = # Type I errors [false positives] T = # Type II errors [false negatives] All these quantities could be known if m0 was known 59 How does type I error control extend to multiple testing situations? Selecting genes with a p-value less than a doesn’t control for P[FP] anymore  What can be done?   Extend  the idea of type I error FWER and FDR are two such extensions  Look for procedures that control the probability for these extended error types  Mainly adjust raw p-values 60 Two main error rate extensions  Family Wise Error Rate (FWER)  FWER is probability of at least one false positive FWER= Pr(# of false discoveries >0) = Pr(V>0)  False Discovery Rate (FDR)  FDR is expected value of proportion of false positives among rejected null hypotheses FDR = E[V/R; R>0] = E[V/R | R>0]·P[R>0] 61 FDR and FWER controlling procedures  FWER  Bonferroni (adj Pvalue = min{n*Pvalue,1})  Holm (1979)  Hochberg (1986)  Westfall & Young (1993) maxT and minP  FDR  Benjamini & Hochberg (1995)  Benjamini & Yekutieli (2001) 62 Difference between controlling FWER or FDR  FWER Controls for no (0) false positives  gives many fewer genes (false positives),  but you are likely to miss many  adequate if goal is to identify few genes that differ between two groups  FDR Controls the proportion of false positives  if you can tolerate more false positives  you will get many fewer false negatives  adequate if goal is to pursue the study e.g. to determine functional relationships among genes 63 Steps to generate a list of candidate genes revisited (2) Gene 1: M11, M12, …., M1k Gene 2: M21, M22, …., M2k ……………. Gene G: MG1, MG2, …., MGk For every gene, calculate Si=t(Mi1, Mi2, …., Mik), e.g. t-statistics, S, B,… Statistics of interest S1, S2, …., SG Assumption on the null distribution: data normality Nominal p-values P1, P2, …, PG Adjusted p-values aP1, aP2, …, aPG A list of candidate DE genes Select genes with adjusted P-values smaller than a 64 Example Golub data, 27 ALL vs 11 AML samples, 3051 genes Bonferroni adjustment: 98 genes with padj< 0.05 (praw < 0.000016) 65 Extensions  Some issues we have not dealt with  Replicates within and between slides  Several effects: use a linear model  ANOVA: are the effects equal?  Time series: selecting genes for trends Different solutions have been suggested for each problem  Still many open questions  66 Examples Ex. 1- Swirl zebrafish experiment Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis  As a result some cell types are reduced and others are expanded  A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish  68 Example 1: Experimental design    Each microarray contained 8848 cDNA probes (either genes or EST sequences) 4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye 2 Wild type Swirl 2 69 Example 1. Data analysis Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type  On a gene-per-gene basis this is a onesample problem  Hypothesis to be tested for each gene:   H0:  log2(R/G)=0 The decision will be based on average log-ratios 70 Example 2 . Scanvenger receptor BI (SR-BI) experiment  Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.  Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.  Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C). 71 Example 2. Experimental design    8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16 mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3. Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity. T C 8 8 C* 72 Example 2. Data analysis Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))  On a gene-per-gene basis this is a 2 sample problem  Hypothesis to be tested for each gene:   H0:  [log (R1/G)-log (R2/G)]=0 Decision will be based on average difference of log ratios 73 Software for microarray data analysis Introduction  Microarray experiments generate huge quantities of data which have to be  Stored,    managed, visualized, processed … Many options available. However… No tool satisfies all user’s needs Trade-off. A tool must be  Powerful but user friendly  Complete but without too many options,  Flexible but easy to start with and go further  Available, to date, well documented but affordable 75 So, what you need is “R”?  R is an open-source system for statistical computation and graphics. It consists of  A language  A run-time environment with  Graphics, a debugger, and  Access to certain system functions,  It can be used  Interactively, through a command language  Or running programs stored in script files 76 http://www.r-project.org/ 77 Some pro’s & con’s    Powerful, Used by statisticians Easy to extend      Creating add-on packages Many already available Freely available Unix, windows & Mac Lot of documentation     Not very easy to learn Command-based Documentation sometimes cryptic Memory intensive   Worst in windows Slow at times We believe the effort is worth the pity!!! • If you “just want to do statistical analysis”  Easy to find alternatives • If you intend to do microarray data analysis  Probably one of best options 78 R and Microarrays R is a popular tool between statisticians  Once they started to work with microarrays they continued using it   To perform the analysis  To implement new tools This gave rise very fast to lots of free Rbased software to analyze microarrays  The Bioconductor project groups many of these (but not all) developments  79 The Bioconductor project     Open source and open development software project for the analysis and comprehension of genomic data. Most early developments as R packages. Extensive documentation and training material from short courses http://www.bioconductor.org/workshop.html. Has reached some stability but still evolving !!!  what is now a standard may not be so in a future. 80 There's much more than R!  Give a look at "My microarray software comparison" http://ihome.cuhk.edu.hk/~b400559/arraysoft.html 81 Examples Ex. 1- Swirl zebrafish experiment Swirl is a point mutation causing defects in the organization of the developing embryo along its ventral-dorsal axis  As a result some cell types are reduced and others are expanded  A goal of this experiment was to identify genes with altered expression in the swirl mutant compared to the wild zebrafish  83 Example 1: Experimental design    Each microarray contained 8848 cDNA probes (either genes or EST sequences) 4 replicate slides: 2 sets of dye-swap pairs For each pair, target cDNA of the swirl mutant was labeled using one of Cy5 or Cy3 and the target cDNA of the wild type mutant was labeled using the other dye 2 Wild type Swirl 2 84 Example 1. Data analysis Gene expression data on 8848 genes for 4 samples (slides): Each hybridixed with Mutant and Wild type  On a gene-per-gene basis this is a onesample problem  Hypothesis to be tested for each gene:   H0:  log2(R/G)=0 The decision will be based on average log-ratios 85 Example 2 . Scanvenger receptor BI (SR-BI) experiment  Callow et al. (2000). A study of lipid metabolism and atherosclerosis susceptibility in mice.  Transgenic mice with SR-BI gene overexpressed have low HDL cholesterol levels.  Goal: To identify genes with altered expression in the livers of transgenic mice with SR-BI gene overexpressed mice (T) compared to “normal” FVB control mice (C). 86 Example 2. Experimental design    8 treatment mice (Ti) and 8 control ones (Ci). 16 hybridizations: liver mRNA from each of the 16 mice (Ti , Ci ) is labelled with Cy5, while pooled liver mRNA from the control mice (C*) is labelled with Cy3. Probes: ~ 6,000 cDNAs (genes), including 200 related to pathogenicity. T C 8 8 C* 87 Example 2. Data analysis Gene expression data on 6348 genes for 16 samples: 8 for treatment (log T/C*) and 8 for control (log (C/C*))  On a gene-per-gene basis this is a 2 sample problem  Hypothesis to be tested for each gene:   H0:  [log (R1/G)-log (R2/G)]=0 Decision will be based on average difference of log ratios 88

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download EXPERIMENTAL DESIGN is - Universitat de Barcelona