* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Presentation Title Goes Here
Minimal genome wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Public health genomics wikipedia , lookup
Designer baby wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson What’s the matter with t-tests? 1. SAM and ANOVA assume that all tests are independent, but they aren’t 2. Some within sample variances are underestimated, which artificially inflates test statistics; some are overestimated, which reduces power 3. They fail to optimize the ETP (true positive estimation) rate for a given FDR Optimal Discovery Procedure Storey, Dai and Leek (2007) Biostatistics 8: 414-432 • the ODP is defined as the testing procedure that maximizes the ETP for each fixed EFP level. • A consequence of this optimality is that the rate of “missed discoveries” is minimized for each FDR level. • Neyman–Pearson lemma: Given a single set of observed data, the optimal single-testing procedure is based on the statistic: • the ODP is similar, but considers the data for a single feature evaluated at all true probability density functions: ODP Principle Fig. 1. Plots comparing the NP testing approach to the ODP testing approach through a simple example. (a) NP approach. The null (gray) and alternative (black) probability density functions of a single test. For observed data x and y, the statistics are calculated by taking the ratio of the alternative to the null densities at each respective point. In this NP approach, the test with data y is more significant than the test with data x. (b) ODP approach. The common null density (gray) for true null tests and the alternative densities (black) for several true alternative tests. For observed data x and y, the statistics are calculated by taking the ratio of the sum of alternative densities to the null density evaluated at each respective point. In this ODP approach, the test with data x is now more significant than the test with data y because multiple alternative densities have similar positive means even though each one is smaller than the single alternative density with negative mean. ODP Performance: BRCA data A comparison of the ODP approach to five leading methods for identifying differentially expressed genes (described in the text). The number of genes found to be significant by each method over a range of estimated q-value cutoffs is shown. The methods involved in the comparison are the proposed ODP, SAM, the traditional t-test/ F-test, a shrunken t-test/F-test, a nonparametric empirical Bayes "local FDR" method, and a model-based empirical Bayes method. A color version of the figure is given in the supplementary material available at Biostatistics online, Figure 9. (a) Results for identifying differential expression between the BRCA1 and BRCA2 groups in the Hedenfalk and others data. (b) Results for identifying differential expression between the BRCA1, BRCA2, and Sporadic groups in the Hedenfalk and others data. The model-based empirical Bayes method has not been detailed for a three-sample analysis, so it is omitted in this panel. ODP Table 1 Thresholding method % Increase by ODP 2-sample % Increase by ODP 3-sample Minimum Median Maximum Minimum Median Maximum SAM (Tusher et al, 2001) 29 43 72 76 92 211 t/F-test (Dudoit et al 2002, Kerr et al, 2000) 52 86 185 63 82 407 Shrunken t/F-test (Cui and others, 2005) 34 52 77 61 69 154 Bayesian local FDR (Efron and others, 2001) 58 87 117 76 92 211 Posterior probability (Lonnstedt & Speed 2002) 44 60 113 — — — Table 1. Improvements of the ODP approach over existing thresholding methods. Shown are the minimum, median, and maximum percentage increases in the number of genes called significant by the proposed ODP approach relative to the existing approaches among FDR levels 2%, 3%, ..., 10%. The exact same FDR methodology (Storey, 2002; Storey and Tibshirani, 2003) was applied to each gene-ranking method in order to make the comparisons fair. The model-based Bayesian method (Lonnstedt and Speed, 2002) is not defined for a three-sample analysis, so that case is omitted ODP algorithm 1. Estimate the true null hypotheses from distribution of P-values from KW rank tests for all genes 2. Determine the maximum likelihood distributions for all genes according to standard methods: 3. Evaluate the ODP statistic for each gene: 4. Use bootstrap resampling to obtain null statistics 5. Contrast observed and expected ODPs -> q values ODP Performance: simulated data A comparison of the ODP approach to five leading methods for identifying differentially expressed genes (described in the text and Figure 2) based on simulated data. The number of genes found to be significant by each method over a range of estimated q-value cutoffs is shown for a single, representative data set from each scenario. The proposed ODP approach is in black and the other methods are in gray. In general, the data sets increase in complexity from panels (a) to (d). (a) In this scenario, two groups are compared, there is perfectly symmetric differential expression, and the variances are simulated from a unimodal, well-behaved distribution. (b) Two groups are compared, there is moderate asymmetry in the differential expression, and the variances are simulated from a bimodal distribution. (c) Three groups are compared, there is slight asymmetry in differential expression, and the variances are simulated from a unimodal, well-behaved distribution. (d) Three groups are compared, there is moderate asymmetry in differential expression, and the variances are simulated from a bimodal distribution. Surrogate Variable Analysis Leek and Storey (2007) PLoS Genetics, In press • In addition to the primary measured variables that are estimated as fixed or random effects in an analysis, there are usually also unmodeled heterogeneity. factors that contribute to expression • For example, age, time-of-day, nutrition probably all impact an analysis without being directly studied, but they are more predictable than gene specific noise. • Sometimes the variable of interest may be confounded with the hidden factors (eg batch with population). • In many situations, SVA can be used to improve power. SVA Simulation Simulated Example of Expression Heterogeneity (A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays. (B) Genes 1-300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array. (C) Genes 201-500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with the group variable. Here the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates. SVA Table 1 The results of the significance analysis in the three real gene expression studies. The results of the genetics of gene expression study include the number of significant cis-linkages before and after adjusting for surrogate variables. The disease class results report the number of genes differentially expressed between BRCA1 and BRCA2 before and after adjusting for surrogate variables. For the timecourse study, the number of genes differentially expressed with respect to age are shown for an unadjusted analysis, an analysis adjusted for tissue type, and an SVA adjusted analysis. An SVA-adjusted analysis may result in an increase or decrease in the number of significant results depending on the direction and degree to which the unmodeled factors (now captured by surrogate variables) were confounded with the primary variables. SVA Performance Impact of Expression Heterogeneity One thousand gene expression data sets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples. (A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal). (B) For each simulated data set, a Kolmogorov-Smirnov test was employed to assess whether the p-values of null genes followed the correct null Uniform distribution (Supplementary Text). A quantilequantile plot of the one thousand Kolmogorov-Smirnov p-values are shown for the SVA adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA adjusted analysis provides correctly distributed null p-values, whereas the unadjusted analysis does not due to EH. (C) A plot of expected true positives versus false discovery rate for the SVA adjusted (solid) and unadjusted (dashed) analyses. The SVA adjusted analysis shows increased power to detect true differential expression. SVA Procedure 1. Remove the signal due to the primary variable(s) of interest to obtain a residual expression matrix. 2. Apply a decomposition to the residual expression matrix to identify signatures of EH in terms of an orthogonal basis of singular vectors. 3. Use a statistical test to determine the singular vectors that represent significantly more variation than would be expected by chance. 4. Identify the subset of genes driving each orthogonal signature of EH. 5. For each subset of genes, build a surrogate variable based on the full EH signature of that subset in the original data. 6. Include all significant surrogate variables as covariates in subsequent regression analyses, allowing for gene-specific coefficients for each surrogate variable. SVA: Trans-eQTL detection SVA Captures EH Due to Genotype (A) A plot of significant linkage peaks (p-value < 1e-7) for expression QTL in the Brem et al. [10, 21] study by marker location (x-axis) and expression trait location (y-axis) . (B) Significant linkage peaks (p-value < 1e-7) after adjusting for surrogate variables. Large trans-linkage peaks on Chromosomes II, III, VII, XII, XIV and XV have been eliminated without reducing cis-linkage peaks. SVA: Breast Cancer Study Surrogate Variables from Human Studies (A) A plot of the top surrogate variable estimated from the breast cancer data [22]. The BRCA1 group is relatively homogeneous (triangles), but the BRCA2 group shows substantial heterogeneity (pluses). (B) A plot of tissue type versus array for the Rodwell et al. [7] study (dotted line) and the top surrogate variable estimated from the expression data when tissue was ignored (dashed line). There is strong correlation between the top surrogate variable and the tissue type variable. SVA: Moroccan study