Download Praktikum der Microarray-Datenanalyse

MicroarrayDatenanalyse Introduction Methodological issues GSEA approach Globaltest Praktikum der Microarray-Datenanalyse – Gene Set Analysis – Hans-Ulrich Klein Christian Ruckert Institut für Medizinische Informatik SS 2011 MicroarrayDatenanalyse Organisation Introduction Methodological issues GSEA approach Globaltest 1 09.05.11 – Normalisierung 2 10.05.11 – Bestimmen diff. expr. Gene, Experiment-Design 3 11.05.11 – Dimensionsreduktion, Clusteranalyse 4 12.05.11 – Klassifikation, Gene Set Analysis 5 13.05.11 – Analyse von Überlebenszeiten MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Gene set analyses Introduction Methodological issues GSEA approach Globaltest • A (long) list of differentially expressed genes is only an intermediary result of a successfull microarray experiment. • This list is the starting point for a complicated interpretation process. • Gene Set Analyes methods formalize this interpretation process: −→ Group all genes that are annotated to the same annotation term together into sets and analyse the experiment’s results in terms of these sets. • shifts the analysis level from single genes to sets of genes MicroarrayDatenanalyse Gene set analyses Introduction Methodological issues GSEA approach Globaltest • Source of annotation terms: • Gene Ontology (BP, MF, CC) • KEGG • chromosomal location • Ingenuity - IPA • Presence of transcription factor binding sites • ... • Many different Gene Set Analyses methods have been proposed in the literature. Due to different models and assumptions, the results of the these methods must be interpreted carefully. Moreover, the foundations and the validity of some approaches had been questioned in recent articles. MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Gene Set Analysis methods – Input Introduction Methodological issues GSEA approach Globaltest • GSA methods can be classified by their input data: • raw expression data • statistics per gene (e.g. fold change, t-statistics, ...) • p-values per gene • list of differentially / not differentially expressed genes • The last input type is most simple and most popular. The data can be clearly represented in a 2 × 2 table. MicroarrayDatenanalyse Introduction 2 × 2 table for over-representation (1/2) Methodological issues GSEA approach Globaltest In gene set Not in gene set Total Differentially expressed gene mGD mG c D mD Not-differentially expressed gene mGD c mG c D c mD c Total mG mG c m MicroarrayDatenanalyse 2 × 2 table for over-representation (2/2) Introduction Methodological issues GSEA approach Globaltest • Many different tests have been proposed, including χ2 -test, hypergeometric test, binomial z-test for proportions. The differences tend to be unimportant in practice. • hypergeometric test: • • • • Put mD red balls into an urn. Put mD c black balls into an urn. Draw mG genes without replacement. The hypergemetric distribution gives the probability for the number of red balls (mGD ) among the drawn sample. • →An non-parametric counterpart to the hypergeometric test can be constructed by permuting the gene set labels. MicroarrayDatenanalyse Null hypothesis (1/2) Introduction Methodological issues GSEA approach Globaltest • Two different null hypotheses for GSA: • H0comp : The genes in G are at most as often differentially expressed as the genes in G c . • H0self : No genes in G are differentially expressed. • The presented 2 × 2 table methods test H0comp . • A self-contained counterpart: • Flag genes with p-values ≤ α as differentially expressed. • Under H0self and independence of genes, mGD ∼ B(mG , α). MicroarrayDatenanalyse Null hypothesis (2/2) Introduction Methodological issues GSEA approach Globaltest Comparison of the two methods based on different null hypotheses: 1 Power 2 Relationship to single gene testing 3 Testing all genes on a chip; definition of G c 4 Biological meaningful 5 Sampling model MicroarrayDatenanalyse Gene vs. subject sampling Introduction Methodological issues GSEA approach Globaltest • Subject-sampling (exp. design for classical stat. tests): • sample consists of n realizations (for n subjects): (X1 , Y1 ), . . . , (Xn , Yn ) • each subject gets the same fixed set of measurements (vector Xi of m gene expression values) • subjects are assumed to be i.i.d. • Replication: Hybridize new subject on same type of microarray • Gene-sampling (model behind 2 × 2 table methods) • sample consists of g realizations (for g genes): (A1 , B1 ), . . . , (Ag , Bg ) • each gene gets same fixed set of measurements (A= ˆ element of gene set; B = ˆ diff. expr.) • measurements of g genes are assumed to be i.i.d. • Replication: Measure new genes on same subjects. MicroarrayDatenanalyse Interpretation of p-values Introduction • meaning of a p-value relates to hypothetical replications of Methodological issues the experiement performed • if H0 is true, no more than a fraction α of the replications will yield a p-value ≤ α • subject-sampling p-value: replications involve taking a new sample of subjects and measure same genes → a significant p-values gives confidence to find the same associations within a new sample of subjects • gene-sampling p-value: replications involve taking a new sample of genes measured on the same subjects → a significant p-values gives confidence to find the same association between the variables membership of gene set and being differentially expressed within these subjects on a new array with different genes GSEA approach Globaltest MicroarrayDatenanalyse Excursus: iGA Introduction Methodological issues GSEA approach Globaltest • iGA is a variant of the 2 × 2 table approach. • cut-off for differential expression is based on fold-change • in their abstract, Breitling et al. (2004) wrote: In the extreme, iGA can even produce statistically meaningful results without any experimental replication. • The gene-sampling urn model does not fit to the actual experiment performed. • It can easily lead to wrong interpretations. • Unfortunately, the competetive null hypothesis is inherently linked with the gene-sampling model. MicroarrayDatenanalyse Independence assumption Introduction Methodological issues GSEA approach Globaltest • Subject-sampling model assumes independent subjects. • Gene-sampling model assumes independent genes. • However, it is known that strong correlations between genes (especially between functional related genes) occur frequently in microarray gene expression data. • → p-value inflation for 2 × 2-table methods MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse GSEA (1/2) Introduction Methodological issues GSEA approach Globaltest • One of the first published Gene Set Analyses-methods (http://www.broadinstitute.org/gsea/) • uses a (weighted) KS-test statistic on the ranks of the genes’ p-values • uses a subject-sampling model, i.e., subjects’ class labels are permuted to estimate the null-distribution • What is the correct interpretation of resulting p-values? MicroarrayDatenanalyse Introduction Methodological issues GSEA approach Globaltest GSEA (2/2) MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Globaltest model (1/2) Introduction Methodological issues GSEA approach Globaltest • score-test for the self-contained null hypothesis • X is n × g -matrix with (normalized) gene expression values • Y is a clinical variable (often categorical) • model: E (Y | β) = h(α + X β) • test H0 : β = 0 gegen HA : β 6= 0. • problem: n < g MicroarrayDatenanalyse Globaltest model (2/2) Introduction Methodological issues GSEA approach Globaltest • distributional assumptions for β: • E (β) = 0 und E (ββ 0 ) = τ 2 I • The specification of the distribution of β would be completed, if a value for τ 2 and a distributional shape were chosen. • Now, the self-contained null hypothesis can be formulated as follows: H̄0 : τ 2 = 0 vs. H̄A : τ 2 > 0. • integration over β gives the Likelihood for τ 2 : L̄(τ 2 ; Y ) = Eβ|τ 2 L(β; Y ) MicroarrayDatenanalyse Score-Test Introduction Methodological issues • Derivation of the Log likelihood function of τ 2 GSEA approach S(τ 2 ) = Globaltest d d ln L̄(τ 2 ; Y ) = ln Eβ|τ 2 L(β; Y ) 2 dτ dτ 2 • leads to the score statistic 1 1 S(0) =: S = ss 0 − tr(I). 2 2 s= ∂ ∂β ln L(0; Y ), score-funktion of β 2 ∂ I = − ∂β∂β ln L(0; Y ), information matrix of β • S is easy to calculate, but its distribution is unknown in general. MicroarrayDatenanalyse Summary Introduction Methodological issues GSEA approach Globaltest • gene-sampling models – H0comp (e.g. http://david.abcc.ncifcrf.gov/) • subject-sampling models – H0self (e.g. Globaltest) • Hybrid methods (e.g. Gene Set Enrichment Analysis) • correct interpretation of the p-values • gene-sampling models: Are there at least some significantly differentially expressed genes in your set?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Praktikum der Microarray-Datenanalyse