* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Praktikum der Microarray-Datenanalyse
Saethre–Chotzen syndrome wikipedia , lookup
Transposable element wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Oncogenomics wikipedia , lookup
X-inactivation wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Essential gene wikipedia , lookup
Metagenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Public health genomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Minimal genome wikipedia , lookup
The Selfish Gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
MicroarrayDatenanalyse Introduction Methodological issues GSEA approach Globaltest Praktikum der Microarray-Datenanalyse – Gene Set Analysis – Hans-Ulrich Klein Christian Ruckert Institut für Medizinische Informatik SS 2011 MicroarrayDatenanalyse Organisation Introduction Methodological issues GSEA approach Globaltest 1 09.05.11 – Normalisierung 2 10.05.11 – Bestimmen diff. expr. Gene, Experiment-Design 3 11.05.11 – Dimensionsreduktion, Clusteranalyse 4 12.05.11 – Klassifikation, Gene Set Analysis 5 13.05.11 – Analyse von Überlebenszeiten MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Gene set analyses Introduction Methodological issues GSEA approach Globaltest • A (long) list of differentially expressed genes is only an intermediary result of a successfull microarray experiment. • This list is the starting point for a complicated interpretation process. • Gene Set Analyes methods formalize this interpretation process: −→ Group all genes that are annotated to the same annotation term together into sets and analyse the experiment’s results in terms of these sets. • shifts the analysis level from single genes to sets of genes MicroarrayDatenanalyse Gene set analyses Introduction Methodological issues GSEA approach Globaltest • Source of annotation terms: • Gene Ontology (BP, MF, CC) • KEGG • chromosomal location • Ingenuity - IPA • Presence of transcription factor binding sites • ... • Many different Gene Set Analyses methods have been proposed in the literature. Due to different models and assumptions, the results of the these methods must be interpreted carefully. Moreover, the foundations and the validity of some approaches had been questioned in recent articles. MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Gene Set Analysis methods – Input Introduction Methodological issues GSEA approach Globaltest • GSA methods can be classified by their input data: • raw expression data • statistics per gene (e.g. fold change, t-statistics, ...) • p-values per gene • list of differentially / not differentially expressed genes • The last input type is most simple and most popular. The data can be clearly represented in a 2 × 2 table. MicroarrayDatenanalyse Introduction 2 × 2 table for over-representation (1/2) Methodological issues GSEA approach Globaltest In gene set Not in gene set Total Differentially expressed gene mGD mG c D mD Not-differentially expressed gene mGD c mG c D c mD c Total mG mG c m MicroarrayDatenanalyse 2 × 2 table for over-representation (2/2) Introduction Methodological issues GSEA approach Globaltest • Many different tests have been proposed, including χ2 -test, hypergeometric test, binomial z-test for proportions. The differences tend to be unimportant in practice. • hypergeometric test: • • • • Put mD red balls into an urn. Put mD c black balls into an urn. Draw mG genes without replacement. The hypergemetric distribution gives the probability for the number of red balls (mGD ) among the drawn sample. • →An non-parametric counterpart to the hypergeometric test can be constructed by permuting the gene set labels. MicroarrayDatenanalyse Null hypothesis (1/2) Introduction Methodological issues GSEA approach Globaltest • Two different null hypotheses for GSA: • H0comp : The genes in G are at most as often differentially expressed as the genes in G c . • H0self : No genes in G are differentially expressed. • The presented 2 × 2 table methods test H0comp . • A self-contained counterpart: • Flag genes with p-values ≤ α as differentially expressed. • Under H0self and independence of genes, mGD ∼ B(mG , α). MicroarrayDatenanalyse Null hypothesis (2/2) Introduction Methodological issues GSEA approach Globaltest Comparison of the two methods based on different null hypotheses: 1 Power 2 Relationship to single gene testing 3 Testing all genes on a chip; definition of G c 4 Biological meaningful 5 Sampling model MicroarrayDatenanalyse Gene vs. subject sampling Introduction Methodological issues GSEA approach Globaltest • Subject-sampling (exp. design for classical stat. tests): • sample consists of n realizations (for n subjects): (X1 , Y1 ), . . . , (Xn , Yn ) • each subject gets the same fixed set of measurements (vector Xi of m gene expression values) • subjects are assumed to be i.i.d. • Replication: Hybridize new subject on same type of microarray • Gene-sampling (model behind 2 × 2 table methods) • sample consists of g realizations (for g genes): (A1 , B1 ), . . . , (Ag , Bg ) • each gene gets same fixed set of measurements (A= ˆ element of gene set; B = ˆ diff. expr.) • measurements of g genes are assumed to be i.i.d. • Replication: Measure new genes on same subjects. MicroarrayDatenanalyse Interpretation of p-values Introduction • meaning of a p-value relates to hypothetical replications of Methodological issues the experiement performed • if H0 is true, no more than a fraction α of the replications will yield a p-value ≤ α • subject-sampling p-value: replications involve taking a new sample of subjects and measure same genes → a significant p-values gives confidence to find the same associations within a new sample of subjects • gene-sampling p-value: replications involve taking a new sample of genes measured on the same subjects → a significant p-values gives confidence to find the same association between the variables membership of gene set and being differentially expressed within these subjects on a new array with different genes GSEA approach Globaltest MicroarrayDatenanalyse Excursus: iGA Introduction Methodological issues GSEA approach Globaltest • iGA is a variant of the 2 × 2 table approach. • cut-off for differential expression is based on fold-change • in their abstract, Breitling et al. (2004) wrote: In the extreme, iGA can even produce statistically meaningful results without any experimental replication. • The gene-sampling urn model does not fit to the actual experiment performed. • It can easily lead to wrong interpretations. • Unfortunately, the competetive null hypothesis is inherently linked with the gene-sampling model. MicroarrayDatenanalyse Independence assumption Introduction Methodological issues GSEA approach Globaltest • Subject-sampling model assumes independent subjects. • Gene-sampling model assumes independent genes. • However, it is known that strong correlations between genes (especially between functional related genes) occur frequently in microarray gene expression data. • → p-value inflation for 2 × 2-table methods MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse GSEA (1/2) Introduction Methodological issues GSEA approach Globaltest • One of the first published Gene Set Analyses-methods (http://www.broadinstitute.org/gsea/) • uses a (weighted) KS-test statistic on the ranks of the genes’ p-values • uses a subject-sampling model, i.e., subjects’ class labels are permuted to estimate the null-distribution • What is the correct interpretation of resulting p-values? MicroarrayDatenanalyse Introduction Methodological issues GSEA approach Globaltest GSEA (2/2) MicroarrayDatenanalyse Introduction Methodological issues 1 Introduction GSEA approach Globaltest 2 Methodological issues 3 GSEA approach 4 Globaltest MicroarrayDatenanalyse Globaltest model (1/2) Introduction Methodological issues GSEA approach Globaltest • score-test for the self-contained null hypothesis • X is n × g -matrix with (normalized) gene expression values • Y is a clinical variable (often categorical) • model: E (Y | β) = h(α + X β) • test H0 : β = 0 gegen HA : β 6= 0. • problem: n < g MicroarrayDatenanalyse Globaltest model (2/2) Introduction Methodological issues GSEA approach Globaltest • distributional assumptions for β: • E (β) = 0 und E (ββ 0 ) = τ 2 I • The specification of the distribution of β would be completed, if a value for τ 2 and a distributional shape were chosen. • Now, the self-contained null hypothesis can be formulated as follows: H̄0 : τ 2 = 0 vs. H̄A : τ 2 > 0. • integration over β gives the Likelihood for τ 2 : L̄(τ 2 ; Y ) = Eβ|τ 2 L(β; Y ) MicroarrayDatenanalyse Score-Test Introduction Methodological issues • Derivation of the Log likelihood function of τ 2 GSEA approach S(τ 2 ) = Globaltest d d ln L̄(τ 2 ; Y ) = ln Eβ|τ 2 L(β; Y ) 2 dτ dτ 2 • leads to the score statistic 1 1 S(0) =: S = ss 0 − tr(I). 2 2 s= ∂ ∂β ln L(0; Y ), score-funktion of β 2 ∂ I = − ∂β∂β ln L(0; Y ), information matrix of β • S is easy to calculate, but its distribution is unknown in general. MicroarrayDatenanalyse Summary Introduction Methodological issues GSEA approach Globaltest • gene-sampling models – H0comp (e.g. http://david.abcc.ncifcrf.gov/) • subject-sampling models – H0self (e.g. Globaltest) • Hybrid methods (e.g. Gene Set Enrichment Analysis) • correct interpretation of the p-values • gene-sampling models: Are there at least some significantly differentially expressed genes in your set?