Download Praktikum der Microarray-Datenanalyse

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Saethre–Chotzen syndrome wikipedia , lookup

Transposable element wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Oncogenomics wikipedia , lookup

X-inactivation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Copy-number variation wikipedia , lookup

Genetic engineering wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Essential gene wikipedia , lookup

Metagenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Public health genomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene therapy wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

Minimal genome wikipedia , lookup

The Selfish Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
MicroarrayDatenanalyse
Introduction
Methodological
issues
GSEA
approach
Globaltest
Praktikum der Microarray-Datenanalyse
– Gene Set Analysis –
Hans-Ulrich Klein
Christian Ruckert
Institut für Medizinische Informatik
SS 2011
MicroarrayDatenanalyse
Organisation
Introduction
Methodological
issues
GSEA
approach
Globaltest
1
09.05.11 – Normalisierung
2
10.05.11 – Bestimmen diff. expr. Gene, Experiment-Design
3
11.05.11 – Dimensionsreduktion, Clusteranalyse
4
12.05.11 – Klassifikation, Gene Set Analysis
5
13.05.11 – Analyse von Überlebenszeiten
MicroarrayDatenanalyse
Introduction
Methodological
issues
1 Introduction
GSEA
approach
Globaltest
2 Methodological issues
3 GSEA approach
4 Globaltest
MicroarrayDatenanalyse
Gene set analyses
Introduction
Methodological
issues
GSEA
approach
Globaltest
• A (long) list of differentially expressed genes is only an
intermediary result of a successfull microarray experiment.
• This list is the starting point for a complicated
interpretation process.
• Gene Set Analyes methods formalize this interpretation
process:
−→ Group all genes that are annotated to the same
annotation term together into sets and analyse the
experiment’s results in terms of these sets.
• shifts the analysis level from single genes to sets of genes
MicroarrayDatenanalyse
Gene set analyses
Introduction
Methodological
issues
GSEA
approach
Globaltest
• Source of annotation terms:
• Gene Ontology (BP, MF, CC)
• KEGG
• chromosomal location
• Ingenuity - IPA
• Presence of transcription factor binding sites
• ...
• Many different Gene Set Analyses methods have been
proposed in the literature. Due to different models and
assumptions, the results of the these methods must be
interpreted carefully. Moreover, the foundations and the
validity of some approaches had been questioned in recent
articles.
MicroarrayDatenanalyse
Introduction
Methodological
issues
1 Introduction
GSEA
approach
Globaltest
2 Methodological issues
3 GSEA approach
4 Globaltest
MicroarrayDatenanalyse
Gene Set Analysis methods – Input
Introduction
Methodological
issues
GSEA
approach
Globaltest
• GSA methods can be classified by their input data:
• raw expression data
• statistics per gene (e.g. fold change, t-statistics, ...)
• p-values per gene
• list of differentially / not differentially expressed genes
• The last input type is most simple and most popular. The
data can be clearly represented in a 2 × 2 table.
MicroarrayDatenanalyse
Introduction
2 × 2 table for over-representation
(1/2)
Methodological
issues
GSEA
approach
Globaltest
In gene set
Not in gene set
Total
Differentially
expressed gene
mGD
mG c D
mD
Not-differentially
expressed gene
mGD c
mG c D c
mD c
Total
mG
mG c
m
MicroarrayDatenanalyse
2 × 2 table for over-representation
(2/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
• Many different tests have been proposed, including
χ2 -test, hypergeometric test, binomial z-test for
proportions. The differences tend to be unimportant in
practice.
• hypergeometric test:
•
•
•
•
Put mD red balls into an urn.
Put mD c black balls into an urn.
Draw mG genes without replacement.
The hypergemetric distribution gives the probability for the
number of red balls (mGD ) among the drawn sample.
• →An non-parametric counterpart to the hypergeometric
test can be constructed by permuting the gene set labels.
MicroarrayDatenanalyse
Null hypothesis (1/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
• Two different null hypotheses for GSA:
• H0comp : The genes in G are at most as often differentially
expressed as the genes in G c .
• H0self : No genes in G are differentially expressed.
• The presented 2 × 2 table methods test H0comp .
• A self-contained counterpart:
• Flag genes with p-values ≤ α as differentially expressed.
• Under H0self and independence of genes, mGD ∼ B(mG , α).
MicroarrayDatenanalyse
Null hypothesis (2/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
Comparison of the two methods based on different null
hypotheses:
1
Power
2
Relationship to single gene testing
3
Testing all genes on a chip; definition of G c
4
Biological meaningful
5
Sampling model
MicroarrayDatenanalyse
Gene vs. subject sampling
Introduction
Methodological
issues
GSEA
approach
Globaltest
• Subject-sampling (exp. design for classical stat. tests):
• sample consists of n realizations (for n subjects):
(X1 , Y1 ), . . . , (Xn , Yn )
• each subject gets the same fixed set of measurements
(vector Xi of m gene expression values)
• subjects are assumed to be i.i.d.
• Replication: Hybridize new subject on same type of
microarray
• Gene-sampling (model behind 2 × 2 table methods)
• sample consists of g realizations (for g genes):
(A1 , B1 ), . . . , (Ag , Bg )
• each gene gets same fixed set of measurements
(A=
ˆ element of gene set; B =
ˆ diff. expr.)
• measurements of g genes are assumed to be i.i.d.
• Replication: Measure new genes on same subjects.
MicroarrayDatenanalyse
Interpretation of p-values
Introduction
• meaning of a p-value relates to hypothetical replications of
Methodological
issues
the experiement performed
• if H0 is true, no more than a fraction α of the replications
will yield a p-value ≤ α
• subject-sampling p-value:
replications involve taking a new sample of subjects and
measure same genes
→ a significant p-values gives confidence to find the same
associations within a new sample of subjects
• gene-sampling p-value:
replications involve taking a new sample of genes
measured on the same subjects
→ a significant p-values gives confidence to find the same
association between the variables membership of gene set
and being differentially expressed within these subjects on
a new array with different genes
GSEA
approach
Globaltest
MicroarrayDatenanalyse
Excursus: iGA
Introduction
Methodological
issues
GSEA
approach
Globaltest
• iGA is a variant of the 2 × 2 table approach.
• cut-off for differential expression is based on fold-change
• in their abstract, Breitling et al. (2004) wrote:
In the extreme, iGA can even produce statistically
meaningful results without any experimental replication.
• The gene-sampling urn model does not fit to the actual
experiment performed.
• It can easily lead to wrong interpretations.
• Unfortunately, the competetive null hypothesis is
inherently linked with the gene-sampling model.
MicroarrayDatenanalyse
Independence assumption
Introduction
Methodological
issues
GSEA
approach
Globaltest
• Subject-sampling model assumes independent subjects.
• Gene-sampling model assumes independent genes.
• However, it is known that strong correlations between
genes (especially between functional related genes) occur
frequently in microarray gene expression data.
• → p-value inflation for 2 × 2-table methods
MicroarrayDatenanalyse
Introduction
Methodological
issues
1 Introduction
GSEA
approach
Globaltest
2 Methodological issues
3 GSEA approach
4 Globaltest
MicroarrayDatenanalyse
GSEA (1/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
• One of the first published Gene Set Analyses-methods
(http://www.broadinstitute.org/gsea/)
• uses a (weighted) KS-test statistic on the ranks of the
genes’ p-values
• uses a subject-sampling model, i.e., subjects’ class labels
are permuted to estimate the null-distribution
• What is the correct interpretation of resulting p-values?
MicroarrayDatenanalyse
Introduction
Methodological
issues
GSEA
approach
Globaltest
GSEA (2/2)
MicroarrayDatenanalyse
Introduction
Methodological
issues
1 Introduction
GSEA
approach
Globaltest
2 Methodological issues
3 GSEA approach
4 Globaltest
MicroarrayDatenanalyse
Globaltest model (1/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
• score-test for the self-contained null hypothesis
• X is n × g -matrix with (normalized) gene expression values
• Y is a clinical variable (often categorical)
• model:
E (Y | β) = h(α + X β)
• test H0 : β = 0 gegen HA : β 6= 0.
• problem: n < g
MicroarrayDatenanalyse
Globaltest model (2/2)
Introduction
Methodological
issues
GSEA
approach
Globaltest
• distributional assumptions for β:
• E (β) = 0 und E (ββ 0 ) = τ 2 I
• The specification of the distribution of β would be
completed, if a value for τ 2 and a distributional shape were
chosen.
• Now, the self-contained null hypothesis can be formulated
as follows:
H̄0 : τ 2 = 0 vs. H̄A : τ 2 > 0.
• integration over β gives the Likelihood for τ 2 :
L̄(τ 2 ; Y ) = Eβ|τ 2 L(β; Y )
MicroarrayDatenanalyse
Score-Test
Introduction
Methodological
issues
• Derivation of the Log likelihood function of τ 2
GSEA
approach
S(τ 2 ) =
Globaltest
d
d
ln L̄(τ 2 ; Y ) =
ln Eβ|τ 2 L(β; Y )
2
dτ
dτ 2
• leads to the score statistic
1
1
S(0) =: S = ss 0 − tr(I).
2
2
s=
∂
∂β
ln L(0; Y ), score-funktion of β
2
∂
I = − ∂β∂β
ln L(0; Y ), information matrix of β
• S is easy to calculate,
but its distribution is unknown in general.
MicroarrayDatenanalyse
Summary
Introduction
Methodological
issues
GSEA
approach
Globaltest
• gene-sampling models – H0comp
(e.g. http://david.abcc.ncifcrf.gov/)
• subject-sampling models – H0self
(e.g. Globaltest)
• Hybrid methods
(e.g. Gene Set Enrichment Analysis)
• correct interpretation of the p-values
• gene-sampling models: Are there at least some
significantly differentially expressed genes in your set?