Download Gene Set Enrichment Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

X-inactivation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Oncogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

NEDD9 wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Metagenomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Public health genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genomic imprinting wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of human development wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Gene Set Enrichment Analysis
Petri Törönen
petri(DOT)toronen(AT)helsinki.fi
What, Why, How…
•
•
•
•
•
•
Gene expression data/analysis
Problems with gene expression data analysis
Earlier solutions
My solution
Comparisons
Conclusions / Warnings
Genome-wide gene expression
• Genome-wide Gene Expression (GE) analysis.
Standard lab tool
• Various methods
• Aim to understand biological differences
across the samples at gene level
• If you don’t work with GE data:
– Gene Set Methods can be used with most other
large scale data sets
Typical pipelines
Generate
the GE data
Generate
the GE data
Generate
the GE data
Pre-processing
(Normalization etc.)
Pre-processing
(Normalization etc.)
Pre-processing
(Normalization etc.)
Define Differentially
Expressed genes
Define Differentially
Expressed genes
Define Differentially
Expressed genes
Draw biological conclusions
Find over-represented
biological processes
Cluster selected genes
Generate a classification
of samples using GE profiles
of genes
Draw biological conclusions
Draw biological conclusions
Classify unknown samples
What can go wrong?
• Is the definition of Differentially Expressed
genes always reasonable?
– datasets with large noise levels
– p-value thresholds
– sudden jump to signif. regulation
– genes with weak regulation
• Is the set of Diff. Expr. genes the main goal?
What can go wrong?
• Is the definition of Differentially Expressed
genes always reasonable?
– datasets with large noise levels
– p-value thresholds
– genes with weak regulation
• Is the set of Diff. Expr. genes the main goal?
=> Biological Processes are usually more
informative.
What can go wrong?
Analysis of data with one threshold.
Biological process with weak regulation goes unnoticed
Solution
• Analyze sets of genes instead of genes
• Gene Set: Genes belonging to same pathway,
biological process, complex and/or Gene
Ontology class
• Benefits: Group of genes is less sensitive to error
than a single gene*
• Benefits: Easy interpretation of the results
• Something to support the gene based analysis
Gene set analysis pipeline
Pre-processing
(Normalization etc.)
Gene level
Define continuous Diff. Expr.
score for genes
Generate permuted data
Gene set level
Pre-defined
gene sets
Calculate a gene set score for each
gene set
Calculate the gene set score
for each gene set
Look for gene sets that show stronger signal
in real data than in permuted data
Class data
Generate
the GE data
Expression data
Sample labels
Methods for gene set scoring
• Average based methods
• Rank based methods
• Other methods (omitted here)
Average based methods
• Calculate the average regulation of gene set
(Tian et al. PNAS)
• Can something go wrong with it?
Rank based methods
• Steps:
Gene expression data Analyzed gene classes
Analyzed
– order genes with differential
subset
expression
– test every possible threshold in threshold
the ordered list
– look over(/under)-representation
of gene set above the threshold
– select the strongest score
• Expression values are (often)
discarded!
• Iterative Group Analysis, Kolmogorov-Smirnov
test (KS), modified KS (Gene Set Enrichment
Analysis package, MIT)
Black = class member
White = not a member
Permutations
• Needed to evaluate significance
• Two types:
• Row Randomization
– mix labels gene set / gene class
• Column Randomization
Row
rand.
– mix sample labels, used to
calculate diff. expr.
• Column Randomization
preferred
Col. rand
Summary of methods
• Average-based methods are weak with noncoherent regulation
• Rank-based methods usually omit gene
expression values => steps between all genes
equally significant
My brilliant proposal
• Combine two method groups:
– Order genes with diff. expr. scores
– Test every threshold position
– At each threshold calculate
• Scale the difference with STD and average
estimates (Toronen et al. 2009)
• Get a Z-score scaling for difference
=> Gene Set Z-score (GSZ)
My brilliant proposal
• An over-representation (hypergeometric)
score weighted with diff. expr. score
• GSZ compares the Diff to the mean and STD
we obtain when the class is randomly distr. in
the ordered list.
• Considers both: Variance in the expr. values
and variance in the number gene set members
in the list
My brilliant proposal
• Many popular Gene Set scoring methods are
variants of GSZ-method:
– hypergeometric testing
– Pearson correlation
– Max-Mean (Efron, Tibshirani)
– Random Sets (Newton et al.)
GSZ profile from ALL data (Chiaretti et.al) for one GO class vs.
7 quantiles (0, 5, 25, 50, 75, 95, 100) from 500 permutations.
Different positions corresponds to other competing methods.
Evaluation
GSZ with diff. parameter values. Third box
shows default parameter values.
•
•
•
•
•
Stability of the scores as threshold goes through the gene list?
Red line: Strongest signal from positive data (across all GO classes)
Blue lines: various quantiles (same as before) across all GO class
Compare with KS and modified KS (Right column. MIT, PNAS and Nature Gen.)
Same data, same permutation!!
Pay attention
to stability of
blue lines.
More evaluation
• GSZ is also stable against the gene set size variations
– most methods are not
• Several Gene Set scoring methods were tested with
artificial positive and random datasets
– GSZ showed best overall ability to separate two dataset
types
• Methods were evaluated by splitting the real data to
two halves: Test how well the results match
– GSZ was best in predicting its own results from the other
half
– GSZ was best in predicting summary of all methods from
the other half
More evaluation
• Compare different gene set scoring functions
• Test with two popular datasets against GO classes
• Calculate the empirical -log(p-values) for strongest GO classes from each
method
• Blue line = GSZ, green line = T-test, red = KS, magenta = iGA, cyan =
modified KS
Class data
p53 dataset
ALL dataset
Pooled data
More evaluation
• Select biologically relevant GO classes as
biologically positive
• Look how many such classes each method
finds across the top ranks (GSZ = blue line)
Here ALL dataset. GSZ outperforms
others at bigger ranks. Similar
results were obtained with p53
dataset
Comparison with other programs
• Selected SignalPathway (green line), GSEA (cyan)
and GSA (black) to comparison
• Evaluation was done again using the biologically
positive classes
• Comparing programs less clear (more variables)
Here again ALL dataset.
Similar results with p53
GSZ outperforms others at
large
Summary
• GSZ, weighted over-representation score
• Math link to many other popular methods
• Stable across GO class sizes and across gene
list positions
• Good performance in artificial datasets
• Best performance with many evaluations
from two real datasets
Other applications
• siRNA data vs. gene IDs (discussed)
• Linkage data vs. biological processes
(discussed)
• BLAST result list vs. descriptions (in usage)
• BLAST result list vs. GO classes (in usage)
Warnings
• Quality of gene expression data
• Enough samples for permutations
• Each gene should occur only once in the
expression data
• Filter genes without annotations (with GO
data)
• Use Column Permutations
• Quality of gene sets / annotations
Wake up!!