Download PPTX - Bioinformatics.ca

Document related concepts

Genetic engineering wikipedia , lookup

Genomic imprinting wikipedia , lookup

History of genetic engineering wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Helitron (biology) wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genome (book) wikipedia , lookup

The Selfish Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene desert wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression programming wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 2
Finding over-represented pathways in
gene lists
Quaid Morris
http://morrislab.med.utoronto.ca
Outline
• Introduction to enrichment analysis
• Demo of DAVID, a web tool for doing
enrichment analysis
• Theory of enrichment analysis:
– Fisher’s Exact Test
– Multiple test corrections
– Other enrichment tests (GSEA)
• DAVID Lab
Module 2
bioinformatics.ca
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
Apoptosis
0.00001
0.00025
ENRICHMENT
TEST
Gene-set
Databases
Module 2
bioinformatics.ca
Enrichment Analysis
•
Given:
1.
2.
•
•
Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast)
Gene sets or annotations: e.g. Gene ontology, transcription factor
binding sites in promoter
Question: Are any of the gene annotations surprisingly
enriched in the gene list?
Details:
–
–
–
6
Module
2
Where do the gene lists come from?
How to assess “surprisingly” (statistics)
How to correct for repeating the tests
bioinformatics.ca
Two-class Design
Expression Matrix
Selection by
Threshold
Genes Ranked by
Differential Statistic
UP
UP
DOWN
Class-1
Module 2
Class-2
DOWN
E.g.:
- Fold change
- Log (ratio)
- t-test
-Significance analysis of microarrays
bioinformatics.ca
Time-course Gene
Design
E.g.:
- K-means Clusters
- K-medoids
- SOM
Expression Matrix
t t t …
1 2 3
Module 2
t
n
bioinformatics.ca
Enrichment Test
Microarray
Experiment
(gene expression table)
Gene list
(e.g UP-regulated)
Gene-set
Databases
Background
(all genes on the array)
Module 2
bioinformatics.ca
Enrichment Test
Microarray
Experiment
(gene expression table)
Gene list
(e.g UP-regulated)
Gene-set
Gene-set
Databases
Background
(all genes on the array)
Module 2
bioinformatics.ca
Enrichment Test
Random samples
of array genes
The output of an enrichment test is a P-value
The P-value
assesses the
probability that the
overlap is at least
as large as
observed by
random sampling
the array genes.
Module 2
bioinformatics.ca
DAVID demo
http://david.abcc.ncifcrf.gov/tools.jsp
Module 2
bioinformatics.ca
Step 1: Define your gene list
• Either
– (a) Copy and paste your list in
– (b) Upload a gene list file
– (c) Choose an example gene list (so, click “demolist1”
on next slide)
Module 2
bioinformatics.ca
Step 1: Define your gene list
If you choose the wrong identifier, you
may need to use the conversion tool
Click here to define type of list (now
we are doing a gene list, next slide
we will define the background)
Module 2
bioinformatics.ca
Step 2: Choose background
You can either upload a background list
(in the upload tab, see previous
step) or choose one of the
background sets (shown here)
Example list is from the Human U95A
array, select it here.
Module 2
bioinformatics.ca
Step 3: Check list & background
Module 2
bioinformatics.ca
Step 4: Run Enrichment Analysis
Click here
Module 2
bioinformatics.ca
Step 5: Select categories of gene lists to
Click here totest
expand
Red indicates default
selections, click
check marks if you
want to change
Click here once you’ve
selected your sets
Module 2
bioinformatics.ca
Step 6: View enrichment
Gene set name
EASE P-value
4.0E-4 means:
4.0 x 10-4
# of genes in set
with annotation
Module 2
bioinformatics.ca
Step 7: Change parameters if desired
Set “count” to 1 and “EASE” to 1 if you
want maximum # of categories.
Beware, only corrects within category.
Module 2
bioinformatics.ca
Step 8: Download results
Download spreadsheet (in tabdelimited text format).
Beware: if you click you may
get text in a browser window,
if this happens, “right-click” to
save as a file.
Module 2
bioinformatics.ca
Step 8a: What can happen if no right-click
Module 2
bioinformatics.ca
Step 8b: How it looks in a spreadsheet
Module 2
bioinformatics.ca
Outline of theory component
• Fisher’s Exact Test for calculating enrichment
P-values (also used for calculating EASE score)
• Multiple test corrections:
– Bonferroni
– Benjamini-Hochberg FDR
• Other enrichment tests
– GSEA for ranked lists (with a short intro)
Module 2
bioinformatics.ca
Fisher’s exact test
a.k.a., the hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
Null hypothesis: List is a random sample from population
Alternative hypothesis: More black genes than expected
Background population:
500 black genes,
4500 red genes
25
Module
2
bioinformatics.ca
Fisher’s exact test
a.k.a., the hypergeometric test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
Null distribution
P-value
Answer = 4.6 x 10-4
Background population:
500 black genes,
4500 red genes
26
Module
2
bioinformatics.ca
2x2 contingency table for Fisher’s Exact Test
Gene list
RRP6
MRD1
RRP7
RRP43
RRP42
In gene list
Not in gene list
In gene set
4
496
Not in gene set
1
4499
e.g.: http://www.graphpad.com/quickcalcs/contingency1.cfm
Background population:
500 black genes,
4500 red genes
27
Module
2
bioinformatics.ca
Important details
• To test for under-enrichment of “black”, test for overenrichment of “red”.
• The EASE score used by DAVID subtracts one from the
observed overlap between gene list and gene set to ensure >1
from the list is in the gene set.
• Need to choose “background population” appropriately, e.g.,
if only portion of the total gene complement is queried (or
available for annotation), only use that population as
background.
• To test for enrichment of more than one independent types of
annotation (red vs black and circle vs square), apply Fisher’s
exact test separately for each type. ***More on this later***
28
Module
2
bioinformatics.ca
Multiple test corrections
Module 2
bioinformatics.ca
How to win the P-value lottery, part 1
Random draws
… 7,834 draws later …
Expect a random draw
with observed enrichment
once every 1 / P-value
draws
Background population:
500 black genes,
4500 red genes
Module 2
bioinformatics.ca
How to win the P-value lottery, part 2
Keep the gene list the same, evaluate different annotations
Observed draw
RRP6
MRD1
RRP7
RRP43
RRP42
Module 2
Different annotation
RRP6
MRD1
RRP7
RRP43
RRP42
bioinformatics.ca
Simple P-value correction: Bonferroni
If M = # of annotations tested:
Corrected P-value = M x original P-value
Corrected P-value is greater than or equal to the probability that one or
more of the observed enrichments could be due to random draws. The
jargon for this correction is “controlling for the Family-Wise Error Rate
(FWER)”
Module 2
bioinformatics.ca
Bonferroni correction caveats
• Bonferroni correction is very stringent and can
“wash away” real enrichments.
• Often one is willing to accept a less stringent
condition, the “false discovery rate” (FDR),
which leads to a gentler correction when
there are real enrichments.
Module 2
bioinformatics.ca
False discovery rate (FDR)
• FDR is the expected proportion of the
observed enrichments due to random
chance.
• Compare to Bonferroni correction which is a bound on the
probability that any one of the observed enrichments
could be due to random chance.
• Typically FDR corrections are calculated using the
Benjamini-Hochberg procedure.
• FDR threshold is often called the “q-value”
Module 2
bioinformatics.ca
Benjamini-Hochberg example
Rank
Category
P-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
Adjusted P-value
FDR / Q-value
Sort P-values of all tests in decreasing order
Module 2
bioinformatics.ca
Benjamini-Hochberg example
Rank
Category
P-value
Adjusted P-value
FDR / Q-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
0.001 x 53/1 = 0.053
0.01 x 53/2 = 0.27
0.02 x 53/3 = 0.35
…
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
= 0.042
= 0.052
= 0.061
= 0.07
Adjusted P-value = P-value X [# of tests] / Rank
Module 2
bioinformatics.ca
Benjamini-Hochberg example
Rank
Category
P-value
Adjusted P-value
FDR / Q-value
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
0.001
0.01
0.02
0.001 x 53/1 = 0.053
0.01 x 53/2 = 0.27
0.02 x 53/3 = 0.35
0.042
0.042
0.042
…
…
…
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
0.04
0.05
0.06
0.07
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
= 0.042
= 0.052
= 0.061
= 0.07
0.042
0.052
0.061
0.07
Q-value = minimum adjusted P-value at given rank or below
Module 2
bioinformatics.ca
Benjamini-Hochberg example
Rank
Category
1
2
3
Transcriptional regulation
Transcription factor
Initiation of transcription
…
…
50
51
52
53
Nuclear localization
RNAi activity
Cytoplasmic localization
Translation
P-value
0.001
0.01
0.02
P-value threshold for FDR < 0.05
…
0.04
0.05
0.06
0.07
Adjusted P-value
FDR /
Q-value FDR < 0.05?
0.001 x 53/1 = 0.053 0.042
0.01 x 53/2 = 0.27 0.042
0.02 x 53/3 = 0.35 0.042
…
0.04 x 53/50
0.05 x 53/51
0.06 x 53/52
0.07 x 53/53
= 0.042
= 0.052
= 0.061
= 0.07
Y
Y
Y
…
…
0.042
0.052
0.061
0.07
Y
N
N
N
P-value threshold is highest ranking P-value for which
corresponding Q-value is below desired significance threshold
Module 2
bioinformatics.ca
Reducing multiple test correction stringency
• The correction to the P-value threshold 
depends on the # of tests that you do, so, no
matter what, the more tests you do, the more
sensitive the test needs to be
• Can control the stringency by reducing the
number of tests: e.g. use GO slim; restrict
testing to the appropriate GO annotations; or
select only larger GO categories.
Module 2
bioinformatics.ca
Multiple test correction in DAVID
• In DAVID, the “Benjamini-Hochberg” column
corresponds to the false discovery rate as it is
typically defined. It is unclear what the FDR
means.
• DAVID does multiple test correction separately
within each category of gene sets, so adding
more categories does not change the FDRs or
P-values. Be careful how you report these
numbers.
Module 2
bioinformatics.ca
Other enrichment tests
UP
Gene list
Fisher’s Exact Test,
Binomial and Chisquared.
DOWN
ENRICHMENT
TEST
UP
Ranked list (semiquantitative)
GSEA,
Wilcoxon ranksum,
Mann-Whitney U,
Kolmogorov-Smirnov
DOWN
Module 2
bioinformatics.ca
Why test enrichment in ranked lists?
• Possible problems with Fisher’s Exact Test
– No “natural” value for the threshold
– Different results at different threshold settings
– Possible loss of statistical power due to thresholding
• No resolution between significant signals with different
strengths
• Weak signals neglected
Module 2
bioinformatics.ca
GSEA Enrichment Test
Ranked
Gene List
Enrichment Table
Gene-set
p-value
FDR
Spindle
Apoptosis
0.0001
0.025
0.01
0.09
GSEA
Gene-set
Databases
Module 2
bioinformatics.ca
GSEA: Method
Steps
1. Calculate the ES score
2. Generate the ES distribution for the null
hypothesis using permutations
• see permutation settings
3. Calculate the empirical p-value
4. Calculate the FDR
Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting
genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)
Module 2
bioinformatics.ca
GSEA: Method
ES score calculation
Where are the gene-set genes located in the ranked list?
Is there distribution random, or is there an enrichment in either end?
Module 2
bioinformatics.ca
GSEA: Method
ES score calculation
ES score
Every present gene (black vertical bar) gives a positive contribution,
every absent gene (no vertical bar) gives a negative contribution
1
1001
4001
Gene rank
Module 2
bioinformatics.ca
GSEA: Method
ES score calculation
MAX (absolute value) running ES score --> Final ES Score
Module 2
bioinformatics.ca
GSEA: Method
ES score calculation
High ES score <--> High local enrichment
Module 2
bioinformatics.ca
GSEA: Method
Empirical p-value estimation (for every gene-set)
1. Generate null-hypothesis distribution from
randomized data (see permutation settings)
Counts
Distribution of ES from
N permutations (e.g. 2000)
ES Score
Module 2
bioinformatics.ca
GSEA: Method
Estimate empirical p-value by comparing observed ES
score to null-hypothesis distribution from
randomized data (for every gene-set)
Counts
Distribution of ES from
N permutations (e.g. 2000)
Real ES score value
ES Score
Module 2
bioinformatics.ca
GSEA: Method
Estimate empirical p-value by comparing observed ES
score to null-hypothesis distribution from
randomized data (for every gene-set)
Counts
Distribution of ES from
N permutations (e.g. 2,000)
Randomized with ES ≥ real: 4 / 2,000
--> Empirical p-value = 0.002
Real ES score value
ES Score
Module 2
bioinformatics.ca
Using GSEA
• Installation
• Launch Desktop Application from:
• http://www.broadinstitute.org/gsea/msigdb/downloads.jsp
• Notes:
–
–
–
–
•
•
if you have sufficient RAM (*), go for the 1Gb option
running GSEA will take some time
(2-5 hrs depending on the system and the memory setting)
you need an internet connection to run GSEA
(*) WIN: check using ALT+CTRL+CANC/Task Manager
MAC: check using Applications/Utilities/Activity Monitor
Module 2
bioinformatics.ca
Summary
• Enrichment analysis:
– Statistical tests
• Gene list: Fisher’s Exact Test
• Gene rankings: GSEA, also see Wilcoxon ranksum,
Mann-Whitney U-test, Kolmogorov-Smirnov test
– Multiple test correction
• Bonferroni: stringent, controls probability of at least
one false positive*
• FDR: more forgiving, controls expected proportion of
false positives* -- typically uses Benjamini-Hochberg
* Type 1 error, aka probability that observed enrichment if no association
Module 2
bioinformatics.ca
Lab Time
• Part 1
– Try out DAVID – use the yeast gene list and Module 2 gene list
– Follow Lab 2 protocol
– http://david.abcc.ncifcrf.gov/
• Part 2
– Continue with the Cytoscape protocol
– http://opentutorials.rbvi.ucsf.edu/index.php/Tutorial:Introduction_to_
Cytoscape
• Part 3
– Try out enrichment map – load the plugin from the plugin manager
– Load DAVID results – or - load the GSEA enrichment analysis file EM_EstrogenMCF7_TestData.zip (unzip) available at
• http://baderlab.org/Software/EnrichmentMap
Module 2
bioinformatics.ca
Upload your gene list
If DAVID doesn’t recognize your genes, it can try to
detect the correct identifiers to use
Module 2
bioinformatics.ca
Gene ID mapping results
Step 1
Module 2
bioinformatics.ca
Run the enrichment analysis
Module 2
bioinformatics.ca
Run the enrichment analysis
Module 2
bioinformatics.ca
We are on a Coffee Break &
Networking Session
Module 2
bioinformatics.ca