Download Gene-set - Bader Lab

Document related concepts

Drosophila melanogaster wikipedia , lookup

Plant disease resistance wikipedia , lookup

Immunomics wikipedia , lookup

Transcript
Functional Enrichment and
Pathway Analysis – I
Daniele Merico
PhD, Molecular and Cellular Biology
Post-doctoral Research Fellow, CCBR, U. of T.
Outline of these lectures
Goal
Identifying functional “themes” and “patterns”
in microarray data
Outline of these lectures
Goal
Identifying functional “themes” and “patterns”
in microarray data
Lesson 1: Gene-set Enrichment Analysis
•
•
•
Data sources
Statistical methods
Visualization
Outline of these lectures
Goal
Identifying functional “themes” and “patterns”
in microarray data
Lesson 1: Gene-set Enrichment Analysis
•
•
•
Data sources
Statistical methods
Visualization
Lesson 2: Networks and Pathways
•
•
Networks: data sources and visualization
Pathways
PART 1
Introduction
How do we relate
microarray expression data
to biological function?
Analysis
Workflow
Define the
experimental design
Collect the biological
samples
Generate the
expression data
Identify the
Differential Genes
Identify the
Functional Groups
Analysis
Workflow
Define the
experimental design
Collect the biological
samples
Generate the
expression data
Identify the
Differential Genes
Identify the
Functional Groups
Analysis
Workflow
Define the
experimental design
Collect the biological
samples
Generate the
expression data
Identify the
Differential Genes
Identify the
Functional Groups
Analysis
Workflow
Define the
experimental design
Collect the biological
samples
Generate the
expression data
Identify the
Differential Genes
Identify the
Functional Groups
From differential genes to
biological functions
?!
How do my data relate to known biological functions?
Are there specific functions that are characterized by
gene expression changes?
Analysis
Workflow
Define the
experimental design
Collect the biological
samples
Generate the
expression data
Identify the
Differential Genes
Identify the
Functional Groups
Identification of Functional Groups
GENE SETS
NETWORKS
PATHWAYS
Spindle
Gene.1
Gene.2
Gene.3
P53 signaling
Gene.2
Gene.4
Gene.5
Score the set depending
on the gene expression of
its member genes
Just visual, or
Just visual, or
Identify modules satisfying
some joint gene expression
and topology requirement
Score the pathways
exploiting gene expression
and topology
Identification of Functional Groups
GENE SETS
NETWORKS
PATHWAYS
Spindle
Gene.1
Gene.2
Gene.3
P53 signaling
Gene.2
Gene.4
Gene.5
Score the set depending
on the gene expression of
its member genes
Just visual, or
Just visual, or
Identify modules satisfying
some joint gene expression
and topology requirement
Score the pathways
exploiting gene expression
and topology
Identification of Functional Groups
GENE SETS
NETWORKS
PATHWAYS
Spindle
Gene.1
Gene.2
Gene.3
P53 signaling
Gene.2
Gene.4
Gene.5
Score the set depending
on the gene expression of
its member genes
Just visual, or
Just visual, or
Identify modules satisfying
some joint gene expression
and topology requirement
Score the pathways
exploiting gene expression
and topology
Identification of Functional Groups
GENE SETS
Spindle
Gene.1
Gene.2
Gene.3
P53 signaling
Gene.2
Gene.4
Gene.5
 This lecture
NETWORKS
PATHWAYS
Identification of Functional Groups
GENE SETS
NETWORKS
PATHWAYS
Spindle
Gene.1
Gene.2
Gene.3
P53 signaling
Gene.2
Gene.4
Gene.5
 Next week lecture
PART 2
Gene-set Enrichment Analysis
What is gene-set enrichment
analysis? How does it help
interpreting microarray data?
What’s Gene-set Enrichment Analysis?
Break down cellular function into gene sets
- Every set of genes is associated to a specific
cellular function, process, component or pathway
What’s Gene-set Enrichment Analysis?
Break down cellular function into gene sets
- Every set of genes is associated to a specific
cellular function, process, component or pathway
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
What’s Gene-set Enrichment Analysis?
Microarray data can be related to gene sets
in order to mine its functional meaning
- Which gene-sets summarize at best gene
expression patterns?
What’s Gene-set Enrichment Analysis?
Microarray data can be related to gene sets
in order to mine its functional meaning
- Which gene-sets summarize at best gene
expression patterns?
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
What’s Gene-set Enrichment Analysis?
Microarray data can be related to gene sets
in order to mine its functional meaning
- Which gene-sets summarize at best gene
expression patterns?
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
What’s Gene-set Enrichment Analysis?
Microarray data can be related to gene sets
in order to mine its functional meaning
- Which gene-sets summarize at best gene
expression patterns?
This is the meaning of significant enrichment
We will see what’s the “statistical” definition of
enrichment in PART.4
PART 3
Gene-set Enrichment: Data
What data sources are available for
gene-set enrichment analysis?
Gene-set Data Sources
Break down cellular function into gene sets
Where can I get these gene-sets?
How were the gene-sets compiled?
How are they structured?
Nuclear Pore
Gene.AAA
Gene.ABA
Gene.ABC
Cell Cycle
Gene.CC1
Gene.CC2
Gene.CC3
Gene.CC4
Gene.CC5
Ribosome
Gene.RP1
Gene.RP2
Gene.RP3
Gene.RP4
P53 signaling
Gene.CC1
Gene.CK1
Gene.PPP
Gene Ontology (GO)
Gene Ontology is:
– a hierarchically-structured,
• Functional categories are organized hierarchically, i.e. a system of
inter-related sets with increasing scope specificity
(parent-child relations)
– controlled vocabulary
• Functional categories are defined by experts, and then must be
used consistently for annotation
– for gene product function annotation
• Gene products (i.e. proteins) are annotated using GO functional
categories (“terms”)
– It is general for all species
Gene Ontology: Example
Terms are organized
hierarchically
– Terms on top are more general,
terms on bottom are more
narrow in scope
– If a protein is annotated as
Spindle, the annotation should
be automatically inferred also
for all progenitors of Spindle
(up-propagation)
Gene Ontology: Example
Gene Ontology: Example
PARENT
CHILD
Gene Ontology: Example
PARENT
CHILD
Gene Ontology: Example
Gene Ontology and the corresponding gene-sets
PARENT
CHILD
Gene Ontology: Example
Gene Ontology and the corresponding gene-sets
PARENT
Gene
Gene-set
ABB1
C5A75
DUCZ
ACAP3
LUC2
CHILD
TRAC1
POF5
ZUMM
Gene Ontology: Example
Gene Ontology and the corresponding gene-sets
PARENT
ABB1
C5A75
DUCZ
ACAP3
LUC2
CHILD
TRAC1
POF5
ZUMM
The set corresponding to
the CHILD is a subset of
the one corresponding to
the PARENT
Gene Ontology: Example
Gene Ontology: Partitions
GO has three independent partitions, which are
not interconnected:
– Molecular Function
• Describes biochemical activities, in-vitro binding specificities, etc…
• Example: Ligase Activity, Kinase Activity, DNA Binding
– Cellular Component
• Describes parts of the cell
• Example: Mitochondrion, Spindle Microtubule
– Biological Process
• Describes processes at the intra-cellular and organism level
• Example: DNA Replication, Apoptosis, Development
MOLECULAR
FUNCTION
Ligase Activity
CELLULAR
COMPONENT
Mitochondrion
BIOLOGICAL
PROCESS
DNA Replication
Gene Ontology: Partitions
First-level children (list)
MOLECULAR
FUNCTION
CELLULAR
COMPONENT
BIOLOGICAL
PROCESS
Gene Ontology Levels
Every partition has several levels…
ROOT
LEVEL-1
LEVEL-2
LEVEL-N
Gene Ontology Levels
However, terms at the same level don’t necessarily have
the same degree of granularity (i.e. specificty of scope)
BIOLOGICAL
PROCESS
SIGNALING
IMMUNE
SYSTEM
PROCESS
Different granularity!!!
PIGMENTATION
Gene Ontology Annotations
How are gene annotated with GO terms?
Human curators go through the literature and mining
for gene functions
- Different genomic databases take part to this effort
- Evidence Codes are used to keep track of the type of
evidence for annotation
- IEA annotations are directly imported from databases,
without human curation
Important Note:
Primary annotations are not propagated using the ontology; therefore:
when you download GO gene-sets always make sure that up-propagation was done
Gene Ontology Evidence Codes
•
•
•
•
•
•
•
•
•
•
ISS: Inferred from Sequence/Structural Similarity
IDA: Inferred from Direct Assay
IPI: Inferred from Physical Interaction
IMP: Inferred from Mutant Phenotype
IGI: Inferred from Genetic Interaction
IEP: Inferred from Expression Pattern
TAS: Traceable Author Statement
NAS: Non-traceable Author Statement
IC: Inferred by Curator
ND: No Data available
• IEA: Inferred from electronic annotation
More at: http://www.geneontology.org/GO.evidence.shtml
Gene Ontology Evidence Codes
How should I use evidence codes?
– Quality Filter for Gene-set Enrichment
• Sometimes IEA (Electronic Annotations) are considered
less reliable, and are not used for analysis
• However, this should be evaluated very carefully and
cannot be generalized
– Gene Browsing
• If you are interested in the function of a specific gene,
you can check if multiple evidences are available
Annotation Inheritance
There are primary and inherited annotations
– Primary Annotations
• Originally defined by curators
– Inherited Annotations
• Back-propagated along the hierarchy
Always check if the gene ontology annotation
resource you are using includes inherited annotations!
Annotation Inheritance
Primary Annotation: Spindle
Annotation Inheritance
Inherited Annotations:
Microtubule Cytoskeleton
Cytoskeletal Part
Cytoskeleton
Intracellular Organelle Part
…
Gene Ontology: Multi-function
Besides hierarchical term organization,
genes can be multi-functional,
i.e. annotated by many independent terms
– In the following slide we see an excerpt of
p53 (the “Warden of Genome”) annotations,
as reported by the NCBI database Entrez-Gene
http://www.ncbi.nlm.nih.gov/gene/7157
Gene Ontology: Statistics
29,922 Total Terms
8,688
Molecular Function
2,689
Cellular Component
18,545 Biological Process
(http://www.geneontology.org/GO.downloads.ontology.shtml)
Annotated Genes (Entrez-Gene)
17,482 Human
18,028 Mouse
Exploring Gene Ontology: QuickGO
http://www.ebi.ac.uk/QuickGO/
Gene-sets: Beyond Gene Ontology
There are many other sources and types of gene-sets:
- Pathways (e.g. KEGG)
- Protein Families / Domains (e.g. PFAM)
- Predicted Targets of Regulators (e.g. MSigDB-c3)
- miRNA, Transcription Factors
- Protein-protein Interaction Modules
- Gene Expression
- Up/down after treatment or in relation to disease (e.g. MSigDB-c2)
- Co-expression across many conditions (e.g. MSigDB-c4)
- Genotype-phenotype association (e.g. DiseaseHub)
- Genomic position (e.g. MSigDB-c1)
Pathways and GO Biol. Process
How do pathways and processes differ?
– In a purely biological perspective, the question is
philosophical (still worth speculating…)
– In a bioinformatics perspective:
• A gene is annotated for a GO Biological Processes if the
curators deem it (significantly) contributes to the process
(which is at the cellular or organ level), according to a
number of evidences
• Pathways include the “wiring” of genes/gene products,
hence they rely on a more intensive curation process
• Some pathways include large ubiquitous actors (such as the
proteasome) that may confound enrichment analysis,
whereas these are usually absent from GO process
A pathway example: the MAPK cascade in KEGG
(http://www.genome.jp/kegg/pathway/hsa/hsa04010.html)
Major Gene-set Resources A-Z
• Bioconductor
–
–
–
–
GO: GO.db + org.Xx.eg.db (org.Xx.egGO2ALLEGS)
KEGG: KEGG.db + org.Xx.eg.db (org.Xx.egPATH)
PFAM: PFAMEDE + org.Xx.eg.db (org.Xx.egPFAM)
Note: Xx has to be replaced with the species id {Hs, Mm, Rn, etc…}
• DiseaseHub (http://zldev.ccbr.utoronto.ca/~ddong/diseaseHub/)
– Phenotype-genotype (OMIM, GAD, HGMD, PharmGKB, CGP, GWAS)
• MSigDB (http://broad.harvard.edu/gsea/msigdb/index.jsp)
– GO (*no IEA), Pathways (KEGG, Biocarta, STKE, GenMAPP, PharmGKB,
GEArray), Predicted Targets (miRNA: ?, TF: Transfac),
Gene Expression, Genomic Positions
• PathwayCommons
(http://www.pathwaycommons.org/pc-snapshot/gsea/by_species/)
– Pathways: Reactome, NCI, Cell map
• WhichGenes (www.whichgenes.org)
– GO, Pathways (KEGG, Biocarta, Reactome), Genomic Positions, Regulators
(miRNA: TargetScan, miRBase), Phenotype-genotype (geneCards Disease,
CancerGenes)
Exploring MSigDB (1)
http://broad.harvard.edu/gsea/msigdb/index.jsp
Exploring MSigDB (2)
Alzheimer
Exploring MSigDB (3)
Select this gene-set
Exploring MSigDB (4)
Exploring MSigDB (5)
I now want to see how the
gene-set I was interested in
overlaps with other gene-sets
in the collection
(I selected only a few types)
Exploring MSigDB (6)
We will se how this p-value is computed and what it
means in the next part (enrichment methods)
Gene-set Resources
Tips to navigate the resource ocean / 1:
– Start your analysis using only a few, reliable
sources (e.g. GO, KEGG)
• GO also has a very large gene coverage
– After the first-pass analysis, expand your gene-set
collection to types you are interested in
– Don’t try from the beginning everything together
– Remember quality and clarity!
• Target predictions may be unreliable
• Gene expression-derived sets are often hard to
interpret
Gene-set Resources
Tips to navigate the resource ocean / 2:
– If you are confident with R, start from
Bioconductor, and supplement the missing
pathways shopping around
•
•
•
•
GO: Bioconductor
Pathways: Pathway Commons
Phenotype-genotype: DiseaseHub
Gene Expression: MSigDB
Useful scripts available at:
http://baderlab.org/DanieleMerico/Code/Bioc2GMT
http://baderlab.org/DanieleMerico/Code/Read_GMT
Gene-set Resources
Tips to navigate the resource ocean / 2:
– If you are not confident with R, and you are a
GSEA user, use MSigDB and Pathway Commons
• From both resources you can download GMT files
(GMT is the format used by GSEA)
• Remember that GO gene-sets in MSigDB do not have
IEA-backed annotations
– Both Bioconductor and MSigDB incorporate GO
inherited annotations (back-propagated)
Summary of PART 3
Gene-set Data Sources
– Gene Ontology, a hierarchically structured
controlled vocabulary for gene function
annotation, is the main source of gene-sets
– Other valuable sources are availables, such as
pathway databases
In the next part we will see how to use gene-set
for enrichment analysis…
Now, take a…
And ready to dive again!
PART 4
Gene-set Enrichment: Methods
What statistical methods can I use to
score gene-sets for enrichment?
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
ENRICHMENT
TEST
Gene-set
Databases
Spindle
0.00001
Apoptosis
0.00025
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
0.00001
Apoptosis
0.00025
ENRICHMENT
TEST
Experimental Data
Gene-set
Databases
A priori knowledge +
existing experimental data
Microarray
Experiment
(gene expression table)
Enrichment Test
Enrichment Table
Spindle
0.00001
Apoptosis
0.00025
ENRICHMENT
TEST
Interpretation
& Hypotheses
Gene-set
Databases
Enrichment Test
Gene-sets
Enrichment Table
Spindle
0.00001
Apoptosis
0.00025
SPP1
SPP2
CCCP
MTC1
…
FADD
TRADD
CYTC1
BAX
BAXL
CASP9
CASP10
….
Microarray
Experiment
(gene expression table)
Enrichment Test
Microarray
Experiment
(gene expression table)
How?
ENRICHMENT
TEST
Two-class Design
Genes Ranked by
Expression Matrix
Differential Statistic
UP
DOWN
Class-1
Class-2
E.g.:
- Fold change
- Log (ratio)
- t-test
Selection by
Threshold
UP
DOWN
Time-course Design
E.g.:
- K-means
- K-medoids
- SOM
Expression Matrix
t1 t2 t3 …
tn
Gene
Clusters
Other Designs
Expression Matrix
Significant Genes
E.g.:
- ANOVA
- Linear Model
Enrichment Test
Microarray
Experiment
(gene expression table)
Significant genes
(e.g UP)
Gene-set
Databases
Background genes
(array genes not significant)
Enrichment Test
Microarray
Experiment
(gene expression table)
Significant genes
(e.g UP)
Gene-set
Gene-set
Databases
Background genes
(array genes not significant)
Enrichment Test
Microarray
Experiment
(gene expression table)
Significant genes
(e.g UP)
Overlap between
significant genes
and gene-set
Gene-set
Gene-set
Databases
Background genes
(array genes not significant)
Enrichment Test
Significant genes
(e.g UP)
Overlap between
significant genes
and gene-set
Is this overlap
larger than
expected by
random
sampling
the array
genes?
Background genes
(array genes not significant)
Enrichment Test
Significant genes
(e.g UP)
Overlap between
significant genes
and gene-set
Is this overlap
larger than
expected by
random
sampling
the array
genes?
Background genes
(array genes not significant)
Random sample
of array genes
Enrichment Test
Significant genes
(e.g UP)
Overlap between
significant genes
and gene-set
Is this overlap
larger than
expected by
random
sampling
the array
genes?
Statistical Model:
Fisher’s Exact Test
Fisher’s Exact Test does not require to
actually perform the random sampling, it
is based on a theoretical null-hypotehsis
distribution (Hypergeometric Distribution)
http://en.wikipedia.org/wiki/Fisher's_exact_test
Background genes
(array genes not significant)
Fisher’s Exact Test
For Gene-set Enrichment
a
c
b
d
Enrichment
P-value
(a  b)!(a  c)!(c  d )!(b  d )!
n!a!b!c!d!
© by Black Box Inc.
a, b, c, d are the size of the fours subsets
(each subset has a different color)
R: help (fisher.test)
MEMO:
P-value ~ 0 --> significant
P-value ~ 1 --> not significant
Fisher’s Exact Test
For Gene-set Overlap
We can also use Fisher’s Exact Test to evaluate the overlap
between gene-sets from databases
Going back to MSigDB…
Now we know where these p-values come from!
Web Resources for Fisher’s Exact Test
ConceptGen
http://conceptgen.ncibi.org/core/conceptGen/index.jsp
Note: free account required
DAVID
http://david.abcc.ncifcrf.gov/summary.jsp
Note: thorough description of how to use in this paper:
Huang da W, Sherman BT, Lempicki RA.
Systematic and integrative analysis of large gene lists using DAVID bioinformatics
resources.
Nat Protoc. 2009;4(1):44-57.
PMID: 19131956
Beyond Fisher’s Test
UP
Thresholddependent
e.g.
Fisher’s
Test
DOWN
ENRICHMENT
TEST
UP
Wholedistribution
e.g. GSEA
DOWN
Beyond Fisher’s Test
Whole-distribution methods have been shown
to be more stable and statistically powerful
– No “natural” value for the threshold
– Different results at different threshold settings
– Loss of information due to thresholding
• No resolution between significant signals with different
strengths
• Weak signals neglected
--> Use whole-distribution whenever possible
GSEA Enrichment Test / 1
Two-class comparison
Expression Matrix
- Fold change
- Log (ratio)
- t-test
- SAM
Class-1
Class-2
Expression Matrix
Correlation to phenotype
- Pearson
correlation
Quantitative Phenotype
Ranked
Gene List
GSEA Enrichment Test / 2
Ranked
Gene List
Enrichment Table
GSEA
Gene-set
Databases
Gene-set
p-value
FDR
Spindle
0.0001
0.01
Apoptosis
0.025
0.09
GSEA Enrichment Test / 2
Ranked
Gene List
Enrichment Table
Gene-set
p-value
FDR
Spindle
0.0001
0.01
Apoptosis
0.025
0.09
GSEA
The p-value depends only on the
single gene-set performance
Gene-set
Databases
The FDR depends on the
performance of all gene-sets
GSEA: Method
Steps
1. Calculate the ES score
2. Generate the ES distribution for the null
hypothesis using permutations
• see permutation settings
3. Calculate the empirical p-value
4. Calculate the FDR
Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for
interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)
GSEA: Method
ES score calculation
Where are the gene-set genes located in the ranked list?
Is there distribution random, or is there an enrichment in either end?
GSEA: Method
ES score calculation
Every present gene (black vertical bar) gives a positive contribution,
every absent gene (no vertical bar) gives a negative contribution
to the running ES score
GSEA: Method
ES score calculation
MAX running ES score --> Final ES Score
GSEA: Method
ES score calculation
High ES score <--> High local enrichment
GSEA: Method
Empirical p-value estimation (for every gene-set)
1. Generate null-hypothesis distribution from
randomized data (see permutation settings)
Number of
instances
Distribution of ES from
N permutations (e.g. 2000)
ES Score
GSEA: Method
Empirical p-value estimation (for every gene-set)
1. Generate null-hypothesis distribution from
randomized data (see permutation settings)
2. Estimate empirical p-value
Distribution of ES from
N permutations (e.g. 2000)
Real ES score value
GSEA: Method
Empirical p-value estimation (for every gene-set)
1. Generate null-hypothesis distribution from
randomized data (see permutation settings)
2. Estimate empirical p-value
Distribution of ES from
N permutations (e.g. 2000)
Real ES score value
Randomized with ES ≥ real: 4 / 2000
--> Empirical p-value = 0.002
GSEA Settings: Permutation
Permutation settings have important implications
which we will not discussed in detail
Practical suggestions:
– When biological replicates are very similar within classes
and classes are well seperated
--> gene permutation
– When biological replicates tend to be dissimilar,
or stratified according to hidden experimental factors
--> use other whole-distribution enrichment methods
of self-contained type (e.g. SAM-GS)
GSEA Settings: Gene-set Filter
Gene-set for enrichment analysis are usually
filtered by size
– Large gene-sets are undesired, if they are derived from
Gene Ontology or other functional resources, as they
usually correspond to uninformative concepts (e.g.
Regulation of Biopolymer Catabolism)
– Small gene-sets are undesired as their statistics are quite
noisy, and they may decrease the FDR of other sets
– See Using GSEA section for the specific value of size
filtering settings
Using GSEA
Installation
Launch Desktop Application from:
http://www.broadinstitute.org/gsea/msigdb/downloads.jsp
Notes:
– if you have sufficient RAM (*), go for the 1Gb option
– running GSEA will take some time
(2-5 hrs depending on the system and the memory setting)
– you need an internet connection to run GSEA
(*) WIN: check using ALT+CTRL+CANC/Task Manager
MAC: check using Applications/Utilities/Activity Monitor
Using GSEA
Data Format
There are three data files you will need:
– Gene-set (.GMT)
– Gene Expression Table (.txt)
– Gene Expression Phenotypes (.CLS)
The formats requirements follow.
More on GSEA data formats:
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
Using GSEA
Data Format: gene-set file (.GMT)
Syntax:
<<GS-Name>> [\tab] <<GS-Description>> [\tab] <<Gene-ID>> [\tab] <<Gene-ID>>
Notes:
• Either use the gene-set ID for the Name (e.g. GO ID) and the geneset full name for the Description
• Or use the gene-set full name for the Name and the source database
for the Description
Example:
regulation of DNA recombination GO:0000018 604 641 3458
transition metal ion transport GO:0000041 475 538 540
Using GSEA
Data Format: gene expression table file (.txt)
Syntax: table
<<NAME>> [\tab] <<DESCRIPTION>> [\tab] <<Value-S1>> [\tab] << Value-S2>>
Notes:
• Use the gene ID for the Name (e.g. GO ID) and the gene symbol
and/or full name for the Description
• I recommend using EntrezGene IDs, for a number of reasons
• Gene IDs must be consistent between the GMT and this file
Example:
Using GSEA
Data Format: expression phenotypes file (.CLS)
Number of
samples
Number of
classes
Always 1
Class Labels
931
# Tg-A Tg-B WT
Tg-A Tg-A Tg-A Tg-B Tg-B Tg-B WT WT WT
Phenotype labels for all
samples in the gene
expression tables
Use space as
separator
Using GSEA
Load the data
Using GSEA
Load the data
Using GSEA
Run the analysis – Parameter setting / 1
Load gene-set (.GMT) file here
Load gene expression table here
Load phenotype file (.CLS) here
2000
gene.-set
If your gene expression table has
probe IDs already matching with the
.GMT file, you don’t need this.
If your gene expression table has
probe IDs already matching with the
.GMT file, set this this to FALSE.
Using GSEA
Run the analysis – Parameter setting / 2
Differential statistic. Use t-test (or
signal-to-noise) if you have at least
3 replicates.
10 is usually good. Keep
between 7-8 and 15.
600 is usually good. Keep
between 500 and 800.
Using GSEA
GSEA Pre-ranked
– If you wish to use a statistic for differential expression
other than GSEA, you can using the Pre-ranked mode
More on GSEA pre-ranked data format:
http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats
#RNK:_Ranked_list_file_format_.28.2A.rnk.29
Summary of PART 4
Methods for Gene-set Enrichment
– Fisher’s Exact Test can be used for any given set of
experimental genes
– When possible, use GSEA to achieve greater
power
– Both GSEA and Fisher’s Exact Test require to score
genes for significance/differentiality; how this is
done depends on the microarray design
Now, take a…
And ready to dive again!
PART 5
Gene-set Enrichment: Visualization
How to use enrichment analysis to
functionally map cellular activity. Or,
everything finally coming together.
Gene-set Enrichment:
Redundancy Problem
Many redundant gene-sets
– Gene Ontology has a very large number of genesets, often with slight differences
– Different pathway databases have different yet
overlapping definitions of pathways
– Globally, it is useful to grasp the overlap relations
between enriched gene-sets
--> we need a visualization framework
going beyond the enrichment table
GO.id
GO:0042330
GO:0006935
GO:0002460
GO:0002250
GO:0002443
GO:0019724
GO:0030099
GO:0002252
GO:0050764
GO:0050766
GO:0002449
GO:0019838
GO:0051258
GO:0005789
GO:0016064
GO:0007507
GO:0009617
GO:0030100
GO:0002526
GO:0045807
GO:0002274
GO:0008652
GO:0050727
GO:0002253
GO:0002684
GO:0050778
GO:0019882
GO:0002682
GO:0050776
GO:0043086
GO:0006909
GO:0002573
GO:0006959
GO:0046649
GO:0030595
GO:0006469
GO:0051348
GO:0007179
GO:0005520
GO:0042110
GO:0002455
GO:0005830
GO:0006487
GO:0051240
GO:0042379
GO:0008009
GO:0016055
GO.name
p.value
covercover.rat
Deg.mdn
Deg.iqr
taxis
2.18E-06
23 0.056930693 54.94499375 9.139238998
chemotaxis
2.18E-06
23 0.060209424 54.94499375 9.139238998
adaptive immune response based on somatic recombination
7.10E-05
25 0.111111111 57.32306955 16.97054864
adaptive immune response
7.10E-05
25 0.111111111 57.32306955 16.97054864
leukocyte mediated immunity
0.000419328
23 0.097046414 58.27890582 15.58333739
B cell mediated immunity
0.000683758
20 0.114285714 57.84161096 15.03496347
myeloid cell differentiation
0.000691589
24 0.089219331 62.22171598 10.35284833
immune effector process
0.000775626
31 0.090116279 58.27890582 23.86214773
regulation of phagocytosis
0.000792138
8
0.2 53.54786293 5.742849971
positive regulation of phagocytosis
0.000792138
8 0.216216216 53.54786293 5.742849971
lymphocyte mediated immunity
0.00087216
22 0.101851852 57.84161096 16.13171132
growth factor binding
0.000913285
15 0.068181818
83.0405088 10.58734852
protein polymerization
0.00108876
17 0.080952381 57.97543252 17.31639968
endoplasmic reticulum membrane
0.001178198
18 0.036072144 64.02284752 12.05209158
immunoglobulin mediated immune response
0.001444464
19 0.113095238 58.27890582 15.58333739
heart development
0.001991562
26 0.052313883 84.02538284 18.60761304
response to bacterium
0.002552999
10 0.027173913 52.75249873 23.23104637
regulation of endocytosis
0.002658555
11 0.099099099 56.38041132 16.02486889
acute inflammatory response
0.002660742
24 0.103004292 57.80098769 24.94311116
positive regulation of endocytosis
0.002903401
9 0.147540984 54.94499375 6.769909171
myeloid leukocyte activation
0.002969661
7 0.077777778 54.94499375 16.07042339
amino acid biosynthetic process
0.003502921
7 0.017241379 45.19797271 31.18248579
regulation of inflammatory response
0.004999055
7 0.084337349 54.94499375 7.737346076
activation of immune response
0.00500146
23 0.116161616 60.29679989 18.41103376
positive regulation of immune system process
0.006581245
27 0.111570248 60.29679989 22.05051447
positive regulation of immune response
0.006581245
27 0.113924051 60.29679989 22.05051447
antigen processing and presentation
0.007244488
7 0.029661017 54.94499375 16.58797889
regulation of immune system process
0.007252134
29 0.099656357 61.05645008 22.65935206
regulation of immune response
0.007252134
29 0.102112676 61.05645008 22.65935206
negative regulation of enzyme activity
0.008017022
9 0.040723982 53.28031076 17.48904224
phagocytosis
0.008106069
10 0.080645161 55.66270253 12.47536747
myeloid leukocyte differentiation
0.008174948
10 0.092592593 62.86577216 9.401887596
humoral immune response
0.008396095
16 0.044568245 55.05654091 18.94209565
lymphocyte activation
0.009044401
29 0.059917355 61.92213317 21.03553355
leukocyte chemotaxis
0.009707319
7 0.101449275 56.33116709 6.945510559
negative regulation of protein kinase activity
0.010782155
7 0.046357616 52.22863516 12.58524145
negative regulation of transferase activity
0.010782155
7 0.04516129 52.22863516 12.58524145
transforming growth factor beta receptor signaling pathw 0.012630825
13 0.071038251 83.49440788 12.63256309
insulin-like growth factor binding
0.012950071
9 0.097826087 81.41963394 7.528247832
T cell activation
0.013410548
20 0.064516129 59.77891783 26.06174863
humoral immune response mediated by circulating immunogl 0.016780163
10
0.125 54.70766244 14.2572143
cytosolic ribosome (sensu Eukaryota)
0.016907351
8 0.01843318 61.68933284 7.814673781
protein amino acid N-linked glycosylation
0.01791078
7 0.044585987 56.50635337 6.780726553
positive regulation of multicellular organismal process 0.017931228
31 0.096573209
62.2953212 23.86214773
chemokine receptor binding
0.018849666
12 0.095238095 55.13915015 19.08254406
chemokine activity
0.018849666
12 0.096774194 55.13915015 19.08254406
Wnt receptor signaling pathway
0.020088086
18 0.04400978 85.47935979 20.92435897
GO.id
GO:0042330
GO:0006935
GO:0002460
GO:0002250
GO:0002443
GO:0019724
GO:0030099
GO:0002252
GO:0050764
GO:0050766
GO:0002449
GO:0019838
GO:0051258
GO:0005789
GO:0016064
GO:0007507
GO:0009617
GO:0030100
GO:0002526
GO:0045807
GO:0002274
GO:0008652
GO:0050727
GO:0002253
GO:0002684
GO:0050778
GO:0019882
GO:0002682
GO:0050776
GO:0043086
GO:0006909
GO:0002573
GO:0006959
GO:0046649
GO:0030595
GO:0006469
GO:0051348
GO:0007179
GO:0005520
GO:0042110
GO:0002455
GO:0005830
GO:0006487
GO:0051240
GO:0042379
GO:0008009
GO:0016055
GO.name
p.value
covercover.rat
Deg.mdn
Deg.iqr
taxis
2.18E-06
23 0.056930693 54.94499375 9.139238998
chemotaxis
2.18E-06
23 0.060209424 54.94499375 9.139238998
adaptive immune response based on somatic recombination
7.10E-05
25 0.111111111 57.32306955 16.97054864
adaptive immune response
7.10E-05
25 0.111111111 57.32306955 16.97054864
leukocyte mediated immunity
0.000419328
23 0.097046414 58.27890582 15.58333739
B cell mediated immunity
0.000683758
20 0.114285714 57.84161096 15.03496347
myeloid cell differentiation
0.000691589
24 0.089219331 62.22171598 10.35284833
immune effector process
0.000775626
31 0.090116279 58.27890582 23.86214773
regulation of phagocytosis
0.000792138
8
0.2 53.54786293 5.742849971
positive regulation of phagocytosis
0.000792138
8 0.216216216 53.54786293 5.742849971
lymphocyte mediated immunity
0.00087216
22 0.101851852 57.84161096 16.13171132
growth factor binding
0.000913285
15 0.068181818
83.0405088 10.58734852
protein polymerization
0.00108876
17 0.080952381 57.97543252 17.31639968
endoplasmic reticulum membrane
0.001178198
18 0.036072144 64.02284752 12.05209158
immunoglobulin mediated immune response
0.001444464
19 0.113095238 58.27890582 15.58333739
heart development
0.001991562
26 0.052313883 84.02538284 18.60761304
response to bacterium
0.002552999
10 0.027173913 52.75249873 23.23104637
regulation of endocytosis
0.002658555
11 0.099099099 56.38041132 16.02486889
acute inflammatory response
0.002660742
24 0.103004292 57.80098769 24.94311116
positive regulation of endocytosis
0.002903401
9 0.147540984 54.94499375 6.769909171
myeloid leukocyte activation
0.002969661
7 0.077777778 54.94499375 16.07042339
amino acid biosynthetic process
0.003502921
7 0.017241379 45.19797271 31.18248579
regulation of inflammatory response
0.004999055
7 0.084337349 54.94499375 7.737346076
activation of immune response
0.00500146
23 0.116161616 60.29679989 18.41103376
positive regulation of immune system process
0.006581245
27 0.111570248 60.29679989 22.05051447
positive regulation of immune response
0.006581245
27 0.113924051 60.29679989 22.05051447
antigen processing and presentation
0.007244488
7 0.029661017 54.94499375 16.58797889
regulation of immune system process
0.007252134
29 0.099656357 61.05645008 22.65935206
regulation of immune response
0.007252134
29 0.102112676 61.05645008 22.65935206
negative regulation of enzyme activity
0.008017022
9 0.040723982 53.28031076 17.48904224
phagocytosis
0.008106069
10 0.080645161 55.66270253 12.47536747
myeloid leukocyte differentiation
0.008174948
10 0.092592593 62.86577216 9.401887596
humoral immune response
0.008396095
16 0.044568245 55.05654091 18.94209565
lymphocyte activation
0.009044401
29 0.059917355 61.92213317 21.03553355
leukocyte chemotaxis
0.009707319
7 0.101449275 56.33116709 6.945510559
negative regulation of protein kinase activity
0.010782155
7 0.046357616 52.22863516 12.58524145
negative regulation of transferase activity
0.010782155
7 0.04516129 52.22863516 12.58524145
transforming growth factor beta receptor signaling pathw 0.012630825
13 0.071038251 83.49440788 12.63256309
insulin-like growth factor binding
0.012950071
9 0.097826087 81.41963394 7.528247832
T cell activation
0.013410548
20 0.064516129 59.77891783 26.06174863
humoral immune response mediated by circulating immunogl 0.016780163
10
0.125 54.70766244 14.2572143
cytosolic ribosome (sensu Eukaryota)
0.016907351
8 0.01843318 61.68933284 7.814673781
protein amino acid N-linked glycosylation
0.01791078
7 0.044585987 56.50635337 6.780726553
positive regulation of multicellular organismal process 0.017931228
31 0.096573209
62.2953212 23.86214773
chemokine receptor binding
0.018849666
12 0.095238095 55.13915015 19.08254406
chemokine activity
0.018849666
12 0.096774194 55.13915015 19.08254406
Wnt receptor signaling pathway
0.020088086
18 0.04400978 85.47935979 20.92435897
adaptive immune response based on somatic
recombination
adaptive immune response
leukocyte mediated immunity
B cell mediated immunity
myeloid cell differentiation
immune effector process
regulation of phagocytosis
positive regulation of phagocytosis
lymphocyte mediated immunity
Gene-set Enrichment:
Redundancy Problem
How to handle the redundancy problem?
– Statistical solutions:
• Correct for inter-redundancy and prioritize the most
enriched gene-sets
• Don’t always work well, not available for all tests
--> not discussed here
– Visualization solution:
• visualize gene-set overlap as a network
Enrichment Map (Cytoscape plugin)
http://baderlab.org/Software/EnrichmentMap
Enrichment Map
Enrichment Map
Enrichment
Significance
Class A
(e.g. UP)
Class B (e.g.
DOWN)
Enrichment Map
A
|A

B|
min
(|A
|,|B|)
B
Application Example
Estrogen treatment of Breast Cancer Cells
Overall Design:
- 2 classes (treated, untreated)
- 3 time points
12 hrs
24 hrs
48 hrs
Estrogen-treated
3
3
3
Untreated
3
3
3
We will start off by analyzing only the 24 hours time point,
which has the maximal induction,
although its is functionally similar to the 12 hours time-point
Clusters were manually identified and tagged;
they represent highly inter-related gene-sets
Condition Comparison
Enrichment Map can be used to compare enrichments
Use cases:
– Different experiments
– Different condition comparisons within the same experiment
Example: same data-set (Estrogen treatment)
12 hrs
24 hrs
48 hrs
Estrogen-treated
3
3
3
Untreated
3
3
3
Now we can analyze together the 12 and 24 hours time-points
Notice that we are always comparing the treated to the untreated
Heat-map Feature
Heat-maps can be used to explore gene
expression patterns
– Microarray data are typically normalized by-row
for heat-map visualization
i. Subtract the mean
ii. Divide by the standard deviation
– This setting is available in Enrichment Map
Down
Up
Gene Ontology Restructured
Gene Ontology is hierarchical, and
terms are highly redundant / interrelated / inter-dependent
Enrichment Maps are not
hierarchical, yet they neatly group
redundant / inter-related / interdependent terms
Enrichment Map How-to
Installation
1. Install Cytoscape
http://www.cytoscape.org/download.php?file=cyto2_6_3
2. Dowload Enrichment Map plugin
http://baderlab.org/Software/EnrichmentMap#Plugin_Download
3. Copy the plugin into the Cytoscape plugin folder
win C:\Program Files\Cytoscape\plugins
mac Applications/Cytoscape/plugins
Enrichment Map: How-to
Load Data
– Open Cytoscape, load the Enrichment Map
plugin from the menu: plugins/
Enrichment Map/ Load Enrichment Results
1. Format: GSEA
– Use the generic if you have generated
enrichment results outside GSEA; follow the
manual for formatting instructions
2. Load the gene-set file (GMT)
3. Load the expression matrix (tab-sep txt)
4. This is optional
5. Change the settings as follows:
– Set the p-value cut-off to 0.001
– Set the FDR q-value cut-off to 0.05 (5%)
– Select the overlap coefficient
More at: http://baderlab.org/Software/EnrichmentMap/UserManual
Enrichment Map: How-to
Browse results
–
–
–
Enrichment Map is a Cytoscape plugin
We will fully learn how to use Cytoscape in the
next lesson
In this lesson, we will just see essential
functionalities
Nodes can be dragged and
dropped, or deleted
Use this panel to move the
view of the network around
Heat-map view
Click on nodes to access
Normalization setting:
Row Normalize Data
These parameters can be
tuned to include/exclude
gene-sets from the map,
depending on their
enrichment scores
Rerun the layout from:
Layout/Cytoscape Layouts/
Force Directed Layout/
Weighted
Summary of PART 5
Visualization of Gene-set Enrichment
– Gene-set enrichment is valuable to summarize the
functional landscape of cellular activity
(in our case, gene expression)
– Gene-sets are highly redundant, organizing them
as a network highly facilitates navigation and
interpretation
• Software: Enrichment Map
Further Readings
Enrichment Analysis (Methods):
•
Nam D, Kim SY.
Gene-set approach for expression pattern analysis.
Brief Bioinform. 2008 May;9(3):189-97.
PMID: 18202032
•
Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS,
Halloran P, Yasui Y.
Gene-set analysis and reduction.
Brief Bioinform. 2009 Jan;10(1):24-34.
PMID: 18836208
Enrichment Map:
•
Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A.
Pathway analysis of dilated cardiomyopathy using global proteomic profiling
and enrichment maps.
Proteomics. 2010 Feb 1.
PMID: 20127684
Assignment
Rules
– Forum discussion:
• Of course, you are free to discuss general topics on the forums
• Please don’t discuss assignment results until I’ve received them all
• You can discuss results of optional assignments on the forum any
time, if you wish
– Send me ([email protected]) the following
material:
•
•
•
•
GSEA input files (zipped)
GSEA output files (zipped)
Cytoscape Session
Any ppt or doc elaborating on what you did and answering
question (please, be concise!)
Assignment
Estrogen Treatment Data
– Run GSEA
• Phenotypes: 12 and 24 hrs X treated vs untreated
• Differential statistic: t-test
– Explore results using Enrichment Map
• Can you reproduce the view in the lesson slides?
• What can you infer about estrogen effect on the
cellular gene expression program?
• Use the heat-maps to inspect the differences between
12 and 24 hours: what do you notice? What are the
implications for the comparison design?
Assignment
Estrogen Treatment Data: Source
– The original microarray data are available on GEO
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11352
– The raw .CEL data were processed using rma in
R/Bioconductor
– The rma gene expression matrix and the gene-set
(GMT) file are also available at:
http://baderlab.org/Software/EnrichmentMap#Sample_Data_Download
Optional Assignments / 1
Do these assignment if you have time and you wish to
explore more
– Run GSEA with ratio-of-classes
• Are the results globally similar?
• what the differences do you notice in the Enrichment Map?
– Make a gene-set (GMT) file with GO and KEGG using
R/Bioconductor
• Are the enriched KEGG pathways insightful?
– Run Enrichment Map with different values of the overlap
coefficient (e.g. 0.4, 0.6)
• In our experience, 0.5 is the optimal value for large maps (> 200 gs)
• Which setting do you like the best? Why?
Optional Assignments / 2
Do these assignment if you have time and you wish to
explore more
1. Compute the t-test p-value in R, select the top (a) 750, (b)
2000 up- and down-regulated genes
2. Run the enrichment analysis in ConceptGen
3. Visualize the enrichment as a network in ConceptGen
– Can you recognize functional clusters?
– Are there similarities with the Enrichment Map view?
At least for this lesson…