Download Week 8 - GEA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Transposable element wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Long non-coding RNA wikipedia , lookup

X-inactivation wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Epistasis wikipedia , lookup

Essential gene wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Oncogenomics wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Gene wikipedia , lookup

The Selfish Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome (book) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Tools and Algorithms in Bioinformatics
GCBA815, Fall 2015
Week-8:
WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA)
Simarjeet K. Negi, Ph.D. candidate
(Guda Lab)
Department of Genetics, Cell Biology and Anatomy
University of Nebraska Medical Center
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Why perform enrichment analysis?
• Large gene lists resulting from high- throughput analysis
• Deciphering the biology
•
Organize expression changes into meaningful functional themes
• Gene enrichment analysis increases the likelihood to identify
molecular processes/functions most pertinent to the study
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Principle of Enrichment Analysis
• If a biological process is abnormal in a given study, the co-functioning
genes should have a higher (enriched) potential to be selected as a
relevant group by the high-throughput screening technologies
•
Analytic conclusion is based on a group of relevant genes that increases
the likelihood to identify the biological processes most pertinent to study
• Enrichment tools map a large number of ‘interesting’ genes to
biological annotation terms (e.g. GO Terms or Pathways)
•
Statistical examination of the enrichment of user genes for each of the
annotation terms by comparing the outcome to the control (or reference)
background
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Classification of Enrichment Tools
• Based on the difference of algorithms, the current enrichment tools can be
broadly divided into three classes:
•
Singular enrichment analysis (SEA); WebGestalt
•
Gene set enrichment analysis (GSEA); GSEA
•
Modular enrichment analysis (MEA); DAVID
Overrepresentation approaches
Aggregate score approach
• Note, some tools with diverse capabilities belong to more than one class
__________________________________________________________________________________________________
10/16/2015
GCBA 815
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(http://bioinfo.vanderbilt.edu/webgestalt/)
__________________________________________________________________________________________________
10/16/2015
GCBA 815
WebGestalt :WEB-based Gene SeT AnaLysis Toolkit
• Input: user’s preselected (e.g. differentially expressed genes selected
between experimental versus control samples) ‘interesting’ genes
•
Iteratively testing the enrichment of each annotation term one-by-one in
a linear mode
• Integrates functional enrichment analysis with information
visualization
• Constantly updated
• Efficiently processes large gene lists
• Weakness: output of terms can be large, thereby diluting the focus and
interrelationships of relevant terms
__________________________________________________________________________________________________
10/16/2015
GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery (https://david.ncifcrf.gov/home.jsp)
__________________________________________________________________________________________________
10/16/2015
GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery
__________________________________________________________________________________________________
10/16/2015
GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery
• DAVID inherits the basic enrichment calculation as found in
WebGestalt
• Input: user defined gene list
• Incorporates extra network discovery algorithms by considering the
term-to-term relationships
•
•
Improve discovery sensitivity and specificity by considering interrelationships of GO terms in the enrichment calculations
•
Joint terms may contain unique biological meaning for a given study, not
held by individual terms
Weakness: Not updated in the recent years, user input gene list size limited to
3000 genes
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA: Gene Set Enrichment Analysis
(http://www.broadinstitute.org/gsea/)
• Identifies the enriched pathways/gene sets between two biological states
• The program uses an underlying database (MSigDB) of about 11,000 gene sets
that include KEGG, BIOCARTA pathways, curated sets from disease states, etc.
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Seven Broader Collections of GSEA
Using MSigDB
•
•
•
•
•
Search
Browse
Examine gene sets
Investigate
Download
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA: Gene Set Enrichment Analysis
• GSEA program (download to your PC)
• Input: Expression dataset (between two conditions); Phenotype labels between two states; Gene
sets in gmx/gmt format (MSigDB - supplied by GSEA)
• GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray
experiment without selecting significant genes (e.g. genes with P-value 0.05
and fold change 2)
• GSEA method requires a summarized biological value (e.g. fold change)
• Weakness:
•
Sometimes, it is a difficult task to summarize many biological aspects of a gene into one
meaningful value; example: SNP arrays, clinical microarray studies
•
GSEA is less powerful to detect a gene set with a mix of genes with positive and negative
associations with the phenotype
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Tutorial
__________________________________________________________________________________________________
10/16/2015
GCBA 815
WebGestalt : example dataset
• 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012
• 11521 genes as the reference gene set from the protein-protein interaction
network used in the same paper
• Genes are from a human study
__________________________________________________________________________________________________
10/16/2015
GCBA 815
WebGestalt : WEB-based Gene SeT AnaLysis Toolkit
(http://bioinfo.vanderbilt.edu/webgestalt/)
hsapiens
hsapiens_gene_symbol
Colorectal_cancer_genes
__________________________________________________________________________________________________
10/16/2015
GCBA 815
PPI_network
hsapiens_gene_symbol
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GO Analysis
nodes with red
label represents
enriched categories
and black label
represents their
non-enriched
parents
__________________________________________________________________________________________________
10/16/2015
GCBA 815
KEGG Analysis
Genes highlighted in red in
the pathway map are enriched
in the user input
__________________________________________________________________________________________________
10/16/2015
GCBA 815
DAVID : example dataset
• 408 genes involved in the cellular responses to HIV envelope protein
infection in resting or suboptimally activated peripheral blood mononuclear
cells; Cicala et al. 2002
• Affymetrix U95A microarray chip (genome wide expression) as the
reference gene set
__________________________________________________________________________________________________
10/16/2015
GCBA 815
DAVID: Database for Annotation, Visualization and Integrated
Discovery (https://david.ncifcrf.gov/home.jsp)
__________________________________________________________________________________________________
10/16/2015
GCBA 815
1
When multiple species pop up,
click on the species of interest
and press ‘Select Species’
3
HIV_genes
If multiple gene lists are open
in the program, select the
gene list of interest and click
on ‘Use’
2
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Percentage, e.g.
33/398 (involved
genes/total genes)
__________________________________________________________________________________________________
10/16/2015
GCBA 815
__________________________________________________________________________________________________
10/16/2015
GCBA 815
KEGG Pathway
BIOCARTA
List genes are shown in red stars
__________________________________________________________________________________________________
10/16/2015
GCBA 815
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Table Report is a gene-centric view which lists the genes and their associated annotation
terms (selected only). There is no statistics applied in this report
__________________________________________________________________________________________________
10/16/2015
GCBA 815
User input genes classified into
big gene functional groups
Measure of the importance of a
gene group in the user’s gene list
Check if there are any other
genes in the gene list or in the
genome functionally similar
to this gene group
Key biology of
this gene group
How the members share
common annotations/biology
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA dataset
• Transcriptional profiles from p53+ and p53 mutant cancer cell lines
• Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct
'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe
set ids) have been replaced with symbols
• Phenotype labels (e.g tumor vs normal): P53.cls
• Gene set: c1.v2.symbols.gmt
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA: Gene Set Enrichment Analysis
(http://www.broadinstitute.org/gsea/)
http://www.broadinstitute.org/gsea/datasets.jsp
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GCT file format; expression data file
__________________________________________________________________________________________________
10/16/2015
GCBA 815
CLS file format; phenotype file
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GMT file format; gene sets
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA: Gene Set Enrichment Analysis
1
2
3
__________________________________________________________________________________________________
10/16/2015
GCBA 815
2
__________________________________________________________________________________________________
10/16/2015
GCBA 815
1
3
2
ftp.broad.mit.edu://pub/gsea/annotations/HG_U95Av2.chip
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Interpreting GSEA Results
GSEA Statistics
GSEA computes four key statistics for the gene set enrichment analysis report:
●
Enrichment Score (ES)
●
Normalized Enrichment Score (NES)
●
False Discovery Rate (FDR)
●
Nominal P Value
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Enrichment plot ; Enrichment Score (ES)
•
Enrichment score (ES), reflects the degree to
which a gene set is overrepresented at the
top or bottom of a ranked list of genes
•
GSEA calculates the ES by walking down
the ranked list of genes, increasing a
running-sum statistic when a gene is in the
gene set and decreasing it when it is not
•
The magnitude of the increment depends on
the correlation of the gene with the
phenotype
•
The ES is the maximum deviation from zero encountered in walking the list
•
A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES
indicates gene set enrichment at the bottom of the ranked list
__________________________________________________________________________________________________
10/16/2015
GCBA 815
GSEA Report
__________________________________________________________________________________________________
10/16/2015
GCBA 815
1
3
2
•
To identify the subset of genes that actually contribute to the enrichment score (ES)
•
The leading edge subset in a geneset are those genes that appear in the ranked list at or before
the point at which the running sum reaches its maximum
•
Outputs heatmaps and set-to-set overlaps of leading edge subsets between pairs enriched
genesets
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Interpreting Leading Edge Analysis Results
HeatMap
Gene in Subsets
Set-to-Set
Histogram
__________________________________________________________________________________________________
10/16/2015
GCBA 815
Interpreting Leading Edge Analysis Results
Heat map shows the (clustered) genes in the leading edge subsets. The
expression values are represented as colors, where the range of colors (red, pink,
light blue, dark blue) shows the range of expression values (high, moderate, low,
lowest)
Set-to-Set graph uses color intensity to show the overlap between subsets: the
darker the color, the greater the overlap between the subsets
Gene in subsets graph shows each gene and the number of subsets in which it
appears
Histogram; the Jacquard is the intersection divided by the union for a pair of
leading edge subsets. Number of Occurrences is the number of leading edge
subset pairs in a particular bin. In this example, most subset pairs have no overlap
(Jacquard = 0)
__________________________________________________________________________________________________
10/16/2015
GCBA 815