Download State-of-the-art Biological Processes Enrichment Using Gene Ontology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy of the human retina wikipedia , lookup

Copy-number variation wikipedia , lookup

Metagenomics wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genetic engineering wikipedia , lookup

Oncogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Quantitative trait locus wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Essential gene wikipedia , lookup

Public health genomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Harvard School of Public Health
Department of Biostatistics
Program in Quantitative Genomics
“Tutorials for Analyzing Quantitative 'Omic Data”
State-of-the-art Biological
Processes Enrichment Using
Gene Ontology
Pierre R. Bushel, Ph.D.
Microarray and Genome Informatics
Biostatistics Branch
National Institute of Environmental Health Sciences
[email protected]
Course Description
The Gene Ontology (GO) is a biological resource that contains the
annotation (in terms of controlled vocabulary) of the molecular
characteristics of genes and gene products. The tool has been
extremely useful for research investigators to glean insight into the
molecular pathways that govern biological conditions. However, the
topology of GO poses challenges for reliable enrichment of biological
processes. This tutorial will present 1) an overview of GO, 2) touch on
the limitations of typical methods for performing gene set enrichment
and then 3) address key considerations for improved
overrepresentation of GO terms in a data set. The tutorial will
conclude with a short demonstration of GOEAST, a web-based tool
that performs gene set enrichment analysis but with the inherent GO
hierarchical structure considered.
Course Outline
•  Presentation: Exploration of biological
processes
–  Overview of the Gene Ontology (GO)
–  Modeling enrichment\over-representation of
biological categories
–  Limitations of certain approaches
–  Leveraging the GO topology
•  Demonstration: GOEAST
–  Overview of, and tutorial on, GOEAST
–  Experimental design of a gene expression study
–  Analysis of differentially expressed genes
Exploration of Biological Processes
A Widely Used Biological Resource
• 
• 
Gene Ontology (GO) Consortium was established in
1998 to developed shared, structured vocabulary (an
ontology) for the annotation of molecular
characteristics across different organisms.
–  a collaborative effort to address the need for
consistent descriptions of gene and gene products
in different databases
–  Original members of the consortium: SGD,
FlyBase and MGD
Two primary purposes for an ontology:
1.  to facilitate communication between people and
organizations
2.  to improve upon the interoperability between
systems
Goals of the GO Initiative
1. 
to compile a comprehensive structured vocabulary of terms
describing different elements of molecular biology that are shared
among life forms
2. 
to describe biological objects (in the model organism database of
each contributing member) using these terms
3. 
to provide tools for querying and manipulating these vocabularies
4. 
to provide tools enabling curators to assign GO terms to
biological objects
What GO is not
• 
GO is not a way to unify biological databases
• 
Not a dictated standard derived from the self-interest of users to
mandate nomenclature across databases
• 
Does not serve to define homologies between gene products
from different organisms.
Structure of GO
•  The ontologies are structured
vocabularies in the form of directed
acyclic graphs (DAGs)
•  The DAG represents a network (not
a tree) in which each term may be a
child of one or more than one parent
•  The relationships of child to parent
can be of the “is a” type or the “part
of” type
Toy Example of a Relationship in GO
•  Each node in the graph contains some genes
•  The parent of a node contains all the genes of its children
•  A node can contain genes that are not found in the children
Ontologies within GO
•  molecular function describing
activities, such as catalytic or
binding activities, at the molecular
level
•  biological process referring to a
biological objective to which the
gene product contributes
•  cellular component referring to the
place in the cell (i.e. the location)
where a gene product is found
http://www.geneontology.org
Gene Expression to Enriched
Biological Processes
Biological
processes
Next Gen RNA-Seq
Adapted from Werner (2008), Current
Opinion in Biotech., 19:50-64
Input: Gene list from microarray or
RNA-Seq
– Cluster of genes with
similar expression
– Up/down regulated genes
Question :
– Are GO biological processes terms
overrepresented in the gene list?
Methods:
– Hypergeometric (parametric) test
– Kolmogorov-Smirnov (nonparametric) test
Hypergeometric Distribution
for k = 0,1,2,…,n
k<=m, n-k <=N-m
A discrete probability distribution that describes the number of successes
(k) in a sequence of n draws from a finite population without replacement
P(k=2, n=6, m=4, N=12) = 0.455
Cumulative dist.:
P(k<=2) = 0.727
•  An urn with two types of marbles:
•  4 red
•  8 white
•  Drawing a red marble is a success!
•  Drawing a white marble is a failure!
•  N total # of marbles (population size)
•  m # of red marbles (# of success [red
marbles] in the population)
•  n is the # of marbles randomly selected out
of the urn (sample size)
•  k is the # of successes (red marbles) in the
sample
Hypergeometric Distribution
in the context of gene expression
Example
80 (n) DEGs
20000 (N) gene array
10 (k) of 80 (n) genes annotated to oxidative stress
100 (m) genes annotated with oxidative stress on the array
p = 2.17 X 10-13
Models the probability the of observing k genes from a cluster
of n genes by chance in a pathway or biological process
category containing m genes from a total genome (or array)
size of N genes.
The closer the probability (p-value) is to 0, the more unlikely
the chance is that the majority of the genes in the cluster have
the same biological function (enriched)
Fisher’s Exact Test (FET)
one-tailed, right
Contingency table
DEG
Not DEGs
totals
With GO term
a
b
a+b
w/o GO term
c
d
c+d
a+c
b+d
totals
n (genes on array)
Fisher showed that the probability of obtaining any such
arrangement of the values is given by the hypergeometric
distribution
Calculate significance of each GO term independently
Account for multiple testing using a Bonferroni correction or
false discovery rate (FDR)
Biological Processes Over-represented
NIAID’s DAVID Database
Database for Annotation, Visualization and Integrated Discovery
http://david.abcc.ncifcrf.gov/
Gene Set Enrichment Analysis
•  Determines if an a priori defined set of
genes are statistically significant
(presumably concordantly different)
between two biological states (i.e.
phenotype distinction)
•  Sets of genes can be those within a
pathway, biological process, etc.
•  Statistical significance determined by
permutation (shuffling of the data)
•  GSEA-P: www.broadinstitute.org/gsea
Gene Set Enrichment Analysis Strategy
•  The genes are ordered on the basis of the parameter
from the statistical test
•  For each gene set compute enrichment score (ES). A
measure of how relevant or associated a biological
process is for discerning the difference between the
two biological states
•  Essentially the max running sum of a normalized
Kolmogorov-Smirnov (non-parametric test) statistic.
•  Permute the class labels a large # of times, each time
recording the maximum ES over all gene sets.
•  Compare the observed ES score to the distribution of
the ES scores from the permuted data.
•  Test the hypothesis that no gene set is associated with
the class distinction
Mootha et al., Nature Genetics, 2003, 34(3):267-273
Enriched Biological Processes in the Samples
adipocyte-like
adipocyte-like
osteoblast-like
osteoblast-like
1000 permutations of the gene sets, log2 ratio of classes, exclusion [15,500], FDR < 25%
Major Concerns
•  GO hierarchy
–  treats each term independently and hence
ignores the structure of the GO hierarchy
•  Correlation among genes
–  The methods assumes that the genes are
uncorrelated
•  Permutation\bootstrap resampling
–  Loses power with small sample size
–  Requires a logical null hypothesis for
reliable results
–  Can be computationally expensive
Parent Child Consideration for a GO term
pa denotes the parent of a GO term
To calculate significance, sum over the probabilities of detecting npa or more
annotations up to min(m, npa )
If the GO term has more than one parent:
a) define the sets of parents of a term as the union of the genes
annotated to the parents (parent –child-union): Npa and npa = # of genes
annotated to any of the parents of their respective sets
b) define the sets of parents of a term as the intersection of the
genes annotated to the parents (parent-child-intersection): take into account all
the genes annotated to all the parents (the common\overlap set).
Grossmann et al., Bioinformatics, 2007, 23(22):3024-3031
Leveraging the GO Topology
•  Adrian Alexa (Max-Planck-Institute for Informatics)
developed two algorithms that uses
GO Topology
–  Asses local dependencies of GO terms
•  Parent child and neighboring GO terms
•  Apply weigh to account for the local
dependencies
•  Implemented in an R code tool
topGO
http://www.bioconductor.org/packages/2.2/bioc/html/topGO.html
Two GO Term Weighting Algorithms
elim algorithm
• Nodes are processed bottom-up in the GO graph
• Removes the genes annotated to significant GO terms
from more general GO terms.
weight algorithm
• The genes are weighted by their relevance in the
significant nodes.
• The enrichment score of a node u is compared with the
scores of its children.
• Children with a better score than u represent the
interesting genes better. Therefore, their significance is
increased
• Children with a lower score than u have their significance
reduced.
Alexa et al., Bioinformatics, 2006, 22(13):1600-1607
Decorrelating GO
Graph Structure
•  Apply weight to children (genes) of parent
(gene node) u
•  Children with better scores than u represent
the more interesting genes
•  These genes should contribute less to the
enrichment score of any ancestor of u
•  The genes receive a smaller weights in all
ancestors of node u and hence should not be
reported as significant
•  The score is recomputed based on the newly
assigned weights
Methods Support by topGO
Algorithms
Statistical tests
•  classic – doesn’t account for GO topology
(independent of significant neighboring nodes)
•  elim – weights restricted to 0 and 1
•  weight (WT) – weights range between 0 and 1
•  topgo (WT01) – combines elim and weight approaches
•  parentChild (PC) – intersection or union
Classic vs Elimination Method
Classic vs Weight Method
Top 20 Biological Processes Ordered by
p-value from Weight (WT) Scheme
A Few Other Software Tools/Approaches
• 
• 
• 
• 
• 
• 
• 
• 
• 
• 
GeneGo MetaCore
GOMiner
GOEx – specifically for proteomics
BiNGO & ClueGO
– integrated with Cytoscape
FunCluster
sigPathway & GOStat
– R/Bioconductor based
FuncAssociate
FatiGO
GOEAST
Gene Set Analysis (GSA)
Demo: GOEAST
Institute of Genetics and Developmental Biology
Chinese Academy of Sciences
http://omicslab.genetics.ac.cn/GOEAST/
Demo: GOEAST
•  Web-based tool for Gene Ontology enrichment analysis
•  Easy to use interface. Results returned by email web link
•  Supports analysis for data from various resources
Affymetrix, Illumina, Agilent and other customized microarrays
non-microarray based experimental data as well
•  Provides visualization of results
•  Supports comparison of multiple analyses\experiments
Institute of Genetics and Developmental Biology
Chinese Academy of Sciences
http://omicslab.genetics.ac.cn/GOEAST/
Differential Expression
Shockley et al., J. Cell. Biochem, 2009, 106:232-246
peroxisome proliferator-activated
receptor gamma
Rosiglitazone
•  mouse cell lines stably
transfected with PPAR-γ2
•  Affymetrix Mouse 430vs2
arrays (~45K probesets)
•  Data RMA normalized
3730 classic 18 minutes
Top 500 classic 11 minutes
Top500 WT 15mins
ANOVA model
3730 DE probe sets (Q-value < 0.01)
[1634 up-regulated, 2096 down-regulated]
Open web browser and navigate to :
http://omicslab.genetics.ac.cn/GOEAST/
At the header tool bar, click on Tutorial
Click on tools on the side menu bar to display platforms
Click here to start an Affymetrix gene expression based
analysis
Step 1) Choose the species.
Select Mus musculus as the species.
The web page will dynamically populate the platforms available for this species
Step 2) Choose the microarray platform
Select mouse genome 430 2.0 Array as the microarray platform.
Step 3) Select the background (population) type. Leave whole chip selected
Step 4) Upload probe set.
Cut and paste list of Affymetrix probe set IDs from the differentially expressed
genes to the entry box .
Use probe set IDs in 72_hr_DEGs_Q_top_500.txt
Use the default option for parameter settings.
Step 5) Enter a valid email address that you have access to.
Give the analysis a distinct name to identify it from the results that will be emailed
to you.
Click Start analysis
Email notification when analysis is complete
Click on hyperlink to display results in your web browser
Use Advanced settings to run GO enrichment using Adrian Alexa’s
weighting approach.
Enriched Categories Based on GO Topology
Enriched Categories Based on Classic Way
Acknowledgements
•  Dr. Xihong Lin for the invitation to present
•  Dr. Adita Hazra for the opportunity to give the tutorial
•  Ms. Shaina Andelman for the wonderful travel arrangements
•  GOEAST development team for use of their server
•  Dr. Keith Shockley for the gene expression data