Download State-of-the-art Biological Processes Enrichment Using Gene Ontology

Harvard School of Public Health Department of Biostatistics Program in Quantitative Genomics “Tutorials for Analyzing Quantitative 'Omic Data” State-of-the-art Biological Processes Enrichment Using Gene Ontology Pierre R. Bushel, Ph.D. Microarray and Genome Informatics Biostatistics Branch National Institute of Environmental Health Sciences [email protected] Course Description The Gene Ontology (GO) is a biological resource that contains the annotation (in terms of controlled vocabulary) of the molecular characteristics of genes and gene products. The tool has been extremely useful for research investigators to glean insight into the molecular pathways that govern biological conditions. However, the topology of GO poses challenges for reliable enrichment of biological processes. This tutorial will present 1) an overview of GO, 2) touch on the limitations of typical methods for performing gene set enrichment and then 3) address key considerations for improved overrepresentation of GO terms in a data set. The tutorial will conclude with a short demonstration of GOEAST, a web-based tool that performs gene set enrichment analysis but with the inherent GO hierarchical structure considered. Course Outline •  Presentation: Exploration of biological processes –  Overview of the Gene Ontology (GO) –  Modeling enrichment\over-representation of biological categories –  Limitations of certain approaches –  Leveraging the GO topology •  Demonstration: GOEAST –  Overview of, and tutorial on, GOEAST –  Experimental design of a gene expression study –  Analysis of differentially expressed genes Exploration of Biological Processes A Widely Used Biological Resource •  •  Gene Ontology (GO) Consortium was established in 1998 to developed shared, structured vocabulary (an ontology) for the annotation of molecular characteristics across different organisms. –  a collaborative effort to address the need for consistent descriptions of gene and gene products in different databases –  Original members of the consortium: SGD, FlyBase and MGD Two primary purposes for an ontology: 1.  to facilitate communication between people and organizations 2.  to improve upon the interoperability between systems Goals of the GO Initiative 1.  to compile a comprehensive structured vocabulary of terms describing different elements of molecular biology that are shared among life forms 2.  to describe biological objects (in the model organism database of each contributing member) using these terms 3.  to provide tools for querying and manipulating these vocabularies 4.  to provide tools enabling curators to assign GO terms to biological objects What GO is not •  GO is not a way to unify biological databases •  Not a dictated standard derived from the self-interest of users to mandate nomenclature across databases •  Does not serve to define homologies between gene products from different organisms. Structure of GO •  The ontologies are structured vocabularies in the form of directed acyclic graphs (DAGs) •  The DAG represents a network (not a tree) in which each term may be a child of one or more than one parent •  The relationships of child to parent can be of the “is a” type or the “part of” type Toy Example of a Relationship in GO •  Each node in the graph contains some genes •  The parent of a node contains all the genes of its children •  A node can contain genes that are not found in the children Ontologies within GO •  molecular function describing activities, such as catalytic or binding activities, at the molecular level •  biological process referring to a biological objective to which the gene product contributes •  cellular component referring to the place in the cell (i.e. the location) where a gene product is found http://www.geneontology.org Gene Expression to Enriched Biological Processes Biological processes Next Gen RNA-Seq Adapted from Werner (2008), Current Opinion in Biotech., 19:50-64 Input: Gene list from microarray or RNA-Seq – Cluster of genes with similar expression – Up/down regulated genes Question : – Are GO biological processes terms overrepresented in the gene list? Methods: – Hypergeometric (parametric) test – Kolmogorov-Smirnov (nonparametric) test Hypergeometric Distribution for k = 0,1,2,…,n k<=m, n-k <=N-m A discrete probability distribution that describes the number of successes (k) in a sequence of n draws from a finite population without replacement P(k=2, n=6, m=4, N=12) = 0.455 Cumulative dist.: P(k<=2) = 0.727 •  An urn with two types of marbles: •  4 red •  8 white •  Drawing a red marble is a success! •  Drawing a white marble is a failure! •  N total # of marbles (population size) •  m # of red marbles (# of success [red marbles] in the population) •  n is the # of marbles randomly selected out of the urn (sample size) •  k is the # of successes (red marbles) in the sample Hypergeometric Distribution in the context of gene expression Example 80 (n) DEGs 20000 (N) gene array 10 (k) of 80 (n) genes annotated to oxidative stress 100 (m) genes annotated with oxidative stress on the array p = 2.17 X 10-13 Models the probability the of observing k genes from a cluster of n genes by chance in a pathway or biological process category containing m genes from a total genome (or array) size of N genes. The closer the probability (p-value) is to 0, the more unlikely the chance is that the majority of the genes in the cluster have the same biological function (enriched) Fisher’s Exact Test (FET) one-tailed, right Contingency table DEG Not DEGs totals With GO term a b a+b w/o GO term c d c+d a+c b+d totals n (genes on array) Fisher showed that the probability of obtaining any such arrangement of the values is given by the hypergeometric distribution Calculate significance of each GO term independently Account for multiple testing using a Bonferroni correction or false discovery rate (FDR) Biological Processes Over-represented NIAID’s DAVID Database Database for Annotation, Visualization and Integrated Discovery http://david.abcc.ncifcrf.gov/ Gene Set Enrichment Analysis •  Determines if an a priori defined set of genes are statistically significant (presumably concordantly different) between two biological states (i.e. phenotype distinction) •  Sets of genes can be those within a pathway, biological process, etc. •  Statistical significance determined by permutation (shuffling of the data) •  GSEA-P: www.broadinstitute.org/gsea Gene Set Enrichment Analysis Strategy •  The genes are ordered on the basis of the parameter from the statistical test •  For each gene set compute enrichment score (ES). A measure of how relevant or associated a biological process is for discerning the difference between the two biological states •  Essentially the max running sum of a normalized Kolmogorov-Smirnov (non-parametric test) statistic. •  Permute the class labels a large # of times, each time recording the maximum ES over all gene sets. •  Compare the observed ES score to the distribution of the ES scores from the permuted data. •  Test the hypothesis that no gene set is associated with the class distinction Mootha et al., Nature Genetics, 2003, 34(3):267-273 Enriched Biological Processes in the Samples adipocyte-like adipocyte-like osteoblast-like osteoblast-like 1000 permutations of the gene sets, log2 ratio of classes, exclusion [15,500], FDR < 25% Major Concerns •  GO hierarchy –  treats each term independently and hence ignores the structure of the GO hierarchy •  Correlation among genes –  The methods assumes that the genes are uncorrelated •  Permutation\bootstrap resampling –  Loses power with small sample size –  Requires a logical null hypothesis for reliable results –  Can be computationally expensive Parent Child Consideration for a GO term pa denotes the parent of a GO term To calculate significance, sum over the probabilities of detecting npa or more annotations up to min(m, npa ) If the GO term has more than one parent: a) define the sets of parents of a term as the union of the genes annotated to the parents (parent –child-union): Npa and npa = # of genes annotated to any of the parents of their respective sets b) define the sets of parents of a term as the intersection of the genes annotated to the parents (parent-child-intersection): take into account all the genes annotated to all the parents (the common\overlap set). Grossmann et al., Bioinformatics, 2007, 23(22):3024-3031 Leveraging the GO Topology •  Adrian Alexa (Max-Planck-Institute for Informatics) developed two algorithms that uses GO Topology –  Asses local dependencies of GO terms •  Parent child and neighboring GO terms •  Apply weigh to account for the local dependencies •  Implemented in an R code tool topGO http://www.bioconductor.org/packages/2.2/bioc/html/topGO.html Two GO Term Weighting Algorithms elim algorithm • Nodes are processed bottom-up in the GO graph • Removes the genes annotated to significant GO terms from more general GO terms. weight algorithm • The genes are weighted by their relevance in the significant nodes. • The enrichment score of a node u is compared with the scores of its children. • Children with a better score than u represent the interesting genes better. Therefore, their significance is increased • Children with a lower score than u have their significance reduced. Alexa et al., Bioinformatics, 2006, 22(13):1600-1607 Decorrelating GO Graph Structure •  Apply weight to children (genes) of parent (gene node) u •  Children with better scores than u represent the more interesting genes •  These genes should contribute less to the enrichment score of any ancestor of u •  The genes receive a smaller weights in all ancestors of node u and hence should not be reported as significant •  The score is recomputed based on the newly assigned weights Methods Support by topGO Algorithms Statistical tests •  classic – doesn’t account for GO topology (independent of significant neighboring nodes) •  elim – weights restricted to 0 and 1 •  weight (WT) – weights range between 0 and 1 •  topgo (WT01) – combines elim and weight approaches •  parentChild (PC) – intersection or union Classic vs Elimination Method Classic vs Weight Method Top 20 Biological Processes Ordered by p-value from Weight (WT) Scheme A Few Other Software Tools/Approaches •  •  •  •  •  •  •  •  •  •  GeneGo MetaCore GOMiner GOEx – specifically for proteomics BiNGO & ClueGO – integrated with Cytoscape FunCluster sigPathway & GOStat – R/Bioconductor based FuncAssociate FatiGO GOEAST Gene Set Analysis (GSA) Demo: GOEAST Institute of Genetics and Developmental Biology Chinese Academy of Sciences http://omicslab.genetics.ac.cn/GOEAST/ Demo: GOEAST •  Web-based tool for Gene Ontology enrichment analysis •  Easy to use interface. Results returned by email web link •  Supports analysis for data from various resources Affymetrix, Illumina, Agilent and other customized microarrays non-microarray based experimental data as well •  Provides visualization of results •  Supports comparison of multiple analyses\experiments Institute of Genetics and Developmental Biology Chinese Academy of Sciences http://omicslab.genetics.ac.cn/GOEAST/ Differential Expression Shockley et al., J. Cell. Biochem, 2009, 106:232-246 peroxisome proliferator-activated receptor gamma Rosiglitazone •  mouse cell lines stably transfected with PPAR-γ2 •  Affymetrix Mouse 430vs2 arrays (~45K probesets) •  Data RMA normalized 3730 classic 18 minutes Top 500 classic 11 minutes Top500 WT 15mins ANOVA model 3730 DE probe sets (Q-value < 0.01) [1634 up-regulated, 2096 down-regulated] Open web browser and navigate to : http://omicslab.genetics.ac.cn/GOEAST/ At the header tool bar, click on Tutorial Click on tools on the side menu bar to display platforms Click here to start an Affymetrix gene expression based analysis Step 1) Choose the species. Select Mus musculus as the species. The web page will dynamically populate the platforms available for this species Step 2) Choose the microarray platform Select mouse genome 430 2.0 Array as the microarray platform. Step 3) Select the background (population) type. Leave whole chip selected Step 4) Upload probe set. Cut and paste list of Affymetrix probe set IDs from the differentially expressed genes to the entry box . Use probe set IDs in 72_hr_DEGs_Q_top_500.txt Use the default option for parameter settings. Step 5) Enter a valid email address that you have access to. Give the analysis a distinct name to identify it from the results that will be emailed to you. Click Start analysis Email notification when analysis is complete Click on hyperlink to display results in your web browser Use Advanced settings to run GO enrichment using Adrian Alexa’s weighting approach. Enriched Categories Based on GO Topology Enriched Categories Based on Classic Way Acknowledgements •  Dr. Xihong Lin for the invitation to present •  Dr. Adita Hazra for the opportunity to give the tutorial •  Ms. Shaina Andelman for the wonderful travel arrangements •  GOEAST development team for use of their server •  Dr. Keith Shockley for the gene expression data

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download State-of-the-art Biological Processes Enrichment Using Gene Ontology