* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Week 8 - GEA
Metagenomics wikipedia , lookup
Transposable element wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Long non-coding RNA wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Essential gene wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Oncogenomics wikipedia , lookup
Copy-number variation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic engineering wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Minimal genome wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Helitron (biology) wikipedia , lookup
Gene desert wikipedia , lookup
The Selfish Gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-8: WebGestalt, DAVID, Gene Set Enrichment Analysis (GSEA) Simarjeet K. Negi, Ph.D. candidate (Guda Lab) Department of Genetics, Cell Biology and Anatomy University of Nebraska Medical Center __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Why perform enrichment analysis? • Large gene lists resulting from high- throughput analysis • Deciphering the biology • Organize expression changes into meaningful functional themes • Gene enrichment analysis increases the likelihood to identify molecular processes/functions most pertinent to the study __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Principle of Enrichment Analysis • If a biological process is abnormal in a given study, the co-functioning genes should have a higher (enriched) potential to be selected as a relevant group by the high-throughput screening technologies • Analytic conclusion is based on a group of relevant genes that increases the likelihood to identify the biological processes most pertinent to study • Enrichment tools map a large number of ‘interesting’ genes to biological annotation terms (e.g. GO Terms or Pathways) • Statistical examination of the enrichment of user genes for each of the annotation terms by comparing the outcome to the control (or reference) background __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Classification of Enrichment Tools • Based on the difference of algorithms, the current enrichment tools can be broadly divided into three classes: • Singular enrichment analysis (SEA); WebGestalt • Gene set enrichment analysis (GSEA); GSEA • Modular enrichment analysis (MEA); DAVID Overrepresentation approaches Aggregate score approach • Note, some tools with diverse capabilities belong to more than one class __________________________________________________________________________________________________ 10/16/2015 GCBA 815 WebGestalt : WEB-based Gene SeT AnaLysis Toolkit (http://bioinfo.vanderbilt.edu/webgestalt/) __________________________________________________________________________________________________ 10/16/2015 GCBA 815 WebGestalt :WEB-based Gene SeT AnaLysis Toolkit • Input: user’s preselected (e.g. differentially expressed genes selected between experimental versus control samples) ‘interesting’ genes • Iteratively testing the enrichment of each annotation term one-by-one in a linear mode • Integrates functional enrichment analysis with information visualization • Constantly updated • Efficiently processes large gene lists • Weakness: output of terms can be large, thereby diluting the focus and interrelationships of relevant terms __________________________________________________________________________________________________ 10/16/2015 GCBA 815 DAVID: Database for Annotation, Visualization and Integrated Discovery (https://david.ncifcrf.gov/home.jsp) __________________________________________________________________________________________________ 10/16/2015 GCBA 815 DAVID: Database for Annotation, Visualization and Integrated Discovery __________________________________________________________________________________________________ 10/16/2015 GCBA 815 DAVID: Database for Annotation, Visualization and Integrated Discovery • DAVID inherits the basic enrichment calculation as found in WebGestalt • Input: user defined gene list • Incorporates extra network discovery algorithms by considering the term-to-term relationships • • Improve discovery sensitivity and specificity by considering interrelationships of GO terms in the enrichment calculations • Joint terms may contain unique biological meaning for a given study, not held by individual terms Weakness: Not updated in the recent years, user input gene list size limited to 3000 genes __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA: Gene Set Enrichment Analysis (http://www.broadinstitute.org/gsea/) • Identifies the enriched pathways/gene sets between two biological states • The program uses an underlying database (MSigDB) of about 11,000 gene sets that include KEGG, BIOCARTA pathways, curated sets from disease states, etc. __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Seven Broader Collections of GSEA Using MSigDB • • • • • Search Browse Examine gene sets Investigate Download __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA: Gene Set Enrichment Analysis • GSEA program (download to your PC) • Input: Expression dataset (between two conditions); Phenotype labels between two states; Gene sets in gmx/gmt format (MSigDB - supplied by GSEA) • GSEA implements a ‘no-cutoff’ strategy, taking all genes from a microarray experiment without selecting significant genes (e.g. genes with P-value 0.05 and fold change 2) • GSEA method requires a summarized biological value (e.g. fold change) • Weakness: • Sometimes, it is a difficult task to summarize many biological aspects of a gene into one meaningful value; example: SNP arrays, clinical microarray studies • GSEA is less powerful to detect a gene set with a mix of genes with positive and negative associations with the phenotype __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Tutorial __________________________________________________________________________________________________ 10/16/2015 GCBA 815 WebGestalt : example dataset • 487 colorectal cancer prognosis genes downloaded from Shi et al. 2012 • 11521 genes as the reference gene set from the protein-protein interaction network used in the same paper • Genes are from a human study __________________________________________________________________________________________________ 10/16/2015 GCBA 815 WebGestalt : WEB-based Gene SeT AnaLysis Toolkit (http://bioinfo.vanderbilt.edu/webgestalt/) hsapiens hsapiens_gene_symbol Colorectal_cancer_genes __________________________________________________________________________________________________ 10/16/2015 GCBA 815 PPI_network hsapiens_gene_symbol __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GO Analysis nodes with red label represents enriched categories and black label represents their non-enriched parents __________________________________________________________________________________________________ 10/16/2015 GCBA 815 KEGG Analysis Genes highlighted in red in the pathway map are enriched in the user input __________________________________________________________________________________________________ 10/16/2015 GCBA 815 DAVID : example dataset • 408 genes involved in the cellular responses to HIV envelope protein infection in resting or suboptimally activated peripheral blood mononuclear cells; Cicala et al. 2002 • Affymetrix U95A microarray chip (genome wide expression) as the reference gene set __________________________________________________________________________________________________ 10/16/2015 GCBA 815 DAVID: Database for Annotation, Visualization and Integrated Discovery (https://david.ncifcrf.gov/home.jsp) __________________________________________________________________________________________________ 10/16/2015 GCBA 815 1 When multiple species pop up, click on the species of interest and press ‘Select Species’ 3 HIV_genes If multiple gene lists are open in the program, select the gene list of interest and click on ‘Use’ 2 __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Percentage, e.g. 33/398 (involved genes/total genes) __________________________________________________________________________________________________ 10/16/2015 GCBA 815 __________________________________________________________________________________________________ 10/16/2015 GCBA 815 KEGG Pathway BIOCARTA List genes are shown in red stars __________________________________________________________________________________________________ 10/16/2015 GCBA 815 __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Table Report is a gene-centric view which lists the genes and their associated annotation terms (selected only). There is no statistics applied in this report __________________________________________________________________________________________________ 10/16/2015 GCBA 815 User input genes classified into big gene functional groups Measure of the importance of a gene group in the user’s gene list Check if there are any other genes in the gene list or in the genome functionally similar to this gene group Key biology of this gene group How the members share common annotations/biology __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA dataset • Transcriptional profiles from p53+ and p53 mutant cancer cell lines • Expression datasets: P53_hgu95av2.gct, P53_collapsed.gct 'Collapsed' refers to datasets whose identifiers (i.e Affymetrix probe set ids) have been replaced with symbols • Phenotype labels (e.g tumor vs normal): P53.cls • Gene set: c1.v2.symbols.gmt __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA: Gene Set Enrichment Analysis (http://www.broadinstitute.org/gsea/) http://www.broadinstitute.org/gsea/datasets.jsp __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GCT file format; expression data file __________________________________________________________________________________________________ 10/16/2015 GCBA 815 CLS file format; phenotype file __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GMT file format; gene sets __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA: Gene Set Enrichment Analysis 1 2 3 __________________________________________________________________________________________________ 10/16/2015 GCBA 815 2 __________________________________________________________________________________________________ 10/16/2015 GCBA 815 1 3 2 ftp.broad.mit.edu://pub/gsea/annotations/HG_U95Av2.chip __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Interpreting GSEA Results GSEA Statistics GSEA computes four key statistics for the gene set enrichment analysis report: ● Enrichment Score (ES) ● Normalized Enrichment Score (NES) ● False Discovery Rate (FDR) ● Nominal P Value __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Enrichment plot ; Enrichment Score (ES) • Enrichment score (ES), reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes • GSEA calculates the ES by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not • The magnitude of the increment depends on the correlation of the gene with the phenotype • The ES is the maximum deviation from zero encountered in walking the list • A positive ES indicates gene set enrichment at the top of the ranked list; a negative ES indicates gene set enrichment at the bottom of the ranked list __________________________________________________________________________________________________ 10/16/2015 GCBA 815 GSEA Report __________________________________________________________________________________________________ 10/16/2015 GCBA 815 1 3 2 • To identify the subset of genes that actually contribute to the enrichment score (ES) • The leading edge subset in a geneset are those genes that appear in the ranked list at or before the point at which the running sum reaches its maximum • Outputs heatmaps and set-to-set overlaps of leading edge subsets between pairs enriched genesets __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Interpreting Leading Edge Analysis Results HeatMap Gene in Subsets Set-to-Set Histogram __________________________________________________________________________________________________ 10/16/2015 GCBA 815 Interpreting Leading Edge Analysis Results Heat map shows the (clustered) genes in the leading edge subsets. The expression values are represented as colors, where the range of colors (red, pink, light blue, dark blue) shows the range of expression values (high, moderate, low, lowest) Set-to-Set graph uses color intensity to show the overlap between subsets: the darker the color, the greater the overlap between the subsets Gene in subsets graph shows each gene and the number of subsets in which it appears Histogram; the Jacquard is the intersection divided by the union for a pair of leading edge subsets. Number of Occurrences is the number of leading edge subset pairs in a particular bin. In this example, most subset pairs have no overlap (Jacquard = 0) __________________________________________________________________________________________________ 10/16/2015 GCBA 815