Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T. Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis • • • Data sources Statistical methods Visualization Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis • • • Data sources Statistical methods Visualization Lesson 2: Networks and Pathways • • Networks: data sources and visualization Pathways PART 1 Introduction How do we relate microarray expression data to biological function? Analysis Workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups Analysis Workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups Analysis Workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups Analysis Workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups From differential genes to biological functions ?! How do my data relate to known biological functions? Are there specific functions that are characterized by gene expression changes? Analysis Workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups Identification of Functional Groups GENE SETS NETWORKS PATHWAYS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Just visual, or Identify modules satisfying some joint gene expression and topology requirement Score the pathways exploiting gene expression and topology Identification of Functional Groups GENE SETS NETWORKS PATHWAYS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Just visual, or Identify modules satisfying some joint gene expression and topology requirement Score the pathways exploiting gene expression and topology Identification of Functional Groups GENE SETS NETWORKS PATHWAYS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Just visual, or Identify modules satisfying some joint gene expression and topology requirement Score the pathways exploiting gene expression and topology Identification of Functional Groups GENE SETS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 This lecture NETWORKS PATHWAYS Identification of Functional Groups GENE SETS NETWORKS PATHWAYS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Next week lecture PART 2 Gene-set Enrichment Analysis What is gene-set enrichment analysis? How does it help interpreting microarray data? What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning - Which gene-sets summarize at best gene expression patterns? What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning - Which gene-sets summarize at best gene expression patterns? Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning - Which gene-sets summarize at best gene expression patterns? Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning - Which gene-sets summarize at best gene expression patterns? This is the meaning of significant enrichment We will see what’s the “statistical” definition of enrichment in PART.4 PART 3 Gene-set Enrichment: Data What data sources are available for gene-set enrichment analysis? Gene-set Data Sources Break down cellular function into gene sets Where can I get these gene-sets? How were the gene-sets compiled? How are they structured? Nuclear Pore Gene.AAA Gene.ABA Gene.ABC Cell Cycle Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 P53 signaling Gene.CC1 Gene.CK1 Gene.PPP Gene Ontology (GO) Gene Ontology is: – a hierarchically-structured, • Functional categories are organized hierarchically, i.e. a system of inter-related sets with increasing scope specificity (parent-child relations) – controlled vocabulary • Functional categories are defined by experts, and then must be used consistently for annotation – for gene product function annotation • Gene products (i.e. proteins) are annotated using GO functional categories (“terms”) – It is general for all species Gene Ontology: Example Terms are organized hierarchically – Terms on top are more general, terms on bottom are more narrow in scope – If a protein is annotated as Spindle, the annotation should be automatically inferred also for all progenitors of Spindle (up-propagation) Gene Ontology: Example Gene Ontology: Example PARENT CHILD Gene Ontology: Example PARENT CHILD Gene Ontology: Example Gene Ontology and the corresponding gene-sets PARENT CHILD Gene Ontology: Example Gene Ontology and the corresponding gene-sets PARENT Gene Gene-set ABB1 C5A75 DUCZ ACAP3 LUC2 CHILD TRAC1 POF5 ZUMM Gene Ontology: Example Gene Ontology and the corresponding gene-sets PARENT ABB1 C5A75 DUCZ ACAP3 LUC2 CHILD TRAC1 POF5 ZUMM The set corresponding to the CHILD is a subset of the one corresponding to the PARENT Gene Ontology: Example Gene Ontology: Partitions GO has three independent partitions, which are not interconnected: – Molecular Function • Describes biochemical activities, in-vitro binding specificities, etc… • Example: Ligase Activity, Kinase Activity, DNA Binding – Cellular Component • Describes parts of the cell • Example: Mitochondrion, Spindle Microtubule – Biological Process • Describes processes at the intra-cellular and organism level • Example: DNA Replication, Apoptosis, Development MOLECULAR FUNCTION Ligase Activity CELLULAR COMPONENT Mitochondrion BIOLOGICAL PROCESS DNA Replication Gene Ontology: Partitions First-level children (list) MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS Gene Ontology Levels Every partition has several levels… ROOT LEVEL-1 LEVEL-2 LEVEL-N Gene Ontology Levels However, terms at the same level don’t necessarily have the same degree of granularity (i.e. specificty of scope) BIOLOGICAL PROCESS SIGNALING IMMUNE SYSTEM PROCESS Different granularity!!! PIGMENTATION Gene Ontology Annotations How are gene annotated with GO terms? Human curators go through the literature and mining for gene functions - Different genomic databases take part to this effort - Evidence Codes are used to keep track of the type of evidence for annotation - IEA annotations are directly imported from databases, without human curation Important Note: Primary annotations are not propagated using the ontology; therefore: when you download GO gene-sets always make sure that up-propagation was done Gene Ontology Evidence Codes • • • • • • • • • • ISS: Inferred from Sequence/Structural Similarity IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern TAS: Traceable Author Statement NAS: Non-traceable Author Statement IC: Inferred by Curator ND: No Data available • IEA: Inferred from electronic annotation More at: http://www.geneontology.org/GO.evidence.shtml Gene Ontology Evidence Codes How should I use evidence codes? – Quality Filter for Gene-set Enrichment • Sometimes IEA (Electronic Annotations) are considered less reliable, and are not used for analysis • However, this should be evaluated very carefully and cannot be generalized – Gene Browsing • If you are interested in the function of a specific gene, you can check if multiple evidences are available Annotation Inheritance There are primary and inherited annotations – Primary Annotations • Originally defined by curators – Inherited Annotations • Back-propagated along the hierarchy Always check if the gene ontology annotation resource you are using includes inherited annotations! Annotation Inheritance Primary Annotation: Spindle Annotation Inheritance Inherited Annotations: Microtubule Cytoskeleton Cytoskeletal Part Cytoskeleton Intracellular Organelle Part … Gene Ontology: Multi-function Besides hierarchical term organization, genes can be multi-functional, i.e. annotated by many independent terms – In the following slide we see an excerpt of p53 (the “Warden of Genome”) annotations, as reported by the NCBI database Entrez-Gene http://www.ncbi.nlm.nih.gov/gene/7157 Gene Ontology: Statistics 29,922 Total Terms 8,688 Molecular Function 2,689 Cellular Component 18,545 Biological Process (http://www.geneontology.org/GO.downloads.ontology.shtml) Annotated Genes (Entrez-Gene) 17,482 Human 18,028 Mouse Exploring Gene Ontology: QuickGO http://www.ebi.ac.uk/QuickGO/ Gene-sets: Beyond Gene Ontology There are many other sources and types of gene-sets: - Pathways (e.g. KEGG) - Protein Families / Domains (e.g. PFAM) - Predicted Targets of Regulators (e.g. MSigDB-c3) - miRNA, Transcription Factors - Protein-protein Interaction Modules - Gene Expression - Up/down after treatment or in relation to disease (e.g. MSigDB-c2) - Co-expression across many conditions (e.g. MSigDB-c4) - Genotype-phenotype association (e.g. DiseaseHub) - Genomic position (e.g. MSigDB-c1) Pathways and GO Biol. Process How do pathways and processes differ? – In a purely biological perspective, the question is philosophical (still worth speculating…) – In a bioinformatics perspective: • A gene is annotated for a GO Biological Processes if the curators deem it (significantly) contributes to the process (which is at the cellular or organ level), according to a number of evidences • Pathways include the “wiring” of genes/gene products, hence they rely on a more intensive curation process • Some pathways include large ubiquitous actors (such as the proteasome) that may confound enrichment analysis, whereas these are usually absent from GO process A pathway example: the MAPK cascade in KEGG (http://www.genome.jp/kegg/pathway/hsa/hsa04010.html) Major Gene-set Resources A-Z • Bioconductor – – – – GO: GO.db + org.Xx.eg.db (org.Xx.egGO2ALLEGS) KEGG: KEGG.db + org.Xx.eg.db (org.Xx.egPATH) PFAM: PFAMEDE + org.Xx.eg.db (org.Xx.egPFAM) Note: Xx has to be replaced with the species id {Hs, Mm, Rn, etc…} • DiseaseHub (http://zldev.ccbr.utoronto.ca/~ddong/diseaseHub/) – Phenotype-genotype (OMIM, GAD, HGMD, PharmGKB, CGP, GWAS) • MSigDB (http://broad.harvard.edu/gsea/msigdb/index.jsp) – GO (*no IEA), Pathways (KEGG, Biocarta, STKE, GenMAPP, PharmGKB, GEArray), Predicted Targets (miRNA: ?, TF: Transfac), Gene Expression, Genomic Positions • PathwayCommons (http://www.pathwaycommons.org/pc-snapshot/gsea/by_species/) – Pathways: Reactome, NCI, Cell map • WhichGenes (www.whichgenes.org) – GO, Pathways (KEGG, Biocarta, Reactome), Genomic Positions, Regulators (miRNA: TargetScan, miRBase), Phenotype-genotype (geneCards Disease, CancerGenes) Exploring MSigDB (1) http://broad.harvard.edu/gsea/msigdb/index.jsp Exploring MSigDB (2) Alzheimer Exploring MSigDB (3) Select this gene-set Exploring MSigDB (4) Exploring MSigDB (5) I now want to see how the gene-set I was interested in overlaps with other gene-sets in the collection (I selected only a few types) Exploring MSigDB (6) We will se how this p-value is computed and what it means in the next part (enrichment methods) Gene-set Resources Tips to navigate the resource ocean / 1: – Start your analysis using only a few, reliable sources (e.g. GO, KEGG) • GO also has a very large gene coverage – After the first-pass analysis, expand your gene-set collection to types you are interested in – Don’t try from the beginning everything together – Remember quality and clarity! • Target predictions may be unreliable • Gene expression-derived sets are often hard to interpret Gene-set Resources Tips to navigate the resource ocean / 2: – If you are confident with R, start from Bioconductor, and supplement the missing pathways shopping around • • • • GO: Bioconductor Pathways: Pathway Commons Phenotype-genotype: DiseaseHub Gene Expression: MSigDB Useful scripts available at: http://baderlab.org/DanieleMerico/Code/Bioc2GMT http://baderlab.org/DanieleMerico/Code/Read_GMT Gene-set Resources Tips to navigate the resource ocean / 2: – If you are not confident with R, and you are a GSEA user, use MSigDB and Pathway Commons • From both resources you can download GMT files (GMT is the format used by GSEA) • Remember that GO gene-sets in MSigDB do not have IEA-backed annotations – Both Bioconductor and MSigDB incorporate GO inherited annotations (back-propagated) Summary of PART 3 Gene-set Data Sources – Gene Ontology, a hierarchically structured controlled vocabulary for gene function annotation, is the main source of gene-sets – Other valuable sources are availables, such as pathway databases In the next part we will see how to use gene-set for enrichment analysis… Now, take a… And ready to dive again! PART 4 Gene-set Enrichment: Methods What statistical methods can I use to score gene-sets for enrichment? Microarray Experiment (gene expression table) Enrichment Test Enrichment Table ENRICHMENT TEST Gene-set Databases Spindle 0.00001 Apoptosis 0.00025 Microarray Experiment (gene expression table) Enrichment Test Enrichment Table Spindle 0.00001 Apoptosis 0.00025 ENRICHMENT TEST Experimental Data Gene-set Databases A priori knowledge + existing experimental data Microarray Experiment (gene expression table) Enrichment Test Enrichment Table Spindle 0.00001 Apoptosis 0.00025 ENRICHMENT TEST Interpretation & Hypotheses Gene-set Databases Enrichment Test Gene-sets Enrichment Table Spindle 0.00001 Apoptosis 0.00025 SPP1 SPP2 CCCP MTC1 … FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. Microarray Experiment (gene expression table) Enrichment Test Microarray Experiment (gene expression table) How? ENRICHMENT TEST Two-class Design Genes Ranked by Expression Matrix Differential Statistic UP DOWN Class-1 Class-2 E.g.: - Fold change - Log (ratio) - t-test Selection by Threshold UP DOWN Time-course Design E.g.: - K-means - K-medoids - SOM Expression Matrix t1 t2 t3 … tn Gene Clusters Other Designs Expression Matrix Significant Genes E.g.: - ANOVA - Linear Model Enrichment Test Microarray Experiment (gene expression table) Significant genes (e.g UP) Gene-set Databases Background genes (array genes not significant) Enrichment Test Microarray Experiment (gene expression table) Significant genes (e.g UP) Gene-set Gene-set Databases Background genes (array genes not significant) Enrichment Test Microarray Experiment (gene expression table) Significant genes (e.g UP) Overlap between significant genes and gene-set Gene-set Gene-set Databases Background genes (array genes not significant) Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Is this overlap larger than expected by random sampling the array genes? Background genes (array genes not significant) Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Is this overlap larger than expected by random sampling the array genes? Background genes (array genes not significant) Random sample of array genes Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Is this overlap larger than expected by random sampling the array genes? Statistical Model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypotehsis distribution (Hypergeometric Distribution) http://en.wikipedia.org/wiki/Fisher's_exact_test Background genes (array genes not significant) Fisher’s Exact Test For Gene-set Enrichment a c b d Enrichment P-value (a b)!(a c)!(c d )!(b d )! n!a!b!c!d! © by Black Box Inc. a, b, c, d are the size of the fours subsets (each subset has a different color) R: help (fisher.test) MEMO: P-value ~ 0 --> significant P-value ~ 1 --> not significant Fisher’s Exact Test For Gene-set Overlap We can also use Fisher’s Exact Test to evaluate the overlap between gene-sets from databases Going back to MSigDB… Now we know where these p-values come from! Web Resources for Fisher’s Exact Test ConceptGen http://conceptgen.ncibi.org/core/conceptGen/index.jsp Note: free account required DAVID http://david.abcc.ncifcrf.gov/summary.jsp Note: thorough description of how to use in this paper: Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57. PMID: 19131956 Beyond Fisher’s Test UP Thresholddependent e.g. Fisher’s Test DOWN ENRICHMENT TEST UP Wholedistribution e.g. GSEA DOWN Beyond Fisher’s Test Whole-distribution methods have been shown to be more stable and statistically powerful – No “natural” value for the threshold – Different results at different threshold settings – Loss of information due to thresholding • No resolution between significant signals with different strengths • Weak signals neglected --> Use whole-distribution whenever possible GSEA Enrichment Test / 1 Two-class comparison Expression Matrix - Fold change - Log (ratio) - t-test - SAM Class-1 Class-2 Expression Matrix Correlation to phenotype - Pearson correlation Quantitative Phenotype Ranked Gene List GSEA Enrichment Test / 2 Ranked Gene List Enrichment Table GSEA Gene-set Databases Gene-set p-value FDR Spindle 0.0001 0.01 Apoptosis 0.025 0.09 GSEA Enrichment Test / 2 Ranked Gene List Enrichment Table Gene-set p-value FDR Spindle 0.0001 0.01 Apoptosis 0.025 0.09 GSEA The p-value depends only on the single gene-set performance Gene-set Databases The FDR depends on the performance of all gene-sets GSEA: Method Steps 1. Calculate the ES score 2. Generate the ES distribution for the null hypothesis using permutations • see permutation settings 3. Calculate the empirical p-value 4. Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43) GSEA: Method ES score calculation Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end? GSEA: Method ES score calculation Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution to the running ES score GSEA: Method ES score calculation MAX running ES score --> Final ES Score GSEA: Method ES score calculation High ES score <--> High local enrichment GSEA: Method Empirical p-value estimation (for every gene-set) 1. Generate null-hypothesis distribution from randomized data (see permutation settings) Number of instances Distribution of ES from N permutations (e.g. 2000) ES Score GSEA: Method Empirical p-value estimation (for every gene-set) 1. Generate null-hypothesis distribution from randomized data (see permutation settings) 2. Estimate empirical p-value Distribution of ES from N permutations (e.g. 2000) Real ES score value GSEA: Method Empirical p-value estimation (for every gene-set) 1. Generate null-hypothesis distribution from randomized data (see permutation settings) 2. Estimate empirical p-value Distribution of ES from N permutations (e.g. 2000) Real ES score value Randomized with ES ≥ real: 4 / 2000 --> Empirical p-value = 0.002 GSEA Settings: Permutation Permutation settings have important implications which we will not discussed in detail Practical suggestions: – When biological replicates are very similar within classes and classes are well seperated --> gene permutation – When biological replicates tend to be dissimilar, or stratified according to hidden experimental factors --> use other whole-distribution enrichment methods of self-contained type (e.g. SAM-GS) GSEA Settings: Gene-set Filter Gene-set for enrichment analysis are usually filtered by size – Large gene-sets are undesired, if they are derived from Gene Ontology or other functional resources, as they usually correspond to uninformative concepts (e.g. Regulation of Biopolymer Catabolism) – Small gene-sets are undesired as their statistics are quite noisy, and they may decrease the FDR of other sets – See Using GSEA section for the specific value of size filtering settings Using GSEA Installation Launch Desktop Application from: http://www.broadinstitute.org/gsea/msigdb/downloads.jsp Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*) WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor Using GSEA Data Format There are three data files you will need: – Gene-set (.GMT) – Gene Expression Table (.txt) – Gene Expression Phenotypes (.CLS) The formats requirements follow. More on GSEA data formats: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats Using GSEA Data Format: gene-set file (.GMT) Syntax: <<GS-Name>> [\tab] <<GS-Description>> [\tab] <<Gene-ID>> [\tab] <<Gene-ID>> Notes: • Either use the gene-set ID for the Name (e.g. GO ID) and the geneset full name for the Description • Or use the gene-set full name for the Name and the source database for the Description Example: regulation of DNA recombination GO:0000018 604 641 3458 transition metal ion transport GO:0000041 475 538 540 Using GSEA Data Format: gene expression table file (.txt) Syntax: table <<NAME>> [\tab] <<DESCRIPTION>> [\tab] <<Value-S1>> [\tab] << Value-S2>> Notes: • Use the gene ID for the Name (e.g. GO ID) and the gene symbol and/or full name for the Description • I recommend using EntrezGene IDs, for a number of reasons • Gene IDs must be consistent between the GMT and this file Example: Using GSEA Data Format: expression phenotypes file (.CLS) Number of samples Number of classes Always 1 Class Labels 931 # Tg-A Tg-B WT Tg-A Tg-A Tg-A Tg-B Tg-B Tg-B WT WT WT Phenotype labels for all samples in the gene expression tables Use space as separator Using GSEA Load the data Using GSEA Load the data Using GSEA Run the analysis – Parameter setting / 1 Load gene-set (.GMT) file here Load gene expression table here Load phenotype file (.CLS) here 2000 gene.-set If your gene expression table has probe IDs already matching with the .GMT file, you don’t need this. If your gene expression table has probe IDs already matching with the .GMT file, set this this to FALSE. Using GSEA Run the analysis – Parameter setting / 2 Differential statistic. Use t-test (or signal-to-noise) if you have at least 3 replicates. 10 is usually good. Keep between 7-8 and 15. 600 is usually good. Keep between 500 and 800. Using GSEA GSEA Pre-ranked – If you wish to use a statistic for differential expression other than GSEA, you can using the Pre-ranked mode More on GSEA pre-ranked data format: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats #RNK:_Ranked_list_file_format_.28.2A.rnk.29 Summary of PART 4 Methods for Gene-set Enrichment – Fisher’s Exact Test can be used for any given set of experimental genes – When possible, use GSEA to achieve greater power – Both GSEA and Fisher’s Exact Test require to score genes for significance/differentiality; how this is done depends on the microarray design Now, take a… And ready to dive again! PART 5 Gene-set Enrichment: Visualization How to use enrichment analysis to functionally map cellular activity. Or, everything finally coming together. Gene-set Enrichment: Redundancy Problem Many redundant gene-sets – Gene Ontology has a very large number of genesets, often with slight differences – Different pathway databases have different yet overlapping definitions of pathways – Globally, it is useful to grasp the overlap relations between enriched gene-sets --> we need a visualization framework going beyond the enrichment table GO.id GO:0042330 GO:0006935 GO:0002460 GO:0002250 GO:0002443 GO:0019724 GO:0030099 GO:0002252 GO:0050764 GO:0050766 GO:0002449 GO:0019838 GO:0051258 GO:0005789 GO:0016064 GO:0007507 GO:0009617 GO:0030100 GO:0002526 GO:0045807 GO:0002274 GO:0008652 GO:0050727 GO:0002253 GO:0002684 GO:0050778 GO:0019882 GO:0002682 GO:0050776 GO:0043086 GO:0006909 GO:0002573 GO:0006959 GO:0046649 GO:0030595 GO:0006469 GO:0051348 GO:0007179 GO:0005520 GO:0042110 GO:0002455 GO:0005830 GO:0006487 GO:0051240 GO:0042379 GO:0008009 GO:0016055 GO.name p.value covercover.rat Deg.mdn Deg.iqr taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897 GO.id GO:0042330 GO:0006935 GO:0002460 GO:0002250 GO:0002443 GO:0019724 GO:0030099 GO:0002252 GO:0050764 GO:0050766 GO:0002449 GO:0019838 GO:0051258 GO:0005789 GO:0016064 GO:0007507 GO:0009617 GO:0030100 GO:0002526 GO:0045807 GO:0002274 GO:0008652 GO:0050727 GO:0002253 GO:0002684 GO:0050778 GO:0019882 GO:0002682 GO:0050776 GO:0043086 GO:0006909 GO:0002573 GO:0006959 GO:0046649 GO:0030595 GO:0006469 GO:0051348 GO:0007179 GO:0005520 GO:0042110 GO:0002455 GO:0005830 GO:0006487 GO:0051240 GO:0042379 GO:0008009 GO:0016055 GO.name p.value covercover.rat Deg.mdn Deg.iqr taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897 adaptive immune response based on somatic recombination adaptive immune response leukocyte mediated immunity B cell mediated immunity myeloid cell differentiation immune effector process regulation of phagocytosis positive regulation of phagocytosis lymphocyte mediated immunity Gene-set Enrichment: Redundancy Problem How to handle the redundancy problem? – Statistical solutions: • Correct for inter-redundancy and prioritize the most enriched gene-sets • Don’t always work well, not available for all tests --> not discussed here – Visualization solution: • visualize gene-set overlap as a network Enrichment Map (Cytoscape plugin) http://baderlab.org/Software/EnrichmentMap Enrichment Map Enrichment Map Enrichment Significance Class A (e.g. UP) Class B (e.g. DOWN) Enrichment Map A |A B| min (|A |,|B|) B Application Example Estrogen treatment of Breast Cancer Cells Overall Design: - 2 classes (treated, untreated) - 3 time points 12 hrs 24 hrs 48 hrs Estrogen-treated 3 3 3 Untreated 3 3 3 We will start off by analyzing only the 24 hours time point, which has the maximal induction, although its is functionally similar to the 12 hours time-point Clusters were manually identified and tagged; they represent highly inter-related gene-sets Condition Comparison Enrichment Map can be used to compare enrichments Use cases: – Different experiments – Different condition comparisons within the same experiment Example: same data-set (Estrogen treatment) 12 hrs 24 hrs 48 hrs Estrogen-treated 3 3 3 Untreated 3 3 3 Now we can analyze together the 12 and 24 hours time-points Notice that we are always comparing the treated to the untreated Heat-map Feature Heat-maps can be used to explore gene expression patterns – Microarray data are typically normalized by-row for heat-map visualization i. Subtract the mean ii. Divide by the standard deviation – This setting is available in Enrichment Map Down Up Gene Ontology Restructured Gene Ontology is hierarchical, and terms are highly redundant / interrelated / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / interdependent terms Enrichment Map How-to Installation 1. Install Cytoscape http://www.cytoscape.org/download.php?file=cyto2_6_3 2. Dowload Enrichment Map plugin http://baderlab.org/Software/EnrichmentMap#Plugin_Download 3. Copy the plugin into the Cytoscape plugin folder win C:\Program Files\Cytoscape\plugins mac Applications/Cytoscape/plugins Enrichment Map: How-to Load Data – Open Cytoscape, load the Enrichment Map plugin from the menu: plugins/ Enrichment Map/ Load Enrichment Results 1. Format: GSEA – Use the generic if you have generated enrichment results outside GSEA; follow the manual for formatting instructions 2. Load the gene-set file (GMT) 3. Load the expression matrix (tab-sep txt) 4. This is optional 5. Change the settings as follows: – Set the p-value cut-off to 0.001 – Set the FDR q-value cut-off to 0.05 (5%) – Select the overlap coefficient More at: http://baderlab.org/Software/EnrichmentMap/UserManual Enrichment Map: How-to Browse results – – – Enrichment Map is a Cytoscape plugin We will fully learn how to use Cytoscape in the next lesson In this lesson, we will just see essential functionalities Nodes can be dragged and dropped, or deleted Use this panel to move the view of the network around Heat-map view Click on nodes to access Normalization setting: Row Normalize Data These parameters can be tuned to include/exclude gene-sets from the map, depending on their enrichment scores Rerun the layout from: Layout/Cytoscape Layouts/ Force Directed Layout/ Weighted Summary of PART 5 Visualization of Gene-set Enrichment – Gene-set enrichment is valuable to summarize the functional landscape of cellular activity (in our case, gene expression) – Gene-sets are highly redundant, organizing them as a network highly facilitates navigation and interpretation • Software: Enrichment Map Further Readings Enrichment Analysis (Methods): • Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008 May;9(3):189-97. PMID: 18202032 • Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform. 2009 Jan;10(1):24-34. PMID: 18836208 Enrichment Map: • Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps. Proteomics. 2010 Feb 1. PMID: 20127684 Assignment Rules – Forum discussion: • Of course, you are free to discuss general topics on the forums • Please don’t discuss assignment results until I’ve received them all • You can discuss results of optional assignments on the forum any time, if you wish – Send me ([email protected]) the following material: • • • • GSEA input files (zipped) GSEA output files (zipped) Cytoscape Session Any ppt or doc elaborating on what you did and answering question (please, be concise!) Assignment Estrogen Treatment Data – Run GSEA • Phenotypes: 12 and 24 hrs X treated vs untreated • Differential statistic: t-test – Explore results using Enrichment Map • Can you reproduce the view in the lesson slides? • What can you infer about estrogen effect on the cellular gene expression program? • Use the heat-maps to inspect the differences between 12 and 24 hours: what do you notice? What are the implications for the comparison design? Assignment Estrogen Treatment Data: Source – The original microarray data are available on GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11352 – The raw .CEL data were processed using rma in R/Bioconductor – The rma gene expression matrix and the gene-set (GMT) file are also available at: http://baderlab.org/Software/EnrichmentMap#Sample_Data_Download Optional Assignments / 1 Do these assignment if you have time and you wish to explore more – Run GSEA with ratio-of-classes • Are the results globally similar? • what the differences do you notice in the Enrichment Map? – Make a gene-set (GMT) file with GO and KEGG using R/Bioconductor • Are the enriched KEGG pathways insightful? – Run Enrichment Map with different values of the overlap coefficient (e.g. 0.4, 0.6) • In our experience, 0.5 is the optimal value for large maps (> 200 gs) • Which setting do you like the best? Why? Optional Assignments / 2 Do these assignment if you have time and you wish to explore more 1. Compute the t-test p-value in R, select the top (a) 750, (b) 2000 up- and down-regulated genes 2. Run the enrichment analysis in ConceptGen 3. Visualize the enrichment as a network in ConceptGen – Can you recognize functional clusters? – Are there similarities with the Enrichment Map view? At least for this lesson…