Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 4 Pathway and Network Analysis Lincoln Stein Classes of Gene Set Analysis DAVID GSEA Reactome FI network PARADIGM Khatri et al. PLOS Comp Bio. 8:1 2012 Module 4 bioinformatics.ca Limitations of Gene Set Enrichment Analysis • Many possible gene sets – diseases, molecular function, biological process, cellular compartment, pathways... • Gene sets are heavily overlapping; need to sort through lists of enriched gene sets! • “Bags of genes” obscure regulatory relationships among them. Module 4 bioinformatics.ca Pathway Databases • Advantages: – – – – Usually curated. Biochemical view of biological processes. Cause and effect captured. Human-interpretable visualizations. • Disadvantages: – Sparse coverage of genome. – Different databases disagree on boundaries of pathways. Module 4 bioinformatics.ca KEGG Module 4 bioinformatics.ca Reactome Module 4 bioinformatics.ca Reactome • Hand-curated pathways in human. • Rigorous curation standards – every reaction traceable to primary literature. • Automatically-projected pathways to non-human species. • 22 species; 1112 human pathways; 5078 proteins. • Features: – – – – Google-map style reaction diagrams with overlays; Find pathways containing your gene list; Calculate gene overrepresentation in pathways; Find corresponding pathways in other species. • Open access. Module 4 bioinformatics.ca Pathway Commons Module 4 bioinformatics.ca Pathway Colorization • Main feature offered by all pathway databases. • Upload a gene list • Database calculates an enrichment score on each pathway and displays ranked list. • Browse into pathways of interest; download colorized pictures. Module 4 bioinformatics.ca Example from Reactome Module 4 bioinformatics.ca Example from Reactome Module 4 bioinformatics.ca Module 4 bioinformatics.ca Networks • Pathways capture only the “well understood” portion of biology. • Networks cover less well understood relationships: – – – – – Module 4 Genetic interactions Physical interaction Coexpression GO term sharing Adjacency in pathways bioinformatics.ca Module 4 bioinformatics.ca Module 4 bioinformatics.ca Module 4 bioinformatics.ca Module 4 bioinformatics.ca Module 4 bioinformatics.ca Biological Networks are Scale Free Properties: 1. The degree (# connections) of nodes follows a power law. A node of degree k+1 is exponentially less likely to occur than a node of degree k. 2. The local clustering coefficient (tendency of nodes to interconnect) is independent of the degree of the node. Nature Reviews Genetics 5, 101-113 (February 2004) | doi:10.1038/nrg1272 Module 4 bioinformatics.ca Biological Networks are Scale Free Implications: 1. A small number of genes have a large number of connections (chokepoints). 2. A large number of genes have a small number of connections (leaves). 3. Genes cluster (functional groups). 4. The cluster sizes are also scale-free (many small clusters, few large clusters). Nature Reviews Genetics 5, 101-113 (February 2004) | doi:10.1038/nrg1272 Module 4 bioinformatics.ca Network Databases • Can be built automatically or via curation. • Popular sources of curated networks: – BioGRID – Curated interactions from literature; 529,000 genes, 167,000 interactions. – InTact – Curated interactions from literature; 60,000 genes, 203,000 interactions. – MINT – Curated interactions from literature; 31,000 genes, 83,000 interactions. Module 4 bioinformatics.ca Uncurated Interaction Sources • Text mining approaches – Computationally extract gene relationships from text, such as PubMed abstracts. – Much faster than hand curation. – Not perfect: • Problems recognizing gene names. Is hedgehog a gene or a species? • Natural language processing is difficult. – Popular resources: • iHOP • PubGene Module 4 bioinformatics.ca Uncurated Interaction Sources • Experimental techniques – Yeast 2 hybrid protein interactions. – Protein complex pulldowns/mass spec. – Genetic screens, such as synthetic lethals, enhancer/suppressor screens. – NOT perfect • Y2H interactions have taken proteins out of natural context; physical interaction != biological interaction. • Protein complex pulldowns plagued by “sticky” proteins such as actin. • Genetic screens highly sensitive to genetic background (“network effects”). Module 4 bioinformatics.ca Integrative Approaches • Combine multiple sources of evidence to increase accuracy. • Simple example: – “Party hubs” are Y2H interactions that have been filtered for those partners that share the same temporal-spatial location. • Complex example: – Combine multiple sources of curated and uncurated evidence. Module 4 bioinformatics.ca Example: Reactome FI Network Curated Human Data – Version 35. 5078 proteins 4166 reactions 3870 complexes 1112 pathways Only ~25% of genome! Goal: add a “corona” of uncurated interaction data around scaffold of curated pathway data. Expanding Reactome’s Coverage Curated Pathways Uncurated Information human PPI PPI inferred from fly, worm & yeast PPI from text mining GeneWays Gene co-expression CellMap TRED GO annotation on biological processes Protein domaindomain interactions Naïve Bayes Classifier Annotated Functional Interactions Predicted Functional Interactions Wu et al. (2010) Genome Biology Integrated Functional Interaction (FI) Network •10,956 proteins (9,542 genes). •209,988 FIs. •~50% coverage of genome. •False (+) rate < 1% •False (-) rate ~80% 5% of network shown here Active Network Extraction Machine Learning + Curated Pathway Dbs Uncurated Interaction Evidence Reactome Functional Interaction Network (~11,000 proteins; 200,000 interactions) Extract and Cluster Altered Genes Disease “modules” (10-30) Module 4 bioinformatics.ca Clustering of TCGA Breast Cancer Mutations Cadherin signaling Signaling by Tyrosine Kinase receptors NOTCH and wnt signaling Focal adhesion ECM-Receptor interaction Neuroactive ligand-receptor interaction Mucin cluster Cell adhesion molecules Ubiquitin-mediated proteolysis Metabolism of proteins Signaling by Rho GTPases DNA repair Cell cycle Axon guidance Calcium signaling Module 4 M phase G2/M Transition bioinformatics.ca Pancreatic Mutation Modules Module 0: MAPK, Hedgehog, TGFβ signaling Module 4: ECM, focal adhesion, integrin signaling Module 5: Wnt & Cadherin singaling Module 3: Translation Module 9: Axon guidance Module 2: B-cell receptor, ERBB, FGFR, EGFR signaling Module 10: muscle contraction Module 7: Axon guidance Module 1: Heterotrimeric Gprotein signaling Module 6: Ca2+ signaling Module 8: MHC class II antigen presentation Module 4 bioinformatics.ca 256 Pancreatic Cancer Mutations Patient Samples Genes Module 4 bioinformatics.ca Modules After Hierarchical Clustering Patient Samples Modules Module 4 bioinformatics.ca Network-Based Clustering Algorithms • Reactome FI network (Wu & Stein, Genome Biol. 2012 13(12):R112) – Expression or SNV analysis – Online analysis via Cytoscape Plugin (lab) • HotNet (Vandin et al. J Comput Biol. 2011 Mar;18(3):507-22). – Expression or SNV analysis – Local installation with Python & MatLab – Cytoscape visualization • WGCNA (Langfelder et al. 2008 BMC Bioinformatics 9: 559.) – Expression analysis – Local installation as R package. Module 4 bioinformatics.ca Classification of Tumors via Molecular Phenotype Test Classify Proteomics Transcriptomics Genomics Module 4 bioinformatics.ca Risk Stratification Don’t Treat Low risk – reduce treatment TEST Treat 10-20% progress No Relapse Module 4 High risk – treat aggresively Relapse bioinformatics.ca Challenges in Biomarker Discovery • Overtraining – 22,000 genes; any given cancer may show alterations in 1000s of them; patients cohorts are in 100s. – Can find a set of gene alterations that nicely predicts survival in a single cohort by chance. – Field is littered with biomarkers that didn’t replicate in independent cohorts. • Disease Heterogeneity – If there are many subtypes of disease then need even larger cohorts. • Tumor Heterogeneity – A single primary tumor may carry high-risk and low-risk subclones simultaneously. Module 4 bioinformatics.ca Using Network Architecture to Accelerate Biomarker Selection Disease Module Map Guanming Wu Expression Analysis of tumours from multiple patients Principal component analysis on modules Correlate principal components with clinical parameters Genome Biol. 2012 Dec 10;13(12):R112 Module 4 bioinformatics.ca Breast Cancer Expression Biomarker: Samples Used • Built the network using Nejm: van de Vijver et al 2002 – 295 Samples, ~12,000 genes – Event: death • Validated with GSE4922: Ivshina et al. Cancer Res. 2006 – 249 Samples, ~13,000 genes – Event: recurrence or death Module 4 bioinformatics.ca PC Analysis Identifies Module 2 as Explaining Much of Variation in Survival Module 4 bioinformatics.ca Same Signature Predicts Survival in Independent Data Set Module 4 bioinformatics.ca And Three More Data Sets as Well… Module 4 bioinformatics.ca Module 2: Kinetochore + Aurora B Signaling Module 4 bioinformatics.ca Integration of Multiple Data Sets • Experimental samples can be interrogated many ways: – – – – RNA expression Genome/exome sequencing Copy number changes/loss of heterozygosity shRNA knockdown screens • Integrate multiple functional data types using network/pathway relationships? Module 4 bioinformatics.ca PARADIGM Vaske, Benz et al. Bioinformatics 26:i237 2010 Module 4 bioinformatics.ca Vaske, Benz et al. Bioinformatics 26:i237 2010 Module 4 Factor graph: directed graph connecting genes; each gene is activated, inactivated, or unchanged in a single patient. bioinformatics.ca Vaske, Benz et al. Bioinformatics 26:i237 2010 Module 4 bioinformatics.ca PARADIGM: The Bad News • Distributed in source code form only – Requires several third-party math/graph libraries (all open source). – Tedious to compile! • Scant documentation. • No repositories of formatted pathway data. • No examples of converting experimental data into input files. • Good news: we are working on a web service implementation for a Reactome-based implementation. Module 4 bioinformatics.ca Take Home Messages • Pathway/network analysis can provide context to altered gene lists. • Pathway/network analysis differs greatly in complexity , power, and usability: – SIMPLE: Pathway diagram colorization – MODERATE: Reactome FI network extraction – COMPLEX: PARADIGM • This type of analysis is work-in-progress, but promises ability to integrate data across many dimensions. Module 4 bioinformatics.ca URLs KEGG – www.genome.jp/kegg Biocarta – www.biocarta.com WikiPathways – www.wikipathways.org Reactome – www.reactome.org NCI/PID – pid.nci.nih.gov Ingenuity – www.ingenuity.com Pathway Commons – www.pathwaycommons.org/pc/ PARADIGM -- http://sbenz.github.com/Paradigm/ Module 4 bioinformatics.ca URLs BioGrid – www.thebiogrid.org InTact – www.ebi.ac.uk/intact MINT – mint.bio.uniroma2.it iHOP – www.ihop-net.org/UniPub/iHOP PubGene – www.pubgene.org Module 4 bioinformatics.ca We are on a Coffee Break & Networking Session Module 4 bioinformatics.ca