Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Network-based approach for GO Term and pathway enrichment analysis Girolamo Giudice Network or Graph 0 1 1 4 6 0 3 5 2 Nodes = Vertices (points) Links = Edges (Lines) Network Theory = Graph Theory Network = Graph Large graphs = Networks Network or Graph 80 60 1 150 4 30 0 1 6 20 50 6 0 3 4 3 120 5 5 2 2 Directed graph Undirected graph 1 0 3 2 Subnetwork 1 0 6 3 0 5 Subnetwork 2 Subnetwork 3 Network – Path and Shortest path paths 1 Barrio del pilar 0 4 6 3 Cnic 5 Shortest paths 1 0 4 6 3 5 Network - index 1 4 6 0 3 5 2 Degree The degree of a node is the number of edges connected to the node. 4 The node with highest degree(3) is node 4 2 The node with lowest degree(3) is node 2 Network - index Betweenness 1 8 0 3 2 4 6 7 4 The node with highest Betweenness is node 4 then number 6 Why Network are important? • Complex System: The whole is greater than the sum of its parts • How to Describe a System As a Whole? Databases • A database for protein-protein interaction data • A databases for ontologies • A database for pathway annotations Databases PPI (string vs Intact) Ontology (gene ontology) Common Identifier Pathway (Reactome vs Kegg) Protein Interaction – String vs Intact String vs Intact 6000000 #intereaction in human 5000000 4000000 3000000 2000000 1000000 0 string Intact Interaction 4850630 144596 PPI Experiments Binary Association Bait 800 prey Association Bait Complex String vs Intact String vs Intact Search in Ensembl Topology of Networks before 1999 Regular Each node is linked to its neighbors Random most nodes have 3 or more links Protein Interaction Network are scale free Hubs Gene Ontology The Gene Ontology project is a collection of dynamic controlled vocabulary terms, called ontologies, used for annotating attributes of genes or proteins for a variety of different organisms. Introduction – Gene Ontology Introduction – Kegg vs Reactome The combination of proteins and small chemicals that can be found inside the cell, can interact each other to perform a task A pathway can be defined as a series of reactions that can lead to the formation of new molecular products or changes in the cellular state To Download kegg = 2000$ Enrichment Analysis – Fisher´s Test 1. 2. 3. 4. 5. 6. State the relevant null hypothesis to be tested Formulate the assumptions Set a significance threshold (α) Compute the relevant test statistic Determine P- value given the test statistics The decision rule is (usually) to reject the null hypothesis if P-value < α H0 is true H1 is true total Accept Null Hypothesis Right decision (a) Type I Error (b) (False Positive) a+b Reject Null Hypothesis Type II Error (c) (False Negative) Right decision (d) a+c b+d c+d N Enrichment Analysis – Fisher´s Test • Fisher demonstrated that the probability of observing the sample in a contingency table follows a hypergeometric distribution and in case of true null hypothesis is equal to: • Where p is the probability of observing the sample a,b,c,d if the null hypothesis is true. Multiple testing – FDR vs Bonferroni Why multiple testing matters? Omics = lots of data= lots of hypothesis tests Adjustment of p-value is needed to reduce the probability of obtaining false positive results when multiple tests are executed on a single data set the risk is to throw away real enrichments is true only for independence Independence and FDR • Example uniprot id: P02830 • GO:0003677: DNA binding • GO:0003700: sequence-specific DNA binding transcription factor activity • GO:0008134: transcription factor binding • GO:0043565: sequence-specific DNA binding Independence and FDR Multiple testing – FDR vs Bonferroni FDR vs Bonferroni Why a Network-based approach High-throughput genomic experiments often lead to the identification of potentially disease-related genes/proteins list. Gene list interpretation involves enrichment methods in order to assess functional associations with known pathways and gene ontology terms. Major limitation of classical enrichment analysis: • Proteins does not work in isolation • The underlying molecular interaction network is not taken into account Related Work: Pina The aim of PINA[1] is to identify modules[2], high and dense interconnected regions of groups of functionally correlated genes, and use them to assess the functional annotation Markov clustering (MCL) Molecular complex detection (MCODE) [1] Cowley, M.J., Pinese, M., Kassahn, K.S., Waddell, N., Pearson, J.V., Grimmond, S.M., Biankin, A.V., Hautaniemi, S. and Wu, J. (2012) PINA v2.0: mining interactome modules. Nucleic Acids Res, 40, D862-865. [2] Rives AW Galitski T . Modular organization of cellular networks. Proc. Natl Acad. Sci. USA 2003;100:1128-1133 Related Work: Enrichnet • The starting protein set is mapped on STRING network • Calculate the distance between nodes and pathways and/or processes • Assign a score • The output of this procedure will be a ranking table of pathways or processes with association scores. E. Glaab, A. Baudot, N. Krasnogor, R. Schneider, A. Valencia EnrichNet: network-based gene set enrichment analysis Bioinformatics, Vol. 28, No. 18 (2012), pp. i451. The 3 phases of processing 1. Preprocessing phase • Build the PPI network • Integrate Data 2. Processing phase • Extract the minimal connecting interaction networks (MCN) • Simplify the MCN 3. Perform the enrichment analysis on the simplified MCN Phase 1: Building the PPI network Protein edge Homo sapiens 9872 nodes 32968 edges Phase 1: Data Integration Protein • gene ontology terms from each of the three categories • the possible pathways in which it is involved Phase 2: Minimal Connecting Network Phase 2: Minimal Connecting Network - Problems Visualizing Network Friction Force Hooke law Friction Force Coulomb law Friction Force Friction Force Phase 2: Simplify the MCN Simplify means to eliminate “less important” nodes and associated edges from the MCN while preserving connectivity properties between the starting MCN nodes. • Graph clustering • Graph filtering Phase 2: Filtering Network A score is assigned to each node but the one of the initial set according to metrics. • Degree • Weight • Betweenness • Modified Betweenness Nodes Label Degree 1 4 6 0 33 5 2 4 3 3 2 5 2 1 2 2 2 Phase 2: Filtering Network Phase 3: Enrichment analysis 4 0 6 5 33 1 9 node 0 1 3 4 6 9 5 Uniprot ID Q9NSB2 Q9NWB1 Q16787 Q9NR09 P25054 Q14156 Q5JSH3 • Perform Fisher’s exact test with Bonferroni correction. • The nodes belonging to filtered network constitutes the target set • Intact Database is the background Testing: The “adenoma-carcinoma” sequence in CRC The transformation of normal colonic epithelium in adenomas and then in adenocarcinoma is called “adenoma-carcinoma sequence”. Adenoma Adenocarcninoma 12 genes with SNV 40 genes with 42 SNV Zhou D, Yang L, Zheng L, Ge W, Li D, et al. (2013) Exome Capture Sequencing of Adenoma Reveals Genetic Alterations in Multiple Cellular Pathways at the Early Stage of Colorectal Tumorigenesis. PLoS ONE Maria S. Pino, Daniel C. Chung, The Chromosomal Instability Pathway in Colon Cancer, Gastroenterology, Volume 138, Issue 6, May 2010, Pages 2059-2072, ISSN 0016-5085 Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science 318: 1108–1113. Test case: Adenoma (GO process) Modified betweenness Go term id P-value Go term description GO:0043065 1,45E-07 positive regulation of apoptotic process GO:0006915 4,66E-07 apoptotic process It has been shown [1] [2] that Apoptosis plays an important role in the elimination of damaged cells. Several papers have showed the increased frequency of cell death [2] [3] [3] [5] during the transition from normal mucosa to adenoma [1] Bedi A, Pasricha PJ, Akhtar AJ, et al. Inhibition of apoptosis during development of colorectal cancer. Cancer Res 1995. [2] Hawkins N, Lees J, Hargrave R, O’Connor T, Meagher A, Ward R. Pathological and genetic correlates of apoptosis in the progression of colorectal neoplasia. Tumour Biol 1997 [3] Aotake T, Lu CD, Chiba Y, Muraoka R, Tanigawa N. Changes of angiogenesis and tumor cell apoptosis during colorectalcarcinogenesis. Clin Cancer Res 1999; [4] Sinicrope FA, Roddey G, McDonnell TJ, Shen Y, Cleary KR, Stephens LC. Increased apoptosis accompanies neoplastic development in the human colorectum. Clin Cancer Res 2006. [5] Koike M. Significance of spontaneous apoptosis during colorectal tumorigenesis. J Surg Oncol. Test case: Adenoma (Pathway) Adenoma pathways with modified betweenness filter Pathways id P-value Description 2644607 0.0001658 Beta-catenin phosphorylation cascade[1] [1] B. Mann, M. Gelos, A. Siedow et al. Target genes of beta-catenin-T-cell-factor/lymphoid-enhancerfactor signaling in human colorectal carcinomas) Test case Adenocarcinoma: PINA and Enrichnet Enrichenet: enriched biological process terms Go term id P-value Go term descriptions GO:0031000 4.23E-08 Response to caffeine Pina: enriched biological process terms Go term id P-value Description GO:0000289 0.001107 nuclear-transcribed mRNA poly(A) tail shortening GO:0000288 0.005218 GO:0031124 0.006576 nuclear-transcribed mRNA catabolic process, deadenylationdependent decay mRNA 3'-end processing GO:0044260 0.006576 cellular macromolecule metabolic process GO:0000956 0.006576 nuclear-transcribed mRNA catabolic process Test case: Adenocarcinoma(GO process) Adenocarcinoma with modified betweenness filter Go term Id p-value Description GO:0019048 2,83E-17 virus-host interaction[1] [2] GO:0006955 7,52E-14 immune response GO:0006974 1,05E-13 response to DNA damage stimulus [1] [2] GO:0045087 2,95E-10 innate immune response GO:0016567 7,32E-10 protein ubiquitination GO:0030512 9,74E-10 negative regulation of transforming growth factor beta receptor signaling pathway[3][4] GO:0051403 1,25E-09 stress-activated MAPK cascade GO:0038124 3,47E-09 toll-like receptor signaling pathway GO:0007179 8,85E-09 transforming growth factor beta receptor signaling pathway [3][4] GO:0007050 1,63E-08 cell cycle arrest GO:0006915 1,63E-08 apoptotic process GO:0031572 2,63E-07 G2 DNA damage checkpoint [1] [2] [1] Danielle Collins, Aisling M Hogan, Desmond C Winter, Microbial and viral pathogens in colorectal cancer, The Lancet Oncology, Volume 12, Issue 5, May 2011, Pages 504-512 [2] Parsonnet J. Bacterial infection as a cause of cancer. Environ Health Perspect 1995; 103: 263–68. [3] Markowitz SD, Roberts AB. Tumor suppressor activity of the TGF-beta pathway in human cancers. Cytokine Growth Factor Rev 1996;7:93–102 [4] Kim SJ, Im YH, Markowitz SD, Bang YJ. Molecular mechanisms of inactivation of TGF-beta receptors during carcinogenesis. Cytokine Growth Factor Rev 2000;11:159–68. Test case: Adenocarcinoma (pathway) Adenocarcinoma pathways with modified betweenness filter Pathways id P-value descriptions 2644607 1,26E-05 Loss of Function of FBXW7 in Cancer and NOTCH1 Signaling[1] 450294 2,21E-05 MAP kinase activation in TLR cascade 168180 4,53E-05 TRAF6 Mediated Induction of proinflammatory cytokines[2] 446652 9,80E-05 Interleukin-1 signaling [1]Rajagopalan H, Jallepalli PV, Rago C, Velculescu VE, Kinzler KW, Vogelstein B, Lengauer C. Inactivation of hCDC4 can cause chromosomal instability. Nature 2004;428:77–81. [2] Inoue J, Ishida T, Tsukamoto N, Kobayashi N, Naito A, Azuma S, Yamamoto T (January 2000). "Tumor necrosis factor receptor-associated factor (TRAF) family: adapter proteins that mediate cytokine signaling". Exp. Cell Res. 254 (1): 14–24. doi:10.1006/excr.1999.4733. PMID 10623461. Enrichment issue • Choose randomly a set of proteins (5,10,20,30) from uniprot • Perform enrichment analysis Uniprot is biased towards certain Go Terms. 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Enrichment issues • Adopt Bonferroni correction No significant enriched terms • Databases is biased towards certain terms Enrichment analysis is completely useless!!!! BUT Enrichment Enrichment “According to the laws of aerodynamics, the bumblebee can’t fly because of the shape and weight of his body in relation to the total wing area, but the bumblebee doesn’t know, so it goes ahead and flies anyway.” “According to statistics, the Bioinformaticians can’t perform enrichment analysis, because of Bonferroni corrections and the bias of Databases, but bioinformaticans don’t know (till now) , so they go ahead and perform enrichment analysis anyway.” Thank you