Download 1 - girolamogiudice.info

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network tap wikipedia , lookup

Airborne Networking wikipedia , lookup

Transcript
Network-based approach for GO Term
and pathway enrichment analysis
Girolamo Giudice
Network or Graph
0
1
1
4
6
0
3
5
2
Nodes = Vertices (points)
Links = Edges (Lines)
Network Theory = Graph Theory
Network = Graph
Large graphs = Networks
Network or Graph
80
60
1
150
4
30
0
1
6
20
50
6
0
3
4
3
120
5
5
2
2
Directed graph
Undirected graph
1
0
3
2
Subnetwork 1
0
6
3
0
5
Subnetwork 2
Subnetwork 3
Network – Path and Shortest path
paths
1
Barrio del pilar
0
4
6
3
Cnic
5
Shortest paths
1
0
4
6
3
5
Network - index
1
4
6
0
3
5
2
Degree
The degree of a node is the number of edges connected to the node.
4 The node with highest degree(3) is node 4
2
The node with lowest degree(3) is node 2
Network - index
Betweenness
1
8
0
3
2
4
6
7
4 The node with highest Betweenness is node 4 then number 6
Why Network are important?
• Complex System: The whole is
greater than the sum of its parts
• How to Describe a System As a
Whole?
Databases
• A database for protein-protein interaction data
• A databases for ontologies
• A database for pathway annotations
Databases
PPI (string vs
Intact)
Ontology (gene
ontology)
Common Identifier
Pathway
(Reactome vs
Kegg)
Protein Interaction – String vs Intact
String vs Intact
6000000
#intereaction in human
5000000
4000000
3000000
2000000
1000000
0
string
Intact
Interaction
4850630
144596
PPI Experiments
Binary Association
Bait
800
prey
Association
Bait
Complex
String vs Intact
String vs Intact
Search in
Ensembl
Topology of Networks before 1999
Regular
Each node is linked to its neighbors
Random
most nodes have 3 or more links
Protein Interaction Network are
scale free
Hubs
Gene Ontology
The Gene Ontology project is a collection of dynamic controlled vocabulary
terms, called ontologies, used for annotating attributes of genes or proteins
for a variety of different organisms.
Introduction – Gene Ontology
Introduction – Kegg vs Reactome
The combination of proteins and small chemicals that can be found inside the
cell, can interact each other to perform a task
A pathway can be defined as a series of reactions that can lead to the
formation of new molecular products or changes in the cellular state
To Download kegg = 2000$
Enrichment Analysis – Fisher´s Test
1.
2.
3.
4.
5.
6.
State the relevant null hypothesis to be tested
Formulate the assumptions
Set a significance threshold (α)
Compute the relevant test statistic
Determine P- value given the test statistics
The decision rule is (usually) to reject the null
hypothesis if P-value < α
H0 is true
H1 is true
total
Accept Null
Hypothesis
Right decision (a)
Type I Error (b)
(False Positive)
a+b
Reject Null
Hypothesis
Type II Error (c)
(False Negative)
Right decision (d)
a+c
b+d
c+d
N
Enrichment Analysis – Fisher´s Test
• Fisher demonstrated that the probability of observing the sample in a
contingency table follows a hypergeometric distribution and in case of
true null hypothesis is equal to:
• Where p is the probability of observing the sample a,b,c,d if the null
hypothesis is true.
Multiple testing – FDR vs Bonferroni
Why multiple testing matters?
Omics = lots of data= lots of hypothesis tests
Adjustment of p-value is needed to reduce the probability of
obtaining false positive results when multiple tests are executed
on a single data set
the risk is to throw away real
enrichments
is true only for independence
Independence and FDR
• Example uniprot id: P02830
• GO:0003677: DNA binding
• GO:0003700: sequence-specific DNA binding
transcription factor activity
• GO:0008134: transcription factor binding
• GO:0043565: sequence-specific DNA binding
Independence and FDR
Multiple testing – FDR vs Bonferroni
FDR vs Bonferroni
Why a Network-based approach
High-throughput genomic experiments often lead to the
identification of potentially disease-related genes/proteins list.
Gene list interpretation involves enrichment methods in order to
assess functional associations with known pathways and gene
ontology terms.
Major limitation of classical enrichment analysis:
• Proteins does not work in isolation
• The underlying molecular interaction network is not taken
into account
Related Work: Pina
The aim of PINA[1] is to identify modules[2], high and dense interconnected
regions of groups of functionally correlated genes, and use them to assess
the functional annotation
Markov clustering (MCL)
Molecular complex detection (MCODE)
[1] Cowley, M.J., Pinese, M., Kassahn, K.S., Waddell, N., Pearson, J.V., Grimmond, S.M.,
Biankin, A.V., Hautaniemi, S. and Wu, J. (2012) PINA v2.0: mining interactome
modules. Nucleic Acids Res, 40, D862-865.
[2] Rives AW Galitski T . Modular organization of cellular networks. Proc. Natl Acad. Sci.
USA 2003;100:1128-1133
Related Work: Enrichnet
• The starting protein set is
mapped on STRING network
• Calculate the distance between
nodes and pathways and/or
processes
• Assign a score
• The output of this procedure
will be a ranking table of
pathways or processes with
association scores.
E. Glaab, A. Baudot, N. Krasnogor, R. Schneider, A. Valencia EnrichNet: network-based gene
set enrichment analysis Bioinformatics, Vol. 28, No. 18 (2012), pp. i451.
The 3 phases of processing
1. Preprocessing phase
• Build the PPI network
• Integrate Data
2. Processing phase
• Extract the minimal connecting interaction
networks (MCN)
• Simplify the MCN
3. Perform the enrichment analysis on the
simplified MCN
Phase 1: Building the PPI network
Protein
edge
Homo sapiens
9872 nodes
32968 edges
Phase 1: Data Integration
Protein
• gene ontology terms from each of the three categories
• the possible pathways in which it is involved
Phase 2: Minimal Connecting Network
Phase 2: Minimal Connecting
Network - Problems
Visualizing Network
Friction Force
Hooke law
Friction Force
Coulomb law
Friction Force
Friction Force
Phase 2: Simplify the MCN
Simplify means to eliminate “less important” nodes and
associated edges from the MCN while preserving connectivity
properties between the starting MCN nodes.
• Graph clustering
• Graph filtering
Phase 2: Filtering Network
A score is assigned to each node but the one of the
initial set according to metrics.
• Degree
• Weight
• Betweenness
• Modified Betweenness
Nodes Label Degree
1
4
6
0
33
5
2
4
3
3
2
5
2
1
2
2
2
Phase 2: Filtering Network
Phase 3: Enrichment analysis
4
0
6
5
33
1
9
node
0
1
3
4
6
9
5
Uniprot ID
Q9NSB2
Q9NWB1
Q16787
Q9NR09
P25054
Q14156
Q5JSH3
• Perform Fisher’s exact test with Bonferroni correction.
• The nodes belonging to filtered network constitutes the target set
• Intact Database is the background
Testing: The “adenoma-carcinoma”
sequence in CRC
The transformation of normal colonic epithelium in adenomas and then in
adenocarcinoma is called “adenoma-carcinoma sequence”.
Adenoma
Adenocarcninoma
12 genes with SNV
40 genes with 42 SNV
Zhou D, Yang L, Zheng L, Ge W, Li D, et al. (2013) Exome Capture Sequencing of Adenoma Reveals Genetic Alterations in
Multiple Cellular Pathways at the Early Stage of Colorectal Tumorigenesis. PLoS ONE
Maria S. Pino, Daniel C. Chung, The Chromosomal Instability Pathway in Colon Cancer, Gastroenterology, Volume 138, Issue
6, May 2010, Pages 2059-2072, ISSN 0016-5085
Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, et al. (2007) The genomic landscapes of human breast and colorectal
cancers. Science 318: 1108–1113.
Test case: Adenoma (GO process)
Modified betweenness
Go term id
P-value Go term description
GO:0043065
1,45E-07 positive regulation of apoptotic process
GO:0006915
4,66E-07 apoptotic process
It has been shown [1] [2] that Apoptosis plays an important role in the
elimination of damaged cells. Several papers have showed the
increased frequency of cell death [2] [3] [3] [5] during the transition from
normal mucosa to adenoma
[1] Bedi A, Pasricha PJ, Akhtar AJ, et al. Inhibition of apoptosis during development of colorectal cancer. Cancer Res 1995.
[2] Hawkins N, Lees J, Hargrave R, O’Connor T, Meagher A, Ward R. Pathological and genetic correlates of apoptosis in the
progression of colorectal neoplasia. Tumour Biol 1997
[3] Aotake T, Lu CD, Chiba Y, Muraoka R, Tanigawa N. Changes of angiogenesis and tumor cell apoptosis during
colorectalcarcinogenesis. Clin Cancer Res 1999;
[4] Sinicrope FA, Roddey G, McDonnell TJ, Shen Y, Cleary KR, Stephens LC. Increased apoptosis accompanies neoplastic
development in the human colorectum. Clin Cancer Res 2006.
[5] Koike M. Significance of spontaneous apoptosis during colorectal tumorigenesis. J Surg Oncol.
Test case: Adenoma (Pathway)
Adenoma pathways with modified betweenness filter
Pathways id P-value
Description
2644607
0.0001658
Beta-catenin phosphorylation cascade[1]
[1] B. Mann, M. Gelos, A. Siedow et al. Target genes of beta-catenin-T-cell-factor/lymphoid-enhancerfactor signaling in human colorectal carcinomas)
Test case Adenocarcinoma: PINA
and Enrichnet
Enrichenet: enriched biological process terms
Go term id
P-value Go term descriptions
GO:0031000
4.23E-08 Response to caffeine
Pina: enriched biological process terms
Go term id
P-value
Description
GO:0000289 0.001107
nuclear-transcribed mRNA poly(A) tail shortening
GO:0000288 0.005218
GO:0031124 0.006576
nuclear-transcribed mRNA catabolic process, deadenylationdependent decay
mRNA 3'-end processing
GO:0044260 0.006576
cellular macromolecule metabolic process
GO:0000956 0.006576
nuclear-transcribed mRNA catabolic process
Test case: Adenocarcinoma(GO process)
Adenocarcinoma with modified betweenness filter
Go term Id
p-value Description
GO:0019048
2,83E-17 virus-host interaction[1] [2]
GO:0006955
7,52E-14 immune response
GO:0006974
1,05E-13 response to DNA damage stimulus [1] [2]
GO:0045087
2,95E-10 innate immune response
GO:0016567
7,32E-10 protein ubiquitination
GO:0030512
9,74E-10 negative regulation of transforming growth factor beta receptor
signaling pathway[3][4]
GO:0051403
1,25E-09 stress-activated MAPK cascade
GO:0038124
3,47E-09 toll-like receptor signaling pathway
GO:0007179
8,85E-09 transforming growth factor beta receptor signaling pathway [3][4]
GO:0007050
1,63E-08 cell cycle arrest
GO:0006915
1,63E-08 apoptotic process
GO:0031572
2,63E-07 G2 DNA damage checkpoint [1] [2]
[1] Danielle Collins, Aisling M Hogan, Desmond C Winter, Microbial and viral pathogens in colorectal cancer, The Lancet Oncology, Volume 12,
Issue 5, May 2011, Pages 504-512
[2] Parsonnet J. Bacterial infection as a cause of cancer. Environ Health Perspect 1995; 103: 263–68.
[3] Markowitz SD, Roberts AB. Tumor suppressor activity of the TGF-beta pathway in human cancers. Cytokine Growth Factor Rev 1996;7:93–102
[4] Kim SJ, Im YH, Markowitz SD, Bang YJ. Molecular mechanisms of inactivation of TGF-beta receptors during carcinogenesis. Cytokine Growth
Factor Rev 2000;11:159–68.
Test case: Adenocarcinoma (pathway)
Adenocarcinoma pathways with modified betweenness filter
Pathways id P-value
descriptions
2644607
1,26E-05 Loss of Function of FBXW7 in Cancer and
NOTCH1 Signaling[1]
450294
2,21E-05 MAP kinase activation in TLR cascade
168180
4,53E-05 TRAF6 Mediated Induction of proinflammatory
cytokines[2]
446652
9,80E-05 Interleukin-1 signaling
[1]Rajagopalan H, Jallepalli PV, Rago C, Velculescu VE, Kinzler KW, Vogelstein B, Lengauer C. Inactivation of hCDC4 can cause chromosomal
instability. Nature 2004;428:77–81.
[2] Inoue J, Ishida T, Tsukamoto N, Kobayashi N, Naito A, Azuma S, Yamamoto T (January 2000). "Tumor necrosis factor receptor-associated
factor (TRAF) family: adapter proteins that mediate cytokine signaling". Exp. Cell Res. 254 (1): 14–24. doi:10.1006/excr.1999.4733. PMID
10623461.
Enrichment issue
• Choose randomly a set of proteins (5,10,20,30) from uniprot
• Perform enrichment analysis
Uniprot is biased towards certain Go Terms.
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
Enrichment issues
• Adopt Bonferroni correction
No significant enriched
terms
• Databases is biased towards certain terms
Enrichment analysis is completely useless!!!!
BUT
Enrichment
Enrichment
“According
to the laws of aerodynamics, the bumblebee can’t
fly because of the shape and weight of his body in relation to
the total wing area, but the bumblebee doesn’t know, so it
goes ahead and flies anyway.”
“According to statistics, the Bioinformaticians can’t perform
enrichment analysis, because of Bonferroni corrections and
the bias of Databases, but bioinformaticans don’t know (till
now) , so they go ahead and perform enrichment analysis
anyway.”
Thank you