Download ppt - People at VT Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Storytelling and Clustering for Cellular
Signaling Pathways
M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys
Department of Computer Science,
Virginia Tech, Blacksburg, VA 24061.
Objective
 STKE Dataset

Cell interactions through chemical signals
 Discover relationships between the pathways


Graph structure
Subgraph discovery problem
 Pathways relationships


Clustering
Storytelling
2
Myocyte Adrenergic Pathway
(CMP_9043)
12
1-10
11-20
21-30
31-40
41-50
51-60
61-70
71-80
81-90
91-100
100-110
Number of Pathways
in Size Range
Dataset properties
Total Pathways = 50
10
8
6
4
2
0
Size Range
4
Design Pipeline
STKE
Dataset
Clustering
Preprocessor
Pathway
Graphs
Frequent
Subgraphs
Frequent
Subgraph
Discovery
NN
Storytelling
5
Subsequent Candidate Generation
 Apriori – incremental approach [17]
 FSG [2]
 Generate a (k+1)-edge candidate subgraph by combining
two k-edge subgraphs where these two k-edge subgraphs
have a common core subgraph of (k-1)-edges.
 Cost of comparison between subgraphs (and core
subgraphs) is reduced using hash-code of each
subgraph object.
l
q
p
m
p
o
n
m
p
o
n
q
l
m
o
n
6
Subsequent Candidate Generation
l p
p
o
m
m
n
o
o
m
p
z
m
o
n
l p
r
o
n
m
q
o
n
t
n
m
l
n
l p
m
q
Not generated
r
l
p o
n
p o
n
m
………………………………………….
………………………………................
………………………………………….
l p
s
o
m
n
p
s
o
m
n
 Instance:


Number of 5-edge
subgraphs: 21
Core subgraph
comparisons for s1: 20
l p
o
m
n
7
Master Pathway Graph (MPG)
Total Unique Nodes:1205
Total Relations:1376
SEG - Subgraph Extension Generation
l
n
m
p
m
s
l
p
r
o
o
n
q
l
p
l
m
o
r
p
m
o
n
n
 Neighborhood Extension

Neighborhood list : {q, r, s}
 Comparison is not required.

q
Subgraph is extended from
physical evidence
l
s
m
p
o
n
9
Design Pipeline
STKE
Dataset
Clustering
Preprocessor
Pathway
Graphs
Frequent
Subgraphs
Frequent
Subgraph
Discovery
NN
Storytelling
10
Subgraph Discovery
• What so novel about pruning edges?
min_sup=2%
k
# of Subgraphs
generated
1
1,376
2
3
4
5
5,380
29,565
187,508
1274,852
---
----
Time (sec.)
Existing
41
149
971
7518
----11
‘Importance Factor’ of a subgraph: sfipf
 For i-th subgraph j-th pathway:
Subgraph frequency,
1
sf j 
nj
Inverse pathway frequency, ipf i 
p
D
j
: si  p j 
sfipf i , j  sf j  ipf i
12
50
40
30
20
10
0
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
# of pathways left
1400
1200
1000
800
600
400
200
0
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
# of edges left
Dataset Properties (sfipf)
min_sfipf
min_sfipf
Number of edges in MPG=1376
Total pathways=50
13
Subgraph Discovery
min_sup= 4.0%
min_sfipf= 0.01
400x103
350x103
FSG
SEG
250x103
200x103
150x103
100x103
50x103
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
0
3
Time (ms)
300x103
k
14
Subgraph Discovery
min_sup= 4.0%
min_sfipf= 0.01
3000
2500
FSG
SEG
1500
1000
500
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
0
3
Time (ms)
2000
k
15
Subgraph Discovery
k
min_sup= 4.0%
min_sfipf= 0.01
1250000
FSG
SEG
750000
500000
250000
k
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
0
3
# of Atempts
1000000
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Number of
Subgraphs
186
246
305
323
313
279
263
292
364
470
608
785
980
1117
1075
804
430
141
20
1
Time
Saved (%)
99.83
98.33
98.57
98.95
98.96
98.88
98.67
98.38
98.58
98.76
99.04
99.22
99.38
99.48
99.53
99.51
99.34
98.76
96.15
75.74
Attempts
Saved(%)
98.98
86.15
86.38
86.91
85.64
83.25
78.91
74.76
74.75
78.08
81.84
85.02
87.63
89.48
90.26
89.40
85.22
71.22
9.19
-574.47
Overall attempts saved = 89.52%
Overall time saved = 99.39%
16
Clustering
 Hierarchical Agglomerative Clustering
(HAC)
 k-means
 Unsupervised measure of clusters’ validity

Average Silhouette Coefficient (ASC) [19]
18
Clustering
min_sup=4%, min_sfipf=0.01
min_sup=4%, min_sfipf=0.01
# of Clusters
20
18
16
14
12
6
20
18
16
14
12
10
8
0.0
6
0.0
4
0.1
2
0.1
4
0.2
2
0.2
8
Cosine
sfipf
Dice
Jaccard
Overlap
0.3
ASC
0.3
ASC
k-means
0.4
10
HAC
0.4
# of Clusters
19
Clustering
ASC Contour map for 10 clusters
using HAC
0.05
ASC Contour map for 10 clusters using
k-means
0.05
0.10
0.08
0.12
0.14 0.10
0.04
0.16
0.03
0.02
0.08
0.10
0.12
0.14
0.16
0.18
0.20
0.08
0.12
0.18
0.14
0.20
0.16
0.10
0.04
min_sfipf
min_sfipf
0.04
0.06
0.08
0.04
0.06
0.08
0.10
0.12
0.14
0.04
0.10
0.08
0.03
0.06
0.12
0.02 0.14
0.06
0.10
0.01
0.08
0.01
4
6
8
min_sup
10
12
4
6
8
10
12
min_sup
20
Design Pipeline
STKE
Dataset
Clustering
Preprocessor
Pathway
Graphs
Frequent
Subgraphs
Frequent
Subgraph
Discovery
NN
Storytelling
21
Pathway Relations (StoryTelling)
 Bidirectional Search
 Cover tree for NN
S
p1
p7
p2
p8
p3
p9
T
22
Day-to-day life example
Roman
Holiday
Sabrina
Funny
Face
Deep in
my Heart
Singing in
the rain
An American
in Paris
From Roman Holiday
Kismet
Take me out to
the Ball Game
On the
Town
Anchors
Aweigh
High
Society
Kiss me
Kate
Terminator 3
Collateral
damage
Blade:
Trinity
van
Helsing
Lethal
Weapon 4
Die
Hard 2
Speed
Air Force
One
From Terminator 3
Roman
Holiday
Sabrina
Breakfast
at Tiffany’s
Some
Like it Hot
The day after
Tomorrow
Die Another
Day
Golden
Eye
U.S.
Marshals
Rear
Window
From:
To:
Terminator
3
S.W.A.T.
2001: A Space
Odyssey
Roman Holiday
Terminator 3
Examples in STKE
 http://people.cs.vt.edu/msh/infoviz/3/
24
Pathway Relations (StoryTelling)
Numbers of varying length stories
for different branching factor
350
Number of t-length stories
300
b=2
b=4
b=6
b=8
250
200
150
100
50
0
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Story length, t
25
Pathway Relations (StoryTelling)
Numbers of varying length stories
for different branching factor
350
Number of t-length stories
300
b=2
b=3
b=4
b=5
b=6
b=7
b=8
b=9
b=10
250
200
150
100
50
0
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Story length, t
26
5
6
7
8
9
10
4
10
4
9
3
Branching factor, b
Branching factor, b
1.4x106
Time to generate
all stories (ms)
2
6
8
0
8
7
200
10
6
400
12
5
600
14
4
800
16
3
1000
2
Length of the longest story
Total stories from all pairs
Pathway Relations (StoryTelling)
1.2x106
1.0x106
800.0x103
600.0x103
400.0x103
200.0x103
0.0
2
3
4
5
6
7
8
Branching factor, b
9
10
27
Future Directions
 Compare our SEG graph methods with text
based clustering and storytelling
 Examine costs and benefits for combining text
and graph mining techniques
28
References
[1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of
Cell Signaling", http://stke.sciencemag.org/cm/
[2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs",
IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051.
[3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of
individual tumor samples", BMC Bioinformatics, June 29, 2005.
[4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE
Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751.
[5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for
Signaling Pathways", 18th SSDBM, 2006, pp. 133-142.
[6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC.
2007, Vol. 2, pp. 319 - 323.
[7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st AsiaPacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110.
[8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization
System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 676.
[9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative
analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced
Applications, March 2003 pp. 327 - 334.
29
References
[10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721724.
[11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on
Research issues in data mining and knowledge discovery, 2004, pp. 51-58.
[12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent
Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005,
pp. 71-76.
[13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very
Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114.
[14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with
Background Knowledge", ICML 2001, pp. 577-584.
[15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of
practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551568.
[16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104.
[17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very
Large Data Bases, Santiago, Chile, September 1994, pp. 487-499.
[18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining
System", KDD'96, USA, 1996, pp. 244-249.
[19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN:
0321321367, April 2005, pp. 539-547.
[20] http://people.cs.vt.edu/amonika/infoviz/
30
Thank You
31
Related documents