Download Mining Patterns from Protein Structures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
EECS 800 Research Seminar
Mining Biological Data
Instructor: Luke Huan
Fall, 2006
The UNIVERSITY of Kansas
Model-Based Clustering
What is model-based clustering?
Attempt to optimize the fit between the given data and some mathematical
model
Based on the assumption: Data are generated by a mixture of underlying
probability distribution
Typical methods
Statistical approach
EM (Expectation maximization), AutoClass
Machine learning approach
COBWEB, CLASSIT
Neural network approach
SOM (Self-Organizing Feature Map)
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide2
EM — Expectation Maximization
EM — A popular iterative refinement algorithm
An extension to k-means
Assign each object to a cluster according to a weight (prob. distribution)
New means are computed based on weighted measures
General idea
Starts with an initial estimate of the parameter vector
Iteratively rescores the patterns against the mixture density produced by the parameter
vector
The rescored patterns are used to update the parameter updates
Patterns belonging to the same cluster, if they are placed by their scores in a particular
component
Algorithm converges fast but may not be in global optima
AutoClass (Cheeseman and Stutz, 1996)
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide3
1D Guassian Mixture Model
Given a set of data distributed in a 1D space, how to
perform clustering in the data set?
General idea: factorize the p.d.f. into a mixture of simple
models.
Discrete values: Bernoulli distribution
Continues values: Gaussian distribution
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide4
The EM (Expectation Maximization) Algorithm
Initially, randomly assign k cluster centers
Iteratively refine the clusters based on two steps
Expectation step: assign each data point Xi to cluster Ci with the following
probability
Maximization step:
Estimation of model parameters
 k   xi *  ( xi , Ck ) /  ( xi , Ck )
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide5
Another Way of K-mean?
Pos:
AutoClass can adapt to different (convex) shapes of clusters, kmean assumes spheres
Solid statistics foundation
Cons:
computational expensive
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide6
Model Based Subspace Clustering
Microarray
Bi-clustering
δ-clustering
p-clustering
OP-clustering
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide7
MicroArray Dataset
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide8
 x11
 ...

 xi1

 ...
 xn1

... x1 j ... x1m 
... ... ... ... 
... xij ... xim 

... ... ... ... 
... xnj ... xnm 
Genes
Genes
Gene Expression Matrix
Conditions
Time points
Cancer Tissues
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
Conditions
slide9
Data Mining: Clustering
k
  dist ( x
t 1 ict
K-means clustering minimizes
Where dist ( x , c
i
10/04/2006
Model-based Clustering
t
)
m
 (x
j 1
ij
i
, ct ) 2
 ctj ) 2
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide10
Clustering by Pattern Similarity (p-Clustering)
The micro-array “raw” data shows 3 genes and their
values in a multi-dimensional space
Parallel Coordinates Plots
Difficult to find their patterns
“non-traditional” clustering
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide11
Clusters Are Clear After Projection
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide12
Motivation
DNA microarray analysis
CH1I
CH1B
CH1D
CH2I
CH2B
CTFC3
4392
284
4108
280
228
VPS8
401
281
120
275
298
EFB1
318
280
37
277
215
SSA1
401
292
109
580
238
FUN14
2857
285
2576
271
226
SP07
228
290
48
285
224
MDM10
538
272
266
277
236
CYS3
322
288
41
278
219
DEP1
312
272
40
273
232
NTG1
329
296
33
274
228
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide14
strength
Motivation
450
400
350
300
250
200
150
100
50
0
CH1I
CH1D
CH2B
condition
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide15
Motivation
Strong coherence exhibits by the selected objects on the
selected attributes.
They are not necessarily close to each other but rather bear a
constant shift.
Object/attribute bias
bi-cluster
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide16
Challenges
The set of objects and the set of attributes are usually
unknown.
Different objects/attributes may possess different biases
and such biases
may be local to the set of selected objects/attributes
are usually unknown in advance
May have many unspecified entries
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide17
Previous Work
Subspace clustering
Identifying a set of objects and a set of attributes such that
the set of objects are physically close to each other on the
subspace formed by the set of attributes.
Collaborative filtering: Pearson R
Only considers global offset of each object/attribute.
 (o  o )(o  o )
 (o  o )   ( o  o )
1
1
2
2
2
1
10/04/2006
Model-based Clustering
1
2
2
2
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide18
bi-cluster Terms
Consists of a (sub)set of objects and a (sub)set of
attributes
Corresponds to a submatrix
Occupancy threshold 
Each object/attribute has to be filled by a certain percentage.
Volume: number of specified entries in the submatrix
Base: average value of each object/attribute (in the bicluster)
Biclustering of Expression Data, Cheng & Church
ISMB’00
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide19
bi-cluster
CH1I
CH1B
CH1D
CH2I
CH2B
Obj base
CTFC3
VPS8
401
120
298
273
EFB1
318
37
215
190
322
41
219
194
347
66
244
219
SSA1
FUN14
SP07
MDM10
CYS3
DEP1
NTG1
Attr base
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide20
17 conditions
1
69
69
110
110
179
0
0
161
110
0
161
179
240
161
283
195
161
326
69
208
383
161
195
230
300
240
442
230
322
195
277
470
248
139
411
220
271
271
195
318
2
0
69
0
0
110
0
0
69
69
69
110
161
161
69
240
110
139
304
0
220
383
220
220
330
289
289
464
208
322
161
256
498
240
110
397
220
248
230
110
195
3
69
69
69
69
69
69
0
69
110
69
110
139
195
110
248
139
161
322
69
277
413
195
161
300
264
264
456
240
300
139
248
488
289
69
383
220
230
208
161
271
Model-based Clustering
4
139
110
69
110
110
69
110
69
110
110
139
161
195
139
264
195
139
326
110
289
414
161
139
277
277
240
451
230
330
161
264
477
300
69
371
208
240
161
139
195
5
139
110
110
110
110
139
110
110
161
139
179
195
256
161
304
248
179
350
110
326
403
195
161
240
277
256
422
248
356
139
271
460
294
110
347
208
248
195
179
304
6
139
110
139
110
110
161
110
69
110
110
139
161
220
139
283
179
161
340
69
289
381
161
161
240
289
220
417
240
361
161
248
466
289
69
314
161
240
161
161
289
7
139
110
139
139
161
179
69
110
69
0
110
161
208
110
283
161
139
376
0
289
393
110
110
179
277
208
403
283
333
139
240
484
264
69
277
161
179
161
179
Mining
283
8
9
69
0
69
0
139
0
110
0
161
0
139
0
110
0
110
0
139
69
0
0
139
69
161
110
240
139
161
69
283
195
220
110
69
69
318
248
69
0
248
220
343
350
110
110
139
110
195
220
300
248
220
248
432
510
248
220
369
376
179
110
256
220
449
532
277
248
69
69
330
264
208
179
248
208
195
161
161
69
Biological
289
304
10
0
69
69
69
69
69
0
0
69
69
69
161
195
110
220
179
139
314
69
271
369
195
195
277
283
271
438
230
369
110
230
485
283
69
289
195
208
208
110
Data
330
11
69
69
69
110
69
0
0
69
110
69
69
161
195
139
240
195
69
283
69
240
358
179
195
289
271
256
442
230
374
139
230
473
283
69
283
179
220
195
139
264
KU EECS 800, Luke Huan, Fall’06
12
110
110
139
139
110
110
69
0
110
110
110
139
195
69
240
161
69
314
139
271
347
179
195
240
294
256
450
220
369
139
256
464
277
69
304
179
230
220
139
256
13
0
110
69
69
0
0
0
0
139
69
110
139
161
69
240
179
179
318
69
294
358
69
69
240
256
240
462
240
343
139
208
487
283
69
264
161
220
161
139
271
14
69
69
69
69
69
69
69
69
110
69
110
161
195
110
248
208
179
326
69
277
356
139
161
220
264
220
419
248
361
110
208
477
277
110
264
139
179
179
161
309
15
0
0
0
0
0
0
0
0
69
0
69
110
161
69
195
110
110
264
0
230
314
110
139
161
271
179
476
220
393
161
240
492
271
69
340
161
230
195
139
277
16
0 YBL069W
69 YBL097W
0 YBR064W
69 YBR065C
69 YBR114W
69 YCL013W
69 YDR149C
69 YDR461W
110 YDR526C
0 YHR061C
69 YIL092W
110 YIR043C
110 YJL010C
69 YJL023C
208 YJL033W
110 YJL076W
69 YJR162C
264 YKL068W
0 YKL134C
208 YLR219W
289 YLR380W
110 YLR381W
139 YLR382C
161 YLR383W
283 YLR384C
208 YLR386W
476 YLR388W
240 YLR392C
399 YLR395C
161 YLR400W
230 YLR401C
484 YLR406C
283 YLR408C
69 YLR411W
343 YLR413W
139 YLR450W
230 YLR451W
220 YLR452C
161 YLR453C
256 YLR454W
40 genes
0
139
0
139
139
208
0
0
179
69
69
139
179
179
161
208
161
139
304
69
283
337
161
208
248
264
230
439
256
374
139
230
494
326
179
326
161
220
220
179
283
10/04/2006
slide21
Motivation
expression level
600
500
400
300
200
100
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
condition
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide22
17 conditions
1
69
69
110
110
179
0
0
161
110
0
161
179
240
161
283
195
161
326
69
208
383
161
195
230
300
240
442
230
322
195
277
470
248
139
411
220
271
271
195
318
2
0
69
0
0
110
0
0
69
69
69
110
161
161
69
240
110
139
304
0
220
383
220
220
330
289
289
464
208
322
161
256
498
240
110
397
220
248
230
110
195
3
69
69
69
69
69
69
0
69
110
69
110
139
195
110
248
139
161
322
69
277
413
195
161
300
264
264
456
240
300
139
248
488
289
69
383
220
230
208
161
271
Model-based Clustering
4
139
110
69
110
110
69
110
69
110
110
139
161
195
139
264
195
139
326
110
289
414
161
139
277
277
240
451
230
330
161
264
477
300
69
371
208
240
161
139
195
5
139
110
110
110
110
139
110
110
161
139
179
195
256
161
304
248
179
350
110
326
403
195
161
240
277
256
422
248
356
139
271
460
294
110
347
208
248
195
179
304
6
139
110
139
110
110
161
110
69
110
110
139
161
220
139
283
179
161
340
69
289
381
161
161
240
289
220
417
240
361
161
248
466
289
69
314
161
240
161
161
289
7
139
110
139
139
161
179
69
110
69
0
110
161
208
110
283
161
139
376
0
289
393
110
110
179
277
208
403
283
333
139
240
484
264
69
277
161
179
161
179
Mining
283
8
9
69
0
69
0
139
0
110
0
161
0
139
0
110
0
110
0
139
69
0
0
139
69
161
110
240
139
161
69
283
195
220
110
69
69
318
248
69
0
248
220
343
350
110
110
139
110
195
220
300
248
220
248
432
510
248
220
369
376
179
110
256
220
449
532
277
248
69
69
330
264
208
179
248
208
195
161
161
69
Biological
289
304
10
0
69
69
69
69
69
0
0
69
69
69
161
195
110
220
179
139
314
69
271
369
195
195
277
283
271
438
230
369
110
230
485
283
69
289
195
208
208
110
Data
330
11
69
69
69
110
69
0
0
69
110
69
69
161
195
139
240
195
69
283
69
240
358
179
195
289
271
256
442
230
374
139
230
473
283
69
283
179
220
195
139
264
KU EECS 800, Luke Huan, Fall’06
12
110
110
139
139
110
110
69
0
110
110
110
139
195
69
240
161
69
314
139
271
347
179
195
240
294
256
450
220
369
139
256
464
277
69
304
179
230
220
139
256
13
0
110
69
69
0
0
0
0
139
69
110
139
161
69
240
179
179
318
69
294
358
69
69
240
256
240
462
240
343
139
208
487
283
69
264
161
220
161
139
271
14
69
69
69
69
69
69
69
69
110
69
110
161
195
110
248
208
179
326
69
277
356
139
161
220
264
220
419
248
361
110
208
477
277
110
264
139
179
179
161
309
15
0
0
0
0
0
0
0
0
69
0
69
110
161
69
195
110
110
264
0
230
314
110
139
161
271
179
476
220
393
161
240
492
271
69
340
161
230
195
139
277
16
0 YBL069W
69 YBL097W
0 YBR064W
69 YBR065C
69 YBR114W
69 YCL013W
69 YDR149C
69 YDR461W
110 YDR526C
0 YHR061C
69 YIL092W
110 YIR043C
110 YJL010C
69 YJL023C
208 YJL033W
110 YJL076W
69 YJR162C
264 YKL068W
0 YKL134C
208 YLR219W
289 YLR380W
110 YLR381W
139 YLR382C
161 YLR383W
283 YLR384C
208 YLR386W
476 YLR388W
240 YLR392C
399 YLR395C
161 YLR400W
230 YLR401C
484 YLR406C
283 YLR408C
69 YLR411W
343 YLR413W
139 YLR450W
230 YLR451W
220 YLR452C
161 YLR453C
256 YLR454W
40 genes
0
139
0
139
139
208
0
0
179
69
69
139
179
179
161
208
161
139
304
69
283
337
161
208
248
264
230
439
256
374
139
230
494
326
179
326
161
220
220
179
283
10/04/2006
slide23
Motivation
600
expression level
500
400
300
200
100
0
3
5
9
14
15
YBL069W
YBL097W
YBR064W
YBR065C
YBR114W
YCL013W
YDR149C
YDR461W
YDR526C
YHR061C
YIL092W
YIR043C
YJL010C
YJL023C
YJL033W
YJL076W
YJR162C
YKL068W
YKL134C
YLR219W
condition
10/04/2006
Model-based Clustering
Co-regulated
genes
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide24
bi-cluster
Perfect -cluster
d ij  d iJ  d Ij  d IJ
d ij  d Ij  d iJ  d IJ
d ij  d iJ  d Ij  d IJ
diJ
dij
dIJ
Imperfect -cluster
dIj
Residue:
rij 
0,
10/04/2006
Model-based Clustering
d ij  d iJ  d Ij  d IJ , d ij is specified
d ij is unspecifie d
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide25
bi-cluster
The smaller the average residue, the stronger the
coherence.
Objective: identify -clusters with residue smaller than a
given threshold
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide26
Cheng-Church Algorithm
Find one bi-cluster.
Replace the data in the first bi-cluster with random data
Find the second bi-cluster, and go on.
The quality of the bi-cluster degrades (smaller volume,
higher residue) due to the insertion of random data.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide27
The FLOC algorithm
Generating initial clusters
Determine the best action for
each row and each column
Perform the best action of each
row and column sequentially
Improved?
Y
N
Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide28
The FLOC algorithm
Action: the change of membership of a row(or column)
with respect to a cluster
column
1
2
3
4
1
3
4
2
2
2
1
3
2
3
3
4
2
0
4
row
N=3
10/04/2006
Model-based Clustering
M=4
M+N actions are
Performed at
each iteration
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide29
The FLOC algorithm
Gain of an action: the residue reduction incurred by
performing the action
Order of action:
Fixed order
Random order
Weighted random order
Complexity: O((M+N)MNkp)
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide30
The FLOC algorithm
Additional features
Maximum allowed overlap among clusters
Minimum coverage of clusters
Minimum volume of each cluster
Can be enforced by “temporarily blocking” certain action
during the mining process if such action would violate
some constraint.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide31
Performance
Microarray data: 2884 genes, 17 conditions
100 bi-clusters with smallest residue were returned.
Average residue = 10.34
The average residue of clusters found via the state of the art
method in computational biology field is 12.54
The average volume is 25% bigger
The response time is an order of magnitude faster
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide32
Conclusion Remark
The model of bi-cluster is proposed to capture coherent
objects with incomplete data set.
base
residue
Many additional features can be accommodated (nearly
for free).
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide33
p-Clustering: Clustering
by Pattern Similarity
Given object x, y in O and features a, b in T, pCluster is a 2 by 2
matrix
d d 
pScore( 
) | (d xa  d xb )  (d ya  d yb ) |
d
d
 ya yb 
xa
xb
A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T),
pScore(X) ≤ δ for some δ > 0
For scaling patterns, one can observe, taking logarithmic on
d xa / d ya
d xb / d yb

will lead to the pScore form
H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide34
Coherent Cluster
Want to accommodate noises but not outliers
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide35
Coherent Cluster
Coherent cluster
Subspace clustering
pair-wise disparity
For a 22 (sub)matrix consisting of objects {x, y} and
attributes {a, b}
  d xa
D 
 d ya

d xb  


d yb  
dxa
 (d xa  d ya )  (d xb  d yb )
mutual bias
of attribute a
10/04/2006
Model-based Clustering
x d
z ya
y
mutual bias
of attribute b
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
dxb
dyb
a
b
attribute
slide36
Coherent Cluster
A 22 (sub)matrix is a -coherent cluster if its D value is
less than or equal to .
An mn matrix X is a -coherent cluster if every 22
submatrix of X is -coherent cluster.
A -coherent cluster is a maximum -coherent cluster if it is not
a submatrix of any other -coherent cluster.
Objective: given a data matrix and a threshold , find all
maximum -coherent clusters.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide37
Coherent Cluster
Challenges:
Finding subspace clustering based on distance itself is already a
difficult task due to the curse of dimensionality.
The (sub)set of objects and the (sub)set of attributes that form
a cluster are unknown in advance and may not be adjacent to
each other in the data matrix.
The actual values of the objects in a coherent cluster may
be far apart from each other.
Each object or attribute in a coherent cluster may bear some
relative bias (that are unknown in advance) and such bias may be
local to the coherent cluster.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide38
Coherent Cluster
Compute the maximum coherent
attribute sets for each pair of objects
Two-way Pruning
Construct the lexicographical tree
Post-order traverse the tree to
find maximum coherent clusters
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide39
Coherent Cluster
Observation: Given a pair of objects {o1, o2} and a (sub)set of
attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent
cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai)
does not differ from each other by more than .
7
5
o1
3
o2
1
a1
a2
3
2 3.5 2 2.5  [2, 3.5]
10/04/2006
Model-based Clustering
a3
a4
If  = 1.5,
then {a1,a2,a3,a4,a5} is a
coherent attribute set (CAS)
of (o1,o2).
a5
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide40
Coherent Cluster
Observation: given a subset of objects {o1, o2, …, ol} and a subset
of attributes {a1, a2, …, ak}, the lk submatrix is a -coherent
cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of
objects (oi,oj) where 1  i, j  l.
a1 a2 a3 a4 a5 a6 a7
o1
o2
o3
o4
o5
o6
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide41
Coherent Cluster
Strategy: find the maximum coherent attribute sets for each
pair of objects with respect to the given threshold .
7
7
5
r1
5
r1
3
r2
3
r2
1
a1
a2
a3
a4
a5
1
a2
a4
a5
a1
a3
2 2 2.5 3 3.5
 = 1 3 2 3.5 2 2.5
The maximum coherent attribute sets define the search space
for maximum coherent clusters.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide42
Two Way Pruning
a0
a1
a2
o0
1
4
2
o1
2
5
5
o2
3
6
5
o3
4
200
7
o4
300
7
6
delta=1 nc =3 nr = 3
10/04/2006
Model-based Clustering
(a0,a1) →(o0,o1,o2)
(a0,a2) →(o1,o2,o3)
(a1,a2) →(o1,o2,o4)
(a1,a2) →(o0,o2,o4)
(o0,o2) →(a0,a1,a2)
(o1,o2) →(a0,a1,a2)
(o0,o2) →(a0,a1,a2)
(o1,o2) →(a0,a1,a2)
MCAS
(a0,a1) →(o0,o1,o2)
(a0,a2) →(o1,o2,o3)
(a1,a2) →(o1,o2,o4)
(a1,a2) →(o0,o2,o4)
MCOS
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide43
Coherent Cluster
10/04/2006
Model-based Clustering
attributes
objects
Strategy: grouping object
pairs by their CAS and, for
each group, find the maximum
clique(s).
Implementation: using a
lexicographical tree to
organize the object pairs and
to generate all maximum
coherent clusters with a single
post-order traversal of the
tree.
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide44
a0
a1
a2
a3
o0
1
4
2
5
o1
2
5
5
8
o2
3
6
5
7
o3
4
20
7
2
o4
30
7
6
6
{a0,a1} : (o0,o1) (o1,o2) (o0,o2)
{a0,a2} : (o1,o3),(o2,o3) (o1,o2) (o0,o2)
{a1,a2} : (o0,o4),(o1,o4),(o2,o4) (o1,o2) (o0,o2)
{a2,a3} : (o0,o1),(o1,o2) (o0,o2)
{a0,a1,a2} : (o1,o2) (o0,o2)
{a0,a1,a2,a3} : (o0,o2)
a0
a1
assume  = 1
(o0,o1) : {a0,a1}, {a2,a3}
(o0,o2) : {a0,a1,a2,a3}
(o0,o4) : {a1,a2}
a1
(o0,o1)
(o1,o2) : {a0,a1,a2}, {a2,a3}
a2
a2
a2
(o1,o3)
(o2,o3)
(o1,o3) : {a0,a2}
(o1,o4) : {a1,a2}
(o2,o3) : {a0,a2}
a2
a3
(o0,o4)
(o1,o4)
(o2,o4)
(o0,o1)
(o1,o2)
(o1,o2)
a3
(o2,o4) : {a1,a2}
10/04/2006
Model-based Clustering
Mining
Data
(o0,o2Biological
)
KU EECS 800, Luke Huan, Fall’06
slide45
Coherent Cluster
High expressive power
The coherent cluster can capture many
interesting and meaningful patterns
overlooked by previous clustering methods.
Efficient and highly scalable
Wide applications
Gene expression analysis
Collaborative filtering
a ve ra ge re sponse time (se c)
12000
10000
8000
6000
subspace
cluster
4000
2000
coherent
cluster
0
10
20
50
100
200
500
number of conditions
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide46
Remark
Comparing to Bicluster
Can well separate noises and outliers
No random data insertion and replacement
Produce optimal solution
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide47
Let I be a subset of genes in the
database. Let J be a subset of
conditions. We say <I, J>
forms an Order Preserving Cluster
(OP-Cluster), if
one of the following relationships
exists for any pair of conditions.
j1 , j2  J , j1  j2
(1)i  I , Dij1  Dij2
Experssion Levels
Definition of OP-Cluster
A1
(2)i  I , Dij1  Dij2
A2
A3
A4
(3)i  I , Dij1  Dij2
max | Dij1  Dij2 |   min (| Dij1 |, | Dij2 |)
j1 , j2 J
10/04/2006
Model-based Clustering
when
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide48
Problem Statement
Given a gene expression matrix, our goal is to find all the
statistically significant OP-Clusters. The significance is
ensured by the minimal size threshold nc and nr.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide49
Conversion to Sequence Mining Problem
(1) Dij1  Dij2  j1  j2
(2) Dij1  Dij2  j1  j2
Experssion Levels
(3) Dij1  Dij2  CanonicalO rder ( j1 , j2 )
Sequence:
A1  A4  A3  A2
A1
10/04/2006
Model-based Clustering
A2
A3
A4
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide50
Ming OP-Clusters: A naïve approach
root
A naïve approach
Enumerate all possible
subsequences in a prefix
tree.
For each subsequences,
collect all genes that contain
the subsequences.
a
b
b
c
d
c
a
d
c
…
d
Challenge:
The total number of distinct
subsequences are
i
i! 

1i  N  m 
10/04/2006
Model-based Clustering
c
d
b
d
b
c
c
d
a
d
…
d
c
d
b
c
b
d
c
d
a
…
A Complete Prefix Tree
with 4 items {a,b,c,d}
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide51
Mining OP-Clusters: Prefix Tree
Goal:
Build a compact prefix tree that
includes all sub-sequenes only
occurring in the original database.
Strategies:
g1
adbc
g2
abdc
g3
badc
1. Depth-First Traversal
Root
2. Suffix concatenation: Visit
subsequences that only exist in
the input sequences.
3. Apriori Property: Visit
subsequences that are sufficiently
supported in order to derive
longer subsequences.
a:1,2,3
a:1,2
d:1,2,3
d:1,3
d:1
b:1
c:1,2,3
c:1,3
c:1
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
b:3
b:2
a:3
d:2
d:3
c:2
c:3
slide52
References
J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation
in a large data set, Proceedings of the 18th IEEE International Conference on
Data Engineering (ICDE), pp. 517-528, 2002.
H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data
sets, to appear in Proceedings of the ACM SIGMOD International Conference on
Management of Data (SIGMOD), 2002.
Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its
applications to gene expression data Bioinformatics and Bioengineering, 2004.
J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional
space, ICDM’03.
10/04/2006
Model-based Clustering
Mining Biological Data
KU EECS 800, Luke Huan, Fall’06
slide53
Related documents