Download Image-Based Modeling and 5th

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Bi-Clustering
COMP 790-90 Seminar
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
• Let I be a subset of genes in
the database. Let J be a subset
of conditions. We say <I, J>
forms an Order Preserving
Cluster (OP-Cluster), if
one of the following
relationships exists for any pair
of conditions.
Expression Levels
Definition of OP-Cluster
j1 , j2  J , j1  j2
A1
A2
A3
A4
(1)i  I , Dij1  Dij2
(2)i  I , Dij1  Dij2
(3)i  I , Dij1  Dij2
max | Dij1  Dij2 |   min (| Dij1 |, | Dij2 |)
j1 , j2 J
2
when
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Problem Statement
• Given a gene expression matrix, our
goal is to find all the statistically
significant OP-Clusters. The
significance is ensured by the minimal
size threshold nc and nr.
3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conversion to Sequence Mining
Problem
(1) Dij1  Dij2  j1  j2
(2) Dij1  Dij2  j1  j2
Expression Levels
(3) Dij1  Dij2  CanonicalO rder ( j1 , j2 )
Sequence:
A1  A4  A3  A2
A1
4
A2
A3
A4
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Ming OP-Clusters: A naïve
approach
• A naïve approach
root
 Enumerate all possible
subsequences in a prefix
tree.
 For each subsequences,
collect all genes that
contain the
subsequences.
• Challenge:
 The total number of
distinct subsequences
are
i
 i! m 
1i  N
5


a
b
b
c
d
c
a
d
c
…
d
c
d
b
d
b
c
c
d
a
d
…
d
c
d
b
c
b
d
c
d
a
…
A Complete Prefix Tree
with 4 items {a,b,c,d}
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Mining OP-Clusters: Prefix
Tree
Goal:
Build a compact prefix tree that
includes all sub-sequences only
occurring in the original database.
Strategies:
g1
adbc
g2
abdc
g3
badc
1. Depth-First Traversal
Root
2. Suffix concatenation: Visit
subsequences that only exist in
the input sequences.
3. Apriori Property: Visit
subsequences that are sufficiently
supported in order to derive
longer subsequences.
a:1,2,3
a:1,2
d:1,2,3
d:1,3
d:1
b:1
c:1
6
c:1,2,3
c:1,3
b:3
b:2
a:3
d:2
d:3
c:2
c:3
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
References
• J. Yang, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing
subspace correlation in a large data set, Proceedings of the 18th
IEEE International Conference on Data Engineering (ICDE),
pp. 517-528, 2002.
• H. Wang, W. Wang, J. Yang, P. Yu, Clustering by pattern
similarity in large data sets, to appear in Proceedings of the
ACM SIGMOD International Conference on Management of
Data (SIGMOD), 2002.
• Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced
pClustering and its applications to gene expression data
Bioinformatics and Bioengineering, 2004.
• J. Liu and W. Wang, OP-Cluster: clustering by tendency in
high dimensional space, ICDM’03.
7
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL