Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Pattern-based Clustering
How to cluster the five objects?
Hard to define a global similarity measure
University at Buffalo The State University of New York
What Is Pattern-based
Clustering?
A cluster: a set of objects following the same
pattern in a subset of dimensions (Wang et al,
2002)
University at Buffalo The State University of New York
Challenges
Most clustering approaches do not address the temporal
variations in time series gene expression data, which is an
important feature and affect the performance.
Previous approaches try to find coherent patterns and
clusters w.r.t. the entire set of attributes
Patterns may be embedded in sub attribute spaces
Only a subset of genes participate in any cellular processes of interest
Any cellular process may take place only in a subset of experiment
conditions.
a) raw data
b) shifting patterns
University at Buffalo The State University of New York
c) scaling patterns
Gene-Sample-Time (GST)
Microarray Data
A collection
of samples
2D time-series data
• The GST microarray data consist
of three dimensions
• The samples often exhibit various
phenotypes, e.g., cancer vs. control
University at Buffalo The State University of New York
3D gene-sample-time data
Challenges of Mining GST
Data
Most clustering algorithms were designed for 2D
data, and cannot be directly extended for 3D data.
Challenges
2D data
3D data
Mining Process
Partition
genes
Partition genes and
samples
simultaneously
Cluster model
Two types of
variables
Three types of
variables
University at Buffalo The State University of New York
Coherent Gene Cluster
A 3D GST data set
The 2D representation
A coherent gene cluster
• The group of samples (sj1, sj2, sj3 ) may exhibit the
same phenotype
• The group of genes (gi1,gi2,gi3) may be strongly
correlated to the phenotype shared by (sj1, sj2, sj3 )
University at Buffalo The State University of New York
Results from a Real Data Set
The Multiple Sclerosis (MS) data consist of
4324 genes
13 MS patients
10 time points before and after IFN- treatment
25 coherent gene clusters were reported
Sample A
Sample E
Sample B
Sample F
Sample C
Sample G
Sample D
Sample H
An example of coherent gene clusters (107 genes, 8 samples)
University at Buffalo The State University of New York
Other Types of Coherent
Clusters
University at Buffalo The State University of New York
Problem Definition
Given a GST microarray data matrix M, a maximal
coherent gene cluster C=(GS) is a combination of a
subset of genes G and a subset of samples S such
that:
Coherent : the subset of genes G are coherent across the
subset of samples S;
Significant : |G|≥ming, |S|≥mins, where ming and mins are
user-specified parameters;
Maximal : any insertion of gG or sS will make C not
coherent.
The problem of mining coherent gene clusters is to
find the complete set of maximal coherent gene
clusters in M.
University at Buffalo The State University of New York
Coherence Measure
Various coherence measures exist.
Measure selection is application dependent.
A general coherence model
Given a coherence measure sim(•) and a user-specified threshold ,
A gene ga is coherent on samples si and sj, if sim(pai,paj)≥ .
Coherent gene matrix (G1,S1): if every gene gi G1 is coherent
across samples in S1.
Trivial coherent gene matrix: ({gi}, {sj}), (G, {sj})
We choose the Person’s correlation coefficient.
Other coherence measures are also applicable.
University at Buffalo The State University of New York
Related Work
Clustering algorithms on Gene-Sample or
Gene-Time microarray data
The cluster model is completely different
Subspace clustering
Find subsets of objects coherent with subsets of
attributes
Frequent pattern mining
Find subsets of items frequently appearing in
transaction databases
University at Buffalo The State University of New York
Algorithm Outline
Phase 1 (Pre-processing) : For each gene g,
find the complete set of maximal coherent
sample sets of gene g.
Phase 2: Compute the complete set of
maximal coherent gene clusters based on
pre-processing results.
University at Buffalo The State University of New York
Coherent Sample Sets
Given a gene g, a maximal coherent sample
set of g is a subset of samples Si such that:
coherent : g is coherent across Si;
significant : |Si| mins;
maximal : there exists no superset S’Si such
that g is also coherent with S’.
(g Si ) is a building block for coherent gene
clusters including g.
University at Buffalo The State University of New York
Preprocessing Phase
Suppose mins = 3
s5
s1
s2
s3
s4
s5
s6
s1
1
1
0
1
0
0
s2
1
1
0
0
0
0
s3
0
0
1
1
1
1
s4
1
0
1
1
1
1
s5
0
0
1
1
1
1
s6
0
0
1
1
1
1
The coherence matrix
of gene g
s6
s4
s1
s3
s2
The coherence graph
of gene g
University at Buffalo The State University of New York
s3
s4
s5
s6
{s3,s4,s5,s6} is a
coherent sample set of
gene g
Sample-gene Search
Set enumeration tree
Enumerate all subsets of samples
systematically.
Each node on the tree corresponds to a subset
of samples.
For each node S
Find the maximal set of genes Gs which is
coherent with S
University at Buffalo The State University of New York
Set Enumeration Tree
{}
{a}
{a,b}
{b}
{a,c} {a,d} {b,c} {b,d}
{a,b,c} {a,b,d} {a,c,d}
{c}
{c,d}
{b,c,d}
{a,b,c,d}
The set enumeration tree for {a,b,c,d}
University at Buffalo The State University of New York
{d}
Find the Maximal Coherent
Subset of Genes
After the pre-processing phase:
g1
{s1, s2, s3, s4, s5}
g2
{s1,s2,s4}, {s1,s5}
g3
{s1,s2,s3,s4,s5}
g4
{s1,s2,s3},{s5,s6}
g5
{s1,s5,s6}
Given a subset of samples S, how to find the maximal coherent set
of genes GS?
Expensive approach: scan the table once
For each S, Gs can be derived by a single scan of the maximal coherent
samples of all genes. If S Sj, g Gs.
Efficient approach: use the inverted list.
University at Buffalo The State University of New York
The Inverted List
Gene
Maximal Coherent sample sets
g1
{s1, s2, s3, s4, s5}
g2
{s1, s2, s4}, {s1, s5}
g3
{s1, s2, s3, s4, s5}
g4
{s1, s2, s3}, {s5, s6}
g5
{s1, s5, s6}
g2.b1
g2.b2
The table of maximal coherent sample sets for genes
Sample
The inverted list
s1
{g1.b1, g2.b1, g2.b2, g3.b1, g4.b1, g5.b1}
s2
{g1.b1, g2.b1, g3.b1, g4.b1}
s3
{g1.b1, g3.b1, g4.b1}
s4
{g1.b1, g2.b1, g3.b1}
s5
{g1.b1, g2.b2, g3.b1, g4.b2, g5.b1}
s6
{g4.b2, g5.b1}
The table of inverted lists for samples
University at Buffalo The State University of New York
Intersection Instead of
Scanning
Given a subset of samples S={si1,…,sik},
intersect the inverted lists of si1,…,sik.
For example, given S={s1,s2,s3},
Ls1^Ls2^Ls3={g1.b1,g3.b1,g4.b1}, so Gs={g1,g3,g4}.
Suppose the parent of S is S’={si1,…,sik-1}, then
LS=LS’ Lsik.
University at Buffalo The State University of New York
Anti-monotonic Property
Given a combination (GS),
if G is not coherent on S,
then for any superset S’S, G cannot be
coherent on S’.
For any descendant S’ of S on the tree
let GS be the maximal coherent gene set of S,
let GS’ be the maximal coherent gene sets of S’,
since S’S, we have GS’ GS.
University at Buffalo The State University of New York
Pruning Irrelevant Samples
Given a subset of samples S={si1,…,sik}, a
sample sjtails, if
j > ik
there exists at least ming genes g such that g is
coherent with S{sj}
Samples sltails(irrelevant samples) cannot
be used to extend S.
University at Buffalo The State University of New York
Pruning Unpromising Nodes
Given a subset of samples S={si1,…,sik},
if |S|+|tails|< mins, then prune the subtree of S.
let the maximal coherent subset of genes of S be Gs,
if there exists (G’S’) such that
(Stails) S’
GsG’,
the prune the subtree of S
University at Buffalo The State University of New York
Determination of Maximal
Coherent Gene Clusters
The depth-first search strategy:
For any superset S’ of S, S’ is
visited before S;
or a child of S.
To determine whether a coherent gene
cluster (GsS) is maximal,
check (GsS) after visiting all its children,
report (GsS) if it is not subsumed.
University at Buffalo The State University of New York
Sample
The inverted list
s1
{g1.b1, g2.b1, g2.b2, g3.b1,
g4.b1, g5.b1}
s2
{g1.b1, g2.b1, g3.b1, g4.b1}
s3
{g1.b1, g3.b1, g4.b1}
s4
{g1.b1, g2.b1, g3.b1}
s5
{g1.b1, g2.b2, g3.b1, g4.b2,
g5.b1}
s6
{}
{s2}
{s3,s4}
{s1}
{s2,s3,s4,s5}
{g4.b2, g5.b1}
{s1,s2}
{s3,s4}
{s1,s3}
{}
{s1,s4}
{}
{g1.b1, g2.b1, g3.b1, g4.b1}
{g1.b1, g3.b1, g4.b1}
{g1.b1, g2.b1, g3.b1}
{s1,s2,s3}
{}
{s1,s2,s4}
{}
{g1.b1,g3.b1,g4.b1}
{g1.b1,g2.b1,g3.b1}
University at Buffalo The State University of New York
{s2,s3}
{}
{s3}
{}
{s4}
{}
{s2,s4}
{}
{g1.b1, g3.b1, g4.b1} {g1.b1, g2.b1, g3.b1}
Mining Coherent Gene
Clusters
Systematic enumeration of genes and
samples
Sample-Gene Search
Gene-Sample Search
Pruning rules
Determination of whether a coherent gene
cluster (GS) is maximal
University at Buffalo The State University of New York
Gene-sample Search
Sample-Gene Search Gene-Sample Search
Subjects to enumerate
Number of subjects to
enumerate
Coherent objects
Efficiency on GST data
samples
101~102
genes
103~104
Single set of maxmial
coherent genes
Single or multiple
sets of maxmial
coherent sample
High
Low
University at Buffalo The State University of New York
Experiment Data Sets
Real-world gene expression data
4324 genes
13 multiple sclerosis (MS) patients
before and at 1,2,4,8,24,48,120 and 168 hours after IFN treatment
Synthetic data
Given the number of genes NG, samples NS and coherent
gene clusters NC
Simulate the pre-processing results
Embed NC maximal coherent gene clusters (GS)
University at Buffalo The State University of New York
A Coherent Gene Cluster
from Real Data
University at Buffalo The State University of New York
Effect of Parameters
Number of clusters vs. ming
(mins=3,=0.8)
Number of clusters vs. mins
(ming=10, =0.8)
University at Buffalo The State University of New York
Number of clusters vs.
(ming=10,mins=3)
Scalability
Scalability of phase 1
Scalability w.r.t. number of
genes (number of samples: 30)
University at Buffalo The State University of New York
Scalability w.r.t. number of
samples (number of genes:
3,000)
Conclusion
We define the new problem of mining
coherent gene clusters from the novel genesample-time microarray data.
We propose two approaches: the samplegene search and the gene-sample search.
We conduct an extensive empirical
evaluation on both real and synthetic data
sets.
University at Buffalo The State University of New York
Future Work
New problems from the gene-sample-time
microarray data:
Coherent sample clusters (GS)
for each sS, any pair of genes gi, gjG has
coherent patterns.
Coherent gene-sample clusters (GS),
both a coherent gene cluster and a coherent
sample cluster.
University at Buffalo The State University of New York