Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Mining Phenotype Structures
Chun Tang and Aidong Zhang
Bioinformatics Journal, 20(6):829-838, 2004
University at Buffalo The State University of New York
Microarray Data Analysis
Analysis from two angles
sample as object, gene as attribute
gene as object, sample/condition as attribute
University at Buffalo The State University of New York
Supervised Analysis
Select training samples (hold out…)
Sort genes (t-test, ranking…)
Select informative genes (top 50 ~ 200)
Cluster based on informative genes
Class 1
Class 2
g1
1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
.
.
.
.
g4131 0 0 … 0 1 1 … 1
g4132 0 0 … 0 1 1 … 1
University at Buffalo The State University of New York
g1 1 1 … 1 0 0 … 0
g2 1 1 … 1 0 0 … 0
.
.
.
g4131
0 0 … 0 1 1 … 1
g4132
0 0 … 0 1 1 … 1
Unsupervised Analysis
We will focus on unsupervised sample partition which
assume no phenotype information being assigned to
any sample.
Since the initial biological identification of sample
classes has been slow, typically evolving through
years of hypothesis-driven research, automatically
discovering sample pattern presents a significant
contribution in microarray data analysis.
Many mature statistic methods can not be applied
without the phenotypes of samples being known in
advance.
University at Buffalo The State University of New York
Automatic
PhenotypeAnalysis
Structure Mining
Unsupervised
samples
1 2 3
4 5 6 7 8 9 10
gene1
Informative
Genes
gene2
gene3
gene4
Noninformative
Genes
gene5
gene6
gene7
An informative gene is a gene which manifests samples'
phenotype distinction.
Phenotype structure: sample partition + informative genes.
University at Buffalo The State University of New York
Automatic Phenotype Structure Mining
Gene expression matrix
Result
Mining
Phenotype distinction
1 2 3
4 5 6 7
gene1
gene2
gene3
Mining
Informative genes
Given a n m data matrix M and the number of samples' phenotypes K.
The goal is to find K mutually exclusive groups of the samples matching their
empirical phenotypes, and to find the set of informative genes which manifests this
phenotype distinction.
University at Buffalo The State University of New York
Requirements
The expression levels of each informative
gene should be similar over the samples
within each phenotype
The expression levels of each informative
gene should display a clear dissimilarity
between each pair of phenotypes
University at Buffalo The State University of New York
Challenges (1)
The volume of genes is very large while the number of
samples is very limited, no distinct class structures of
samples can be properly detected by the existing techniques.
University at Buffalo The State University of New York
Challenges (2)
gene1
gene2
gene3
gene5
gene4
gene9
gene5
gene12
gene6
gene7
gene8
gene9
The limited informative
gene10
genes are buried in large
gene11
gene12
gene13
gene14
gene15
University at Buffalo The State University of New York
amount of noise.
Challenges (3)
Gene LTC4 synthase U50136
Gene Fumarylacetoacetate M55150
Gene C-myb U22376
Gene PROTEASOME IOTA X59417
The values within data matrices are all real numbers
None of the informative genes follows ideal “high-`low” pattern.
University at Buffalo The State University of New York
Related Work
New tools using traditional methods :
TreeView
CLUTO
CIT
• SOM
• K-means
CNIO
• Hierarchical clustering
GeneSpring
• Graph based clustering
J-Express
• PCA
CLUSFAVOR
The similarity measures used in these methods are based on the
full gene space.
PCs do not necessarily have strong correlation with informative
genes.
University at Buffalo The State University of New York
Related Work (Cont’d)
Clustering with feature selection:
(CLIFF, two-way ordering, SamCluster)
1. Filtering the invariant genes
• Rank variance
• PCA
• CV
2. Partition the samples
• Ncut, Min-Max Cut
• Hierarchical Clustering
3. Pruning genes based on the partition
• Markov blanket filter
• T-test
University at Buffalo The State University of New York
Related Work (Cont’d)
Subspace clustering :
•
•
Bi-clustering
δ-clustering
University at Buffalo The State University of New York
Related Work (Cont’d)
Subspace clustering only measure trend similarity. But in our
model, we require each gene show consistent signals on the
samples of the same phenotype.
University at Buffalo The State University of New York
Related Work (Cont’d)
Subspace clustering algorithms only detect local correlated
features and objects without considering dissimilarity between
different clusters. We want to get the genes which can
differentiate all phenotypes.
University at Buffalo The State University of New York
Our Contributions
We transferred the phenotype structure
mining problem into an optimization problem.
A series of statistic-based metrics are defined
as objective functions.
A heuristic searching method and a mutual
reinforcing adjustment approach are proposed
to find phenotype structures.
University at Buffalo The State University of New York
Model - Measurements
Inter-divergency
S1
S2
samples
gene1
G’ gene2
gene3
Intra-consistency Intra-consistency
University at Buffalo The State University of New York
Phenotyp
e
Quality
Intra-consistency
NOT consistent
Measure- Data(A)
ment
Data(B)
residue
0.1975
0.4506
MSR
0.0494
0.4012
Ours
339.0667
5.3000
consistent
University at Buffalo The State University of New York
Intra-pattern-consistency (Cont’d)
In a subset of genes (candidate informative genes), does every
gene have good consistency on a set of samples?
Variance of a single gene on the samples within one
phenotype:
1
2
Var (i, S ' )
(
w
w
)
i, j i,S '
S ' 1 jS '
Intra-pattern-consistency: average row variance
Con(G ' , S ' )
1
G'
Var (i, S ' )
giG '
1
2
(
w
w
)
.
i, j
i ,S '
G ' S ' 1 giG 's jS '
Average of variance of the subset of genes – the smaller the
intra-phenotype consistency, the better.
University at Buffalo The State University of New York
Inter-pattern-divergence
How a subset of genes (candidate informative genes) can discriminate
two phenotypes of samples?
Both “inter-pattern-consistency” and
``intra-pattern-divergence” on the same
gene are reflected.
Average block distance:
Div (G ' , S 1, S 2))
w
giG '
i , S1
wi , S 2
G'
Sum of the average difference between the phenotypes – the
larger the inter-phenotype divergence, the better.
University at Buffalo The State University of New York
Pattern Quality
The purpose of pattern discovery is to identify the
empirical patterns where the intra-patternconsistency inside each phenotype is high and the
inter-pattern-divergence between each pair of
phenotypes is large.
1
Con(G ' , Si ) Con(G ' , S j )
S i , S j (1i , j K ;i j )
Div (G ' , Si , Sj )
The higher the value, the better the quality.
University at Buffalo The State University of New York
Measurements
Intra-consistency
1
2
Con(G' , S ' )
(
w
w
)
.
i, j
i ,S '
G' S ' 1 giG 's jS '
Inter-divergence:
Div (G ' , S 1, S 2))
w
giG '
i , S1
wi , S 2
G'
Phenotype Quality
1
Con(G ' , Si ) Con(G ' , S j )
S i , S j (1i , j K ;i j )
University at Buffalo The State University of New York
Div (G ' , Si , Sj )
Phenotype Quality
Data(A)
Data(B)
Data(C)
Con
4.25
3.44
4.52
Div
41.60
25.20
46.16
14.2687 9.6074
15.3526
Highest phenotype quality
University at Buffalo The State University of New York
Model - Formalized Problem
Input
m samples and n genes
the corresponding gene expression matrix M
the number of phenotypes K
Output
A K-partition of samples (phenotypes) and a
subset of genes (informative space) that the
phenotype quality is maximized.
University at Buffalo The State University of New York
Strategy
Maintain a candidate phenotype structure and iteratively adjust
the candidate structure toward the optimal solution.
Basic elements:
A candidate structure:
A partition of samples {S1,S2,…Sk}
A subset of genes G’G
The corresponding phenotype quality
An adjustment:
For a gene gi G’, insert into G’
For a gene gi G’, remove from G’
For a sample s i in a group S’, move to other group
The quality gain measures the change of phenotype quality
of before and after the adjustment.
University at Buffalo The State University of New York
Heuristic Searching
candidate structure
generation
Iterative Adjusting
pick up
an object
intermediate
candidate
structure
gene/sample
adjustment N
Ω > 0
p exp(
)
T (i )
Y
adjusting
University at Buffalo The State University of New York
Heuristic Searching
Starts with a random K-partition of samples and a subset of genes as
the candidate of the informative space.
Iteratively adjust the partition and the gene set toward a better
solution. (Random order of genes and samples.)
for each gene, try possible insert/remove
for each sample, try best movement.
Insert a gene
Remove a gene
University at Buffalo The State University of New York
Move a sample
Heuristic Search
For each possible adjustment, compute
For each gene, try possible insert/remove
For each sample, try the best movement
> 0 conduct the adjustment
< 0 conduct the adjustment with probability
p exp(
)
T (i )
T(i) is a decreasing simulated annealing function
and i is the iteration number. T(0)=1, T(i)=1/(i+1) in
our implementation
University at Buffalo The State University of New York
Mutual Reinforcing Adjustment - Motivation
Drawbacks of the heuristic searching method: blind
initialization , equal chance of samples and genes,
noisy samples.
The phenotype quality value of subset of
informative genes and partially phenotype should
also be high.
Mining phenotypes and informative genes directly
from high-dimensional noisy data is difficult, we
start from small groups whose data distribution and
patterns are much easier to be detected.
Mining of phenotypes and informative genes should
mutually reinforced.
University at Buffalo The State University of New York
Mutual Reinforcing Adjustment - Motivation
A
University at Buffalo The State University of New York
B C
Mutual Reinforcing Adjustment - Major Steps
Partition the Matrix: divide the original matrix into a
series of exclusive sub-matrices based on partitioning
both the samples and genes.
Reference Partition Detection: post a partial or
approximate phenotype structure called a reference
partition of samples.
compute reference degree for each sample groups;
select k groups of samples;
do partition adjustment.
Gene Adjustment: adjust the candidate informative
genes.
compute for reference partition on G
perform possible adjustment of each genes
Refinement Phase
University at Buffalo The State University of New York
Method Detail - Iteration Phase
all samples
informative genes G’
reference
partition
detection
reference
partition
all samples
to next
iteration
University at Buffalo The State University of New York
reference
partition
informative genes G’
partitioning
the matrix
informative genes G’’
informative genes G’’
informative genes G’
all samples
gene
adjustment
Partitioning the Matrix
Partition the samples and genes into multiple groups
Use CAST
A threshold t decide the size of each group
Based on the Pearson’s correlation Coefficient
k
X ,Y
i 1
( xi x )( yi y )
2
x
x
y
y
i1 i
i1 i
k
2
k
Outliers will be filtered out from any group
Samples or genes in the same group share similar
patterns
University at Buffalo The State University of New York
Reference Partition Detection
Select the groups of samples as potential
phenotypes
Pick the first group with the highest
reference degree
1
ref ( S j ) log S j
Con(G , Sj )
Gi G '
i
Select the other groups by considering the
inter-phenotype divergence w.r.t. selected
groups
x 1
Div (Gi, Spx, Spt )
t 0
Ran ( S px ) log Spx
Con(Gi, Spx)
GiG '
University at Buffalo The State University of New York
Check the Missing Samples
Probabilistically insert the remaining
samples not in the selected groups into the
most probably matching group
In iterations, use the gene candidate sets to
improve the reference partition
University at Buffalo The State University of New York
Gene Adjustment
Gene adjustment: Test the possible adjustments that
lead to improvement
Insert a gene
University at Buffalo The State University of New York
Remove a gene
Method-Refinement Phase
The partition corresponding to the best state may not
cover all the samples.
Add every sample not covered by the reference
partition into its matching group the phenotypes of
the samples.
Then, a gene adjustment phase is conducted. We
execute all adjustments with a positive quality gain
informative space.
Time complexity O(n*m2*I)
University at Buffalo The State University of New York
Mining Multiple Phenotype Structures
samples
Empirical
Phenotype
Structure
1 2 3 4 5 6 7 8 9 10
gene1
gene2
gene3
gene4
gene6
gene7
Hidden
Phenotype
Structure
gene8
gene9
Output: p phenotype structures where the tth structure is a Kt-partition of
samples (phenotypes) and a subset of genes (informative space) which
manifest the sample partition. The overall phenotype quality is maximized.
University at Buffalo The State University of New York
Extended Algorithm Strategy
Maintain p candidate phenotype structures and iteratively adjust
them toward the optimal solution.
Basic elements of each candidate structure:
A candidate structure
A Kt partition of samples
A subset of genes G’G
The corresponding phenotype quality t
An adjustment
For a gene gi Gt, insert into Gt
For a gene gi Gt, move from Gt’ (tt’) or remove from all
structures
For a sample si in group S’, move to other group
The quality gain measures the change of pattern quality of
the states after the adjustment.
University at Buffalo The State University of New York
The Extended Algorithm (Cont’d)
Gene
insert
move
remove
move
Sample
candidate structure 1
candidate structure 2
University at Buffalo The State University of New York
Mining Multiple Phenotype Structures
(Cont’d)
Partially informative genes
University at Buffalo The State University of New York
Formalized Problem
Input
•m samples and n genes
•the corresponding gene expression matrix M
•the number of phenotype structures p
•the set of numbers {K1, K2, …, Kp}
Output
p phenotype structures where the tth structure is a
Kt-partition of samples (phenotypes) and a subset
of genes (informative space) which manifest the
sample partition. The overall phenotype quality is
maximized.
University at Buffalo The State University of New York
The Algorithm
Candidate Structure Generation
cluster genes into p’ group (p’>p) (CAST)
generate sample partitions one by one on
clusters of genes, select best quality genes.
Iterative Adjustment
for each gene, try possible insert/move/remove
for each sample,
- examine all possible adjustment
- select best movement.
University at Buffalo The State University of New York
The Algorithm (Cont’d)
Gene (p possible adjustments)
insert
remove
Sample (Kt-1 possible
adjustments for each
partition)
University at Buffalo The State University of New York
move
The Algorithm (Cont’d)
Data Standardization
the original gene intensity values relative values
'
i, j
w
wi , j wi
i
j 1 wi, j
, where
wi
m
m
m
; i
2
(
w
w
)
i
,
j
i
j 1
m 1
Random order of genes and samples
Conduct negative action with a probability
Simulated annealing technique
p exp(
)
T (i )
1
T (0) 1; T (i )
.
1 i
University at Buffalo The State University of New York
Experiments
Data Sets:
Multiple-sclerosis data
MS-IFN : 4132 * 28 (14 MS vs. 14 IFN)
MS-CON : 4132 * 30 (15 MS vs. 15 Control)
Leukemia data
7129 * 38 (27 ALL vs. 11 AML)
7129 * 34 (20 ALL vs. 14 AML)
Colon cancer data
2000 * 62 (22 normal vs. 40 tumor colon tissue)
Hereditary breast cancer data
3226 * 22 ( 7 BRCA1, 8 BRCA2, 7 Sporadics)
University at Buffalo The State University of New York
Rand Index
Rand Index -A measurement of “agreement”
between the ground-truth (P) and the results (Q) :
“a” : the number of pairs of objects that are in
the same class in P and in the same class in Q;
“b” : the number of pairs of objects that are in
the same class in P but not in the same class in
Q;
“c” : the number of pairs of objects that are in
the same class in Q but not in the same class in
P;
“d” : the number of pairs of objects that are in
different classes in P and in different class in Q.
ad
RI
abcd
University at Buffalo The State University of New York
P
Q
s 1 s2
s1 s2
s 1 s2
s1
s2
s1
s2
s1
s2
s1 s2
s1
s2
Phenotype Structure Detection
Data Set
MS-IFN
MS-CON
Leukemia-G1
Leukemia-G2
Data Size
4132*28
4132*30
7129*38
7129*34
J-Express
0.4815
0.4851
0.5092
0.4965
0.4939
0.4112
CLUTO
0.4815
0.4828
0.5775
0.4866
0.4966
0.6364
CIT
0.4841
0.4851
0.6586
0.4920
0.4966
0.5844
CNIO
0.4815
0.4920
0.6017
0.4920
0.4939
0.4112
CLUSFAVOR
0.5238
0.5402
0.5092
0.4920
0.4939
0.5844
-cluster
0.4894
0.4851
0.5007
0.4538
0.4796
0.4719
Heuristic
0.8052
0.6230
0.9761
0.7086
0.6293
0.8638
Mutual
0.8387
0.6513
0.9778
0.7558
0.6827
0.8749
University at Buffalo The State University of New York
Colon
Breast
2000*62 3226*22
Experiments
Number of iterations
Running time
Data Size
mean
standard
deviation
mean
standard
deviation
4132*28
158
27.2
180
35.1
4132*30
168
29.5
195
37.8
7129*38
171
16.1
436
51.9
7129*34
198
35.9
458
101.2
2000*62
133
17.8
479
98.5
3226*22
157
22.2
167
35.6
The mean value and standard deviation of the numbers of
iterations and response time (in second) with respect to the
matrix size.
University at Buffalo The State University of New York
Phenotype
Structure Detection
(Cont’d)
Experimental
Results (5)
The mutual reinforcing approach as applied to the MS-IFN
group.
(A) shows the distribution of the original 28 samples. Each point
represents a sample with 4132 genes mapped to twodimensional space.
(B) shows the distribution in the middle of the adjustment.
(C) shows the distribution of the same 28 samples after the
iterations. 76 genes was selected as informative space.
University at Buffalo The State University of New York
Informative
Gene
Selection
Experimental
Results
(5)
University at Buffalo The State University of New York
Phenotype Structures
University at Buffalo The State University of New York
Informative
Gene Selection
Experimental
Results (Cont’d)
(5)
University at Buffalo The State University of New York
Scalability Evaluation
Experimental
Results (5)
University at Buffalo The State University of New York
Conclusion from the Experiments
The work is motivated by the needs of emerging
microarray data analysis.
The strategy is designed for data which have the
following properties:
The number of samples is limited but the gene
dimension is very large.
Large volumes of irrelevant and redundant genes
prevent accurate grouping of samples;
Analyzing over one dimension object can enhance
detecting meaningful patterns of another dimension.
University at Buffalo The State University of New York