Download Biclustering of Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epigenetics of human development wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Ridge (biology) wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Biclustering of Expression Data
by Yizong Cheng and Geoge M. Church
Presented by Bojun Yan
March 25, 2004
outline
1. MicroArray and its relative research
1.1 MicroArray Gene Expression Data
1.2 Main research about MicroArray
2. Why Bicluster?
2.1 Preceding research and its faults
2.2 The concept of Bicluster
2.3 Similarity measure
3. The hardness of Bicluster
4. Methods proposed by this paper
4.1 Relative Works and paper’s goal
4.2 Definition of mean squared residue score
4.3 Some special matrices’ scores
4.4 Some Theorems deduced by authors
4.5 Algorithms proposed by this paper
5. Experiment
5.1 Data preparation
5.2 Determining Algorithm Parameters
5.3 Final Algorithm
5.4 Results and Display
1. MicroArray and its relative Research
1.1 MicroArray Gene expression data:
Being generated by DNA chips and other microarray
technique, Row---Genes, Column---Conditions or Samples
1.2 Main Research about MicroArray
(1) Gene Clustering: Finding the genes having similar functions
(2) Conditions Clustering: Helpful to case analysis
(3) Classification: Tumor Classification, Cancer prediction
(4) Gene Selection: Find the genes relative to some disease
(5) Gen Network: Explore the regulatory interaction between the genes
1.3 Paper Target: Biclustering
2. Why Bicluster?
2.1 Preceding research and its faults
• Goal: Discover the regulatory patterns or condition
similarities
• Methods: Based on Euclidean distance or the dot
product between the vectors (equally weighted)
(1) Group genes (row)
(2) Group conditions (column)
• Result: Partition the genes or conditions into mutually
exclusive groups or hierarchies
• Faults: obscuring some other similarity groups while
discovering some similarity groups
2.2 The concept of Bicluster
Clustering the genes(rows) and conditions(columns)
simultaneously---subspace clustering
2.3 Similarity Measure
(1)Based on Distance Metric, such as Minkowski distances
(2)Cosine Measure
(3)Pearson Correlation
(4)Extended Jaccard Similarity
(5)Mean Sqare Residue (proposed by this paper)
+ A measure of the coherence of the genes and conditions in the
bicluster
+ Symmetric function of the genes and conditions
+ Group genes and conditions simultaneously
3. Hardness of the bicluster
• The problem of finding a maximum bicluster with a score
lower than a threshold includes the problem of finding a
maximum biclique in a bipartite graph as a special case
• Finding the largest constant square submatrix is proven
to be NP-hard (Johnson, 1987)
• The problem of finding a minimum set of biclusters,
either mutually exclusive or overlapping, to cover all the
elements in a data matrix has been shown to be NPhard(Orlin,1977)
4. Methods proposed by this paper
4.1 Relative Works and the paper’s goal
(1) Relative Works
•
Divisive algorithm: partitioning data into sets with
approximately constant values, proposed by Morgan
and Sonquist(1963) and Hartigan(1972)
•
Hartigan mentioned that the criterion for partitioning
may be a two-way analysis of variance model, similar
to the mean squared residue scoring proposed in this
article
•
Mirkin(1996) presents a node addition algorithm.
• “biclustering” has been used by Mirkin(1996), which
means simultaneous clustering of both row and column
sets in a data matrix.
• The term “direct clustering”(Hartigan 1972),and “box
clustering”(Mirkin,1996) have the same meaning.
(2) The Paper’s Goal and criterion:
• Goal: Finding of a set of genes showing strikingly similar
up-regulation and down-regulation under a set of
conditions.
• Criterion: A low mean squared residue score plus a large
variation from the constant as a criterion for identifying
these genes and conditions
• Overlapping: Biclusters should be allowed to overlap in
expression data analysis
4.2 Definition of mean squared residue score
The row variance:
It is an accompanying score to reject trival or constant
biclusters.
4.3 Scores of some special matrice
• A special case for a perfect score( a zero mean squared
residue score) is a constant bicluster of elements of a
single value
• For the matrix aij=ij, i,j>0, no submatrix of a size larger
than a single cell has a score lower than 0.5
• A K×K matrix of all 0s except one 1 has the score
Equation:
• A matrix with elements randomly and uniformly generated
in the range of [a,b] has an expected score of (b-a)(b-a)/12.
For example the range is [0,800], the expected score is
53,333.
1
1


1
1
1
1
1 1
1
 1
1
 1
1
1
1
1 1 1  1
 4 4 4 4


7 7 7 7
4 4 4
1 2 3  2
 4 5 6 5


7 8 9 8
4 5 6
H (I , J )  0
H (I , J )  0
H (I , J )  0
• Some characteristic of mean square residue
score
(1)Adding a constant number to the matrix will not affect
the H(I,J) score
(2)Multiplying a constant number will affect the score (by
the square of the constant)
(3)Both will not affect the ranking of the biclusters in a
matrix
4.4 Theorems deduced by authors
Comments on Algorithm 0:
• Algorithm 0, although a polynomial-time one, will not be
efficient enough for a quick analysis of most expression
data matrices.
• The complexity of Algorithm 0 is o((n+m)nm)
Comments on Algorithm 1:
• In each iterate, a complete recalculation for step1
and step 2 is needed
• The time complexity of Algorithm 1 is o(nm)
• Higher efficiency than Algorithm 0, but not the
best.
Comments on Algorithm 2:
• Need to properly select parameter α>1
• Without updating the score after the removal of
each node
• The time complexity of Algorithm 2 is
o(logn+longm)
• One may miss some large δ-bicluster
Comments on Algorithm 3:
•
The time complexity is o(mn)
•
The resulting δ-bicluster may still not be maximal
because of two reasons:
(1)Lemma 3 only gives a sufficient condition for adding rows and
conditions
(2)By adding rows and columns, the score may decrease to the
point it much smaller than δ
5. Experiment
5.1 Data preparation
Datasets and Parameters:
(1)Yeast data,o-value=300, n=100
(2)Human data, o-value=1200,n=100
Missing Data Replacement:
Replace the missing data using the random number underlying the
uinform distriubiton
Biclusters is Compared to the Cluster results from
(1)Travazoie et al. (1999)
(2)Alizadeh et al. (2000)
5.2 Determining Algorithm Parameters
Thinking about the clusters from the papres Travazoie et al. (1999)
and Alizadeh et al. (2000)
For yeast data, δ= 300, α=1.2
For human gene data, δ= 1200, α=1.2
The number of biclusters is n=100
Masking discovered Biclusters: Each time a bicluster was
discovered, the elements in it will be replaced by random number
because the algorithms are deterministic
5.3 Final Algorithm
Biclusters for Yeast data
Biclusters for Yeast data
Biclusters for Yeast data
Biclusters for Yeast data
Biclusters for human data
Biclusters for human data