Download Clustering approaches for temporal microarray gene expression data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

NEDD9 wikipedia , lookup

Gene wikipedia , lookup

Ridge (biology) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
 1 Clustering approaches for temporal microarray gene expression data
Irsal Jasebel Alsanea
Electrical Engineering and Computer Science Department, Northwestern University
Abstract: Clustering is the division of data into groups of similar objects. Temporal
gene expression data has the potential to generate a great deal of biological knowledge
using microarray technology. In this paper, we explore and implement different gene
clustering methods for temporal microarray expression data, and propose a combined
method to improve upon previous methods. We propose that pre-clustering with a
Transitional State Discrimination algorithm (Template-Based), as well as a
TC_linkage_infer algorithm (Shape-Based), and clustering with a Pointwise similarity
algorithm will reduce information loss in the processing of genomic data.
Keywords: Clustering, time-series, microarray, bioinformatics, fuzzy c-means
algorithm, k-means clustering algorithm
Received December 8, 2012
For a compiled list of software used, go to http://collablab.northwestern.edu/irsal/bioinformatics/
1. Introduction
2.1 Pointwise Sim ilarity
Recent developments in microarray technology have
yielded revolutionary contributions to genomics.
Microarrays allow for the monitoring of gene expressions
of tens of thousands of genes in parallel. The analysis of
microarray data is increasingly becoming a major
bottleneck in the utilization of the technology [1].
Microarray experiments can be divided into two
main types: static and time-series. In static experiments,
gene expression measurements are taken one time each
from a number of samples. In time-series experiments,
gene expression levels are measured in a single sample at
a number of points in time [4]. Clustering is used to
make sense of microarray data. Similar to parallel
clustering approaches in social and physical sciences, it
divides large sets of gene expressions into smaller sets
with comparably similar gene expressions grouped by
different distance or correlation measures.
Time-series microarray experiments have far
greater applications than static experiments [4]. First, it
is utilized in the discovery of the dynamics behind
various
biological
systems.
Second,
time-series
microarrays are used to study the development of
different controlling genes (example, does gene 1 express
or suppress gene 2?). Third, it allows scientists to study
disease progression (such as cancer) over time and in
greater depth. Fourth, time-series microarrays enable
novel methods of drug discovery by allowing for the
observation of genetic responses to varied queues.
Expression levels in microarrays are measured
by the intensity and frequency of the fluorescence tags
(or dye). The tags (depending on the experiment) reveal
genes that have been inhibited (typically red
fluorescence), or activated (typically green fluorescence).
Temporal data is taken in time points, which can in
hour, minute or second units, depending on temporal
activity of the genes.
In this paper, we describe different time-series
clustering algorithms and propose a combined method
that improves upon methods in existence.
Pointwise similarity algorithms are the simplest and
easiest to implement of all clustering algorithms. Kmeans is a well-known pointwise algorithm and
partitioning method [1]. Genes are classified as belonging
to one of k groups, k chosen a priori. Cluster membership
is determined by calculating the centroid of each group,
finding the proximity (via Euclidean distance,
Manhattan distance, the Pearson correlation coefficient,
etc.) of a gene to each centroid, and assigning said gene
to the closest centroid [2]. This algorithm finds the total
minimum distance of each gene to an assigned centroid.
2. Discriminative Algorithms Here we use the Pearson correlation over other distance
formulas and coefficient functions, as a control in the
comparison with other algorithms. The Pearson
correlation coefficient between any two series of number
X={X1, X2, …, XN}, and Y={Y1, Y2, …, YN} is defined as
Time-series clustering approaches in microarray data can
be broken down into two types: discriminative and
generative algorithms. Discriminative algorithms define a
pairwise similarity function and then apply that function
to cluster similar data points together.
(a) Pseudocode
Input: a set of S objects and an integer k clusters.
Output: a partition of S into S1, S2, …, Sk.
Program:
•
Choose an integer K, as the number of
clusters.
•
Initialize the codebook vectors of the K
clusters randomly.
•
For every new sample vector:
o Compute the Pearson correlation
coefficient between the new vector
and every cluster’s centroid.
o Assign each gene to closest
centroid.
k-means works by calculating the centroid of each cluster
Si, denoted x-i, and optimizing the cost function:
(1)
The goal is to minimize the total cost:
(2)
2
Clustering approaches for temporal microarray gene expression data
•
(3)
(b) Analysis
The k-means algorithm is a popular choice as far as
clustering algorithms go. The time complexity is O(nkl),
where n is the number of patterns, k is the number of
clusters, and l is the number of iterations taken by the
algorithm to converge. The space complexity is O(k+n).
Pointwise algorithms are the very base of all the
following methods. For the following 3 algorithms, we
find that they are all improvements upon the Pointwise
Similarity method by using other models or vectors to
compare gene pairs (as opposed to using raw data, as is
the case for Pointwise algorithms).
(c) Implementation
We used Cluster and TreeView to implement a k-means
algorithm. A Yeast dataset was filtered and ran through
the k-means algorithm.
TreeView was then used to visualize the
clustering of genes in a structure similar to a
phylogenetic tree.
(See
http://rana.lbl.gov/manuals/ClusterTreeView.pdf
for more information.)
•
•
•
•
•
One-step
(Down):
gene
expressions
transition from a high value to a low value
Binary
two-step
(Up-Down):
gene
expressions transition from low to high and
return to the same low value
Binary
two-step
(Down-Up):
gene
expressions transition from high to low and
return to the same high value
F1 statistic: represents how well the onestep model fits the data
F2 statistic: represents how well the twostep model fits
F12 statistic: represents the relative
goodness of fit of a one-step versus a twostep pattern
Program (StepMiner):
•
Find a one- or two-step function that best
fits features of expression profiles n time
points, X1, X2, …,Xn, over binary
transitions.
o Find F1 and F2 that follows an Fdistribution with (m-1, n-m)
degrees of freedom for respective
one- and two-step functions
•
Find the Best feature vector:
SelectBestModels(){
oneStep = F-Significant(F1) && Not-FSignificant(F12)
twoStep = F-Significant(F2) && NotIn(oneStep)
other = NotIn(oneStep, twoStep)
}
•
Use a Pointwise Similarity algorithm (kmeans) to compute clusters by comparing
similarity or dissimilarity of feature vectors
found (Section 2.1).
(b) Analysis
Figure 1: TreeView interface of raw data passed through
the Cluster algorithm (Pointwise method only). A Pearson
correlation coefficient can be seen as the ‘Selected Array
Node Correlation.’
2.2 Feature-Based Sim ilarity
Feature-Based similarity algorithms are more complex
than Pointwise Similarity algorithms. Whereas Pointwise
Similarity algorithms compare raw temporal expression
data, Feature-Based Similarity algorithms extract a set
of features from the set of data, and use that as a form of
comparison.
(a) Pseudocode
In a general, Feature-Based methods first transform each
gene expression vector into a feature vector [8], which
encompasses the time and direction of step-wise temporal
transitions (which the authors consider to be most
important); then, they use traditional clustering
algorithms, as in the Pointwise Similarity methods, such
as k-means, and hierarchical clustering.
Input: a determinant feature, a set of S objects and an
integer k clusters.
Output: a partition of S into S1, S2, …, Sk.
Definitions (see Figure 2):
•
One-step (Up): gene expressions transition
from a low value to a high value
The strength of this method lies in the use of statistical
parameters in creating the feature vector. By
transforming the data into one and two-step functions,
StepMiner
[8]
creates
temporal
ordering
of
measurements. Whereas Pointwise Similarity methods
treat time as another parameter, Feature-Based methods
take into account time and use it to pre-order temporal
data.
StepMiner is ideal for users interested in binary
models of gene expression time courses. The downside is
that binary models abstract from other features
(essentially, we only isolate one feature—the change in
expression level from low to high, or vice versa). Such is
the case for most Feature-Based implementations, where
only one feature is isolated. In StepMiner, however, the
creators did take into account how well the binary model
fits the temporal variation in gene expression by fixing a
p-value.
T abl e 1: GO annotations of different gene groups [8]. The
extracted binary patterns from the StepMiner algorithm
correspond to different cellular functions (by gene groups),
all with low p-values.
3
Bioinformatics
(c) Implementation
In the implementation of this method, we used
StepMiner, which relies on the assumption that
transitions in expression levels are the most important
features of an expression profile.
Each function group at time n (one- and twostep, both up, down and other combinations) corresponds
to a gene group responsible for a particular cellular
process (or other functions). For example, an expression
profile that fits the one-step (Down) function at time
9.25h, is responsible for Protein biosynthesis, with a pvalue of 3.4E-51 (See Table 1).
Figure 2: Image on the left shows raw data that has been
clustered using a Pointwise Similarity method only. Image
on the right shows data that has been passed through the
StepMiner (Feature-Based) algorithm, then a Pointwise
Similarity method [8].
Output: a partition of S into S1, S2, …, Sk.
Assumptions:
•
Functional relationships (in gene expression
profiles over time) with high statistical
significance must be possible.
•
Said functional relationships should have
high biological significance.
Definitions:
•
sc: maximal local alignment of change trend
between each gene pair
•
cc: correlation coefficient between the
maximal alignment
Program (TC_linkage_infer):
•
Randomize
a
dataset
by
shuffling
normalized gene expression levels at
different time points among each gene
expression profile in the original dataset.
•
Calculate sc between each pair in the
random dataset.
•
Calculate cc for each gene pair in the
random dataset.
•
Find the frequency of sc, f(sc) as a function
of sc.
•
Find the distribution of cc for gene pairs
that have the same sc.
•
Calculate p-vales for the two scores sc and
cc (Psc(s≧sc), Pcc(c≧cc)) by integrating the
frequency distribution.
•
Calculate the sc and cc between each gene
pair in the original dataset.
•
Extract gene pairs with significantly high sc
values with a certain preset p-value.
o The correlation coefficient cc is
regarded as a second index when
the gene pairs have the same score
sc.
•
Extract gene pairs with statistically
significant high value of combined scores of
sc and cc.
•
Find local alignment of gene expression
profiles over time.
•
Use a Pointwise Similarity algorithm (kmeans) to compute clusters by comparing
similarity or dissimilarity of feature vectors
found (Section 2.1).
2.3 Shape-Based Sim ilarity
Shape-Based Similarity algorithms utilize the shape of
expression profiles over time to compare gene pairs.
Shape-Based algorithms are commonly based on the
popular Smith-Waterman algorithm for local sequence
alignment (assigns to each gene pair of expression
profiles a score and a relationship: simultaneous, timedelayed, inverted, or inverted time-delayed) [7]. This is
analogous to the StepMiner algorithm (using one feature,
one-step up, one-step down, two-step up-down or twostep down-up), except in a broader pattern of change of
expression profile overtime. The score can then be taken
as a measure of similarity and used for clustering the
genes. Below, we show a more sophisticated method that
relies on the basic principle behind Shape-Based
Similarity.
(a) Pseudocode
The following algorithm improves upon previous ShapeBased Similarity algorithms by transforming gene expression vectors into “change trend” vectors containing the
direction of change in the gene expression levels for
successive time points prior to local shape alignment [6].
Input: a set of S objects and an integer k clusters.
(b) Analysis
The major advantage of Shape-Based algorithms is in
their ability to identify as similar two expression profiles
that are shifted, inverted, or both (See Figure 3).
Biologically, the shifted time would mean that one gene
is regulating another. The inverted shape would mean a
particular mechanism is activating one gene and
inhibiting the other gene pair (for that time interval).
The major disadvantage of this method is the
slow process of finding the best local sequence alignment
must be performed many times. The best local sequence
alignment algorithm has a time complexity of O(mn),
where m is the number of genes and n is the number of
time points. Thus, this algorithm has a time complexity
of O(kmn), where k is the number of clusters defined.
Heuristic approaches have been taken in this method.
Since one of the first steps in the algorithm
above is randomization of the dataset (and thus the time
points), the inclusion of penalized gaps would reduce
problems caused by non-uniform sampling of time points.
(c) Implementation
This algorithm was implemented using a C-program, and
CLUTO, a clustering software.
4
Clustering approaches for temporal microarray gene expression data
ng: number of genes
nt: number of time points
xg = [xg1, xg2,…xg ]: gene expression vector
for the gth gene where 1<g<ng
•
ns: number of defined states for the pattern
vector function
Program (FCV- TSD):
• Define the pattern vector function pg(xg, t)
with ns number of states.
• For all the genes and time points, evaluate
the pattern vector function pg(xg, t).
• Generate pattern vector functions or
prototypes.
• For all prototypes and all genes, match the
gene to the corresponding pattern vector
via a fuzzy c-means algorithm.
o A gene g belongs to the cluster
represented by prototype p.
• Fuzzy c-means algorithm:
o Randomly
initialize
the
membership matrix (U) that has
constraints in Equation 4.
•
•
•
ni
(4)
o
Calculate centroids (ci) by using
Equation 5.
(5)
o
Figure 3: (A) Expression profiles that match
simultaneously. (B) Expression profiles that are timedelayed. (C) Expression profiles that are inverted [7].
3. Generative Algorithms
Generative algorithms pre-process data to determine
optimal parameters to group clusters, then use said
parameters to identify similar profiles generated by
models.
3.1 Tem plate-Based Sim ilarity
A template-based algorithm uses expression vectors, and
transforms them into pattern vectors. Pattern vectors
show the change in consecutive expression profiles.
(a) Pseudocode
Below we describe a template-based algorithm called
fuzzy c-varieties clustering with Transitional State
Discrimination pre-clustering (FCV-TSD). This is a twostep approach which identifies groups of points ordered
linearly in temporal locations, and orientations of the
data-space that correspond to similar expressions in the
time domain [6].
Input: a determinant feature, a set of S objects and an
integer k clusters.
Output: a partition of S into S1, S2, …, Sk.
Definitions:
Compute dissimilarity between
centroids and data points using
Equation 6.
(6)
o
Compute a
matrix (U’)
new
membership
(b) Analysis
The template-matching algorithm above does not require
researchers to choose a candidate profile because it
includes every possible pattern vector as a template
profile (compare to Feature-Based Similarity algorithms,
which isolate one feature). The downside of this
flexibility, however, is that as the time points increase
(longer time series), the number of template profiles and
therefore the number of clusters becomes large (and at
times larger than) compared to the number of genes.
In Template-Based algorithms’ use of pattern
vectors rather than raw data (as in Pointwise) makes it
robust to noise. These algorithms work well with short
time series, which is important because over eighty
percent of all time series in the Stanford Microarray
Database contains fewer than nine time points.
A major disadvantage of Template-Based
Similarity methods (similar to Feature-Based methods) is
the loss of information from the transformation of raw
data into a pattern vector. In the next section (Section
4), we propose a method to overcome this disadvantage.
5
Bioinformatics
(b) Analysis
(c) Implementation
MATLAB was
algorithm.
used
to
implement
the
TSD-FCV
The approach above would first group the genes based
on very general shared patterns and then further
distinguish within any individual group based on the
more complex features of the expression profiles.
The major advantages of this combined
approach are the reduction of both information loss (in
Template-Based Similarity methods), and the reduction
of time needed to process a dataset in the Shape-Based
method. The slowness would not be an issue because the
pre-clustering via Template-Based algorithms reduces the
dataset into a smaller subset, thus only a small subset
would need to pass through the Shape-Based algorithm.
This combined approach would also be robust to
noise because of its two-step pre-clustering (first with a
Template-Based algorithm, then with a Shape-Based
algorithm). Additionally, the use of a fuzzy c-means
algorithm for the Pointwise clustering portion of the
method would ensure that every input point yields a
membership value in each of the clusters [5].
(c) Implementation
Figure 4: Template-Based method for an artificial dataset
in MatLab. The horizontal axis denotes time, and the
vertical axis denotes expression level (fluorescence) [6].
In future work, we would implement this method via
MatLab.
We
would
convert
the
C-program
TC_linkage_infer to a MatLab code, and run raw
artificial temporal data via a MatLab scrip through three
MatLab files: tsd.m, timeshift.m, and last fcv.m (see
http://collablab.northwestern.edu/irsal/bioinformatics/T
emplate-Shape-Based%20Method/) for the files and
script.
4. Proposed Method
The latter three methods (Feature-Based, Shape-Based
and Template-Based) mentioned involved a preclustering algorithm, then a Pointwise Similarity
algorithm (k-means or fuzzy c-means). For our proposed
method, we plan on running two algorithms in the preclustering phase of the method, then using a Pointwise
algorithm (in this case, fuzzy c-means) to cluster the
gene expressions.
4.1 Tem plate-Shape-Based
Both Template-Based and Shape-Based Similarity
methods have their drawbacks. This proposed combined
method is an attempt to get rid of the respective
disadvantages of both methods.
(a) Pseudocode
Input: a set of S objects and an integer k clusters.
Output: a partition of S into S1, S2, …, Sk.
Assumptions (modified from Section 2.3a):
•
Functional relationships after initial preclustering with high statistical significance
must be possible.
•
Said functional relationships (again, after
clustering) should have high biological
significance.
Program (Combined Method):
•
Pre-clustering:
o Run raw gene expression data in
TSD algorithm (Section 3.1a).
o Run
pre-clustered
data
in
TC_linkage_infer
algorithm
(Section 2.3a).
•
Clustering:
o Run pre-processed data into a
fuzzy c-means (Section 3.1a) or kmeans (Section 2.1a) algorithm.
5. Conclusion
This paper explored and compared different data
clustering algorithms in the Bioinformatics space,
particularly in the gene clustering of temporal microarray
data. The most popular algorithms for general clustering
(hierarchical, k-means, etc.) are also the most popular
algorithms for gene clustering (Section 2.1), which we
categorized as the Pointwise Similarity method. The rest
of the methods we have described utilize the popular
clustering algorithms, but as a last step in the method.
Feature-Based,
Shape-Based
and
Template-Based
methods use a pre-clustering step that processes raw
data with vectors to transform the raw data into
processed data, and a clustering step that uses a
similarity scale or matrix to group similar genes over
time.
We have proposed a new method that minimizes
information loss and reduces processing time for the
Shape-Based method. We plan on implementing this
combined method in later iteration, and currently have a
framework for how it will be implemented. Further
comparisons of these algorithms can be performed using
a set of factors, and comparing the effectiveness with
statistical significance. It is important to note that each
algorithm is used at different points of microarray
experiments (Pointwise is used in early analyses, while
Shape-Based is used in later analyses), thus there is no
best method. Rather, we have methods that work for
shorter time-series (Shape-Based), methods that are
restrictive to scientists (Feature-Based), and methods
that are prone to information loss (Template-Based and
Pointwise). Our proposed method resolves these
aforementioned problems.
Acknowledgements
We would like to thank Professor Kao for his excellent
teaching
of
a
very
difficult
subject
matter
(Bioinformatics Algorithms), and his guidance for this
paper.
Clustering approaches for temporal microarray gene expression data
References
[1] Osama Abu Abbas. Comparisions Between Data
Clustering Algorithms. The International Arab
Journal of Information Technology, 5:3-320, 2008.
[2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D.
Botstein. Cluster analysis and dis- play of genomewide expression patterns. Proc Natl Acad Sci
USA, 95:14863–14868, 1998.
[3] J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering
short time series gene expres- sion data.
Bioinformatics, 21 Suppl 1:i159–168, 2005.
[4] L. Kuenzel. Gene clustering methods for time
series microarray data. Biochemistry 215, 2010.
[5]
Amin MA, Afzulpurkar NV, Dailey MN,
Esichaikul VE, Batanov DN (2005) Fuzzy-c-mean
determines the principle component pairs to
estimate the degree of emotion from facial
expressions. In: International Conference on
Natural
Computation
and
International
Conference on Fuzzy Systems and Knowledge
Discovery, pp 484–493
[6]
C. S. Moller-Levet, K. H. Cho, and O.
Wolkenhauer. Microarray data clus- tering based
on temporal variation: FCV with TSD
preclustering. Appl Bioinfor- matics, 2:35–45,
2003.
[7] J. Qian, M. Dolled-Filhart, J. Lin, H. Yu, and M.
Gerstein. Beyond synexpression relationships:
Local clustering of time- shifted and inverted gene
expression pro- files identifies new, biologically
relevant interactions. J Mol Biol, 314:1053–1066,
2001.
[8] Sahoo D, Dill DL, Tibshirani R, Plevritis SK.
Extracting binary signals from microarray timecourse data. Nucleic Acids Res 2007; 35: 3705–12
6