Download Clustering Time-Series Gene Expression Data Using Smoothing

Clustering Time-Series Gene Expression Data Using Smoothing Spline Derivatives S. DÉJEAN, P. G. P. MARTIN, A. BACCINI, AND P. BESSE EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY 2007 Outline  Introduction  Biological Experiment  Statistical Methodology  Results  Discussion Introduction  For time-series gene expression data, we would deal with temporal profile clustering.  In this paper, it focuses on the shapes of the curves rather than on the absolute level of expression.  The shapes of the curves may provide meaningful information on coordinate gene regulation. Introduction  How to describe the shapes of the curves?  We use splines for continuous representations, as Bar-Joseph et al, and the derivative for the shapes. Outline  Introduction  Biological Experiment  Statistical Methodology  Results  Discussion Biological Experiment  Experimental design:  44 mice were subjected to 11 different fasting periods ranging from 0 to 72 hours.  At each time points(0, 3, 6, 9, 12, 18, 24, 36, 48, 60, 70), 4 mice were euthanized and their livers were used for RNA extraction.  The experiment was measured with a decreasing sampling rate, it was assumed that most of the gene expression changes would occur at the beginning of fasting. Biological Experiment  Data preprocessing:  All data were log-transformed.  Considering the missing expression value, only 130 of the total 200 genes were selected for analysis.  So, the data set includes:  130 genes  11 time points  4 samples at each time point Outline  Introduction  Biological Experiment  Statistical Methodology  Results  Discussion Statistical Methodology  The methodology was composed of two steps:  Signal Extraction  Clustering the derivatives of the smoothed curves.  The data set: Signal Extraction  Consider the observed gene expression values.  Two assumptions:  The values are noisy observations of the “true” value.  The biological phenomenon should be regular and so differentiable, function of time. Signal Extraction  According to the assumptions, consider the model for each gene expression: where denotes the observation for the mouse at time .  How to solve , a continuous and differentiable function? Signal Extraction  Using cubic spline smoothing, the estimation of gene expression curve is the solution of the optimization problem: Smoothing parameter. Force the solutions to be close to mean values. Control the regularity of the function. Signal Extraction  The solution shape and its smoothness depend directly on .  How to tune the smoothing parameter to extract the informative part of the signal? Tuning the smoothing parameter  Considering the influence of : Tuning the smoothing parameter  A unique value for all genes.  A heuristic approach combining two levels of reflection:   Eigenelements of the PCA performed a posteriori. Biological interpretations of results. Eigenvalues and eigenvectors smoothness  For each different values of :  Each gene expression profile is smoothed according to the same value.  First derivatives are computed and discretized. Eigenvalues and eigenvectors smoothness  Then, a PCA is computed, leading to a scree graph. When is large, the derivative is constant, so the PCA gave only one large eigenvalue. As decreases, other eigenvalues are arisen. Eigenvalues and eigenvectors smoothness  The first two eigenvectors: As decreases, the first two eigenvector become much more irregular and more difficult to interpret. Biological interpretation  The consistency with biological relevance should be considered.   For higher value, two or three time points could interpret the phenomena. As decreases, more oscillations in the eigenvectors could be irrelevant.  The consideration would be important for avoiding misinterpretation. Synthesis  For two levels of reflection, we could yield the smoothing parameter .  There are clearly two separate eigenvalues, and the corresponding eigenvectors are smooth enough to interpret the gene expression profile. Clustering  By interpreting gene expression profiles with the derivative of smoothed curves, we choose 20 points equally spaced between 0 and 72 hours.  The data can be presented with 130 genes and 20 expression values for each gene. Clustering  Two kinds of clustering are applied:  Hierarchical clustering  K-means clustering  Hierarchical clustering:  Ward criterion: when fusing two clusters, it minimize the increase in the total within-cluster sum of squares.  K-means clustering:  To avoid an improper fusion in hierarchical clustering, we use k-means clustering with the k centroids from the results of hierarchical clustering. Outline  Introduction  Biological Experiment  Statistical Methodology  Results  Discussion Results  Hierarchical clustering: the dendrogram with four clusters. Results  The four clusters correspond to four temporal expression profiles:     Weakly increasing(hc1) Stationary(hc2) Decreasing(hc3) Strongly increasing(hc4) Results  To make the clustering more robust, the k-means clustering are performed with the initial centers as the centers of the classes obtained when cutting the dendrogram.  Clustering changes: The main change. Results  The four clusters of K-means clustering: Graphical display  Representation of variables(time points) and individuals(genes): km4 km2 km1 km3 Outline  Introduction  Biological Experiment  Statistical Methodology  Results  Discussion Discussion  Before clustering step, the spline smoothing is applied as a de-noising method.  According to this work, one could set time points adequately depending on the scientific aims.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering Time-Series Gene Expression Data Using Smoothing