* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering Time-Series Gene Expression Data Using Smoothing
Public health genomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of depression wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Genome evolution wikipedia , lookup
Genome (book) wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Microevolution wikipedia , lookup
Ridge (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Clustering Time-Series Gene Expression Data Using Smoothing Spline Derivatives S. DÉJEAN, P. G. P. MARTIN, A. BACCINI, AND P. BESSE EURASIP JOURNAL ON BIOINFORMATICS AND SYSTEMS BIOLOGY 2007 Outline Introduction Biological Experiment Statistical Methodology Results Discussion Introduction For time-series gene expression data, we would deal with temporal profile clustering. In this paper, it focuses on the shapes of the curves rather than on the absolute level of expression. The shapes of the curves may provide meaningful information on coordinate gene regulation. Introduction How to describe the shapes of the curves? We use splines for continuous representations, as Bar-Joseph et al, and the derivative for the shapes. Outline Introduction Biological Experiment Statistical Methodology Results Discussion Biological Experiment Experimental design: 44 mice were subjected to 11 different fasting periods ranging from 0 to 72 hours. At each time points(0, 3, 6, 9, 12, 18, 24, 36, 48, 60, 70), 4 mice were euthanized and their livers were used for RNA extraction. The experiment was measured with a decreasing sampling rate, it was assumed that most of the gene expression changes would occur at the beginning of fasting. Biological Experiment Data preprocessing: All data were log-transformed. Considering the missing expression value, only 130 of the total 200 genes were selected for analysis. So, the data set includes: 130 genes 11 time points 4 samples at each time point Outline Introduction Biological Experiment Statistical Methodology Results Discussion Statistical Methodology The methodology was composed of two steps: Signal Extraction Clustering the derivatives of the smoothed curves. The data set: Signal Extraction Consider the observed gene expression values. Two assumptions: The values are noisy observations of the “true” value. The biological phenomenon should be regular and so differentiable, function of time. Signal Extraction According to the assumptions, consider the model for each gene expression: where denotes the observation for the mouse at time . How to solve , a continuous and differentiable function? Signal Extraction Using cubic spline smoothing, the estimation of gene expression curve is the solution of the optimization problem: Smoothing parameter. Force the solutions to be close to mean values. Control the regularity of the function. Signal Extraction The solution shape and its smoothness depend directly on . How to tune the smoothing parameter to extract the informative part of the signal? Tuning the smoothing parameter Considering the influence of : Tuning the smoothing parameter A unique value for all genes. A heuristic approach combining two levels of reflection: Eigenelements of the PCA performed a posteriori. Biological interpretations of results. Eigenvalues and eigenvectors smoothness For each different values of : Each gene expression profile is smoothed according to the same value. First derivatives are computed and discretized. Eigenvalues and eigenvectors smoothness Then, a PCA is computed, leading to a scree graph. When is large, the derivative is constant, so the PCA gave only one large eigenvalue. As decreases, other eigenvalues are arisen. Eigenvalues and eigenvectors smoothness The first two eigenvectors: As decreases, the first two eigenvector become much more irregular and more difficult to interpret. Biological interpretation The consistency with biological relevance should be considered. For higher value, two or three time points could interpret the phenomena. As decreases, more oscillations in the eigenvectors could be irrelevant. The consideration would be important for avoiding misinterpretation. Synthesis For two levels of reflection, we could yield the smoothing parameter . There are clearly two separate eigenvalues, and the corresponding eigenvectors are smooth enough to interpret the gene expression profile. Clustering By interpreting gene expression profiles with the derivative of smoothed curves, we choose 20 points equally spaced between 0 and 72 hours. The data can be presented with 130 genes and 20 expression values for each gene. Clustering Two kinds of clustering are applied: Hierarchical clustering K-means clustering Hierarchical clustering: Ward criterion: when fusing two clusters, it minimize the increase in the total within-cluster sum of squares. K-means clustering: To avoid an improper fusion in hierarchical clustering, we use k-means clustering with the k centroids from the results of hierarchical clustering. Outline Introduction Biological Experiment Statistical Methodology Results Discussion Results Hierarchical clustering: the dendrogram with four clusters. Results The four clusters correspond to four temporal expression profiles: Weakly increasing(hc1) Stationary(hc2) Decreasing(hc3) Strongly increasing(hc4) Results To make the clustering more robust, the k-means clustering are performed with the initial centers as the centers of the classes obtained when cutting the dendrogram. Clustering changes: The main change. Results The four clusters of K-means clustering: Graphical display Representation of variables(time points) and individuals(genes): km4 km2 km1 km3 Outline Introduction Biological Experiment Statistical Methodology Results Discussion Discussion Before clustering step, the spline smoothing is applied as a de-noising method. According to this work, one could set time points adequately depending on the scientific aims.