* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering Time-Course Gene
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Introduction to Time-Course Gene Expression Data STAT 675 R Guerra April 21, 2008 Outline • • • • The Data Clustering – nonparametric, model based A case study A new model The Data • DNA Microarrays: collections of microscopic DNA spots, often representing single genes, attached to a solid surface The Data • Gene expression changes over time due to environmental stimuli or changing needs of the cell • Measuring gene expression against time leads to time-course data sets Time-Course Gene Expression • Each row represents a single gene • Each column represents a single time point • These data sets can be massive, analyzing many genes simultaneously Time-Course Gene Expression • k-means to clustering • “in the budding yeast Saccharomyces cerevisiae clustering gene expression data • groups together efficiently genes of known similar function, • and we find a similar tendency in human data…” Eisen et al. (1998) Clustering Expression Data • When these data sets first became available, it was common to cluster using nonparametric clustering techniques like KMeans and hierarchical clustering Yeast Data Set • Spellman et al (1998) measured mRNA levels on yeast (saccharomyces cerevisiae) – 18 equally spaced time-points – Of 6300 genes nearly 800 were categorized as cellcycle regulated – A subset of 433 genes with no missing values is a commonly used data set in papers detailing new timecourse methods – Original and follow-up papers clustered genes using Kmeans and hierarchical clustering Spellman et al. (1998) Yeast cell cycle Row labels = cell cycle Rows=genes Col labels = expts Cols = time points Yeast Data Set (Spellman et al.) K-means Hierarchical Which method gives the “right” result??? Non-Parametric Clustering 1. Data curves 2. Apply distance metric to get distance matrix 3. Cluster Issues with Non-Parametric Clustering • Technical – Require the number of clusters to be chosen a priori – Do not take into account the time-ordering of the data – Hard to incoporate covariate data, eg, gene ontology • Yeast analysis had number of clusters chosen based on number of cell cycle groups .…no statistical validation showing that these were the best clustering assignments Model-Based Clustering • In response to limitations of nonparametric methods, model based methods proposed – – – – Time series Spline Methods Hidden Markov Model Bayesian Clustering Models • Little consensus over which method is “best” to cluster time course data K-Means Clustering Relocation method: Number of clusters predetermined and curves can change clusters at each iteration – – – – Initially, data assigned at random to k clusters Centroid is computed for each cluster Data reassigned to cluster whose centroid is closest to it Algorithm repeats until no further change in assignment of data to clusters – Hartigan rule used to select “optimal” #clusters K-means: Hartigan Rule • n curves, let k1 =k groups and k2 = k+1 groups. • If E1 and E2 are the sums of the within cluster sums of squares for k1 and k2 respectively, then add the extra group if: E1 (n k 1) 10 E2 K-means: Distance Metric • Euclidean Distance • Pearson Correlation K-means: Starting Chains • Initially, data are randomly assigned to k clusters but this choice of k cluster centers can have an effect on the final clustering • R implementation of K-means software allows the choice of “number of initial starting chains” to be chosen and the run with the smallest sum of within cluster sums of squares is the run which is given as output K-Means: Starting Chains For j = 1 to B Random assignment j k clusters wj = within cluster sum-of-squares End j Pick clustering with min(wj) Insert Initial starting chains Hierarchical Clustering • Hierarchical clustering is an addition or subtraction method. • Initially each curve is assigned its own cluster – The two closest clusters are joined into one branch to create a clustering tree – The clustering tree stops when the algorithm terminates via a stopping rule Hierarchical Clustering • Nearest neighbor: Distance between two clusters is the minimum of all distances between all pairs of curves, one from each cluster • Furthest neighbor: Distance between two cluster is the maximum of all distances between all pairs of curves, one from each cluster • Average linkage: Distance between two clusters is the average of all distances between all pairs of elements, one from each cluster Hierarchical Clustering • Normally the algorithm stops at a predetermined number of clusters or when the distance between two clusters reaches some pre-determined threshold – No universal stopping rule of thumb to find an optimal number of clusters using this algorithm. Model-Based Clustering Many uses mixture models, splines or piecewise polynomial functions used to approximate curves Can better incorporate covariate information Models using Splines • Time course profiles assumed observations from some underlying smooth expression curve • Each data curves represented as the sum of: – Smooth population mean spline (dependent on time and cluster assignment) – Spline function representing individual (gene) effects – Gaussian measurement noise SSCLUST software Pan Model based clustering and data transformations for gene expression data (2001) Yeung et al., Bioinformatics, 17:977-987. MCLUST software Validation Methods BIC (C ) 2L(C ) mc Log (n) • L(C) is maximized log-likelihood for model with C clusters, m is the number of independent parameters to be estimated and n is the number of genes • Strikes a balance between goodness-of-fit and model complexity • The non-model-based methods have no such validation method Clustering Yeast Data using SSClust Clustering Yeast Data in MCLUST Comparison of Methods • • • • Ma et al (2006) Smoothing Spline Clustering (SSClust) Simulation study SSClust better than MClust & nonparameteric • Comparison: misclassification rates Functional Form of Ma et al (2006) Simulation Cluster Centers MR and OSR • Misclassification Rate # of Misclassif ied Curves MR Total # of Curves • Overall Success Rate OSR (% correct # clusters found) (1 - MR ) – To calculate OSR the MR is only for the cases when the correct number of clusters is found Comparison of Methods • From Ma et al. (2006) paper. Clustering Method Distance Metric K-means Euclidean K-means Pearson MCLUST N/A SSClust N/A MR (%) 9.73 2.64 0.38 0.13 Correct # of Clusters (%) N/A N/A 77 100 OSR (%) NA NA 69.5 98.7 SSClust Methods Paper • Concluded that SSClust was the superior clustering method • Looking at the data, the differences in scale between the four true curves is large – Typical time course clusters differ in location and spread but not in scale to this extreme – Their conclusions are based on a data set which is not representative of the type of data this clustering method would be used for Alternative Simulation Functional Form for five clusters centers Example of SSClust Breaking Down Linear curves joined while sine curves arbitrarily split into 2 clusters Simulation Configuration • Distance Metric – Euclidean or Pearson • # of Curves – Small (100), Large (3000) • # Resolution of Time Points – 13 or 25 time points – evenly spaced or unevenly spaced • Types of underlying Curves – Small (4) – Large (8) Simulation Configuration • Distribution of curves across clusters – Equally distributed verses unequally distributed • Noise Level – Small (< 0.5*SD of the data set) – Large (> 0.5*SD of the data set) • For these cases, found the misclassification rates and the percent of times that the correct number of clusters was found Function Forms of 7 Cluster Centers Simulation Analysis Conclusions from Simulations • MCLUST performed better than SSClust and Kmeans in terms of misclassification rate and finding the correct number of clusters • Clustering methods were affected by the level of noise but, in general, not by the number of curves, the number of time points or the distribution of curves across cluster Effect on Number of Profiles on OSR Comparison based on Real Data • Applied these same clustering techniques to real data • Different numbers of clusters found for different methods for each real data set Yeast Data Human Fibroblast Data Simulations Based on Real Data – Start with real data, like the yeast data set – Cluster the results using a given clustering method – Perturb the original data (add noise at each point) – Evaluate how different the new clustering is in comparison to the original clustering • Use MR and OSR Simulations Based on Yeast Data Simulations Based on HF Data Conclusions from these Simulations • SSClust better than MCLUST and K-means – This was in contrast to the prior simulations where MCLUST was best Gene Ontology • So far I’ve described my work analyzing and comparing clustering results on gene expression data • Some, like Pan (2006) have argued that clustering methods, even newer modelbased clustering methods, are incomplete because they ignore gene function and other biological aspects in the clustering Gene Ontology • Expectation is that incorporating biological data in with the expression data with yield to better clustering Gene Ontology • Gene Ontology project (Ashburner et al. 2000) provides a structured vocabulary to describe genes and gene products in organisms • Three ontologies developed – Biological Processes (e.g……) – Molecular Function (e.g……) – Cellular Component (e.g……) Annotations • Gene Ontology annotations are associations made between gene products and the GO terms describing them • A directed acyclic graph for a gene from the HF data set using GO molecular function annotation is to the right Clustering using GO Data • First, need a distance metric • Two metrics used are based UnionIntersection distance and the longest path distance both developed in Gentleman (2005) and extended by Christian (2007) • I used the Union-Intersection distance in my clustering GO Distances • The union-intersection distance is defined as • Show example using two dags – Min = 0 when two DAGs are identical, – Max = 1 when two DAGs have nothing in common Showing UI Distance Clustering Using All Data • Open question in how to cluster genes using both time-course expression data and gene ontology data together • Two of the methods I used are from Boratyn et al (2007) and from Fang et al (2006) Boratyn et al (2007) Method • Clusters are based on adding individually scaled distances matrices – Take distance matrix from expression clustering and the distance matrix from gene ontology cluster – Put them on the same scale [0,1] – Add the scaled distance matrices together – Cluster using this new distance matrix which captures differences in expression profiles and gene function Yeast: 12 Clusters on Combined Distance Metric Fang et al (2006) Method • In this method, – Gene ontology is a guide for clustering the expression profiles – Biological process is the GO annotation used – Uses the mean squared residual score to assess the expression correlation of genes within a cluster from the clustering by GO data. Effect of the Choice of Ontologies • Examined effect of the choice of which ontology to use in my clustering between BP, CC, and MF. • Fang et al (2006) uses BP in their method as it has tended to be most closely correlated with gene function among the 3 ontologies Effect of Choice of Ontology Conclusions from GO Chapter • Clustering using expression and ontology data together proved to provide expression clusters as good or better as when expression data is clustered alone but we have the added bonus of a biological base filtering out potentially nonsensical clustering Conclusions from Paper as a Whole • Expression clustering by model-based and nonmodel-based clustering methods do not have a uniform “best clustering method” in all cases – But, methods are robust in terms of data apportionment per cluster and the number of curves per dataset (important for massive gene data banks.) • Clustering using expression and GO data together improves upon expression clustering and again methods vary in complexity, performance, and ease of use Further Extensions • GO analysis was all using K-means and hierarchical clustering – Extend GO clustering to model-based clustering techniques like MCLUST and SSClust (currently, GO data can be used as initial conditions in these models but not as some notion of prior model parameters.) P. falciparum: Examination of Correlation Between Spatial Location and Temporal Expression of Genes Motivations: • Evidence for correlation in literature – Printing artifact – Biological • Develop a visualization and statistical testing methodology Biological Motivations Operon control (bacteria) promoter ORF1 ORF2 mRNA Upstream Activating Sequences (yeast) UAS1 ORF1 UAS2 ORF2 mRNAs Locus Control Region (mammalian globin cluster) LCR1 mRNAs ORF1 ORF2 Hypothesis and Statistic • Statistical: Correlation between chromosomal location and gene expression? • Biological: Gene order random? • H0: no correlation between location on chromosome and expression • Consider correlations in partitions Approach Covariogram: General Tool Partition Chromosome, Develop Statistic Permutation Testing Framework Check for Confounding Factors Biological Significance Issues • Confounding (printing) or other artifacts • Account for inter-gene distances (as opposed to adjacent pairwise correlation) • Significance of correlation operon Methods: Data • Need gene information (plasmodb.org has annotated fastA files): TCAAGCAATTGTTAGATGAGAACAATAGGAAGAATTTAAATTTTAATGAT CTGGTTATACACCCTTGGTGGTCTTATAAGAATTAA >Pfa3D7|pfal_chr1|PFA0135w|Annotation|Sanger(protein coding) hypothetical protein Location=join(124752..124823,124961..125719) ATGATATTTCATAAATGCTTTAAAATTTGTTCGCTCTCTTGTACTGTTTT ATGGGTTACCGCCATATCATCGATCATTCAACCAGACAAACAACAAGAAA • Normalized gpr files (2-D loess, centered and scaled) Methods: Data FastA sequence: 5400 predicted genes PFA0135w 124752:125719 bp Intersection: 3500 genes with common gene name PFA0135w 124752:125719 bp probe a16122_1 t1,t2,…, t48 QC Microarray: 3800 genes 5100 probes PFA0135w probe a16122_1 t1,t2,…, t48 Methods: Covariograms ( x, y; d a , db ) Ave[ ( x, y | d a dist ( x, y) db )] • Covariogram 1: distance is chromosomal location: d ( g i , g j ) g i ,midpt( chrloc) g j ,midpt( chrloc) • Covariogram 2: distance is printed microarray location: d ( gi , g j ) g i , x g j , x g i , y g j , y 2 2 Chr 10: Covariogram 1 Chr 10: Covariogram 2 Chr 6: Covariogram 1 Chr 6: Covariogram 2 Methods: Partitioning 0 kb • Partition • Avg of all pairwise Pearson correlations 60 kb 1 21 r1 ri 21 i 1 7 7 genes, pairwise correlation 2 3 3 genes, pairwise correlation 2 120 kb 1 3 r2 ri 3 i 1 Methods: Partitioning • Chr 6, 40 kb partition • Significant? Methods: Permutation Test • r .50 in a 40kb interval on chr 6 • Permutation test • Null distribution • Estimated p-values gene obs Perm(1)Perm(2)… Perm(n g1 e1 e4 e3 e2 g2 e2 e2 e4 e3 g3 e3 e1 e2 e1 g4 e4 e3 e1 e4 … Methods: Permutation Test • Distribution of in 40 kb interval r robs 0.57 ngenes 2 p val 0.22 Methods: Permutation Test • Distribution of in 40 kb interval r robs 0.72 ngenes 6 p val 0.001 Methods: Permutation Test • Distribution of in 40 kb interval r robs 0.49 ngenes 9 p val 0.002 Methods: Permutation Test • Distribution of in 40 kb interval r robs 0.018 ngenes 12 p val 0.475 Significant Intervals (Chr 7) 100kb 80kb 60kb 40kb 20kb 10kb Significant Intervals (Chr 7) 100kb 80kb 60kb 40kb 20kb 10kb Significant Intervals (Chr 7) 100kb 80kb 60kb 40kb 20kb 10kb 100kb 80kb 60kb 40kb 20kb 10kb MAL6P1.257: hypothetical protein MAL6P1.258: malate:quinone oxidoreductase MAL6P1.259: hypothetical protein MAL6P1.260: hypothetical protein MAL6P1.263: hypothetical protein MAL6P1.265: pyridoxine kinase MAL6P1.266: hypothetical protein MAL6P1.267: hypothetical protein MAL6P1.268: hypothetical protein MAL6P1.271: cdc2-like protein kinase MAL6P1.272: ribonuclease MAL6P1.273: hypothetical protein Results: Summary Table 10kb 60kb 100kb 10kb in 60kb Chr 3 3/400 0/68 0/40 0 Chr 4 10/476 5/80 2/48 4 Chr 5 6/528 1/88 3/56 0 Chr 14 4/1304 2/220 1/132 0 Conclusions • Statistical: Significance for both small regions of strong correlation and large regions of weak correlation • Biological: Evidence for regulation at multiple levels