Download ppt - University of Connecticut

Application of Clustering to Identify Cell Types from Single-Cell mRNA Expression Data University of Connecticut School of Engineering Elham Sherafat, Ion Mandoiu Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269 INTRODUCTION Recent RNA-seq technologies have facilitated generation of single-cell transcriptome data, but it should be considered that sometime tools and computational strategies that work for analyzing bulk-cell population RNA-seq data cannot successfully applied to study of gene expression at sing-cell level. So, we need to develop new tools for analyzing them. At this work, we look at application of clustering methods in analyzing different RNA-seq data, especially identification of heterogeneous cell types from single-cell transcriptome. Clustering multiple tissues samples by their bulk expression profiles have done at previous studies. Clustering methods also can be used to identify hidden tissue heterogeneity on the basis of expression profiles. There are two groups of clustering methods in this context [3]. It mainly depends on availability of prior knowledge or expectation regarding relationship between cells. Gene expression profiles are input as clustering method. Most of times the values are measured or estimated imprecisely due to biological or technical noise which really affect accuracy of clustering result. We applied different clustering methods on different datasets and evaluate the methods to see how well they can basically address noisy and high dimensional data. DATASET DESCRIPTION Dataset #1 RNA-Seq of neural cells (MiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)  Feature selection: Top 500 genes from [2]  Best parameters: K=4; d= 1.4; metric= Euclidean; method= complete; I = 200; S= 200; n= 5; r = 0.7; m= 0.5 Dataset #2 RNA-Seq of neural cells (HiSeq) [2]  65 cells  Ground truth clusters:  Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)  Feature selection: Top 500 genes from [2]  Best parameters: K=4; d= 1.4; metric= Cityblock; method= complete; I = 200; S= 200; n= 5; r = 0.4; m= 0.4 Dataset #3 RNA-Seq of mouse sematosensory cortex and hippocampal CA1 cells [6] METHODS We selected five clustering methods: K-means, fuzzy c-means, hierarchical and EM clustering and SNN-Cliq [5] as tools to reveal heterogeneity of cell type at six different datasets. The last one is recently developed to do clustering on RNA-seq data. Some datasets have their own rules to reduce features. Even dimensionality will reduce too much in this way, but sometimes applying feature extraction methods like PCA leads to more improvement at final performance of the algorithms. There are some parameters for each algorithm which is in below table. The results are reported base on best parameter setting founded for different datasets. Below are list of parameters for the algorithms. Algorithm K-means Fuzzy c-means Clustering (FCM) Parameters K = Number of clusters K = number of clusters d = Degree of fuzziness  3005 cells  Ground truth clusters:  Astrocytes_ependymal, Endothelial-mural, Interneurons, Microglia, Oligodendrocytes, Pyramidal CA1, Pyramidal SS and 47 subtypes  Feature selection: Top 500 genes using ceftools published by [6]  Best parameters: K=47; d= 1.2; metric= Seulidean; method= ward; I = 200; S= 200 Dataset #4 qPCR of mouse hematopoietic system [1]  327 cells  Ground truth clusters:  HSC (Hematopoietic stem cells)  CMP (Common myeloid progenitors  GMP (Granulocyte/monocyte progenitors )  MEP (Megakaryocyte/erythroid progenitors)  CLP (Common lymphoid progenitors)  MPP (Multipotent progenitor cells)  Feature selection: Top 280 genes from [1]  Best parameters: K= 6; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.6; m=0.7 Dataset #5 RNA-Seq of mouse distal lung epithelial [4] Hierarchical Clustering (HCS) EM Clustering SNN-Cliq Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearman Method = average, centroid, complete, median, single K = Number of clusters S = Number of initial seeds I = Number of iteration n = Size of the nearest neighbor list r = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.  80 cells  Ground truth clusters:  Clara (Scgb1a1), Ciliated (Foxj1), AT1 (Pdpn, Ager), AT2 (Sftpc, Sftpb), BP (alveolar bipotential progenitor)  Number of genes: 8,578  Feature selection: Top 8,578 from [4] and then choose 10 first PCs after applying PCA  Best parameters: K=5; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.7; m= 0.9 PCA RESULTS EVALUATION METRICS Seven measures are considered to evaluate clustering algorithms which their definitions are as following. 1 𝑃𝑢𝑟𝑖𝑡𝑦 = 𝑁 Purity 𝑣𝑖 ∩ 𝑢𝑗 𝑖 U: set ofground truth classes; V: set of the computed clusters; N:total # of objects in dataset Adjusted Rand Index (AR) 𝑁 𝐴𝑅𝐼 = 2 𝑇𝑃 + 𝑇𝑁 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)] 𝑁 2 2 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)] Rand Index (RI) RI= (TP+TN)/(TP+FP+FN+TN) F1Score F1 Score= 2×TP/(2×TP+FP+FN) Mirkin’s index (MI) It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering. Hubert’s index (HI) HI = RI – MI Corr Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster CONCLUSION & FUTURE WORK  HC and model-based clustering methods (EM) performed well on most datasets; the other clustering methods had less consistent performance.  Best method and best parameters depend on dataset.  A limitation of most current methods is that they do not model the noise in the expression level estimates. Density based clustering may be a good way of handling noise.  An additional advantage of model-based methods is that they can incorporate prior knowledge in the inference process; the value of incorporating such prior knowledge is currently under evaluation.  Further increases in accuracy may benefit from time-series data such as SCUBA [7].  With increased number of cells scalability becomes a concern. In ongoing work we will explore scalable alignmentfree clustering methods. REFERENCES [1] Guo, Guoji, et al. "Mapping cellular hierarchy by single-cell analysis of the cell surface repertoire." Cell stem cell 13.4 (2013): 492-505. [2] Pollen, Alex A., et al. "Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex." Nature biotechnology (2014). [3] Stegle, Oliver, et al. "Computational and analytical challenges in singlecell transcriptomics." Nature Reviews Genetics 16.3 (2015): 133-145. [4] Treutlein, B. et al. “Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.” Nature 509, 371–375 (2014). [5] Xu, Chen, and Zhengchang Su. "Identification of cell types from singlecell transcriptomes using a novel clustering method." Bioinformatics (2015): btv088. [6] Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142. [7] Marco, Eugenio, et al. "Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape." Proceedings of the National Academy of Sciences 111.52 (2014): E5643-E5650

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt - University of Connecticut