Download ppt - University of Connecticut

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neuropsychopharmacology wikipedia , lookup

Optogenetics wikipedia , lookup

Subventricular zone wikipedia , lookup

Synaptogenesis wikipedia , lookup

Development of the nervous system wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Transcript
Application of Clustering to Identify Cell Types from
Single-Cell mRNA Expression Data
University of Connecticut
School of Engineering
Elham Sherafat, Ion Mandoiu
Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269
INTRODUCTION
Recent RNA-seq technologies have facilitated generation of single-cell transcriptome data,
but it should be considered that sometime tools and computational strategies that work for
analyzing bulk-cell population RNA-seq data cannot successfully applied to study of gene
expression at sing-cell level. So, we need to develop new tools for analyzing them. At this
work, we look at application of clustering methods in analyzing different RNA-seq data,
especially identification of heterogeneous cell types from single-cell transcriptome.
Clustering multiple tissues samples by their bulk expression profiles have done at previous
studies. Clustering methods also can be used to identify hidden tissue heterogeneity on
the basis of expression profiles. There are two groups of clustering methods in this context
[3]. It mainly depends on availability of prior knowledge or expectation regarding
relationship between cells. Gene expression profiles are input as clustering method. Most
of times the values are measured or estimated imprecisely due to biological or technical
noise which really affect accuracy of clustering result. We applied different clustering
methods on different datasets and evaluate the methods to see how well they can
basically address noisy and high dimensional data.
DATASET DESCRIPTION
Dataset #1
RNA-Seq of neural cells (MiSeq) [2]
 65 cells
 Ground truth clusters:
 Group I (Neural Progenitors), Group II (Radial
Gilia), Group III (Newborn Neurons), Group IV
(Maturing Neurons)
 Feature selection:
Top 500 genes from [2]
 Best parameters:
K=4; d= 1.4; metric= Euclidean; method= complete;
I = 200; S= 200; n= 5; r = 0.7; m= 0.5
Dataset #2
RNA-Seq of neural cells (HiSeq) [2]
 65 cells
 Ground truth clusters:
 Group I (Neural Progenitors), Group II (Radial
Gilia), Group III (Newborn Neurons), Group IV
(Maturing Neurons)
 Feature selection:
Top 500 genes from [2]
 Best parameters:
K=4; d= 1.4; metric= Cityblock; method= complete;
I = 200; S= 200; n= 5; r = 0.4; m= 0.4
Dataset #3
RNA-Seq of mouse sematosensory cortex and
hippocampal CA1 cells [6]
METHODS
We selected five clustering methods: K-means, fuzzy c-means, hierarchical and EM
clustering and SNN-Cliq [5] as tools to reveal heterogeneity of cell type at six different
datasets. The last one is recently developed to do clustering on RNA-seq data. Some
datasets have their own rules to reduce features. Even dimensionality will reduce too
much in this way, but sometimes applying feature extraction methods like PCA leads to
more improvement at final performance of the algorithms. There are some parameters for
each algorithm which is in below table. The results are reported base on best parameter
setting founded for different datasets. Below are list of parameters for the algorithms.
Algorithm
K-means
Fuzzy c-means Clustering
(FCM)
Parameters
K = Number of clusters
K = number of clusters
d = Degree of fuzziness
 3005 cells
 Ground truth clusters:
 Astrocytes_ependymal, Endothelial-mural,
Interneurons, Microglia, Oligodendrocytes,
Pyramidal CA1, Pyramidal SS and 47 subtypes
 Feature selection:
Top 500 genes using ceftools published by [6]
 Best parameters:
K=47; d= 1.2; metric= Seulidean; method= ward;
I = 200; S= 200
Dataset #4
qPCR of mouse hematopoietic system [1]
 327 cells
 Ground truth clusters:
 HSC (Hematopoietic stem cells)
 CMP (Common myeloid progenitors
 GMP (Granulocyte/monocyte progenitors )
 MEP (Megakaryocyte/erythroid progenitors)
 CLP (Common lymphoid progenitors)
 MPP (Multipotent progenitor cells)
 Feature selection:
Top 280 genes from [1]
 Best parameters:
K= 6; d= 1.2; metric= correlation; method=
complete; I = 500; S= 500; n= 5; r = 0.6; m=0.7
Dataset #5
RNA-Seq of mouse distal lung epithelial [4]
Hierarchical Clustering
(HCS)
EM Clustering
SNN-Cliq
Metric = euclidean, seuclidean, cityblock, minkowski,
chebychev, cosine, correlation, spearman
Method = average, centroid, complete, median, single
K = Number of clusters
S = Number of initial seeds
I = Number of iteration
n = Size of the nearest neighbor list
r = Density threshold of quasi-cliques
m = Threshold on the overlapping rate for merging.
 80 cells
 Ground truth clusters:
 Clara (Scgb1a1), Ciliated (Foxj1), AT1 (Pdpn,
Ager), AT2 (Sftpc, Sftpb), BP (alveolar bipotential
progenitor)
 Number of genes: 8,578
 Feature selection:
Top 8,578 from [4] and then choose 10 first PCs
after applying PCA
 Best parameters:
K=5; d= 1.2; metric= correlation; method= complete;
I = 500; S= 500; n= 5; r = 0.7; m= 0.9
PCA
RESULTS
EVALUATION METRICS
Seven measures are considered to evaluate clustering
algorithms which their definitions are as following.
1
𝑃𝑢𝑟𝑖𝑡𝑦 =
𝑁
Purity
𝑣𝑖 ∩ 𝑢𝑗
𝑖
U: set ofground truth classes; V: set of the computed
clusters; N:total # of objects in dataset
Adjusted Rand
Index (AR)
𝑁
𝐴𝑅𝐼 = 2
𝑇𝑃 + 𝑇𝑁 − [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)]
𝑁
2
2
− [(𝑇𝑃 + 𝐹𝑃)(𝑇𝑃 + 𝐹𝑁) + (𝐹𝑁 + 𝑇𝑁)(𝐹𝑃 + 𝑇𝑁)]
Rand Index
(RI)
RI= (TP+TN)/(TP+FP+FN+TN)
F1Score
F1 Score= 2×TP/(2×TP+FP+FN)
Mirkin’s index
(MI)
It counts the number of disagreements in data pairs
between two clustering. It is the ratio of the number of
disagreeing pairs to the total number of pairs. Lower
value of Mirkin’s index indicates better clustering.
Hubert’s index
(HI)
HI = RI – MI
Corr
Maximum weighted Pearson correlation
between
average expression value of each class at ground truth
and computed cluster
CONCLUSION & FUTURE WORK
 HC and model-based clustering methods (EM) performed
well on most datasets; the other clustering methods had less
consistent performance.
 Best method and best parameters depend on dataset.
 A limitation of most current methods is that they do not model
the noise in the expression level estimates. Density based
clustering may be a good way of handling noise.
 An additional advantage of model-based methods is that
they can incorporate prior knowledge in the inference
process; the value of incorporating such prior knowledge is
currently under evaluation.
 Further increases in accuracy may benefit from time-series
data such as SCUBA [7].
 With increased number of cells scalability becomes a
concern. In ongoing work we will explore scalable alignmentfree clustering methods.
REFERENCES
[1] Guo, Guoji, et al. "Mapping cellular hierarchy by single-cell analysis of
the cell surface repertoire." Cell stem cell 13.4 (2013): 492-505.
[2] Pollen, Alex A., et al. "Low-coverage single-cell mRNA sequencing
reveals cellular heterogeneity and activated signaling pathways in
developing cerebral cortex." Nature biotechnology (2014).
[3] Stegle, Oliver, et al. "Computational and analytical challenges in singlecell transcriptomics." Nature Reviews Genetics 16.3 (2015): 133-145.
[4] Treutlein, B. et al. “Reconstructing lineage hierarchies of the distal lung
epithelium using single-cell RNA-seq.” Nature 509, 371–375 (2014).
[5] Xu, Chen, and Zhengchang Su. "Identification of cell types from singlecell transcriptomes using a novel clustering method." Bioinformatics (2015):
btv088.
[6] Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus
revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142.
[7] Marco, Eugenio, et al. "Bifurcation analysis of single-cell gene expression
data reveals epigenetic landscape." Proceedings of the National Academy of
Sciences 111.52 (2014): E5643-E5650