* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Intro Data Clustering - Genomics & Bioinformatics at Purdue
Pathogenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Copy-number variation wikipedia , lookup
Minimal genome wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Oncogenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Public health genomics wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genomic imprinting wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
The Selfish Gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome (book) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression programming wikipedia , lookup
Ridge (biology) wikipedia , lookup
Microevolution wikipedia , lookup
Gene expression profiling wikipedia , lookup
An Overview of Clustering Methods Michael D. Kane, Ph.D. Topics • What is clustering? • Clustering mechanics (how the computer does it). • Parameter choices and their effect. • Examples. What is clustering? Grouping by similarity. Samples Gene clustering Similar genes. Group genes that have similar expression profiles when observed over multiple samples. Genes Samples Sample clustering Similar samples. Genes Group samples that are similar when observed over multiple genes. Why cluster? • Similar gene expression infers common biology. Function of uncharacterized genes may be deduced from coexpression with known genes. • Associate expression patterns with: Response to environmental change. Disease pathology/progression. Clustering Mechanics For gene clustering, we must measure similarity between genes. + E2 E1 E2 c Gene a Gene b Gene c Gene d a e d E1 b Gene e f Gene f - + Distance (similarity) measure Euclidean distance + E2 c (1.0, 1.7) dbe a - e d E1 b dbe 4.6 1.0 0.5 1.7 2 (4.6, 0.5) 2 f - + Distance Measure Pearson Correlation 1 S a, b N N i 1 ai a bi b a b S=(-1 . . . +1) Used in “Eisen” clustering Hierarchical Clustering + E2 c a e d E1 + b f - a b c d e f Measuring distance between clusters Single linkage The minimum distance between clusters. May form loose clusters. Produces “chained” clusters. Complete linkage The maximum distance between clusters. Tends to form compact clusters. Methods for joining clusters UPGMA unweighted pair group method (Average linkage) The average distance between clusters. Weighted pair group method Same as UPGMA but the distance is weighted by cluster size. Use when clusters are expected to be significantly uneven in size! Effect of distance measure Euclidean Single Linkage Euclidean Complete Linkage Effect of distance measure Euclidean UPGMA Euclidean Ward’s Method Alternatives to hierarchical clustering k-means • Number of clusters specified by user. • Good when prior knowledge available. k-means clustering + E2 1. Number of clusters specified by user. c 2. Genes randomly assigned to clusters. a e d E1 b 3. Assess inter and intra-cluster similarity. f - 4. Move genes to alternative cluster if distance is reduced. + Alternatives to hierarchical clustering SOM Self-organizing maps • Number of clusters specified by user. • Good when prior knowledge available. SOM E1 E2 E1 E2 Gene a + 0 - Gene b + 0 - Gene c + 0 - Gene d + 0 - Gene e + 0 - Gene f + 0 - E1 E2 + 0 - cluster 1 + 0 - cluster 2 + 0 - cluster 3 E1 E2 E1 E2 Usera specified For Increase Iteratively After gene, training, thetrain find similarity assign number the the cluster most each by of similar adjusting clusters. gene representations. tocluster the themost cluster representation. similar representation. cluster. Each initially given a random expression representation. “Training” Gene clustering Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS v95,14863-14868, 1998 cholesterol biosynthesis 24 hour time course after re-introduction of serum to serum-deprived human fibroblasts. Pearson correlation, average linkage. cell cycle immediate-early response signaling wound healing Sample clustering Ross et al., Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics v24, 227-235, 2000 Note breast cancer cell lines, derived from the same patient. 64 cancer cell lines clustered. 8,000 genes. Clustering performed with 2 different subsets of genes. Similar results. Pearson correlation, average linkage. Summary • Different methods often provide different clusters. • No overall “best” clustering method. • Clustering applied to unrelated data will still provide clusters. • Use biological insight in method selection and interpretation. Clustering + E2 c a e d E1 + b f - a b c d e f SOM E1 E2 Gene a + 0 - Gene b + 0 - Gene c + 0 - Gene d + 0 - Gene e + 0 - Gene f + 0 - E1 E2 + 0 - cluster 1 + 0 - cluster 2 + 0 - cluster 3 E1 E2 E1 E2 After training, assign each gene to the most similar cluster.