Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Clustering Jarno Tuimala Clustering • Aim – Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related to each other than to objects in other clusters • Exploratory data analysis – View all data simultaneously – Identify clusters and patterns in data • Uses: – Time series analysis – Visualization of known classes Unsupervised vs. Supervised Clustering methods • Hierarchical clustering – single, average (UPGMA) and complete linkage • Non-hierarchical clustering – K-means Hierarchical clustering • Two phases – Pick a distance method • Euclidian • Pearson / Spearman correlation – Pick the dendrogram drawing method • Single linkage • Average linkage • Complete linkage Distances • Euclidian – Average difference between gene or chip expression profiles – Similar values are clustered together • Correlation – Difference in trends – Similar trends are clustered together – Typically: Pearson or Spearman correlation Dendrogram drawing Single, average, and complete linkage UPGMA example Hierarchical Clustering 10 gene tree 10 gene tree 10 gene tree 10 gene tree non-binary 8 gene 7 gene treetree 5 gene tree 3 gene... 2... Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Casp12 Gria4 Gpcr25 Fgfr1 Gdf3 Time , Strain Time 0 ,Time Strain0 chocola Time 0 , Strain chocolate_addict 0 , chocolate_addict Strain chocolate_addict , Strain Time 0Time Tim Time Tim Time , Strain Time 4 ,Time Strain4 chocola Time 4 , Strain chocolate_addict 4 , chocolate_addict Strain chocolate_addict , Strain Time 4Time Tim Time Tim Time 24 choco , Strai Timechocolate_addict 24Time , Strain Time chocolate_addict 24 , Strain Time 24 , chocolate_addict Strain , Strain Time 24 Tim Time Tim Time , Strain Time 0 ,Time Strain0 normal Time 0 , Strain normal 0 , normal Strain normal , Strain Time 0Time Tim Time Tim Time , Strain Time 4 ,Time Strain4 normal Time 4 , Strain normal 4 , normal Strain normal , Strain Time 4Time Tim Time Tim Time 24 norma , Strai Timenormal 24Time , Strain Time normal 24 , Strain Time 24 , normal Strain , Strain Time 24 Tim Time Tim X55123 Y18280 L06443 Y18280 U16297 U39827 X55123 Y18280U39827 U16297Y13090 Y13090 Y13090 M33760 X55123 X55123 Y18280 U16297 X55123U16297 X55123 X55123 Y18280 U16297 U39827 Y13090 M33760 L06443 Y18280 U16297 U39827 Y13090 M33760 X55123 U16297 U39827 Y13090 M33760 L06443 L06443 Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Gpcr25 Casp12 Gria4 Fgfr1 Gdf3 Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Gpcr25 Gata3 Gata3 Kcnd2Gata3 Api6 Kcnd2 Dyrk1b Cyb561Cyb561 Casp12 Casp12 Gria4 Gria4 Kcnd2 Api6 Api6 Kcnd2 Dyrk1b Gata3 Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Gpcr25 Casp12 Gria4 Fgfr1 Gdf3 Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Gpcr25 Casp12 Gria4 Fgfr1 Gdf3 Cyb561 Gpcr25 Casp12 Gria4 Fgfr1 Gdf3 Silicon Genetics, 2003 Heatmap K-means clustering • Partitioning method – The dataset is divided into K clusters – User needs to deside on the K before the run • K-means is heuristic algorithm, so different runs can give dissimilar results – Make several runs, and select the one giving the minimum sum of within-clusters variance K-means Clustering Silicon Genetics, 2003 K-means Clustering Silicon Genetics, 2003 K-means Clustering Silicon Genetics, 2003 K-means Clustering Silicon Genetics, 2003 Visualization Gene selection • Genes are usually filtered before clustering. – This decreases calculation time. • Typically a few hundred genes with highest variance (or standard deviation) are selected. • If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.