Download Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Clustering
Jarno Tuimala
Clustering
• Aim
– Grouping objects (genes or chips) into clusters so that
the objects inside one cluster are more closely related
to each other than to objects in other clusters
• Exploratory data analysis
– View all data simultaneously
– Identify clusters and patterns in data
• Uses:
– Time series analysis
– Visualization of known classes
Unsupervised vs. Supervised
Clustering methods
• Hierarchical clustering
– single, average (UPGMA) and complete
linkage
• Non-hierarchical clustering
– K-means
Hierarchical clustering
• Two phases
– Pick a distance method
• Euclidian
• Pearson / Spearman correlation
– Pick the dendrogram drawing method
• Single linkage
• Average linkage
• Complete linkage
Distances
• Euclidian
– Average difference between gene or chip
expression profiles
– Similar values are clustered together
• Correlation
– Difference in trends
– Similar trends are clustered together
– Typically: Pearson or Spearman correlation
Dendrogram drawing
Single,
average,
and
complete
linkage
UPGMA example
Hierarchical Clustering
10 gene tree
10
gene
tree
10
gene
tree
10
gene
tree non-binary
8 gene
7 gene
treetree
5 gene tree
3 gene...
2...
Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Casp12 Gria4 Gpcr25 Fgfr1 Gdf3
Time
, Strain
Time
0 ,Time
Strain0 chocola
Time
0 , Strain
chocolate_addict
0 , chocolate_addict
Strain
chocolate_addict
, Strain
Time 0Time
Tim
Time
Tim
Time
, Strain
Time
4 ,Time
Strain4 chocola
Time
4 , Strain
chocolate_addict
4 , chocolate_addict
Strain
chocolate_addict
, Strain
Time 4Time
Tim
Time
Tim
Time
24 choco
, Strai
Timechocolate_addict
24Time
, Strain
Time chocolate_addict
24 , Strain
Time
24 , chocolate_addict
Strain
, Strain
Time 24
Tim
Time
Tim
Time
, Strain
Time
0 ,Time
Strain0 normal
Time
0 , Strain
normal
0 , normal
Strain
normal
, Strain
Time 0Time
Tim
Time
Tim
Time
, Strain
Time
4 ,Time
Strain4 normal
Time
4 , Strain
normal
4 , normal
Strain
normal
, Strain
Time 4Time
Tim
Time
Tim
Time
24 norma
, Strai
Timenormal
24Time
, Strain
Time normal
24 , Strain
Time
24 , normal
Strain
, Strain
Time 24
Tim
Time
Tim
X55123
Y18280
L06443
Y18280
U16297
U39827
X55123
Y18280U39827
U16297Y13090
Y13090 Y13090 M33760
X55123 X55123
Y18280
U16297
X55123U16297
X55123 X55123
Y18280
U16297
U39827
Y13090
M33760
L06443
Y18280
U16297
U39827
Y13090
M33760
X55123
U16297
U39827
Y13090
M33760 L06443
L06443
Gata3
Kcnd2
Api6
Dyrk1b
Cyb561
Gpcr25
Casp12
Gria4
Fgfr1
Gdf3
Gata3
Kcnd2
Api6
Dyrk1b
Cyb561
Gpcr25
Gata3 Gata3
Kcnd2Gata3
Api6 Kcnd2
Dyrk1b
Cyb561Cyb561
Casp12 Casp12
Gria4 Gria4
Kcnd2
Api6
Api6
Kcnd2 Dyrk1b
Gata3
Gata3
Kcnd2
Api6
Dyrk1b
Cyb561
Gpcr25
Casp12
Gria4
Fgfr1
Gdf3
Gata3
Kcnd2
Api6
Dyrk1b
Cyb561
Gpcr25
Casp12
Gria4
Fgfr1
Gdf3
Cyb561
Gpcr25
Casp12
Gria4
Fgfr1
Gdf3
Silicon Genetics, 2003
Heatmap
K-means clustering
• Partitioning method
– The dataset is divided into K clusters
– User needs to deside on the K before the run
• K-means is heuristic algorithm, so different
runs can give dissimilar results
– Make several runs, and select the one giving
the minimum sum of within-clusters variance
K-means Clustering
Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003
K-means Clustering
Silicon Genetics, 2003
Visualization
Gene selection
• Genes are usually filtered before
clustering.
– This decreases calculation time.
• Typically a few hundred genes with
highest variance (or standard deviation)
are selected.
• If you have, e.g., two types of cancers, do
not use t-test for selecting genes. You will
always get a result where the cancer type
is differentiates between the clusters.
Related documents