Download Clustering and Data Mining in R

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Clustering and Data Mining in R
Introduction
Thomas Girke
December 7, 2012
Clustering and Data Mining in R
Slide 1/40
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Slide 2/40
Outline
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Introduction
Slide 3/40
What is Clustering?
Clustering is the classification of data objects into similarity
groups (clusters) according to a defined distance measure.
It is used in many fields, such as machine learning, data
mining, pattern recognition, image analysis, genomics,
systems biology, etc.
Clustering and Data Mining in R
Introduction
Slide 4/40
Why Clustering and Data Mining in R?
Efficient data structures and functions for clustering
Reproducible and programmable
Comprehensive set of clustering and machine learning libraries
Integration with many other data analysis tools
Useful Links
Cluster Task Views
Link
Machine Learning Task Views
UCR Manual
Clustering and Data Mining in R
Link
Link
Introduction
Slide 5/40
Outline
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Data Preprocessing
Slide 6/40
Data Transformations
Choice depends on data set!
Center & standardize
1
2
Center: subtract from each vector its mean
Standardize: devide by standard deviation
⇒ Mean = 0 and STDEV = 1
Center & scale with the scale() fuction
1
2
Center: subtract from each vector its mean
Scale: divide centered vector by their root mean square (rms)
v
u
n
u 1 X
xrms =t
xi 2
n−1
i=1
⇒ Mean = 0 and STDEV = 1
Log transformation
Rank transformation: replace measured values by ranks
No transformation
Clustering and Data Mining in R
Data Preprocessing
Data Transformations
Slide 7/40
Distance Methods
List of most common ones!
Euclidean distance for two profiles X and Y
v
u n
uX
d(X , Y ) =t
(xi − yi )2
i=1
Disadvantages: not scale invariant, not for negative correlations
Maximum, Manhattan, Canberra, binary, Minowski, ...
Correlation-based distance: 1 − r
Pearson correlation coefficient (PCC)
Pn
Pn
Pn
n i=1 xi yi − i=1 xi i=1 yi
r =q P
Pn
Pn
Pn
n
( i=1 xi2 − ( i=1 xi )2 )( i=1 yi2 − ( i=1 yi )2 )
Disadvantage: outlier sensitive
Spearman correlation coefficient (SCC)
Same calculation as PCC but with ranked values!
Clustering and Data Mining in R
Data Preprocessing
Distance Methods
Slide 8/40
Cluster Linkage
Single Linkage
Complete Linkage
Average Linkage
Clustering and Data Mining in R
Data Preprocessing
Cluster Linkage
Slide 9/40
Outline
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Hierarchical Clustering
Slide 10/40
Hierarchical Clustering Steps
1
Identify clusters (items) with closest distance
2
Join them to new clusters
3
Compute distance between clusters (items)
4
Return to step 1
Clustering and Data Mining in R
Hierarchical Clustering
Slide 11/40
Hierarchical Clustering
(a)
Agglomerative Approach
0.1
g1
g2
g3
(b)
g4
g5
g4
g5
0.4
0.1
g1
g2
(c)
g3
0.6
0.5
0.4
0.1
g1
Clustering and Data Mining in R
g2
Hierarchical Clustering
g3
g4
g5
Slide 12/40
Hierarchical Clustering Approaches
1
Agglomerative approach (bottom-up)
hclust() and agnes()
2
Divisive approach (top-down)
diana()
Clustering and Data Mining in R
Hierarchical Clustering
Approaches
Slide 13/40
Tree Cutting to Obtain Discrete Clusters
1
Node height in tree
2
Number of clusters
3
Search tree nodes by distance cutoff
Clustering and Data Mining in R
Hierarchical Clustering
Tree Cutting
Slide 14/40
Outline
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Non-Hierarchical Clustering
Slide 15/40
Non-Hierarchical Clustering
Selected Examples
Clustering and Data Mining in R
Non-Hierarchical Clustering
Slide 16/40
K-Means Clustering
1
Choose the number of k clusters
2
Randomly assign items to the k clusters
3
Calculate new centroid for each of the k clusters
4
Calculate the distance of all items to the k centroids
5
Assign items to closest centroid
6
Repeat until clusters assignments are stable
Clustering and Data Mining in R
Non-Hierarchical Clustering
K-Means
Slide 17/40
K-Means
(a)
X
X
X
(b)
X
X
X
(c)
X
X
X
Clustering and Data Mining in R
Non-Hierarchical Clustering
K-Means
Slide 18/40
Principal Component Analysis (PCA)
Principal components analysis (PCA) is a data reduction technique
that allows to simplify multidimensional data sets to 2 or 3
dimensions for plotting purposes and visual variance analysis.
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Slide 19/40
Basic PCA Steps
Center (and standardize) data
First principal component axis
Accross centroid of data cloud
Distance of each point to that line is minimized, so that it
crosses the maximum variation of the data cloud
Second principal component axis
Orthogonal to first principal component
Along maximum variation in the data
1st PCA axis becomes x-axis and 2nd PCA axis y-axis
Continue process until the necessary number of principal
components is obtained
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Slide 20/40
PCA on Two-Dimensional Data Set
2nd
2nd
1st
1st
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Slide 21/40
Identifies the Amount of Variability between Components
Example
Principal Component
Proportion of Variance
1st
62%
2nd
34%
3rd
3%
Other
rest
1st and 2nd principal components explain 96% of variance.
Clustering and Data Mining in R
Non-Hierarchical Clustering
Principal Component Analysis
Slide 22/40
Multidimensional Scaling (MDS)
Alternative dimensionality reduction approach
Represents distances in 2D or 3D space
Starts from distance matrix (PCA uses data points)
Clustering and Data Mining in R
Non-Hierarchical Clustering
Multidimensional Scaling
Slide 23/40
Biclustering
Finds in matrix subgroups of rows and columns which are as similar as
possible to each other and as different as possible to the remaining data
points.
Unclustered
Clustering and Data Mining in R
⇒
Non-Hierarchical Clustering
Clustered
Biclustering
Slide 24/40
Remember: There Are Many Additional Techniques!
Additional details can be found in the Clustering Section of the
R/Bioc Manual Link
Clustering and Data Mining in R
Non-Hierarchical Clustering
Biclustering
Slide 25/40
Outline
Introduction
Data Preprocessing
Data Transformations
Distance Methods
Cluster Linkage
Hierarchical Clustering
Approaches
Tree Cutting
Non-Hierarchical Clustering
K-Means
Principal Component Analysis
Multidimensional Scaling
Biclustering
Clustering with R and Bioconductor
Clustering and Data Mining in R
Clustering with R and Bioconductor
Slide 26/40
Data Preprocessing
Scaling and Distance Matrices
> ## Sample data set
> set.seed(1410)
> y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""),
+
paste("t", 1:5, sep="")))
> dim(y)
[1] 10
5
> ## Scaling
> yscaled <- t(scale(t(y))) # Centers and scales y row-wise
> apply(yscaled, 1, sd)
g1
1
g2
1
g3
1
g4
1
g5
1
g6
1
g7
1
g8
1
g9 g10
1
1
> ## Euclidean distance matrix
> dist(y[1:4,], method = "euclidean")
g1
g2
g3
g2 4.793697
g3 4.932658 6.354978
g4 4.033789 4.788508 1.671968
Clustering and Data Mining in R
Clustering with R and Bioconductor
Slide 27/40
Correlation-based Distances
Correlation matrix
> c <- cor(t(y), method="pearson")
> as.matrix(c)[1:4,1:4]
g1
g2
g3
g4
g1 1.00000000 -0.2965885 -0.00206139 -0.4042011
g2 -0.29658847 1.0000000 -0.91661118 -0.4512912
g3 -0.00206139 -0.9166112 1.00000000 0.7435892
g4 -0.40420112 -0.4512912 0.74358925 1.0000000
Correlation-based distance matrix
> d <- as.dist(1-c)
> as.matrix(d)[1:4,1:4]
g1
g2
g3
g4
g1
0.000000
1.296588
1.002061
1.404201
g2
1.296588
0.000000
1.916611
1.451291
Clustering and Data Mining in R
g3
1.0020614
1.9166112
0.0000000
0.2564108
g4
1.4042011
1.4512912
0.2564108
0.0000000
Clustering with R and Bioconductor
Slide 28/40
Hierarchical Clustering with hclust I
Hierarchical clustering with complete linkage and basic tree plotting
> hr <- hclust(d, method = "complete", members=NULL)
> names(hr)
[1] "merge"
"height"
"order"
"labels"
"method"
[6] "call"
"dist.method"
> par(mfrow = c(1, 2)); plot(hr, hang = 0.1); plot(hr, hang = -1)
Cluster Dendrogram
Clustering and Data Mining in R
1.0
0.5
g1
Height
1.0
d
hclustClustering
(*, "complete")
with R and Bioconductor
g10
g3
g4
g2
g9
g6
g7
g1
g5
g8
0.0
g6
g7
g5
g8
g3
g4
g2
g9
0.0
g10
0.5
Height
1.5
1.5
2.0
2.0
Cluster Dendrogram
d
hclust (*, "complete")
Slide 29/40
Tree Plotting I
Plot trees horizontally
> plot(as.dendrogram(hr), edgePar=list(col=3, lwd=4), horiz=T)
g8
g5
g1
g7
g6
g9
g2
g4
g3
g10
2.0
Clustering and Data Mining in R
1.5
1.0
Clustering with R and Bioconductor
0.5
0.0
Slide 30/40
Tree Plotting II
The ape library provides more advanced features for tree plotting
> library(ape)
> plot.phylo(as.phylo(hr), type="p", edge.col=4, edge.width=2,
+
show.node.label=TRUE, no.margin=TRUE)
g8
g5
g1
g7
g6
g9
g2
g4
g3
g10
Clustering and Data Mining in R
Clustering with R and Bioconductor
Slide 31/40
Tree Cutting
Accessing information in hclust objects
> hr
Call:
hclust(d = d, method = "complete", members = NULL)
Cluster method
: complete
Number of objects: 10
> ## Print row labels in the order they appear in the tree
> hr$labels[hr$order]
[1] "g10" "g3"
"g4"
"g2"
"g9"
"g6"
"g7"
"g1"
"g5"
"g8"
Tree cutting with cutree
> mycl <- cutree(hr, h=max(hr$height)/2)
> mycl[hr$labels[hr$order]]
g10
3
g3
3
g4
3
g2
2
Clustering and Data Mining in R
g9
2
g6
5
g7
5
g1
1
g5
4
g8
4
Clustering with R and Bioconductor
Slide 32/40
Heatmaps
Count
0 1 2 3
All in one step: clustering and heatmap plotting
> library(gplots)
> heatmap.2(y, col=redgreen(75))
Color Key
and Histogram
−3
−1
1
Value
g9
g2
g3
g4
g7
g6
g5
g8
g10
Clustering and Data Mining in R
Clustering with R and Bioconductor
t2
t3
t1
t4
t5
g1
Slide 33/40
Customizing Heatmaps
Customizes row and column clustering and shows tree cutting result in row color bar.
Additional color schemes can be found here Link
> hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete")
> mycol <- colorpanel(40, "darkblue", "yellow", "white")
> heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=mycol,
+
scale="row", density.info="none", trace="none",
+
RowSideColors=as.character(mycl))
Color Key
−1.5
0
1
Row Z−Score
g8
g5
g1
g7
g6
g9
g2
g4
g3
Clustering and Data Mining in R
Clustering with R and Bioconductor
t4
t2
t5
t3
t1
g10
Slide 34/40
K-Means Clustering with PAM
Runs K-means clustering with PAM (partitioning around medoids) algorithm and shows result
in color bar of hierarchical clustering result from before.
> library(cluster)
> pamy <- pam(d, 4)
> (kmcol <- pamy$clustering)
g1 g2 g3 g4 g5 g6 g7 g8 g9 g10
1
2
3
3
4
4
4
4
2
3
> heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=mycol,
+
scale="row", density.info="none", trace="none",
+
RowSideColors=as.character(kmcol))
Color Key
−1.5
0
1
Row Z−Score
g8
g5
g1
g7
g6
g9
g2
g4
g3
Clustering and Data Mining in R
Clustering with R and Bioconductor
t4
t2
t5
t3
t1
g10
Slide 35/40
K-Means Fuzzy Clustering
Runs k-means fuzzy clustering
> library(cluster)
> fannyy <- fanny(d, k=4, memb.exp = 1.5)
> round(fannyy$membership, 2)[1:4,]
g1
g2
g3
g4
[,1]
1.00
0.00
0.02
0.00
[,2]
0.00
0.99
0.01
0.00
[,3]
0.00
0.00
0.95
0.99
[,4]
0.00
0.00
0.03
0.01
> fannyy$clustering
g1
1
>
>
>
>
g2
2
g3
3
g4
3
g5
4
g6
4
g7
4
g8
4
g9 g10
2
3
## Returns multiple cluster memberships for coefficient above a certain
## value (here >0.1)
fannyyMA <- round(fannyy$membership, 2) > 0.10
apply(fannyyMA, 1, function(x) paste(which(x), collapse="_"))
g1
"1"
g2
"2"
Clustering and Data Mining in R
g3
"3"
g4
"3"
g5
"4"
g6
"4"
g7
g8
"4" "2_4"
Clustering with R and Bioconductor
g9
"2"
g10
"3"
Slide 36/40
Multidimensional Scaling (MDS)
Performs MDS analysis on the geographic distances between European cities
> loc <- cmdscale(eurodist)
> ## Plots the MDS results in 2D plot. The minus is required in this example to
> ## flip the plotting orientation.
> plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)")
> text(loc[,1], -loc[,2], rownames(loc), cex=0.8)
cmdscale(eurodist)
1000
Stockholm
Copenhagen
Hamburg
0
Hook of Holland
Calais
Cologne
Brussels
Cherbourg
Paris
Lisbon
Madrid
−1000
Gibraltar
Munich
Lyons
Vienna
Geneva
Marseilles Milan
Barcelona
Rome
Athens
−2000
Clustering and Data Mining in R
−1000
0
Clustering with R and Bioconductor
1000
2000
Slide 37/40
Principal Component Analysis (PCA)
Performs PCA analysis after scaling the data. It returns a list with class prcomp that contains
five components: (1) the standard deviations (sdev) of the principal components, (2) the
matrix of eigenvectors (rotation), (3) the principal component data (x), (4) the centering
(center) and (5) scaling (scale) used.
> library(scatterplot3d)
> pca <- prcomp(y, scale=TRUE)
> names(pca)
[1] "sdev"
"rotation" "center"
"scale"
"x"
> summary(pca) # Prints variance summary for all principal components.
Importance of components:
PC1
PC2
PC3
PC4
PC5
Standard deviation
1.3611 1.1777 1.0420 0.69264 0.4416
Proportion of Variance 0.3705 0.2774 0.2172 0.09595 0.0390
Cumulative Proportion 0.3705 0.6479 0.8650 0.96100 1.0000
> scatterplot3d(pca$x[,1:3], pch=20, color="blue")
1.5
●
●
●
●
0.0
●
●
−0.5
●
●
3
2
●
1
−1.5
−2.0
Clustering and Data Mining in R
PC2
−1.0
PC3
0.5
1.0
●
0
−1
−2
−2
−1 with0 R and
1 Bioconductor
2
3
Clustering
Slide 38/40
Additional Exercises
See here
Clustering and Data Mining in R
Link
Clustering with R and Bioconductor
Slide 39/40
Session Information
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid
stats
graphics
[8] base
grDevices utils
other attached packages:
[1] scatterplot3d_0.3-33 cluster_1.14.3
[4] MASS_7.3-22
KernSmooth_2.23-8
[7] bitops_1.0-4.2
gdata_2.12.0
[10] ape_3.0-6
methods
gplots_2.11.0
caTools_1.13
gtools_2.7.0
loaded via a namespace (and not attached):
[1] gee_4.13-18
lattice_0.20-10 nlme_3.1-105
Clustering and Data Mining in R
datasets
tools_2.15.2
Clustering with R and Bioconductor
Slide 40/40
Related documents