Download ex.matrix - clic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Matrix completion wikipedia , lookup

System of linear equations wikipedia , lookup

Rotation matrix wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Determinant wikipedia , lookup

Jordan normal form wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Four-vector wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Principal component analysis wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Ordinary least squares wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Gaussian elimination wikipedia , lookup

Matrix calculus wikipedia , lookup

Matrix multiplication wikipedia , lookup

Transcript
Computational Methods for Data Analysis – 2015/16
Lab 9: Clustering
In this lab we’ll explore the use of clustering methods for data analysis. The lab
follows broadly Chapter 9 of Machine Learning for Hackers.
1. A toy example
In this first example we’ll create a random matrix to represent the results of a
fictitious data collection exercise in which four customers were asked to rate six
products. The rows of the matrix represent customers, and the columns
products. The ratings can be 1 (liked), -1 (disliked), or 0 (skipped).
set.seed(851982) # To make sure results are consistent
ex.matrix <- matrix(sample(c(-1, 0, 1), 24, replace = TRUE),
nrow = 4,
ncol = 6)
row.names(ex.matrix) <- c('A', 'B', 'C', 'D')
colnames(ex.matrix) <- c('P1', 'P2', 'P3', 'P4', 'P5', 'P6')
ex.matrix
>ex.matrix
P1 P2
A 0 -1
B -1 0
C 0 0
D 1 0
P3 P4 P5
0 -1 0
1 1 1
0 1 -1
1 -1 0
P6
0
0
1
0
Each row of the matrix is a vector ‘representing’ that customer. We can use these
vectors to compare customers with each other. One way to do this is to multiply
the matrix by its transpose. The transpose of the matrix is another matrix in
which the rows have become the columns and viceversa:
> t(ex.matrix)
A B C D
P1 0 -1 0 1
P2 -1 0 0 0
P3 0 1 0 1
P4 -1 1 1 -1
P5 0 1 -1 0
P6 0 0 1 0
Multiplying the customer X products matrix with the products X customer matrix
gives a customer X customer matrix that we can use to compare a customer’s
ratings to all the others. Matrix multiplication is a basic part of linear algebra:
we can multiply ex.matrix by its transpose t(ex.matrix) to obtain an
NXN customer by customer matrix ex.mult as follows:
ex.mult <- ex.matrix %*% t(ex.matrix)
ex.mult
This produces a matrix where the diagonal elements represent the number of
products each customer reviewed, and the off-diagonal elements the extent of
agreement – positive for agreement, negative for disagreement. (See Figure).
> ex.mult
A B C D
A 2 -1 -1 1
B -1 4 0 -1
C -1 0 3 -1
D 1 -1 -1 3
From this square matrix we can compute the distance between customers. R has
a function called dist() that computes this distance according to different metrics
– default Euclidean distance. The output is a distance matrix:
ex.dist <- dist(ex.mult)
ex.dist
This distance matrix can be used to produce a spatial layout of the distances
between customers on the basis of the distances just calculated. This is done via
Multi-Dimensional Scaling – a technique that produces a visualization in two
dimensions of the distances between points in a multi-dimensional space. The R
function cmdscale takes as input a distance matrix and produces as output an
object that can be plotted:
ex.mds <- cmdscale(ex.dist)
plot(ex.mds, type = 'n')
text(ex.mds, c('A', 'B', 'C', 'D'))
Exercise analyze the diagram produced by this procedure. Do the clusters
respect your intuitions?
2. Using kmeans to find clusters
We can now use kmeans to find clusters in the data:
fit <- kmeans(ex.dist,2)
the object fit contains information about the results of clustering. In particular,
the column cluster specifies the cluster to which every customer belongs:
fit$cluster
We can add this information about the clusters to the data as follows:
ex.df <- data.frame(ex.matrix, fit$cluster)
We can get the cluster means as follows:
aggregate(ex.df,by=list(fit$cluster),FUN=mean)
To find the number of clusters using the ‘elbow’ method discussed in the
lectures, we can use within cluster sum of squares (wss) as a measure of the cost.
The vector wss contains the sum of squares for each number of clusters.
wss <- (nrow(ex.df)-1)*sum(apply(ex.df,2,var))
for (i in 2:3)
wss[i] <- sum(kmeans(ex.dist, centers=i)$withinss)
plot(1:3, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
3. Using hierarchical clustering
Lastly, hierarchical clustering can be tried on the toy example. The following
command runs Ward’s hierarchical clustering algorithm on the distance matrix:
fit.hclust <- hclust(ex.dist, method="ward.D")
And the following produces a dendogram of the model:
plot(fit.hclust)