* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ex.matrix - clic
Matrix completion wikipedia , lookup
System of linear equations wikipedia , lookup
Rotation matrix wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Eigenvalues and eigenvectors wikipedia , lookup
Determinant wikipedia , lookup
Jordan normal form wikipedia , lookup
Matrix (mathematics) wikipedia , lookup
Four-vector wikipedia , lookup
Singular-value decomposition wikipedia , lookup
Principal component analysis wikipedia , lookup
Perron–Frobenius theorem wikipedia , lookup
Orthogonal matrix wikipedia , lookup
Cayley–Hamilton theorem wikipedia , lookup
Ordinary least squares wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Gaussian elimination wikipedia , lookup
Computational Methods for Data Analysis – 2015/16 Lab 9: Clustering In this lab we’ll explore the use of clustering methods for data analysis. The lab follows broadly Chapter 9 of Machine Learning for Hackers. 1. A toy example In this first example we’ll create a random matrix to represent the results of a fictitious data collection exercise in which four customers were asked to rate six products. The rows of the matrix represent customers, and the columns products. The ratings can be 1 (liked), -1 (disliked), or 0 (skipped). set.seed(851982) # To make sure results are consistent ex.matrix <- matrix(sample(c(-1, 0, 1), 24, replace = TRUE), nrow = 4, ncol = 6) row.names(ex.matrix) <- c('A', 'B', 'C', 'D') colnames(ex.matrix) <- c('P1', 'P2', 'P3', 'P4', 'P5', 'P6') ex.matrix >ex.matrix P1 P2 A 0 -1 B -1 0 C 0 0 D 1 0 P3 P4 P5 0 -1 0 1 1 1 0 1 -1 1 -1 0 P6 0 0 1 0 Each row of the matrix is a vector ‘representing’ that customer. We can use these vectors to compare customers with each other. One way to do this is to multiply the matrix by its transpose. The transpose of the matrix is another matrix in which the rows have become the columns and viceversa: > t(ex.matrix) A B C D P1 0 -1 0 1 P2 -1 0 0 0 P3 0 1 0 1 P4 -1 1 1 -1 P5 0 1 -1 0 P6 0 0 1 0 Multiplying the customer X products matrix with the products X customer matrix gives a customer X customer matrix that we can use to compare a customer’s ratings to all the others. Matrix multiplication is a basic part of linear algebra: we can multiply ex.matrix by its transpose t(ex.matrix) to obtain an NXN customer by customer matrix ex.mult as follows: ex.mult <- ex.matrix %*% t(ex.matrix) ex.mult This produces a matrix where the diagonal elements represent the number of products each customer reviewed, and the off-diagonal elements the extent of agreement – positive for agreement, negative for disagreement. (See Figure). > ex.mult A B C D A 2 -1 -1 1 B -1 4 0 -1 C -1 0 3 -1 D 1 -1 -1 3 From this square matrix we can compute the distance between customers. R has a function called dist() that computes this distance according to different metrics – default Euclidean distance. The output is a distance matrix: ex.dist <- dist(ex.mult) ex.dist This distance matrix can be used to produce a spatial layout of the distances between customers on the basis of the distances just calculated. This is done via Multi-Dimensional Scaling – a technique that produces a visualization in two dimensions of the distances between points in a multi-dimensional space. The R function cmdscale takes as input a distance matrix and produces as output an object that can be plotted: ex.mds <- cmdscale(ex.dist) plot(ex.mds, type = 'n') text(ex.mds, c('A', 'B', 'C', 'D')) Exercise analyze the diagram produced by this procedure. Do the clusters respect your intuitions? 2. Using kmeans to find clusters We can now use kmeans to find clusters in the data: fit <- kmeans(ex.dist,2) the object fit contains information about the results of clustering. In particular, the column cluster specifies the cluster to which every customer belongs: fit$cluster We can add this information about the clusters to the data as follows: ex.df <- data.frame(ex.matrix, fit$cluster) We can get the cluster means as follows: aggregate(ex.df,by=list(fit$cluster),FUN=mean) To find the number of clusters using the ‘elbow’ method discussed in the lectures, we can use within cluster sum of squares (wss) as a measure of the cost. The vector wss contains the sum of squares for each number of clusters. wss <- (nrow(ex.df)-1)*sum(apply(ex.df,2,var)) for (i in 2:3) wss[i] <- sum(kmeans(ex.dist, centers=i)$withinss) plot(1:3, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") 3. Using hierarchical clustering Lastly, hierarchical clustering can be tried on the toy example. The following command runs Ward’s hierarchical clustering algorithm on the distance matrix: fit.hclust <- hclust(ex.dist, method="ward.D") And the following produces a dendogram of the model: plot(fit.hclust)