Download ClassGroupActivity

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
CS 401R Class Group Activity
Data Clustering
For your experience, you will do this activity using the R software, as it is one of the richest and
most versatile freeware tools for statistical analysis and data mining. Some information is given
here to get you started. There are also a number of online resources for R that you can refer to as
you complete this activity and beyond.
You should have done the following prior to coming to class. If not, please do so quickly.
1. Download and install the latest version of R on your computer. The following is a direct
link: https://www.r-project.org.
Note that here also exists a rather nice IDE for R known as RStudio (https://www.rstudio.com).
While not necessary for this activity, you may still wish to download it and use it. Similarly, there is
a platform known as Revolution R (http://www.revolutionanalytics.com/products) that extends R to
run on big data in parallel and distributed environments. It is not needed for this activity, but again
you may wish to have a look.
Complete the following activities as a group.
2. Install the following packages: datasets, stats, animation, and dbscan.
To do so, click on Packages & Data in the menu bar, and select Package Installer. Click on Get
List. A list of packages will then appear. Click on the above packages in the list (if they are not
there, they have probably been loaded by default when R installed; skip to the next step). Make sure
you tick the Install Dependencies box before you click Install Selected. Note that you can get the
same result by typing: install.packages(“packagename”, dependencies=TRUE) at the R prompt. If
you use RStudio, package installation is under the Tools tab.
3. After the packages have been installed, click on Packages & Data again in the menu bar,
and this time select Package Manager. Click on the above packages so they get loaded
Note that it is possible to get the same result by typing: library(packagename) at the R
prompt.
Details on the various functions implemented in each package, as well as examples of usage may be
found in the Package Manager by selecting the package of interest. Note that you can get the same
result by typing: help(packagename) at the R prompt.
4. Download the following file: http://dml.cs.byu.edu/~cgc/docs/CS401R/DS1.csv, and load it
into R using the following command.
ds <- read.csv(“pathname/DS1.csv")
This places the content of the DS1.csv file in a variable called ds that you can now use.
5. Cluster the data in ds, as follows.
a. Use the k-means algorithm with k=3
i. The k-means algorithm in R is known as kmeans(). R offers a nice animation
for k-means (for 2-dimensional data only), which can be run with:
kmeans.ani(ds,3).
ii. Run it a few times and observe what is happening.
b. Use the hierarchical agglomerative clustering algorithm
i. The HAC algorithm in R is known as hclust(). You can find out more about it
by typing: help(hclust). You will note that the input to hclust must be a
distance matrix. However, ds is simply a list of data points. It is possible to
produce the corresponding distance matrix, ds1, from ds, by typing: ds1 <dist(ds). You may now use hclust on ds1. Be sure to store the result in some
variable, e.g., cl <- hclust(ds1, …).
ii. Note that R includes a nice plot() function that can be used to display most
data types. For example, you can look at the result of HAC by typing:
plot(cl).
iii. Try hclust with various linkage techniques and observe the results.
c. Use the dbscan algorithm
i. The dbscan algorithm in R is known as dbscan().You can find out more about
it by typing: help(dbscan). You will note that the input to dbscan must be a
matrix. However, ds is simply a list of data points. It is possible to produce
the corresponding matrix, ds2, from ds, by typing: ds2 <- as.matrix(ds). You
may now use dbscan on ds2. Be sure to store the result in some variable, e.g.,
cl1 <- dbscan(ds2, …).
ii. As before the plot() function may come in handy. In this case, however, you
cannot just plot(cl1) since the result is not a dendrogram, but a list of cluster
assignments for each point in ds2. So, you need to specify both the original
data and the cluster assignments to the plot() function for it to display what
you expect. This is done by typing: plot(ds2,col=cl1$cluster).
iii. Try dbscan with various values of eps and minPts, and observe the results.
For example, try eps=0.5, minPts=5, and eps=3, minPts=8.
6. Based on the data (you can use plot(ds) to look at it), and your understanding of the different
clustering algorithms, discuss your findings.
7. Repeat 5 with the file: http://dml.cs.byu.edu/~cgc/docs/CS401R/DS2.csv