Download DOC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Assignment 2: k-Means and DBSCAN clustering
This assignment focuses on two clustering techniques: k-Means and DBSCAN. k-Means is a partitional
clustering method. It is one of the most commonly used clustering methods as it is quite easy to understand
and implement. DBSCAN [1] is a density-based clustering method. (The paper is available on the lab
course homepage.) This assignment asks you to implement both algorithms and examine their
characteristics on two different 4-dimensional data sets. You will implement k-Means and DBSCAN in
AmosII.
We have scheduled four labs in a room with a limited number of computers. To avoid overcrowded labs,
each group should sign up for one lab. Lab lists will be posted on the board outside 1321.
Assignment
The data sets for assignment 2 reside in the following two files, which is accessible from the lab course
homepage:

patterns.nt a 150*4 matrix where each row contains one pattern

patterns2.nt a 200*4 matrix where each row contains one pattern
These tasks should be performed on both data sets; see Sect. 3 on how you should report on them.
1.
Use principal components analysis (PCA) to project the 4-dimensional pattern-vectors on the two
principal components. PCA enables viewing the patterns in two dimensions. This will give you
some hints as to how many clusters there are, which in turn will be useful to decide which
parameters to use for the different algorithms. Note that the actual clustering should be done on
the four dimensional data set.
2.
Implement k-Means clustering and find a suitable value for k. Remember that the clustering
provided by the k-Means algorithm depends on the initial placements of the clusters. Therefore, it
might be wise to run the program several times for each k, with different initial centroids. For each
k, choose the clustering that gives the lowest mean distance to the centroids.
3.
Then implement DBSCAN. Set the MinPts value to 5. Create a graph of the 5-dist value of the
patterns (as described in [1]) and use this to estimate the amount of noise in the data set. Then
make a choice of Eps that gives you the correct amount of noise.
Your assignment is to implement k-Means and DBSCAN in AmosII. Download the skeletons files for
assignment: a2kmeans.osql and a2dbscan.osql. Some pre-defined functions are available in
these skeleton files. Follow the instructions and complete / replace / correct the code where necessary
(marked with TODO) according to the function signatures, descriptions, HINTS, and NOTES. You are
encouraged to experiment with explicit or implicit options and alternative implementations.
You are referred to the AmosII manual for further reading, which is available on the lab course home page.
We suggest you to read section 2.4.5 about Transitive closures, also look at the tutorial slides on this
subject. You might also find the following sections useful: Collections in 2.6, average, stdev, and count in
2.6.1, groupby in 2.6.2, and aggv.
Validation
At the examination, your implementations will be executed in AmosII using:
< 'a2kmeans.osql';
< 'a2dbscan.osql';
When the execution is done, we expect the following functions and the sub-functions they call to be fully
and correctly defined:

kmeans(Integer k)->Bag of Vector

disp_clustering3d(Bag of Vector c)->Integer

minimal_kmeans(Integer k, Integer n)->Bag of Vector

dbscan(Number eps, Integer minpts)->Bag of Vector

kdist()->Bag of Number
References
[1] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in
Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining, pages 226-231, 1996.