Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Assignment 2: k-Means and DBSCAN clustering This assignment focuses on two clustering techniques: k-Means and DBSCAN. k-Means is a partitional clustering method. It is one of the most commonly used clustering methods as it is quite easy to understand and implement. DBSCAN [1] is a density-based clustering method. (The paper is available on the lab course homepage.) This assignment asks you to implement both algorithms and examine their characteristics on two different 4-dimensional data sets. You will implement k-Means and DBSCAN in AmosII. We have scheduled four labs in a room with a limited number of computers. To avoid overcrowded labs, each group should sign up for one lab. Lab lists will be posted on the board outside 1321. Assignment The data sets for assignment 2 reside in the following two files, which is accessible from the lab course homepage: patterns.nt a 150*4 matrix where each row contains one pattern patterns2.nt a 200*4 matrix where each row contains one pattern These tasks should be performed on both data sets; see Sect. 3 on how you should report on them. 1. Use principal components analysis (PCA) to project the 4-dimensional pattern-vectors on the two principal components. PCA enables viewing the patterns in two dimensions. This will give you some hints as to how many clusters there are, which in turn will be useful to decide which parameters to use for the different algorithms. Note that the actual clustering should be done on the four dimensional data set. 2. Implement k-Means clustering and find a suitable value for k. Remember that the clustering provided by the k-Means algorithm depends on the initial placements of the clusters. Therefore, it might be wise to run the program several times for each k, with different initial centroids. For each k, choose the clustering that gives the lowest mean distance to the centroids. 3. Then implement DBSCAN. Set the MinPts value to 5. Create a graph of the 5-dist value of the patterns (as described in [1]) and use this to estimate the amount of noise in the data set. Then make a choice of Eps that gives you the correct amount of noise. Your assignment is to implement k-Means and DBSCAN in AmosII. Download the skeletons files for assignment: a2kmeans.osql and a2dbscan.osql. Some pre-defined functions are available in these skeleton files. Follow the instructions and complete / replace / correct the code where necessary (marked with TODO) according to the function signatures, descriptions, HINTS, and NOTES. You are encouraged to experiment with explicit or implicit options and alternative implementations. You are referred to the AmosII manual for further reading, which is available on the lab course home page. We suggest you to read section 2.4.5 about Transitive closures, also look at the tutorial slides on this subject. You might also find the following sections useful: Collections in 2.6, average, stdev, and count in 2.6.1, groupby in 2.6.2, and aggv. Validation At the examination, your implementations will be executed in AmosII using: < 'a2kmeans.osql'; < 'a2dbscan.osql'; When the execution is done, we expect the following functions and the sub-functions they call to be fully and correctly defined: kmeans(Integer k)->Bag of Vector disp_clustering3d(Bag of Vector c)->Integer minimal_kmeans(Integer k, Integer n)->Bag of Vector dbscan(Number eps, Integer minpts)->Bag of Vector kdist()->Bag of Number References [1] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pages 226-231, 1996.