Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CE417 - Data Mining Course Homework Assignment #3 Fall 2007 1. Briefly outline how to compute the dissimilarity between objects described by the following types of variables: a. Numerical (interval-scaled) variables b. Asymmetric binary variables c. Categorical variables d. Ratio-scaled variables e. Nonmetric vector objects 2. Briefly describe and give examples in each case for the following approaches to clustering: a. Partitioning methods b. Hierarchical methods c. Density-based methods d. Grid-based methods e. Model-based methods f. Methods for high-dimensional data g. Constraint-based methods 3. Both k-means and k-medoids algorithms can perform effective clustering. Illustrate the strength and weakness of k-means in comparison with the k-medoids algorithm. Also, illustrate the strength and weakness of these schemes in comparison with a hierarchical clustering scheme (such as AGNES). 4. Describe each of the following clustering algorithms in terms of the following criteria: (i) shapes of clusters that can be determined; (ii) input parameters that must be specified; and (iii) limitations. a. k-means b. k-medoids c. CLARA d. BIRCH e. ROCK f. Chameleon g. DBSCAN 5. Design a privacy-preserving clustering method so that a data owner would be able to ask a third party to mine the data for quality clustering without worrying about the potential inappropriate disclosure of certain private or sensitive information stored in the data. 6. Local outlier factor (LOF) is an interesting notion for the discovery of local outliers in an environment where data objects are distributed rather unevenly. However, its performance should be further improved in order to efficiently discover local outliers. Can you propose an efficient method for effective discovery of local outliers in large data sets? 7. (Extra Credit) This question focuses on two clustering techniques: K-means and hierarchical clustering in Matlab. The data sets can be found in the file assignment3.mat which is accessible through the course’s assignments page. When you load this file in MATLAB, you will find two matrices: a. Patterns a 150*4 matrix where each row contains one pattern b. Patterns2 a 200*4 matrix where each row contains one pattern Run the k-means and hierarchical clustering algorithm. Remember that the clustering provided by the k-means algorithm depends on the initial placements of the clusters so it might be wise to make several runs for each k and choose the clustering that gives the lowest mean distance to cluster center. Find a suitable value for k. Explain the reasoning behind your choice of parameters; in particular how you’ve chosen the values K1, K2 for k-means. Plot figures by using different colors or different markers to show what cluster each data point belongs. Explain the differences of the two datasets based on the results of the clustering. Finally, give a suggestion how a k-dist graph can be used to remove noise when the k-means algorithm is used. Notes: • • • All homeworks must be solved and written independently. If you use someone else’s work including books, papers or any other material, then you have to acknowledge it and directly cite those resources in every place in your document that they are used. You should submit your solutions in PDF format to [email protected], before 25th of Azar. The subject of the email should conform to the following format: [DMC][HW3][your student number(s)] For example: [DMC][HW3][87777777-86666666] Your email should have one PDF attachment that contains your solutions. The name of the file should be your student number and the file should reflect your full name. You should also deliver a hard copy of your solutions to Dr. Abolhassani in the first session of the class after the deadline.