Download Assignement 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
CE417 - Data Mining Course
Homework Assignment #3
Fall 2007
1. Briefly outline how to compute the dissimilarity between objects described by the following types
of variables:
a. Numerical (interval-scaled) variables
b. Asymmetric binary variables
c. Categorical variables
d. Ratio-scaled variables
e. Nonmetric vector objects
2. Briefly describe and give examples in each case for the following approaches to clustering:
a. Partitioning methods
b. Hierarchical methods
c. Density-based methods
d. Grid-based methods
e. Model-based methods
f. Methods for high-dimensional data
g. Constraint-based methods
3. Both k-means and k-medoids algorithms can perform effective clustering. Illustrate the strength
and weakness of k-means in comparison with the k-medoids algorithm. Also, illustrate the
strength and weakness of these schemes in comparison with a hierarchical clustering scheme
(such as AGNES).
4. Describe each of the following clustering algorithms in terms of the following criteria: (i) shapes
of clusters that can be determined; (ii) input parameters that must be specified; and (iii)
limitations.
a. k-means
b. k-medoids
c. CLARA
d. BIRCH
e. ROCK
f. Chameleon
g. DBSCAN
5. Design a privacy-preserving clustering method so that a data owner would be able to ask a third
party to mine the data for quality clustering without worrying about the potential inappropriate
disclosure of certain private or sensitive information stored in the data.
6. Local outlier factor (LOF) is an interesting notion for the discovery of local outliers in an
environment where data objects are distributed rather unevenly. However, its performance should
be further improved in order to efficiently discover local outliers. Can you propose an efficient
method for effective discovery of local outliers in large data sets?
7. (Extra Credit) This question focuses on two clustering techniques: K-means and hierarchical
clustering in Matlab. The data sets can be found in the file assignment3.mat which is accessible
through the course’s assignments page. When you load this file in MATLAB, you will find two
matrices:
a. Patterns a 150*4 matrix where each row contains one pattern
b. Patterns2 a 200*4 matrix where each row contains one pattern
Run the k-means and hierarchical clustering algorithm. Remember that the clustering provided by
the k-means algorithm depends on the initial placements of the clusters so it might be wise to
make several runs for each k and choose the clustering that gives the lowest mean distance to
cluster center. Find a suitable value for k. Explain the reasoning behind your choice of
parameters; in particular how you’ve chosen the values K1, K2 for k-means. Plot figures by using
different colors or different markers to show what cluster each data point belongs. Explain the
differences of the two datasets based on the results of the clustering. Finally, give a suggestion
how a k-dist graph can be used to remove noise when the k-means algorithm is used.
Notes:
•
•
•
All homeworks must be solved and written independently. If you use someone else’s work including books,
papers or any other material, then you have to acknowledge it and directly cite those resources in every
place in your document that they are used.
You should submit your solutions in PDF format to [email protected], before 25th of Azar. The
subject of the email should conform to the following format:
[DMC][HW3][your student number(s)]
For example: [DMC][HW3][87777777-86666666]
Your email should have one PDF attachment that contains your solutions. The name of the file should be
your student number and the file should reflect your full name. You should also deliver a hard copy of your
solutions to Dr. Abolhassani in the first session of the class after the deadline.