Download Data Mining Project Part II: Clustering and Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining Project
Part II: Clustering and Classification
Task1: Cluster your data
Description
The goal of this task is to choose a clustering algorithm, implement it, and then test it
on a real dataset from http://archive.ics.uci.edu/ml/. You can choose from the two
following topics, or choose your own topic:
1) Topic1: Partitioning and Density-based Clustering
Implement K-means and DBSCAN, and examine their characteristics. The references
to both algorithms can be found in the course web page. Address the problem of noise
removal when the k- means algorithm is used. Report all the results and discuss them.
2) Topic2: Hierarchical Clustering
Implement the BIRCH Algorithm and examine its characteristics. The reference of
the algorithm can be found in the course web page. In this task you should address the
problem of noise removal. You are free to choose a contrasting algorithm of
hierarchical clustering to implement. In case the group is composed only of one
student, the BIRCH algorithm is enough.
Deliverable:
1- A report of maximum 3 pages explaining the findings
2- The source code of the algorithms
Deadline: 21 December 2012
Task2: Classify your data
This task consists in analyzing the behavior of different classification algorithms
using the dataset of the first task if it is labeled. Otherwise, you choose a dataset that
is suitable for classification. To analyze the behavior of classification algorithms, you
should use Weka, a Data Mining Software http://www.cs.waikato.ac.nz/ml/weka/
where you can find documentations, jar packages, and the source code. You can also
download the Weka application to visually discover the different services. Note that
the software is already installed on your working machines. Weka gets datasets as
input and performs a large variety of data mining techniques. The only requirement is
that the file format of your dataset should be ARRF. If your dataset does not have that
format, write a program that generates the ARRF format. You can find information
about this format in http://www.cs.waikato.ac.nz/~ml/weka/arff.html
After getting familiar with Weka, Choose a set of classification algorithms, apply
them on your dataset and observe their impact on the results. For example, in the
decision tree algorithm J48, you can use different values of the confidence factor
parameter which determines the amount of pruning. For the logistic regression, you
can vary the Ridge parameter. For KNN classifier, you can vary the value the
parameter K. Additionally, you can also choose a subset of attributes or construct new
attributes and see the impact on the final results. You can also vary the sizes of
training and test datasets. By performing these experiments which choices lead to best
results?
1- Which setting did you try?
2- What are your best results?
3- How reliable are the results?
4- What evaluation measure did you use to compare the results and why?
5- Justify any decision you make regarding the choice of the parameters or the
selection
Deliverable:
1- A report of maximum 3 pages explaining the findings
2- The source code of the algorithms
Deadline: 15 January 2013