Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Project Part II: Clustering and Classification Task1: Cluster your data Description The goal of this task is to choose a clustering algorithm, implement it, and then test it on a real dataset from http://archive.ics.uci.edu/ml/. You can choose from the two following topics, or choose your own topic: 1) Topic1: Partitioning and Density-based Clustering Implement K-means and DBSCAN, and examine their characteristics. The references to both algorithms can be found in the course web page. Address the problem of noise removal when the k- means algorithm is used. Report all the results and discuss them. 2) Topic2: Hierarchical Clustering Implement the BIRCH Algorithm and examine its characteristics. The reference of the algorithm can be found in the course web page. In this task you should address the problem of noise removal. You are free to choose a contrasting algorithm of hierarchical clustering to implement. In case the group is composed only of one student, the BIRCH algorithm is enough. Deliverable: 1- A report of maximum 3 pages explaining the findings 2- The source code of the algorithms Deadline: 21 December 2012 Task2: Classify your data This task consists in analyzing the behavior of different classification algorithms using the dataset of the first task if it is labeled. Otherwise, you choose a dataset that is suitable for classification. To analyze the behavior of classification algorithms, you should use Weka, a Data Mining Software http://www.cs.waikato.ac.nz/ml/weka/ where you can find documentations, jar packages, and the source code. You can also download the Weka application to visually discover the different services. Note that the software is already installed on your working machines. Weka gets datasets as input and performs a large variety of data mining techniques. The only requirement is that the file format of your dataset should be ARRF. If your dataset does not have that format, write a program that generates the ARRF format. You can find information about this format in http://www.cs.waikato.ac.nz/~ml/weka/arff.html After getting familiar with Weka, Choose a set of classification algorithms, apply them on your dataset and observe their impact on the results. For example, in the decision tree algorithm J48, you can use different values of the confidence factor parameter which determines the amount of pruning. For the logistic regression, you can vary the Ridge parameter. For KNN classifier, you can vary the value the parameter K. Additionally, you can also choose a subset of attributes or construct new attributes and see the impact on the final results. You can also vary the sizes of training and test datasets. By performing these experiments which choices lead to best results? 1- Which setting did you try? 2- What are your best results? 3- How reliable are the results? 4- What evaluation measure did you use to compare the results and why? 5- Justify any decision you make regarding the choice of the parameters or the selection Deliverable: 1- A report of maximum 3 pages explaining the findings 2- The source code of the algorithms Deadline: 15 January 2013