Download Homework 5

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia, lookup

Nonlinear dimensionality reduction wikipedia, lookup

K-means clustering wikipedia, lookup

Expectation–maximization algorithm wikipedia, lookup

K-nearest neighbors algorithm wikipedia, lookup

CIS526: Homework 5
Assigned: Oct 19, 2004
Due: in class, Tue, Oct 26, 2004
Homework Policy
 All assignments are INDIVIDUAL! You may discuss the problems with your colleagues, but you must solve
the homework by yourself. Please acknowledge all sources you use in the homework (papers, code or ideas from
someone else).
Assignments should be submitted in class on the day when they are due. No credit is given for assignments
submitted at a later time, unless you have a medical problem.
Problem 1:
Downloand install WEKA 3: Data Mining Software in Java from Learn how
to use “Explorer GUI” – it is very user friendly and should not take long. Download “Adult Database” from
Briefly explain the properties of the data set “pima.txt” (how much data, what is the meaning of attributes
and target). Assign each attribute to either nominal or numeric type.
b) Select the first 5000 data points from the data set (it will allow you to perform more experiments).
Reformat the data to WEKA format. Run 5-fold cross validation classification experiments using the
following algorithms (you can leave the default parameters of each algorithm):
a. ZeroR (trivial predictor)
b. J48 (decision tree)
c. NaiveBayes
d. IBk (k-nearest neighbor)
e. MultilayerPerc (neural network)
f. SMO (support vector machine)
g. Bagging of 30 decision trees (meta learning algorithm)
Based on the J48 tree result, discuss which attributes are important for classification and which are not.
Comment if this agrees with your intuition. Report the accuracy of each algorithm. Rank the algorithms by
their speed.
c) Try to improve the accuracy of each of the above algorithms by changing some of the default parameters.
Explain your choices and present the results. Hopefully, you will be able to improve accuracy of each
algorithm other than ZeroR.
Problem 2:
Write a one-page report (font 11pt, 1-inch margin) describing motivation, methodology, experimental results of the
following paper:
Opitz, D., Maclin, R.: Popular ensemble methods: an empirical study. Artificial Intelligent Research, Vol.
11 (1999), 169-198.
You can download it from Within the one page limit, please also
discuss what do you consider as strongest or most important parts of the paper and what are the drawbacks or
problems with the paper. The paper is a very nice example of high-quality empirical work in data mining. Pay
special attention to the style of presentation, since you will be expected to be able to emulate it in your course
project report.
Good Luck!!