Download KSE525 - Data Mining Lab

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining and Knowledge Discovery (KSE525)
Assignment #3 (May 7, 2013, due: May 21)
1. [15 points] Suppose that you have a weather data set below.
1)
The last attribute is the class label.
Calculate the conditional probabilities required for the Naïve Bayes classifier.
For ease of
calculation, use the original definition of probability estimation.
2)
Predict the class label of a new instance 𝑋 = (Overcast, Cool, High, True).
Outlook
Temperature
Humidity
Windy
Play
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
False
True
False
False
False
True
True
False
False
False
True
True
False
True
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
2. [7 points] Discuss the advantages and disadvantages of lazy classification (e.g., k-nearest neighbor
classification) in comparison with eager classification.
3. [8 points] A notable problem of the information gain is that it prefers attributes with a large number
of distinct values.
Explain why the information gain suffers from the problem and why the gain
ratio does not.
4. [20 points] Install R and then two packages party and randomForest.
questions using R.
1)
Answer the following
For each question, hand in your R code as well as your answer (result).
Build a decision tree for the "GlaucomaM" data set and plot the decision tree.
data set for training.
This data set is included in the "ipred" package.
Use the whole
[Hint: use the "ctree"
function in the party package.]
2)
Show the confusion matrix of the result of the decision tree.
two class labels: glaucoma and normal.
Please note that there are
3)
Run the random forest algorithm with ntree=100 for the same data set and plot the error rates
as the number of trees grows.
4)
[Hint: use the "plot" function for showing the error rates.]
Find out the most important variable for classification in this data set, according to the
random forest built.
[Hint: use the "varImpPlot" function.]
Manuals for R packages:

party: http://cran.r-project.org/web/packages/party/party.pdf

randomForest: http://cran.r-project.org/web/packages/randomForest/randomForest.pdf
Related documents