Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining and Knowledge Discovery (KSE525) Assignment #3 (May 7, 2013, due: May 21) 1. [15 points] Suppose that you have a weather data set below. 1) The last attribute is the class label. Calculate the conditional probabilities required for the Naïve Bayes classifier. For ease of calculation, use the original definition of probability estimation. 2) Predict the class label of a new instance 𝑋 = (Overcast, Cool, High, True). Outlook Temperature Humidity Windy Play Sunny Sunny Overcast Rainy Rainy Rainy Overcast Sunny Sunny Rainy Sunny Overcast Overcast Rainy Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Mild Hot Mild High High High High Normal Normal Normal High Normal Normal Normal High Normal High False True False False False True True False False False True True False True No No Yes Yes Yes No Yes No Yes Yes Yes Yes Yes No 2. [7 points] Discuss the advantages and disadvantages of lazy classification (e.g., k-nearest neighbor classification) in comparison with eager classification. 3. [8 points] A notable problem of the information gain is that it prefers attributes with a large number of distinct values. Explain why the information gain suffers from the problem and why the gain ratio does not. 4. [20 points] Install R and then two packages party and randomForest. questions using R. 1) Answer the following For each question, hand in your R code as well as your answer (result). Build a decision tree for the "GlaucomaM" data set and plot the decision tree. data set for training. This data set is included in the "ipred" package. Use the whole [Hint: use the "ctree" function in the party package.] 2) Show the confusion matrix of the result of the decision tree. two class labels: glaucoma and normal. Please note that there are 3) Run the random forest algorithm with ntree=100 for the same data set and plot the error rates as the number of trees grows. 4) [Hint: use the "plot" function for showing the error rates.] Find out the most important variable for classification in this data set, according to the random forest built. [Hint: use the "varImpPlot" function.] Manuals for R packages: party: http://cran.r-project.org/web/packages/party/party.pdf randomForest: http://cran.r-project.org/web/packages/randomForest/randomForest.pdf