Download Homework 3 Classification – Decision Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Transcript
Homework 3 Classification – Decision Trees Data Mining 2/2557. CE, KMITL Consult tools manual/tutorial if you feel necessary.  weather weather_nominal.arff @relation weather.nominal
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature {hot, mild, cool}
humidity {high, normal}
windy {TRUE, FALSE}
play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
1. Construct, by hand, a decision tree model from weather_nominal.arff with id3 classifier. Draw your tree. 2. Accuracy ‐‐ without test dataset Launch Weka (or RapidMiner) and invoke id3 classifier using weather_nominal.arff. Note that Weka’s default test option is 10‐fold cross‐validation. a. Check out the decision tree generated. Is it the same as what you got in 1 ? b. Report model accuracy; e.g., the number of instances correctly and incorrectly classified during the cross‐validation. c. What does it mean to perform a 10‐fold cross‐validation? For what purpose is cross‐validation used? d. Run the classifier again using 5‐fold cross‐validation. Observe model accuracy. What do you find in comparing with 10‐fold? e. “leave one out” cross‐validation is explained in the box right below. (ref:http://en.wikipedia.org/wiki/Cross‐validation_(statistics)#Leave‐one‐out_cross‐validation)
Leave‐one‐outcross‐validation
As the name suggests, leave-one-out cross-validation (LOOCV) involves using a single
observation from the original sample as the validation data, and the remaining observations
as the training data. This is repeated such that each observation in the sample is used once as
the validation data. This is the same as a K-fold cross-validation with K being equal to the
number of observations in the original sample. Leave-one-out cross-validation is usually very
expensive from a computational point of view because of the large number of times the
training process is repeated.
Suppose you wish to perform a "leave one out" cross‐validation on this weather_nominal.arff dataset. How many folds must you specify to achieve this? f. Run the classifier again using "leave one out" cross‐validation. Observe model accuracy. What do you find in comparing with 10‐fold and 5‐fold? g. You may want to add more instances to the dataset (make it at least 20 instances) and try again. h. Check out other test options. 3. Accuracy ‐‐ with test dataset a. Test the model with weather_test.arff dataset. (You can save the model and load it to run with the test dataset, or you can select test dataset and re‐evaluate the model against the test dataset) b. What is the accuracy of the model when run it against the test data set? c. You may want to add more instances to the test dataset.  bank 1. Look into bank.arff. Perform data preprocessing as you see necessary. Save the new version for 2. a. What data preprocessing did you do and why ? b. Use C4.5 classifier (J48 in Weka). c. Observe model accuracy. d. You may want to vary classifier parameters and see the effect. 2. Create test dataset (bank_test.arff) with the same structure as bank.arff. Take some (50, maybe) instances from bank.arff (remove them from bank.arff). You may do this in wordpad/notepad. 3. Create model again (from bank.arff) 4. Test the model with bank_test.arff. Observe accuracy.  drug drug_train.arff (training dataset), drug_test.arff(test dataset) 1. Construct a decision tree with C4.5 classifier against the training data set. Take a close look at these two parameters – unpruned and minNumobj “Unpruned” prevents pruning the decision tree after building it. “MinNumObj” sets the minimum number of instances in any leaf node to limit the growth of the decision tree. 2. Vary those two parameters. Generate pruned and unpruned trees with a minimum number of instances in leaves between 1 and 4. 3. Explain why pruning affects the performance of the decision tree algorithm. 4. Explain how different limitations of instances in the leaf influence the performance (prediction accuracy) of the decision tree algorithm. You don’t have to turn in anything. However, be prepared to discuss results and findings in class individually. I will randomly call you guys to give explanation.