Download Question 1 - Decision Trees

ADDIS ABABA UNIVERSITY BUSINESS INTELLIGENCE, DATA WAREHOUSING, AND DATA MINING CIT 828: Summer 2009 HW Assignment 3 Due: July 9th 2009 Credit risk is a key problem in today’s economy, and it is also a relevant application of data mining methods. In this assignment, you will build classification models to predict whether a person will be a good or bad credit payer. You will need to download the TRAINING (TRAIN3.arff) and TEST (TEST3.arff,) files from the course website. In both data sets there are approximately 50% positive examples in the sample. The nominal target variable is class (goor or bad customer). Predictors include nominal and numeric attributes, related to the credit history of the customer, personal characteristics, and financial status, among others. The names and values of the attributes are selfexplanatory, for more details on this visit http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data) (NOTE: we have modified the sample somewhat for this assignment). If you want to know more about the problem of credit scoring and its connection with data mining methods, you may find useful to skim some articles available on the course website: We will never have a perfect model of risk FT.pdf , Recent developments in consumer credit risk assessment.pdf. This assignment will walk you through the basic steps of building classifiers, evaluating and comparing models, analyzing overfitting, and performing feature selection. Step 0: Loading the files From the start menu select Weka, then Weka 3-4. You will see the Weka GUI Chooser. Select Explorer. This will launch the Weka Explorer. You will find the training set, TRAIN3.arff on the course website. Load the training set into Weka. Step 1: Evaluation Building/Comparing/Selecting Models Select the tab that says Classify. In the box that says classifier, you can choose your classifier. Compare the results of some of the methods we will discuss in class: J48 (Decision Trees), Naïve Bayes, Logistic Regression (go to classifiers.functions.logistic), and K Nearest Neighbors (lazy IBK). Build each model using 10-fold cross-validation1 on your training set (select cross-validation in test options). For each model, report the % of Correctly Classified Instances (accuracy). What are the two best methods based on this measure? Step 2: Evaluation ROC Curves You will use another tool to evaluate a predictive model, the ROC curves. Read pages 98-99 of the textbook. For each of the two “best” models identified above, do the following: Now instead of running each of the two methods on your training set using 10 fold cross-validation, apply the model to the TEST set (TEST3.arff ), provided with your homework (select supplied test set in the test options. Then select your TEST3.arff file, and run the respective models) 1 n-fold cross-validation means the training data set is divided into n subsets. n different models are built, leaving one subset out of the training set each time. More reliable estimates of the performance of a model can be obtained by this procedure. Right Click on the appropriate Run in the “results list”. Select Visualize Threshold Curve, and then the class ‘good’ (class 1). You will see a window popup. In the top two drop down menus, select X: False Positive Rate on the left and Y: True Positive Rate on the right. This will give you a visualization of the ROC Curve. Above the visualization, you will see a value for Area Under ROC. Document this value for each model. What is the Area Under the ROC Curve for each model? Step 3: Evaluation Learning Curves Go back to the Preprocess tab and create 4 additional samples of your training set (20% data, 40% data, 60% data, 80% data) Directions for sampling: Open the original TRAIN.arff file. In the filter box push the Choose Button. From the directory of filters, select. Weka, filters, supervised, instance , Resample. Click in the Resample box, right next to the Choose button. In the popup (see example below) set biasToUniformClass to 0 and sampleSizePercent to 20 for 20% of the data. Then push the Apply Button. Go to the classifier screen and build your model using the sampled data as training (already in memory once you hit apply) for the top two methods found in Step 4. For each method supply the same test set (TEST3.arff). Record the accuracy, and the area under the ROC curve for each training set sample size/ method pair. NOTE: once you record the values for a given training sample size, do not forget to open again the original TRAIN3.arrf file in Weka, and repeat the process for the remaining sample sizes. Create 2 Plots: Plot the sample size on the horizontal axis (20,40,60,80,100) and accuracy performance on the vertical axis. Plot the sample size on the horizontal axis (20,40,60,80,100) and Area under the ROC Curve (AUC) performance on the vertical axis. How do the methods compare with less training data? How do they compare with more training data? In general, do you think your models would perform a lot better with more training data? Why or why not? Step 4: Overfitting In class we have discussed that a better performance in the training set does not necessarily imply better predictive ability on the test set. You will now explore overfitting in classification trees. Open the original file TRAIN3.arrf in Weka, make sure you are not using the Resample filter. Build a J48 classifier using the default settings: In the ‘test options’, select ‘supplied test set’ (TEST3.arff) and register the % of Correctly Classified Instances. Repeat the process, selecting now the option ‘Use training set’ in the test set options. To explore overfitting, repeat the process indicated above, setting the parameter minNumObj=1. Try also setting minNumObj=1 and unpruned=True. Report the values you obtained in the following table: J48 model Default setting minNumObj=1 minNumObj=1 & unpruned=True Step 5: Feature selection Accuracy TRAIN TEST Finally, analyze what are the most relevant variables to explain credit risk in our data set. Build Logistic regression models using the ‘supplied test set’ in the Test options; go to the select attributes window, and set the field ‘Attribute evaluator’ to InfoGainAttributeEval, and the ‘Search method’ to Ranker. In the ‘Attribute selection mode’, select use full training set. What are the 5 most important variables according to this criterion? Build a logistic regression model using only this 5 most important attributes as predictors (NOTE: in assignment 2 you learnt how to remove attributes using Invert. You will need to do this for both the TRAIN.arff and TEST.arff files, save the resulting files with a different name). Register and report the accuracy and area under ROC that you obtain using the 5 most important attributes as predictors, and compare with the results you obtain when you use all the attributes in the model; use supplied test set in the Test options in all the experiments. Does your model perform better when you include all the 20 features instead of just the 5 most important?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Question 1 - Decision Trees