Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISM/UNEX 270, Winter 2008 Service Engineering Homework 3: Data Mining with SAS Enterprise Miner (Part 2 of 2) Prof. Kevin Ross, [email protected] T.A. Geoff Ryder, [email protected] Due: beginning of class, Thursday, February 14 (three weeks) Assessment: H3, Parts 1 and 2 together are 10% of course grade. Part 1 has questions Q1, Q2, and Q3. (See separate handout.) This is Part 2, with questions Q4 through Q10. Each question is worth 1%. Background The data we wish to mine for answers in these problems contains consumer demographic and purchase history data; our goal is to develop a mailing list of customers who are likely to buy from our hypothetical firm. If we think of the data as stored in a table, then there are 1,966 independent customer records (table rows, or observations) and 50 variables (table columns). Of the column variables, 31 are numerical (“interval” variables in SAS), and 18 are non-numeric, binary or the like (“class” or “categorical” variables in SAS). One of them, the Account Number, is just a row identifier variable that does not influence the calculations. The questions ask you to adjust parameters in the SAS Enterprise Miner blocks (or nodes) to construct a model that can predict who our best customers are. We will use linear regression, logistic regression, a tree classifier, and a neural network. The last block, Assessment, lets us compare how well our different predictive strategies perform. The SAS help files are useful, but if you follow the instructions you won’t need to refer to them much. This homework is based roughly on the example for the Assessment Node. Instructions If you are a student on the Mountain View side, go to one of the four machines running SAS EM at Silicon Valley Center: two in room 2076, one in 2071, and one in room 2072. If you are in Santa Cruz go to the computing lab in room BE109, Baskin Engineering, where they have several SAS EM licenses running over the LAN. We will take some class time to get you started, but you will likely need to come in on your own time to finish. Go step-by-step through the instructions, and answer questions Q4 through Q10 (note that Q1, Q2, and Q3 were in the separate Part 1 handout) . Most require only a number or a few words to complete. Step-by-step procedure and questions 1. Open SAS 2. Menu: choose Solutions -> Analysis -> Enterprise Miner 3. Menu: choose File -> New -> Project. Type a file name you’ll remember (your last name, say), choose a directory, and hit the Create button. 4. Now showing in the Diagrams tab (the tab is near the bottom of the screen), under your new project name, a blank diagram called Untitled is open. Click on Untitled and rename it H3. 5. Let’s set up the data mining application. Click on the Tools tab, near the Diagrams tab. Locate the blocks from the Tools menu of blocks that you see in the diagram below, and drag them into your diagram in the sequence shown. (The SAMPSIO.DMEXA1 label you see is actually an annotation for the Input Data Source node.) 6. Are they all in position? Now, right click in the Diagram somewhere and select “Connect items” from the mouse menu. Then draw one-way arrows connecting all the blocks as shown. Good – your outline is done. Next, we open the blocks roughly in sequence from left to right and set up the parameters. 7. Open the first block, Input Data Source. Click on Select, and choose the SAMPSIO library, DMEXA1 data set. You should see Source Data: SAMPSIO.DMEXA1. If you wish you can look at the variables from the Variables, Interval Variables, and Class Variables tabs. 8. Right-click on the Multiplot block, and choose Run. SAS should report back that the run was successful; go ahead and say Yes to the “view results?” prompt. For this data set EM creates 49 histograms for us, one for each variable we might use in a computation. This is a great way to get a quick look at how the data are distributed. Q4. Which bin of the EDLEVEL histogram is largest? What gender were most of this firm’s customers? 9. Skip the Filter Outliers block, and open the Data Set Attributes block. Click the Variables tab. The Amount variable is in the second row. Click in the fourth column, New Model Role; choose Set New Model Role; and select Target for the new role of the Amount variable. Note that the target of our analysis is now an interval (continuous numerical) variable. 10. Skip the Data Partition block, and open the Regression block. Click the Data tab, and observe that the regression type is “Linear”. This is the same type of regression we performed in Homework 1, Problem 2, but the difference is that now the regression equation has 48 input variables – x1, x2, …, x48 – instead of just one. In H1P2 we wanted to build a model where the output variable (response variable, predicted variable, target variable) y was Revenue, and the input variable (or effect) was Adwords spending. Here we are building a model which predicts how large the output variable Amount will be, the amount spent by each customer, based on each customer’s demographic data and past sales data: those 48 variables we mentioned above. Now, run the regression node, from the right-click mouse menu, or from the little running icon at the top. When done, say yes to View Results. It will open to a distribution of T-scores in the Estimates Tab; we can ignore this for now. Instead choose the Output Tab. You should see a long report listing the effect of various parameters on the model. Q5. From the Model Fit Statistics section, what is the Ajusted R2 value? From the Type 3 Analysis of Effects section, what are the top three effects (the three effects with largest F-values)? 11. Next, try running the tree block. Say yes to View Results. Q6. (i) What is the R squared value on the Valid data set? (ii) Click on the View Menu at the top left, and choose tree; observing the tree and its branches, what input variable has been chosen to be the classifying criterion at the very top of the tree? Q7. Was your answer to (ii), above, one of the top three effects in the linear regression model? Q8. Based on the R2 values from Q2 and Q3, which of our models is a more effective predictor – the linear regression or the tree? 12. Shift gears now and use the data to predict a different target output variable. This time the target will be a binary variable stating whether a consumer made a purchase or not. The model we built can then predict, for data on any new customers we may gather, whether or not those new customers are likely to buy something from our company. This yes/no, will buy/won’t buy variable is different from the continuous Amount variable we predicted before. That fact has a significant impact on the regression node, because when SAS EM detects a binary target it switches to the logistic mode for regressions. Open the Data Set Attributes block again. Click the Variables tab, and under New Model Role, change the Amount variable back to an input, and change the Purchase variable to be the target. 13. Open the Data Partition block. Don’t change any parameters, but observe the percentages assigned to the Train, Validation, and Test procedures. This is an important step in data mining: take your database and split into three parts for different purposes. Step 1 – train your model, whether a regression, tree, neural network, or some other method. But there are many parameters and procedures that could be varied in the model building and training step; how do you know if your adjustments were good ones? Hence Step 2 – use fresh data from the same database as inputs to your newly created model, and “validate” your adjustments. You can check the validity because you have values of the variable you want to predict already existing in your database. For example, you already have values for Purchase (yes or no) for all your customers. So randomly split your database into two parts; train a model with the first part; and then verify your model with the second. SAS EM provides three categories: train, validate and test. The validate and test functions are similar and will be of interest to advanced users of the regression and neural network nodes. If you would like to investigate these functions as part of your class project, you may be interested in tuning the percentages in the Data Partition block to improve regression and NN performance. It would also be interesting to filter the input data using the Filter Outliers and Data Set Attributes blocks. However, for Homework 3, just accept the default settings. 14. Click on the Assessment node and run it. It should run several of the previous blocks, including the regression, tree, and NN nodes. Say yes to View Results. In View Results’ Models tab, in the first column you should see Regression, Tree, and Neural Network in successive rows. Control-click on all three so they are highlighted at the same time. Then, from the Tools menu at the top, select Lift Chart. Q9. Suppose we want to target 30% of this customer base with a direct advertising campaign. Based on this lift chart, which of the three models does a better job of predicting the likelihood of purchases for this customer group? Q10. Suppose we want to target 70% of this customer base instead of 30%. Which model would be a better predictor of purchasing for this larger customer group?