Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
LINKÖPING UNIVERSITY Department of Computer and Information Science Division of statistics Written exam in Data mining and statistical learning (732A20) Date: 2008-01-14, at 8.30 – 18.00 Examiner: Anders Grimvall Rules for the preparation of reports 1) Reports shall be mailed to [email protected] 2) Each person shall prepare his/her own report without communicating with any other person. 3) The datasets shall be individualized by random sampling from common datasets. Please, indicate in your report which seeds you have used in the random number generator. 4) Textbooks, the internet and old lab reports can be used without any restrictions. Task 1. Credit scoring The data file creditscoring.xls contains data retrieved from a database in a private enterprise. Each row contains information about one customer. The variable good/bad indicates how the customers have managed their credits. The other variables are potential predictors. Your task is to derive a prediction model that can be used to predict whether or not a new customer will handle his/her credit in a good manner a) Import the file creditscoring.xls to SAS and set suitable model roles for all variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed seed to select a sample comprising 90% of the 1000 customers. This data set will then be your raw data. Include the seed in your report. b) Use the Tree node in EM to fit a decision tree to your data. Make a tree plot with five leaf nodes and determine the misclassification rate for this tree. c) Use the Regression node in EM to fit a logistic regression model to your data and determine the model’s misclassification rate when forward selection is used. d) Use the output from the logistic regression to determine which of the predictors that make a significant contribution to the classification. e) Explain how the odds ratio can be interpreted for an interval variable and an ordinal variable. f) Explain how the logistic regression model will change if an ordinal or nominal variable that takes at least three values is redefined as an interval variable. LINKÖPING UNIVERSITY Department of Computer and Information Science Division of statistics g) Mention at least two more methods that might be used to distinguish between good and bad customers. Task 2: Prediction and model selection The data file choppedmeat.xls contains data regarding the protein content and the absorbance of light in 100 different channels for a total of 240 samples of finely chopped meat. Your task is to select suitable prediction models for the protein content and to estimate their predictive power. a) Import the file choppedmeat.xls to SAS and set suitable model roles for all variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed seed to select a sample comprising 90% of the 240 meat samples. This data set will then be your raw data. Include the seed in your report. b) Use the Partition node in EM to divide your data into a training set (70%) and a validation set (30%). c) Use the regression node with forward selection to derive a prediction model of the protein content and compute the average squared prediction error for the validation set. d) Use the Neural network node with default settings to fit an artificial neural network to your training set and compute the average squared prediction error for the validation set. e) Explain how the parameters in the estimated artificial neural network can be interpreted. f) Use proc PLS to derive a suitable partial least squares model of the protein content and determine the (validation set) average squared prediction error. Note that the validation set shall be the same as in task d. g) Explain how the results in task f can guide you in your search for a suitable artificial neural network model of the protein content. h) Investigate whether or not the (validation set) average squared prediction error for a suitable neural network can be made smaller than the ASE obtained with PLS in task f. LINKÖPING UNIVERSITY Department of Computer and Information Science Division of statistics Task 3: Residual analysis and modelling The data file softdrinks_res.xls contains data on the quantities of soft drinks sold by a shop in a residential area. Your task is to step by step build a prediction model for the quantities sold. a) Import the file softdrinks_res.xls to SAS and set suitable model roles for all variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed seed to select a sample comprising 90% of the data. This data set will then be your raw data. Include the seed in your report. b) Use proc gam in SAS to fit a model with log_quantity as target and weekday dummies and spline functions of the day_of_year and time as predictors. c) Plot the residuals in task b against suitably selected variables and explain how such plots can help you modify the degrees of freedom of the spline functions. d) Introduce a spline function of temperature in the additive model and plot the sum of the linear and nonlinear parts of this spline function. Also determine whether or not these components are statistically significant. e) Introduce a set of holiday dummies in your model and use visual inspection of suitably selected residual plots to assess the presence of interaction effects influencing the target variable. f) Explain how certain interaction effects can be incorporated into generalized additive models. g) Examine the presence of substantial interaction effects in your model of the log quantity sold and discuss whether or not these effects can be handled in proc gam. To hand in Highlighted items with clear and well organized answers and insightful comments and explanations. Grading The solutions to the three tasks will be graded from 0 to 10, and a minimum of 12 points are required to pass the course with grade D. Higher scores will yield higher grades. The clarity and quality of motivations and comments play an important role in the grading.