Download examjan2008

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
LINKÖPING UNIVERSITY
Department of Computer and Information Science
Division of statistics
Written exam in Data mining and statistical learning (732A20)
Date: 2008-01-14, at 8.30 – 18.00
Examiner: Anders Grimvall
Rules for the preparation of reports
1) Reports shall be mailed to [email protected]
2) Each person shall prepare his/her own report without communicating with any other
person.
3) The datasets shall be individualized by random sampling from common datasets.
Please, indicate in your report which seeds you have used in the random number
generator.
4) Textbooks, the internet and old lab reports can be used without any restrictions.
Task 1. Credit scoring
The data file creditscoring.xls contains data retrieved from a database in a private
enterprise. Each row contains information about one customer. The variable good/bad
indicates how the customers have managed their credits. The other variables are potential
predictors. Your task is to derive a prediction model that can be used to predict whether
or not a new customer will handle his/her credit in a good manner
a) Import the file creditscoring.xls to SAS and set suitable model roles for all
variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed
seed to select a sample comprising 90% of the 1000 customers. This data set will
then be your raw data. Include the seed in your report.
b) Use the Tree node in EM to fit a decision tree to your data. Make a tree plot with
five leaf nodes and determine the misclassification rate for this tree.
c) Use the Regression node in EM to fit a logistic regression model to your data and
determine the model’s misclassification rate when forward selection is used.
d) Use the output from the logistic regression to determine which of the predictors
that make a significant contribution to the classification.
e) Explain how the odds ratio can be interpreted for an interval variable and an
ordinal variable.
f) Explain how the logistic regression model will change if an ordinal or nominal
variable that takes at least three values is redefined as an interval variable.
LINKÖPING UNIVERSITY
Department of Computer and Information Science
Division of statistics
g) Mention at least two more methods that might be used to distinguish between
good and bad customers.
Task 2: Prediction and model selection
The data file choppedmeat.xls contains data regarding the protein content and the
absorbance of light in 100 different channels for a total of 240 samples of finely chopped
meat. Your task is to select suitable prediction models for the protein content and to
estimate their predictive power.
a) Import the file choppedmeat.xls to SAS and set suitable model roles for all
variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed
seed to select a sample comprising 90% of the 240 meat samples. This data set
will then be your raw data. Include the seed in your report.
b) Use the Partition node in EM to divide your data into a training set (70%) and a
validation set (30%).
c) Use the regression node with forward selection to derive a prediction model of the
protein content and compute the average squared prediction error for the
validation set.
d) Use the Neural network node with default settings to fit an artificial neural
network to your training set and compute the average squared prediction error for
the validation set.
e) Explain how the parameters in the estimated artificial neural network can be
interpreted.
f) Use proc PLS to derive a suitable partial least squares model of the protein
content and determine the (validation set) average squared prediction error. Note
that the validation set shall be the same as in task d.
g) Explain how the results in task f can guide you in your search for a suitable
artificial neural network model of the protein content.
h) Investigate whether or not the (validation set) average squared prediction error for
a suitable neural network can be made smaller than the ASE obtained with PLS in
task f.
LINKÖPING UNIVERSITY
Department of Computer and Information Science
Division of statistics
Task 3: Residual analysis and modelling
The data file softdrinks_res.xls contains data on the quantities of soft drinks sold by a
shop in a residential area. Your task is to step by step build a prediction model for the
quantities sold.
a) Import the file softdrinks_res.xls to SAS and set suitable model roles for all
variables. Use the Sampling node in Enterprise Miner (EM) and a manually typed
seed to select a sample comprising 90% of the data. This data set will then be
your raw data. Include the seed in your report.
b) Use proc gam in SAS to fit a model with log_quantity as target and weekday
dummies and spline functions of the day_of_year and time as predictors.
c) Plot the residuals in task b against suitably selected variables and explain how
such plots can help you modify the degrees of freedom of the spline functions.
d) Introduce a spline function of temperature in the additive model and plot the sum
of the linear and nonlinear parts of this spline function. Also determine whether or
not these components are statistically significant.
e) Introduce a set of holiday dummies in your model and use visual inspection of
suitably selected residual plots to assess the presence of interaction effects
influencing the target variable.
f) Explain how certain interaction effects can be incorporated into generalized
additive models.
g) Examine the presence of substantial interaction effects in your model of the log
quantity sold and discuss whether or not these effects can be handled in proc gam.
To hand in
Highlighted items with clear and well organized answers and insightful comments and
explanations.
Grading
The solutions to the three tasks will be graded from 0 to 10, and a minimum of 12 points
are required to pass the course with grade D. Higher scores will yield higher grades. The
clarity and quality of motivations and comments play an important role in the grading.