Download Model Comparison - NCSU Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Framingham data comparisons.
(1) In a recent model comparison lecture, we looked at the 2 feature, generated neural network data
that was used to show the neural net’s flexibility, and we compared 3 models. Here, we look at some
real data, the famous Framingham heart study data. Using stepwise logistic regression we already found
that the important predictors were age, smoke (yes,no) , SBP32 (systolic blood pressure), and Cholest2
(cholesterol).
Get the data from your AAEM folder. Declare firstchd and smoker as binary. Set the 4 variables above
to inputs, firstchd to target, and everything else to rejected. Nothing to write up for part (1).
(2) Use a search engine (e.g. Google) to research and write a brief, one paragraph introduction to the
Framingham study for a start to your report. Also research and write a second paragraph describing
firstschd (first stage coronary heart disease).
(3) Continue your report being sure to include these items:
Prepare a diagram with 75% training and 25% validation data and fit a regression, a neural net, and a
class probability tree. Be sure to mention what subtree assessment measure is used for this tree (i.e. for
getting probability estimates). Describe any other properties that you choose (your option) to change
from default in any of these model nodes. Look at the output of the neural network node. Did it
converge? Did it use all 50 iterations? If not, how many did use?
Connect these 3 models to a model comparison node. Under data selection and model selection,
leave the entries as “default” which should be what they are to start with. When you do this, SAS/EM
will list the measure used to select the model and the data set (training or validation) as the first of the
various available model comparison statistics. Which statistic was used and what are its values for the
three models? Note that the best of these models is listed first and a Y appears next to it (this means
that model will be used for scoring a future data set). For your 3 models, what are the areas under the
ROC curve (ROC index) and the lifts as reported in the model comparison output? Since the lift chart is a
curve, not a number, it must have been evaluated at some depth (i.e. percentile). Looking at the results,
report what that percentile is. For each of these other two assessment statistics, which model would
have been chosen?
(4) (perhaps an appendix to your report) Describe what happens if you change to a decision tree instead
of a class probability tree by adjusting the model selection statistic property. Include a discussion of the
spits that result. What will be our misclassification rate if we simply say that nobody will get first stage
coronary heart disease? This may explain what you observed.
Optional (not graded): Turning to the neural net, it appears from the iteration plot that the error
function on the validation data set gets worse from the beginning. If, however, you turn off the default
“preliminary training” as we did in the class demo, which iteration gives the best validation error
function?