Download bison-methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia , lookup

Discrete choice wikipedia , lookup

Least squares wikipedia , lookup

Time series wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Methodology
Pairwise scatterplots were used to identify unusual observations. After a careful
examination of the data, xx values were removed because their data were clearly not
consistent with the bulk of the data. We suspect that transcription or measurement errors
occurred for these animals.
Logistic regression was used to develop a linear prediction equation to separate
plains and wood bison. In logistic regression, a linear combination of variables is
developed that predicts the probability of a bison being a plains bison:
logit p plains  b0  bAHA AHA  bTri _ 1TRI _1  L


where
 p 
1
logit  p   log 
and p 

1 p
1  e logit
[For example, logit(.5)=0, logit(.1)=-2.19, etc. The logit(0 function is used to keep
predicted probabilities in the range of 0-1 by mapping them into the range of –infinity to
+infinity].
Stepwise selection was used to identify important predictor variables for each
sex/age class combination. The basic procedure of Stepwise Selection is that a series of
steps are conducted. At each step, the variables under consideration are divided into those
currently in the prediction model and those outside of the prediction model. First, the
variables currently outside the prediction model are searched to find the most important
new predictor that significantly improves the model which is then added to the prediction
equation. Second, the variables now in the prediction equation are searched to see which
variable is least important and does not significantly improve the model which is then
removed from the prediction model. These steps are repeated until the model cannot be
improved by the addition of new predictors and there are no longer any variables in the
prediction equation that can be removed.
However, there are some drawbacks to stepwise selection. The selection is typically
unstable, sensitive to small perturbations in the data. Addition or deletion of a small
number of observations can change the chosen model markedly (see Breiman, (1995),
and Steyerberg et al. (2000)). Reported standard errors of regression coefficients are
biased low, confidence intervals are too narrow, p-values are too small, and R2 or
analogous measures are inflated. Several recommendations have been made to improve
stepwise methods. Shtatland et al. (2003) developed a three-step procedure, which
incorporates the conventional stepwise logistic regression, information criteria, and
finally best subsets regression to avoid some of the problems with stepwise selection.
If the same set of data is used to build the prediction model and then reused to
estimate the probability of correct predictions, this leads to optimistic estimates of
performance (i.e. the probability of correct classification is overstate). An improved
estimate of performance is obtained by cross-validation. In cross-validation, each
observation is “dropped” from the final model, the final model is refit, and then used to
1
classify the observation. Consequently, each observation is not used for both model
fitting and model assessment. (reference here)
The fit of the logistic models was assessed by examining the usual residual plots
and the influence of each observation on the fitting procedure was examined using
influence statistics (need reference here).
These procedures were used to build suitable prediction models using SAS
(Version 9.2).
Results:
Here I will make reference to various pages in the anal.pdf file as until we settle on which
datasets/age groups to use, there isn’t much point in getting too details.
page 1. Some of the raw data that I read in
page 2. Summary of the number of animals by sex/age and group. Does this seem
correct?
pages 3 and 4. Scatterplots of the important variables (used in you logistic regression
model). The (i,j) column as variable (i) on the vertical axis and variable (j) on the
horizontal axis. The (i,j) plot is exactly the same as the (j,i) plot except the axes are
reversed. A few outliers are apparent (e.g. high vnm_tbh ratios).
page 5 and 6. List of outlier points removed with the “reason” variable why it was
removed.
pages 7 and 8. Scatter plots again with outliers removed.
pages 9 to 59. Ignore as this is just an output dump from logistic regression.
pages 60. Here is a summar y of the models selected for each age./sex. Age/sex B* is
males B4, B5, B6 combined together. In all of the male datasets, HPA is included in the
model, and for B*, B4, B5, VNM/TBH ratio is important.
Notice that for datasets B6 and C3, the model fitting procedure needs to be reviewed as it
ran into what is known as “complete separation” (perfect prediction). Ironically, perfect
prediction is problematic because it leads to parameter estimates that tend to wander off
to +/- infinity and are very unstable (there are then many combinations of parameter
estimates that lead to perfect predictions and it is hard to choose which is best). The
reported model here for B6 and C3 should not be trusted.
page 61. This is a table of the performance of the models, i.e. what fraction of each
species/group/age is correctly predicted. A value of .90 indicates a 90% correct
2
prediction. Prediction rates are high except for C3 and B6, but this is just an arefact of the
problem in building the models mentioned earlier.
pages 62-76 is a simple plot of three variables showing the separation by species.
I then used the logistic model as outlined in the Bisom MAP spreadsheets. The same data
was used to fit a model with the parameters listed on the MAP spreadsheets to the various
sex/age combinations as before.
Again ignore everything until page 97.
Here again are the prediction equations for each combination of sex/age. These don’t
seem to match the prediction equation in your MAP excel spreadsheet, so I don’t know
which data was used to develop the equation. As I mentioned in my last email, there is
some concern about the huge coefficients in your prediction equation and these may be
the result of “complete separation”.
Finally, on page 98 is a summary of how well the prediction equations work. Again, goo
separation with some potential problems with 100% predictive ability.
Carl.
3