Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Methodology Pairwise scatterplots were used to identify unusual observations. After a careful examination of the data, xx values were removed because their data were clearly not consistent with the bulk of the data. We suspect that transcription or measurement errors occurred for these animals. Logistic regression was used to develop a linear prediction equation to separate plains and wood bison. In logistic regression, a linear combination of variables is developed that predicts the probability of a bison being a plains bison: logit p plains b0 bAHA AHA bTri _ 1TRI _1 L where p 1 logit p log and p 1 p 1 e logit [For example, logit(.5)=0, logit(.1)=-2.19, etc. The logit(0 function is used to keep predicted probabilities in the range of 0-1 by mapping them into the range of –infinity to +infinity]. Stepwise selection was used to identify important predictor variables for each sex/age class combination. The basic procedure of Stepwise Selection is that a series of steps are conducted. At each step, the variables under consideration are divided into those currently in the prediction model and those outside of the prediction model. First, the variables currently outside the prediction model are searched to find the most important new predictor that significantly improves the model which is then added to the prediction equation. Second, the variables now in the prediction equation are searched to see which variable is least important and does not significantly improve the model which is then removed from the prediction model. These steps are repeated until the model cannot be improved by the addition of new predictors and there are no longer any variables in the prediction equation that can be removed. However, there are some drawbacks to stepwise selection. The selection is typically unstable, sensitive to small perturbations in the data. Addition or deletion of a small number of observations can change the chosen model markedly (see Breiman, (1995), and Steyerberg et al. (2000)). Reported standard errors of regression coefficients are biased low, confidence intervals are too narrow, p-values are too small, and R2 or analogous measures are inflated. Several recommendations have been made to improve stepwise methods. Shtatland et al. (2003) developed a three-step procedure, which incorporates the conventional stepwise logistic regression, information criteria, and finally best subsets regression to avoid some of the problems with stepwise selection. If the same set of data is used to build the prediction model and then reused to estimate the probability of correct predictions, this leads to optimistic estimates of performance (i.e. the probability of correct classification is overstate). An improved estimate of performance is obtained by cross-validation. In cross-validation, each observation is “dropped” from the final model, the final model is refit, and then used to 1 classify the observation. Consequently, each observation is not used for both model fitting and model assessment. (reference here) The fit of the logistic models was assessed by examining the usual residual plots and the influence of each observation on the fitting procedure was examined using influence statistics (need reference here). These procedures were used to build suitable prediction models using SAS (Version 9.2). Results: Here I will make reference to various pages in the anal.pdf file as until we settle on which datasets/age groups to use, there isn’t much point in getting too details. page 1. Some of the raw data that I read in page 2. Summary of the number of animals by sex/age and group. Does this seem correct? pages 3 and 4. Scatterplots of the important variables (used in you logistic regression model). The (i,j) column as variable (i) on the vertical axis and variable (j) on the horizontal axis. The (i,j) plot is exactly the same as the (j,i) plot except the axes are reversed. A few outliers are apparent (e.g. high vnm_tbh ratios). page 5 and 6. List of outlier points removed with the “reason” variable why it was removed. pages 7 and 8. Scatter plots again with outliers removed. pages 9 to 59. Ignore as this is just an output dump from logistic regression. pages 60. Here is a summar y of the models selected for each age./sex. Age/sex B* is males B4, B5, B6 combined together. In all of the male datasets, HPA is included in the model, and for B*, B4, B5, VNM/TBH ratio is important. Notice that for datasets B6 and C3, the model fitting procedure needs to be reviewed as it ran into what is known as “complete separation” (perfect prediction). Ironically, perfect prediction is problematic because it leads to parameter estimates that tend to wander off to +/- infinity and are very unstable (there are then many combinations of parameter estimates that lead to perfect predictions and it is hard to choose which is best). The reported model here for B6 and C3 should not be trusted. page 61. This is a table of the performance of the models, i.e. what fraction of each species/group/age is correctly predicted. A value of .90 indicates a 90% correct 2 prediction. Prediction rates are high except for C3 and B6, but this is just an arefact of the problem in building the models mentioned earlier. pages 62-76 is a simple plot of three variables showing the separation by species. I then used the logistic model as outlined in the Bisom MAP spreadsheets. The same data was used to fit a model with the parameters listed on the MAP spreadsheets to the various sex/age combinations as before. Again ignore everything until page 97. Here again are the prediction equations for each combination of sex/age. These don’t seem to match the prediction equation in your MAP excel spreadsheet, so I don’t know which data was used to develop the equation. As I mentioned in my last email, there is some concern about the huge coefficients in your prediction equation and these may be the result of “complete separation”. Finally, on page 98 is a summary of how well the prediction equations work. Again, goo separation with some potential problems with 100% predictive ability. Carl. 3