Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Boosted Regression Trees A method to explore biologyenvironment relationships Sophie Mormede, Matt Pinkerton National Institute of Water and Atmospheric Research, Wellington, NZ May 2010 Two main uses of BRT • to investigate the ecological dependence of a species on the environment • to determine "habitat preference" in order to extrapolate patchy biological data to a larger domain An example • WHAT: Predict toothfish and bycatch species distributions over the Ross Sea (88.1 & 882A–B) • WHY: – – – – layers for bioregionalisation input to systematic conservation planning to investigate overlap of TOA and prey species to consider potential changes in species distribution under climate change scenarios – to help in estimating biomass from the small number of research trawls (WGR) • HOW: GLM / GAM (not very satisfactory), BRT, General Dissimilarity Matrices, … Project outcomes so far • Predictions seem to make sense, and confidence intervals • Quality of depth data critical (use gebco08, modified with fishing depth) • Still need to validate models on a different area (882E?, Kerguelen?) BRT – what is it all about then? • Regression Tree: – Recursive binary splits – Stopping criterion – Allows interactions natively if wanted (tree complexity) • Boosting = forward stagewise model fitting: – A truncated tree (1-10 splits) – Computed the fitted values and residuals – Fit and add a new tree to the residuals, repeating many times (number of trees > 1000) More about BRT • Boosting with stochasticity: – At each step a proportion of dataset is randomly selected (bag fraction) to be fitted to, improves model performance • Cross validation (CV): – To avoid overfitting, test model on withheld parts of the data – also estimates overfitting • You can bootstrap BRTs (I used 1000 bootstraps) Pros of BRT • Copes with NAs, • Copes with non normally-distributed environmental variables (no transforms), • Copes with outliers • Allows multiple levels of interactions • Unlikely to overfit as much as GLM, quantifies • 20-30% improvement of fits compared with GLM / GAM • Runs on R Cons of BRT • Cons of BRT – Does not give smooth / monotonic responses – Still some overfitting – need to be careful – Slow when using bootstrapping • Cons of any prediction method – Only as good as the environmental layers – Predict only in the domain we have data for (need to mask other areas) BRT process • Optimise BRT setup (which variables, how many interactions, based on deviance) • Run full models and bootstraps • Run reduced models with only variables that were significant • Bootstrap predictions based on reduced model, and calculate CI • Plot Back to the example environmental variables we used • Bathymetry (Gebco 2008, modified for fishing depth) • Chlorophyll A summer (remote sensing) • Ice15 and ice85 (satellite data) – not used • Rugosity (Gebco08) • Near bottom current speed, temperature and salinity (HIGEM circulation model) • Use only variables that make biological sense! Predictor variables • For each species, predict proportion of hooks that caught a fish – Akin to binomial per hook • Transform to normalise data – Y = arcsin [ sqrt (fish per hook) ] • Predict with BRT using Gaussian link • Also predict binomial for all but toothfish (only 5% null catch) • Could also do fish per line Example - TOA prediction preliminary results Other example – Oithona similis Pinkerton et al. (2010) CPR database relative abundance BRT 10 30 100 Oithona similis The most abundant animal in the world? Last example – species richness Leathwick et al. (2006) Others methods to consider General Dissimilarity Modelling • General Dissimilarity Modelling: Multivariate response variable • Pros – predict communities based on environmental variables (multiple species analysed) – Classification part of the process • Cons – No bootstrapping – How many species?? Classification • Classifications (clusters): separates areas based on layers (environment, biology etc) • Options – Use biology layers from BRT? – Use environmental layers too? (doubledipping?) – Use GDM directly for predictions and classifications? • Number of classes…