Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Alternative Forecasting Methods: Bootstrapping Bryce Bucknell Jim Burke Ken Flores Tim Metts Agenda Scenario Obstacles Regression Model Bootstrapping Applications and Uses Results Scenario You have been recently hired as the statistician for the University of Notre Dame football team. You are tasked with performing a statistical analysis for the first year of the Charlie Weis era. Specifically, you have been asked to develop a regression model that explains the relationship between key statistical categories and the number of points scored by the offense. You have a limited number of data points, so you must also find a way to ensure that the regression results generated by the model are reliable and significant. Problems/Obstacles: Central Limit Theorem Replication of data Sampling Variance of error terms Constrained by the Central Limit Theorem In selecting simple random samples of size_n from a population, the sampling distribution of the sample mean x can be approximated by a normal probability distribution as the sample size becomes large. It is generally accepted that the sample size must be 30 or greater to satisfy the large-sample condition of the theorem. Sample N = 1 Sample N = 2 Sample N = 3 Sample N = 4 1. http://www.statisticalengineering.com/central_limit_theorem_(summary).htm Central Limit Theorem Central Limit theorem is the foundation for many statistical procedures, because the distribution of the phenomenon under study does NOT have to be Normal because its average WILL tend to be normal. Why is the assumption of a normal distribution important? A normal distribution allows for the application of the empirical rule – 68%, 95% and 99.7% Chebyshev’s Theorem no more than 1/4 of the values are more than 2 standard deviations away from the mean, no more than 1/9 are more than 3 standard deviations away, no more than 1/25 are more than 5 standard deviations away, and so on. The assumption of a normally distributed data allows descriptive statistics to be used to explain the nature of the population Not enough data available? Monte Carlo simulation, a type of spreadsheet simulation, is used to randomly generate values for uncertain variables over and over to simulate a model. Monte Carlo methods randomly select values to create scenarios The random selection process is repeated many times to create multiple scenarios Through the random selection process, the scenarios give a range of possible solutions, some of which are more probable and some less probable As the process is repeated multiple times, 10,000 or more, the average solution will give an approximate answer to the problem The accuracy can be improved by increasing the number of scenarios selected Sampling without Replacement Simple Random Sampling A simple random sample from a population is a sample chosen randomly, so that each possible sample has the same probability of being chosen. In small populations such sampling is typically done "without replacement“ Sampling without replacement results in deliberate avoidance of choosing any member of the population more than once This process should be used when outcomes are mutually exclusive, i.e. poker hands Sampling with Replacement Initial data set is not sufficiently large enough to use simple random sampling without replacement Through Monte Carlo simulation we have been able to replicate the original population Units are sampled from the population one at a time, with each unit being replaced before the next is sampled. One outcome does not affect the other outcomes Allows a greater number of potential outcomes than sampling without replacement If observations were not replaced there would not be enough independent observations to create a sample size of n ≥ 30 X Homoscedasticity – constant variance Residuals Residuals Hetroscedasticity vs. Homoscedasticity X Hetroscedasticity – nonconstant variance All random variables have the same finite variance Simplifies mathematical and computational treatment Leads to good estimation results in data mining and regression Random variables may have different variances Standard errors of regression coefficients may be understated T-ratios may be larger than actual More common with cross sectional data Regression Model For ND Points Scored ND Points = 38.54 + 0.079*b1 - 0.170*b2 - 0.662*b3 - 3.16*b4 b1 = Total Yards Gained b3 = Total Plays b2 = Penalty Yards b4 = Turnovers Audit Trail -- Coefficient Table (Multiple Regression Selected) Series Included Standard Description in Model Coefficient Error ND Points Dependent 38.54 14.26 Total YDS Yes 0.08 0.02 Penalty YDS Yes -0.17 0.06 Total Plays Yes -0.66 0.23 Turnovers Yes -3.16 2.50 T-test 2.70 5.29 -2.64 -2.84 -1.26 F-test 7.31 27.97 6.99 8.05 1.59 Overall F-test 8.92 4 Checks of a Regression Model 1. Do the coefficients have the correct sign? 2. Are the slope terms statistically significant? 3. How well does the model fit the data? 4. Is there any serial correlation? 4 Checks of a Regression Model 1. Do the coefficients have the correct sign? Audit Trail -- Coefficient Table Series Included Description in Model ND Points Dependent Total YDS Yes Penalty YDS Yes Total Plays Yes Turnovers Yes Coefficient 38.54 0.08 -0.17 -0.66 -3.16 Could this represent a big play factor? 4 Checks of a Regression Model 2. Are the slope terms statistically significant? Audit Trail -- Coefficient Table (Multiple Regression Selected) Series Included Standard 3. How well does the model fit the Description in Model Coefficient Error T-test ND Points Dependent 38.54 14.26 2.70 Yes serial 0.08 correlation? 0.02 5.29 4. Total IsYDSthere any Penalty YDS Yes -0.17 0.06 -2.64 Total Plays Yes -0.66 0.23 -2.84 Turnovers Yes -3.16 2.50 -1.26 data? F-test 7.31 27.97 6.99 8.05 1.59 Overall F-test 8.92 M ay -0 Ju 5 n05 Ju lAu 05 g0 Se 5 p0 O 5 ct -0 N 5 ov -0 D 5 ec -0 Ja 5 n0 Fe 6 b0 M 6 ar -0 Ap 6 r0 M 6 ay -0 Ju 6 n06 Ju lAu 06 g0 Se 6 p0 O 6 ct -0 N 6 ov -0 D 6 ec -0 Ja 6 n0 Fe 7 b0 M 7 ar -0 Ap 7 r07 4 Checks of a Regression Model 3. How well does the model fit the data? ND Points 60 50 40 30 20 10 0 ND Points Forecast of ND Points Fitted Values Adjusted R2 = 74.22% 4 Checks of a Regression Model 4. Is there any serial correlation? Data is cross sectional With limited data points, how useful is this regression in describing how well the model fits the actual data? Is there a way to tests its reliability? How to test the significance of the analysis What happens when the sample size is not large enough (n ≥ 30)? Bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. Commonly used statistical significance tests are used to determine the likelihood of a result given a random sample and a sample size of n. If the population is not random and does not allow a large enough sample to be drawn, the central limit theorem would not hold true Thus, the statistical significance of the data would not hold Bootstrapping uses replication of the original data to simulate a larger population, thus allowing many samples to be drawn and statistical tests to be calculated How It Works Bootstrapping is a method for estimating the sampling distribution of an estimator by resampling with replacement from the original sample. The bootstrap procedure is a means of estimating the statistical accuracy . . . from the data in a single sample. Bootstrapping is used to mimic the process of selecting many samples when the population is too small to do otherwise The samples are generated from the data in the original sample by copying it many number of times (Monte Carlo Simulation) Samples can then selected at random and descriptive statistics calculated or regressions run for each sample The results generated from the bootstrap samples can be treated as if it they were the result of actual sampling from the original population Characteristics of Bootstrapping Sampling with Replacement Full Sample Bootstrapping Example Original Data Set Limited number of observations Random sampling with replacement can be employed to create multiple independent samples for analysis 1st Random Sample Pittsburgh Navy Michigan Ohio State Michigan State USC Washington Washington Purdue Ohio State USC BYU 109 Copies of each observation USC BYU Tennessee Stanford Navy Pittsburgh Syracuse Stanford Ohio State Creating a much larger sample with which to work Ohio State Stanford Michigan When it should be used Bootstrapping is especially useful in situations when no analytic formula for the sampling distribution is available. Traditional forecasting methods, like exponential smoothing, work well when demand is constant – patterns easily recognized by software In contrast, when demand is irregular, patterns may be difficult to recognize. Therefore, when faced with irregular demand, bootstrapping may be used to provide more accurate forecasts, making some important assumptions… Assumptions and Methodology Bootstrapping makes no assumption regarding the population No normality of error terms No equal variance Allows for accurate forecasts of intermittent demand If the sample is a good approximation of the population, the sampling distribution may be estimated by generating a large number of new samples For small data sets, taking a small representative sample of the data and replicating it will yield superior results Applications and Uses Criminology Statistical significance testing is important in criminology and criminal justice Six of the most popular journals in criminology and criminal justice are dominated by quantitative methods that rely on statistical significance testing However, it poses two potential problems: tautology and violations of assumptions Applications and Uses Criminology Tautology: the null hypothesis is always false because virtually all null hypothesis may be rejected at some sample size Violation of assumptions of regression: errors are homogeneous and errors of independent variables are normally distributed Bootstrapping provides a user-friendly alternative to cross-validation and jackknife to augment statistical significance testing Applications and Uses Actuarial Practice Process of developing an actuarial model begins with the creation of probability distributions of input variables Input variables are generally asset-side generated cash flows (financial) or cash flows generated from the liabilities side (underwriting) Traditional actuarial methodologies are rooted in parametric approaches, which fit prescribed distribution of losses to the data Applications and Uses Actuarial Practice However, experience from the last two decades has shown greater interdependence of loss variables with asset variables Increased complexity has been accompanied by increased competitive pressures and more frequent insolvencies There is a need to use nonparametric methods in modeling loss distributions Bootstrap standard errors and confidence intervals are used to derive the distribution Applications and Uses Classifications Used by Ecologists Ecologists often use cluster analysis as a tool in the classification and mapping of entities such as communities or landscapes However, the researcher has to choose an adequate group partition level and in addition, cluster analysis techniques will always reveal groups Use bootstrap to test statistically for fuzziness of the partitions in cluster analysis Partitions found in bootstrap samples are compared to the observed partition by the similarity of the sampling units that form the groups. Applications and Uses Human Nutrition Inverse regression used to estimate vitamin B-6 requirement of young women Standard statistical methods were used to estimate the mean vitamin B-6 requirement Used bootstrap procedure as a further check for the mean vitamin B-6 requirement by looking at the standard error estimates and confidence intervals Application and Uses Outsourcing Agilent Technologies determined it was time to transfer manufacturing of its 3070 in-circuit test systems from Colorado to Singapore Major concern was the change in environmental test conditions (dry vs humid) Because Agilent tests to tighter factory limits (“guard banding”), they needed to adjust the guard band for Singapore Bootstrap was used to determine the appropriate guard band for Singapore facility An Alternative to the bootstrap Jackknife A statistical method for estimating and removing bias* and for deriving robust estimates of standard errors and confidence intervals Created by systematically dropping out subsets of data one at a time and assessing the resulting variation Bias: A statistical sampling or testing error caused by systematically favoring some outcomes over others A comparison of the Bootstrap & Jackknife Bootstrap Yields slightly different results when repeated on the same data (when estimating the standard error) Not bound to theoretical distributions Jackknife Less general technique Explores sample variation differently Yields the same result each time Similar data requirements Another alternative method Cross-Validation The practice of partitioning data into a sample of data into sub-samples such that the initial analysis is conducted on a single sub-sample (training data), while further sub-samples (test or validation data) are retained “blind” in order for subsequent use in confirming and validating the initial analysis Bootstrap vs. Cross-Validation Bootstrap Requires a small of data More complex technique – time consuming Cross-Validation Not a resampling technique Requires large amounts of data Extremely useful in data mining and artificial intelligence Methodology for ND Points Model Use bootstrapping on ND points scored regression model Goal: determine the reliability of the model Replication, random sampling, and numerous independent regression Calculation of a confidence interval for adjusted R2 Bootstrapping Results R2 Data Sample # Adjusted R^2 Sample # Adjusted R^2 1 0.7351 13 0.7482 2 0.7545 14 0.8719 3 0.7438 15 0.7391 4 0.7968 16 0.9025 5 0.5164 17 0.8634 6 0.6449 18 0.7927 7 0.9951 19 0.6797 8 0.9253 20 0.6765 9 0.8144 21 0.8226 10 0.7631 22 0.9902 11 0.8257 23 0.8812 12 0.9099 24 0.9169 The Mean, Standard Dev., 95% and 99% confidence intervals are then calculated in excel from the 24 observations Bootstrapping Results R2 Data Mean: STDEV: 0.8046 0.1131 Conf 95% Conf 99% 0.0453 or 75.93 - 84.98% 0.0595 or 74.51 - 86.41% So what does this mean for the results of the regression? Can we rely on this model to help predict the number of points per game that will be scored by the 2006 team? Questions?