Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Types of artificial neural networks wikipedia , lookup
History of statistics wikipedia , lookup
Linear least squares (mathematics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Statistical Weather Forecasting INDEPENDENT STUDY DARIA KLUVER F R O M STATISTICAL METHODS IN THE ATMOSPHERIC SCIENCES B Y D A N I E L W I L K S Outline Review of Least-Squares Regression Simple Linear Regression Distribution of the Residuals The Analysis-of-variance table Goodness-of-fit Measures Sampling Distributions of the Regression Coefficients Examining Residuals Prediction Intervals Multiple Linear Regression Derived predictor variables in multiple regression Objective Forecasts- Without NWP Stratification and Compositing When the Predictand is a Probability Predictor Selection Screening regression Stopping rules Cross-validation Analog Forecasting Background Statistical methods are necessary because the atmosphere is a nonlinear dynamical system, and not perfectly predictable in a deterministic sense. Classical statistical forecast methods- do not operate with info from fluid-dynamical NWP models. Still useful at very short lead times (hrs in advance), or very long lead times (weeks or more). Review of Least-squares regression Least-squares regression - Describes the linear relationship between 2 variables. Usually more than 1 predictor (independent) variable is needed in practical forecasts, but the ideas of Multiple Linear Regression are covered in the simple case. Simple Linear Regression Basically, simple linear regression seeks to summarize the relationship between 2 variables, by a straight line. The regression procedure chooses the line that produces the least error for predictions of y based on x. Why least-squares? Line-fitting is fairly tolerant of small discrepancies between line and points, but is not resistant to outliers. Example 46°N 121°W Total Seasonal Snowfall 8000 15.0 miles (24.1 km) NW of Goldendale, Klickitat, WA, USA Approx. altitude: 1089 m (3572 ft) total snowfall (cm) 7000 6000 5000 4000 3000 2000 1000 0 1899 1919 1939 1959 snow season 1979 1999 snow(cm) The linear regression as calculated by SPSS: Eqn is: Total Seasonal Snowfall Y^=40.88x- 7000 total snowfall (cm) 76956.87 8000 6000 5000 4000 3000 2000 1000 0 1899 1919 1939 1959 snow season 1979 1999 snow(cm) regression Line is y=a+bx And the error, or residuals are ei=yi-y(xi) Combine line + residuals to get regression eqn: yi=yi+ei=a+bxi+ei Means that the true value of the predictand is the sum of predicted value + residual. In order to minimize the sum of squared residuals sum(ei)2=sum(i-yi)2=sum(yi-[a+bxi]2 Set derivative wrt parameters a and b to zero and solve derivaties are:… rearrange to get normal equations: sum yi=na+bsum(xi) sumxiyi=asumxi+bsum(xi)2 Solving the normal equations for the regression parameters: b=nsum(xiyi)sumxisumyi/nsum(xi)2-(sumxi)2 and a=ybarbxbar Distribution of the Residuals Conventional to assume ‘ei’s are residuals 5000 4000 3000 residual (ei) independent random variables with zero mean and constant variance. It is also sometimes assumed that these residuals follow a Gaussian distribution. A constant-variance assumption really means that the variance of the residuals is constant in x. Therefore a given residual is equally likely to occur at any part of the regression line. Fig 6.2 same distributions but means are shifted depending on the level of the regression line. In order to make statistical inferences in the regression setting this constant residual variance from the sample of the residuals must be made. 2000 1000 0 1880 -1000 1900 1920 1940 1960 1980 2000 -2000 -3000 x value (season) residuals 2020 Sample average of residuals squared S2e=1/n-2sumei2 =1847651.6 Remember ei=yi-y(xi) So s2e=1/n-2sum[yi-y(xi)]2 to compute the estimated residual variance. Commonly written as SST=SSR+SSE SST=SSR+SSE SST Total sum of squares Sst=sum(yi-ybar)2=sumy2i-ny Proportional(by a factor of n-1) to the sample variance of y Measures the overall variability of the predictand. SSR Regression sum of squares SSR=sum[y(xi)-ybar]2 sum squared difference between the regression predictions and the sample mean of y Relates to regression eqn by: SSR=b2sum(xi-xbar)2=b2[sumx2i-nxbar2] SSE SSE=sum of squared errors sum of squared differences btwn the residuals and their mean (z) SSE=sum ei2 So s2e=1/n-2{SST-SSR}=1/n-2{sumy2i-ny-2-b2[sumxi2-nxbar2]} computational form. Analysis-of-variance table Output from regression analysis is often given in an ANOVA table. Source df SS MS Total N-1 SST Regression K SSR MSR=SSR/k Residual N-k-1 SSE MSE=SSE/(n-k-1) F=MSR/MSE Mode l 1 Sum of Squares Mean Square df Regression 13930436 1.091 1 139304361.0 91 Residual 18104619 3.251 98 1847410.135 Total 32035055 4.342 99 F 75.405 Sig. .000(a) Goodness-of-Fit Measures ANOVA table allows computation of 3 related measures of the “fit” of a regression Fit=correspondence between the regression line and a scatterplot of the data Summary(b) MSE-fundamental because Model it indicates the variabillity of the observed (predictand) y values around the forecast regression line Reflects the accuracies of resulting forecasts. MSE=se2 and shows how residuals cluster around the regression line. R2-coefficient of determination R2=SSR/SST=1-SSE/SST portion of the variation of the predictand “described” by the regression. |sqrtR2|=pearson correlation F ratio = MSR/MSE strong xy relationship=increased MSR, dec MSE In multiple regression problems of multiplicity invalidate it. Change Statistics Model R R Square Adjusted R Square Std. Error of the Estimate R Square Change Sig. F Change F Change .435 75.405 1 df1 df2 1 .659(a) .435 .429 1359.19466 98 .000 1.333 DurbinWatson Sampling Distributions of the Regression Coefficients Estimation of a’s and b’s sampling distributions allows you to construct confidence intervals around the parameter values obtained. (also used for hypothesis testing about population values). We assumed Gaussian distributions. Eqns of intersept and slope Showing that the percision with which the b and a can be estimated depends on the estimated standard deviation of the residuals, Se. Some packages output “t ratio” = ratios of estimate parameters to their standard errors >>> a 1 sample t test is implied. Null hypothesis being that the underlying population mean for the parameter is zero reject null hy. At 5% level of est slop is ~2 as large as standard error. Examining Residuals It is important to check that all the assumptions you made hold true. And these pertain to the residuals. Plot the residuals as a function of predicted value yhat. residuals as a function of Predictand 5000 4000 residuals 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 -1000 -2000 -3000 yhat (predicted total snowfall) residuals Heteroscedasticity – non constant residual variance Solution – apply a logarithmic transformation to y, which reduces larger values more strongly. Or y**2 which stretches the small values? **remember to recover predictand! Y=e^lny=e^(a+bx) To determine graphically if the residuals folow a gaussian distribution do a quantile-quantile (Q-Q_ plot. Investigate the degree to which the residuals are mutually independent. Plot regression residuals as a function of time. If + or – cluster tog, time correlation is suspected. Formal test: Durbin-Watson test. 1.33 for this example, reject null hypothesis. Ho: residuals are serially independent Halt: consistent with a 1st order autoregressive process D=sum(ei-ei-1)2/sum ei2 if es are + corr d is small if es are randomly distributed d is large, so don’t reject Ho. Fig 6.7 shows critical d values at 5% level. **A trend in the residuals may indicate the nature of the relationship changing with time and another predictor variable may be needed. Prediction Intervals To calculate confidence intervals around forecast values. Residuals must follow a gaussian distribution. MSE=se2 residual estimated variance So 95% confidence interval is approx bounded by yhat+-2se (TABLEB1) good with large sample size Sqrt of se2 is standard dev of residuals = 1359.3 Interval is +- 2718.6 But predictand and slope are subject to sampling variations For a forecast of y using predictand value xo prediction variance = s2yhant=se2[1+1/n+(xo-xbar)2/sum(xi-xbar)2] valid for simple LR Total Seasonal Snowfall 8000 total snowfall (cm) 6000 4000 2000 0 1899 1919 1939 1959 1979 1999 -2000 snow(cm) regression -4000 upperbound snow season lowerbound Multiple Linear Regression Prediction eqn (where k=number of predictands) Yhat=b0+b1x1+b2x2…+bkxk Several things are the same but with vectors and matrix algebra. MSE=SSE/(n-k-1) R2 is the same, but is NOT the square of the pearson correlation coef btwn y and any of the x variables. Model Summary(c) ANOVA(c) Model R R Square Change Statistics Adjusted R Square Std. Error of the Estimate R Square Change F Change Sig. F Change df1 DurbinWatson df2 1 .659(a) 2 .435 .429 1359.19466 .435 75.405 1 98 .000 .714(b) .510 .499 1272.73544 .075 14.767 1 97 Sum of Squares Model 1 a Predictors: (Constant), season b Predictors: (Constant), season, stations c Dependent Variable: total 2 a Predictors: (Constant), season b Predictors: (Constant), season, stations c Dependent Variable: total df Mean Square Regression 139304361 .091 1 139304361.09 1 Residual 181046193 .251 98 1847410.135 Total 320350554 .342 99 Regression 163224572 .001 2 81612286.000 Residual 157125982 .341 97 1619855.488 Total 320350554 .342 99 .000 F 75.405 50.382 Sig. .000(a) .000(b) 1.554 Coefficients(a) Unstandardized Coefficients (Constant) season 2 95% Confidence Interval for B Beta Mode l 1 Collinearity Statistics Standardized Coefficients (Constant) season stations B Std. Error 76956.86 9 9180.468 40.888 4.709 27394.47 6 15499.90 4 14.915 8.070 228.467 59.454 a Dependent Variable: total t Sig. Lower Bound Upper Bound Tolerance -8.383 .000 -95175.210 -58738.528 8.684 .000 31.544 50.232 -1.767 .080 -58157.494 3368.542 .241 1.848 .068 -1.102 30.931 .299 .500 3.843 .000 110.468 346.466 .299 .659 1.000 VIF 1.000 3.350 3.350 Derived Predictor variables in MR “derived” predictors- mathematical transformation of predictor variables. Ex power x2=x12 Trid sin, cos Binary or “dummy” (or 0 based on a threshold) Objective Forecasts- Without NWP Classical statistical forecasting = construction of weather forecasts through purely statistical means, without infor from NWP models. Mostly based on multiple regression eqns Objective – particular set of inputs will always produce the same forecast, once the eqns are developed. **however, several subjective decesions go into the development of the forecast eqns. You need developmental data:-past values of the predictand and matching collection of potential predictors, whose values will be known prior to the forecast time. Regression eqn is fit to this historical “training” data. *predictors that are not available in time for operational production of forecasts are of NO VALUE. Stratification and Compositing Physical and statistical relationships may change according to snow know condition (ie seasons) Have a predictor that is a function of day Sometimes better by stratifying data according to time fo year and producting separate forecast eqns for each month or season. Stratify by meteorologic or climatological criteria. Ex: Separate temp forecasts for snow/non snow covered Long-range forecasts for el nino vs non el nino You can also composite datasets to increase sample size. Data forom nearby (climatologically similar) stations When the Predictand is a Probability *Advantage of statistical over (deterministic) dynamical forecasting is the capacity to produce probability forecasts. Explicit expression of the inherent uncertainty Allows users to extract more value in decision making. To produce probability info about a predictand 1st transform predictand to a binary variable 2nd do regression: 2 regression approaches: (simplest) Regression Estimation of Event Probabilities (REEP) MLR to derive a forecast eqn for binary predictand Values are btwn 0 and 1, treated as specifications of probabilities of y=1 No more computationally demanding than any other linear regression problem. Logistic Regression (more theoretically satisfying) Also binary predictand Fits regression parameters to nonlinear eqn … Guaranteed to produce properly bounded probability estimates. Parameters must be estimated using iterative techniques, computationally intensive. ANOVA(b) Mode l 1 Sum of Squares Regression Residual Total Mean Square df F .679 3 .226 7.901 46 .172 8.580 49 Sig. 1.317 .280(a) y=1 if total snow more than 100 year mean. Model Summary(b) Change Statistics Model R R Square Adjusted R Square Std. Error of the Estimate R Square Change F Change Sig. F Change .079 1.317 3 df1 DurbinWatson df2 1 .281(a) .079 .019 .41445 a Predictors: (Constant), EA, stations, season Unstandardized b Dependent Variable: VAR00017 Coefficients .280 2.154 95% Confidence Interval for B Standardized Coefficients Beta Mode l 1 46 B (Constant) season stations EA Std. Error t Sig. Lower Bound Upper Bound 44.153 1.058 .296 -13.737 a Predictors: (Constant), EA, stations, season b Dependent Variable: VAR00017 .007 -.945 .350 -.021 15.208 14.380 -.007 .007 -.230 -.166 .092 -.404 -1.816 .076 -.351 .027 .068 .066 .402 .689 -.110 .018 .164 Coefficients(a) Predictor Selection How to find a good subset of potential predictors Dangers in including too many predictor variables Example with nonsense predictors and a perfect fit. Development sample, dependent sample, or training sample- portion of available data used to produce the forecast eqn. Any K=n-1 predictors will produce a perfect regression fit to any predictand for which there are n observations. Called “overfitting” Will not work for independent, or verification data- data not used in the development of the equation. Only the first 50 years: Model Summary Change Statistics Mode l 1 R Adjusted R Square Std. Error of the Estimate R Square Change F Change .648 .633 904.77237 .648 42.369 .805(a) ANOVA(b) Mode l 1 R Square a Predictors: (Constant), stations1, year b Dependent Variable: total1 Sum of Squares Mean Square df Regression 69367224 .396 2 34683612.19 8 Residual 37656200 .290 46 818613.050 Total 10702342 4.685 48 F 42.369 Sig. .000(a) df1 df2 2 Sig. F Change 46 .000 Coefficients(a) Unstandardized Coefficients Beta Mode l 1 95% Confidence Interval for B Standardized Coefficients (Constant) year stations1 B Std. Error 55797.73 4 25901.80 6 29.564 13.508 584.760 131.753 a Dependent Variable: total1 t Sig. Lower Bound -2.154 .036 -107935.366 .283 2.189 .034 2.374 .574 4.438 .000 319.554 Upper Bound -3660.102 56.753 849.965 Lessons learned: Chose only physically reasonable or meaningful predictors Tentative regression eqn needs to be tested on a sample of data not involved in its development. (a very large difference in performance btwn the dependent and independent samples would lead to suspicion that the eqn was over fit). You need a reasonably large developmental sample if the resulting regression eqn is to be stable-regression coefficients are also applicable to independent (future) data and not from an overfit. *in forecasting, little gained form 12+ predictors. Screening Regression Relevant predictor variables are almost always mutually correlated, so there is redundant info. Screening regression – the problem of selecting a good set of predictors from a pool of potential predictors Common type of screening: Forward selection AKA stepwise regression (highest R2, lowest MSE, largest F ratio) Backward elimination- remove (smallest t ratio) Stopping Rules Both forward selection and backward elimination require stopping criterion. Ex: what if every time you recalculate the reg the F ratios change. In practice, other less rigorous stopping criteria are used Stop adding when none of the remaining predictors would reduce the R2 by an amount (ex 0.05%) Use MSE. sqrtMSE directly reflects the anticipated forecast accuracy f the regression. Ex if MSE 0.01F2 why add more. Indicate +-2stdev (~95%) confidence interval of +-squr 0.01=.2F. Stopping criterion is where MSE does not decline appreciably with the addition of more predictors. Artificial skill-underestimate of the operational MSE proveded by the performance of a forecast eqn on a developmental data set. Sometimes the minimum number of K is smaller for the independent data. For black box, you just minimize MSE not worry about identities of predictors. For scientific understanding it is not the reduction of MSE tha is most important, but the relationships of the variables suggested by the regression procedure. Cross-Validation A resampling technique The available data re repeatedly divided into developmental and verification data subsets. Often: developmental data sets are size n-1 and verification data sets contain remaining ob of predictand therefore n partitions of the data The cross-validation estimate of the prediction MSE is computed by forecasting each omitted ob, computing the squared difference btwn prediction and predictand and averaging the n squared differences. Could do this for any m withheld obs and n-m size developmental data sets. Analog Forecasting To search archives of climatological synoptic data for maps closely resembling current observations, and assume that the future evolution of the atmosphere will be similar to the flows that followed the historical analog. The atmosphere is a perfect model of itself The atmosphere apparently does not ever exactly repeat itself You must match all relevant fields (not just heights and temp, but SST etc) Stats can be used to search for analogs and rank historical cases for closeness to current Take average outcomes of all good analogs.