Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 501 Some Theory for Activity 1 of the Dec. 1 Lab The Starting Model (1) y t 0 1 x t t , with y t =Gas index at time t and x t = Oil index at time t Keep in mind that the overall goal is to estimate the parameters of this model. The Difficulty In Part C, we see a linear relationship between e t and e t 1 . This is evidence against the assumption that the errors are independent. Ordinary least squares calculations don’t provide “correct” answers in the presence of dependent errors. Note: The Durbin-Watson test in Part B also gives evidence the errors at two consecutive times are correlated. Autoregressive Model for Errors The 1-st order autoregression model for the errors is t t 1 u t (2) with u t ~iid(0, 2 ) and the u’s and the ε’s are independent of each other. Note 1: It can be proved that the parameter equals the correlation between εt and εt-1. Note 2: Notice that no intercept is used in model (2). Suppose that we do include an intercept. Remember that errors have mean 0 (and so does ut) so if we take expected values (averages) on both sides of the model t int ercept t -1 u t , the result is 0 = intercept. By theory, the intercept is 0 so it’s not necessary to include it as a parameter. Transforming to a model with random errors Model (2) can be expressed as t t -1 u t . Recall that u t ~iid(0, 2 ), so the variable t t -1 has the desired properties for an error term in ordinary least squares. This leads to the “trick” use to estimate the parameters in model (1). y t 0 1 x t t Model (1) is y t 1 0 1 x t 1 t 1 At time = t-1, model (1) is Multiply all elements of the model at time = t-1 by (the correlation between εt and εt-1) y t 1 0 1 x t 1 t 1 (3) Subtract equation (3) from equation (1), and do a bit of algebraic organization to get y t y t 1 0 (1 ) 1 ( x t x t 1 ) ( t t 1 ) (4) Model (4) is what we’re after. Recall that t t -1 will be a random error term ( = u t ). Parameter estimation Use model (4) to estimate 0 and 1 , the parameters of model (1). The estimated slope for model (4) directly estimates 1 . The estimated intercept for model (4) estimates 0 (1 ) , so we have to divide the sample intercept of model (4) by 1 . Estimating To carry out model (5), we need an estimate of , the correlation between errors at two consecutive times. To do this estimate, we lag the residuals from model (1) by one time period and then find the correlation between the residuals and the lagged residuals. Alternatively, regress the residuals on the lagged residuals and use the slope to estimate the autocorrelation. Example The data are U.S. oil and gas price index values for a different 84 months than used in the lab assignment. These months are after oil prices were deregulated in the United States. The months used in the assignment were before deregulation. Ordinary least squares results The regression equation is Gas = 82.7 + 0.800 Oil Predictor Constant Oil Coef 82.73 0.80047 Source Regression Residual Error Total DF 1 82 83 SE Coef 11.34 0.02065 T 7.30 38.76 P 0.000 0.000 SS 1088509 59426 1147935 MS 1088509 725 F 1501.99 P 0.000 Durbin-Watson statistic Durbin-Watson statistic = 0.460775 The value of the statistic is well below the value of dL in table B.7 for either n – 80 or n = 85. We conclude there is significant autocorrelation. Estimate the autocorrelation Pearson correlation of RESI1 and reslag1 = 0.768 Transformed Model Response is y t 0.768 y t 1 and predictor is x t 0.768 x t 1 The regression equation is gasnew = 26.5 + 0.736 oilnew Predictor Constant oilnew Coef 26.509 0.73580 SE Coef 5.832 0.04593 T 4.55 16.02 P 0.000 0.000 26.509 ˆ 0 114.263 and ˆ 1 0.7358 . This leads to this regression line 1 0.768 Mean Gas 114 .263 0.7358 Oil ---------------------------------------------------------------------------------------------------Prediction (Not in the lab assignment). When calculating predicted values, it’s important to utilize t t 1 u t . With this, model (1) can be written y t 0 1 x t t 0 1 x t t 1 u t . If we know what happened at time t-1, the predicted value at time t can be computed as ŷ ˆ ˆ x ˆ e . t 0 1 t t 1 Here, ŷ t 114 .263 0.7358 x t 0.768e t 1 . Values of ŷ t are computed iteratively. Start by assuming e 0 0 (error before time = 1 is 0), compute ŷ1 and e1 y1 ŷ1 . Use the value of e 2 y1 ŷ1 when computing ŷ 2 114 .263 0.7358 x 2 0.768e1 . Determine e 2 y 2 ŷ 2 , and use that when computing ŷ 3 , and so on. I did the calculations for this example using Excel, and found SSE = 37,999, a much lower value than for ordinary least squares (59,426 in AOV table above). Stat 501 Dec. 3 Matrix Notation for Regression In matrix notation, the regression model is written Y X . Y1 Y2 Y is a column vector containing the y-values. Y = . . Note that there are n rows. . Yn 1 2 ε is a column vector containing the errors. ε = . . Note that there are n rows. . n 0 1 β is a column vector containing the coefficients. β = . . Notice the subscript numbering of the . p 1 β’s. As an example, for simple regression, β = 0 . 1 The X matrix is a matrix in which each row gives the data for a different observation. The first column equals 1 for all observations, and each column after the first gives the data for a different variable. There is a column for each variable, including any added interactions, transformations, indicators, and so on. The abstract formulation is 1 X 11 1 X 21 X = . . 1 X n1 . X 1,p 1 . X 2,p 1 . . . . . . X n ,p 1 In the subscripting, the first value is the observation number and the second number is the variable number. With sample data, the columns for the variables are the same as the Minitab columns of the data for the predictor variables. The first column is always a column of 1’s. The X matrix has n rows and p columns. Coefficient estimates 1 The least-squares estimates of the β coefficients are calculated as b = X T X X T Y . In this formula, XT means the transpose of X. In the transpose of a matrix, the rows are the 1 3 1 1 1 . columns of the original matrix. For example, if X = 1 0 , then XT= 3 0 1 1 1 X X is the matrix inverse of X T X . This means that X T X X T X = I, where I is the identity matrix (of p rows and p columns) with the value 1 down the main diagonal (top left to bottom right) and the value 0 in all other locations. T 1 1 Linear Dependence The columns of a matrix are linearly dependent if one column can be expressed as a linear 1 combination of the other columns. If there is a linear dependence in X, then X T X does not exist. In the regression setting this means that we can’t determine estimates of the coefficients (the 1 formula for doing so needs X T X . Statistically, a linear dependence in the X matrix occurs due to a perfect multicollinearity among the X variables. PRACTICE PROBLEMS 1. For the X matrix given toward the top of this page, calculate X T X . by multiplying it by your answer to problem 1 to see 2. Verify that X T X = 4 14 3 14 if the answer = I. 3. Suppose data for a y-variable and two x-variables are: 1 Y X1 X2 6 1 1 5 1 2 10 14 10 3 1 12 5 1 4 14 14 3 2 18 5 2 For the model, y i 0 1 X i1 2 X i 2 3 X i1 X i 2 i , write out each of Y, X, β, and ε. 4. Suppose that Y = muscle mass, X1 = age, X2 = 1 if male and 0 if female, and X3 = 1 if female and 0 if male. The data are Y 60 50 70 42 50 45 Age 40 45 43 60 60 65 Sex Male Female Male Female Male Female (a) Write out the X matrix for the model y i 0 1 X i1 2 X i 2 3 X 3 i . (b) There is a linear dependence in the X matrix, Explain what it is, and what you would do about it in practice. NOTE: The Exam 3 study guide said you should know something about the variancecovariance of the coefficient estimates. I’ve changed my mind – don’t worry about that. Stat 501 Dec. 3 Family error rates and the Bonferroni Inequality In most statistical studies, researchers may calculate many different significance tests. For instance, in a multiple regression we ten x-variables the analysis is likely to include looking at the t-tests for the ten different β coefficients multiplying predictors. As we increase the number of inference procedures we carry out, we also increase the risk that at least one of the conclusions will be wrong. As an example, the 0.05 significance level means that for 5% of all cases where the null hypothesis really is true, the conclusion will be to reject. Thus if we examine 20 situations where the null is the truth, we might incorrectly reject the null 1 or so times. If we examine 40 situations where the null is the truth, we might incorrectly reject the null 2 or so times. Wrong decisions in favor of the null also are likely when we do multiple inferences. The power of a statistical test is the probability of picking the alternative when the alternative is true. Power is a function of sample size, among other things. Suppose a sample size is relatively small and the power for various significance tests is 0.60. This means that we may pick the alternative only in about 60% of the tests we do in situations where the alternative actually is true. A type 1 error is the error of rejecting the null when actually the null is true, the problem described in the first paragraph of this handout. Suppose we carry out a number of significance tests. The family wide type 1 error rate is the probability that we make at least one type 1 error in our conclusions. Table 1 shows the probabilities of making and not making any type 1 errors when carrying out s independent significance tests using a 0.05 significance level for each test. For s independent tests, each with a 0.05 significance level, the probability of making 0 type 1 errors is (0.95)s. The probability of at least one type 1 error is 1 − (0.95)s. This last column in the table gives the family wide type 1 error rate when each test is done with a 0.05 level of significance. . Two cautions about Table 1 are in order. First, it’s rare that all tests you do in a study are independent of one another. Second, keep in mind that type 1 error has to do situations where the null is really true. Thus Table 1 applies only to situations where the truth for all s tests is that the null is true. Table 1. s = number of tests 1 5 10 20 40 Probability of no type 1 errors 0.95 0.774 0.599 0.358 0.129 Probability of at least one type 1 error 0.05 0.226 0.401 0.642 0.871 Bonferroni Inequality If we do s tests (independent or not), each with a level of significance α/s , the family type 1 error rate is less than or equal to α. Examples (1) For s = 5 tests, if we want family type 1 error rate to be less than or equal to 0.05, the level of significance for each test should be 0.05/5 = 0.01. That is, in each test, only reject the null if the pvalue is less than 0.01. (2) For s = 3 tests, if we want family type 1 error rate to be less than or equal to 0.06, the level of significance for each test should be 0.06/3 = 0.02. That is, in each test, only reject the null if the pvalue is less than 0.02. (3) For s = 20 tests, if we want family type 1 error rate to be less than or equal to 0.10, the level of significance for each test should be 0.10/40 = 0.0025. That is, in each test, only reject the null if the p-value is less than 0.0025. Conservatism of Bonferroni Inequality Example (3) illustrates the problem with the Bonferroni inequality. If we use 0.0025 as a significance level for a test, we’re making it hard to reject the null hypothesis. With such a rigorous significance level, we’re almost to the most extreme and certain way to prevent a type 1 error, which is to never reject the null. The problem is that in many situations (maybe most) the alternative hypothesis is the correct decision. We shouldn’t use such a difficult standard for significance that we lose the power to pick the alternative when it’s correct to do so. Recommendations: 1. Only use the Bonferroni inequality for small s. 2. Don’t obsess too much about incorrectly rejecting the null. The risk of doing so is that you decrease the power of the test(s). Confidence Intervals The Bonferroni Inequality also can be applied to multiple confidence intervals. An error for a confidence interval is that it doesn’t cover the true value of the parameter. For a 95% confidence interval, the error rate is 0.05 (5%). For a family of s intervals, the family wide error rate is the probability at least one of the intervals doesn’t capture the true value of the parameter. If we do s confidence intervals (independent or not), each with a error rate α/s (and confidence level = 1 − α/s , the family error rate is less than or equal to α. Examples (4) For s = 5 intervals, if we want family error rate to be less than or equal to 0.05, the error rate for each interval should be 0.05/5 = 0.01. That is, for each individual confidence interval use a confidence level = 1−0.01 = 0.99. (4) For s = 20 intervals, if we want family error rate to be less than or equal to 0.10, the error rate for each interval should be 0.10/20 = 0.005. That is, for each individual confidence interval use a confidence level = 1−0.005 = 0.995.