Download Dec. 3 Handout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 501
Some Theory for Activity 1 of the Dec. 1 Lab
The Starting Model
(1) y t   0  1 x t   t , with y t =Gas index at time t and x t = Oil index at time t
Keep in mind that the overall goal is to estimate the parameters of this model.
The Difficulty
In Part C, we see a linear relationship between e t and e t 1 . This is evidence against the assumption that the
errors are independent. Ordinary least squares calculations don’t provide “correct” answers in the presence of
dependent errors. Note: The Durbin-Watson test in Part B also gives evidence the errors at two consecutive
times are correlated.
Autoregressive Model for Errors
The 1-st order autoregression model for the errors is
 t   t 1  u t
(2)
with u t ~iid(0,  2 ) and the u’s and the ε’s are independent of each other.
Note 1: It can be proved that the parameter  equals the correlation between εt and εt-1.
Note 2: Notice that no intercept is used in model (2). Suppose that we do include an intercept. Remember
that errors have mean 0 (and so does ut) so if we take expected values (averages) on both sides of the
model  t  int ercept   t -1  u t , the result is 0 = intercept. By theory, the intercept is 0 so it’s not necessary
to include it as a parameter.
Transforming to a model with random errors
Model (2) can be expressed as  t   t -1  u t .
Recall that u t ~iid(0,  2 ), so the variable  t   t -1 has the desired properties for an error term in ordinary
least squares. This leads to the “trick” use to estimate the parameters in model (1).
y t   0  1 x t   t
Model (1) is
y t 1   0  1 x t 1   t 1
At time = t-1, model (1) is
Multiply all elements of the model at time = t-1 by  (the correlation between εt and εt-1)
y t 1   0   1 x t 1   t 1
(3)
Subtract equation (3) from equation (1), and do a bit of algebraic organization to get
y t  y t 1   0 (1  )  1 ( x t  x t 1 )  ( t   t 1 )
(4)
Model (4) is what we’re after. Recall that  t   t -1 will be a random error term ( = u t ).
Parameter estimation
Use model (4) to estimate  0 and 1 , the parameters of model (1). The estimated slope for model (4) directly
estimates 1 . The estimated intercept for model (4) estimates  0 (1  ) , so we have to divide the sample
intercept of model (4) by 1   .
Estimating 
To carry out model (5), we need an estimate of  , the correlation between errors at two consecutive
times. To do this estimate, we lag the residuals from model (1) by one time period and then find the
correlation between the residuals and the lagged residuals. Alternatively, regress the residuals on the lagged
residuals and use the slope to estimate the autocorrelation.
Example
The data are U.S. oil and gas price index values for a different 84 months than used in the lab assignment.
These months are after oil prices were deregulated in the United States. The months used in the assignment
were before deregulation.
Ordinary least squares results
The regression equation is
Gas = 82.7 + 0.800 Oil
Predictor
Constant
Oil
Coef
82.73
0.80047
Source
Regression
Residual Error
Total
DF
1
82
83
SE Coef
11.34
0.02065
T
7.30
38.76
P
0.000
0.000
SS
1088509
59426
1147935
MS
1088509
725
F
1501.99
P
0.000
Durbin-Watson statistic
Durbin-Watson statistic = 0.460775
The value of the statistic is well below the value of dL in table B.7 for either n – 80 or n = 85. We conclude
there is significant autocorrelation.
Estimate the autocorrelation
Pearson correlation of RESI1 and reslag1 = 0.768
Transformed Model
Response is y t  0.768 y t 1 and predictor is x t  0.768 x t 1
The regression equation is
gasnew = 26.5 + 0.736 oilnew
Predictor
Constant
oilnew
Coef
26.509
0.73580
SE Coef
5.832
0.04593
T
4.55
16.02
P
0.000
0.000
26.509
ˆ 0 
 114.263 and ˆ 1  0.7358 . This leads to this regression line
1  0.768
Mean Gas  114 .263  0.7358 Oil
---------------------------------------------------------------------------------------------------Prediction
(Not in the lab assignment).
When calculating predicted values, it’s important to utilize  t   t 1  u t . With this, model (1) can be
written y t   0  1 x t   t   0  1 x t   t 1  u t . If we know what happened at time t-1, the predicted
value at time t can be computed as ŷ  ˆ  ˆ x  ˆ e .
t
0
1
t
t 1
Here, ŷ t  114 .263  0.7358 x t  0.768e t 1 .
Values of ŷ t are computed iteratively. Start by assuming e 0  0 (error before time = 1 is 0), compute
ŷ1 and e1  y1  ŷ1 . Use the value of e 2  y1  ŷ1 when computing ŷ 2  114 .263  0.7358 x 2  0.768e1 .
Determine e 2  y 2  ŷ 2 , and use that when computing ŷ 3 , and so on. I did the calculations for this example
using Excel, and found SSE = 37,999, a much lower value than for ordinary least squares (59,426 in AOV
table above).
Stat 501 Dec. 3 Matrix Notation for Regression
In matrix notation, the regression model is written Y  X   .
 Y1 
 
 Y2 
Y is a column vector containing the y-values. Y =  .  . Note that there are n rows.
 
 . 
 
 Yn 
 1 
 
2 
ε is a column vector containing the errors. ε =  .  . Note that there are n rows.
 
 . 
 
n 
 0 


 1 
β is a column vector containing the coefficients. β =  .  . Notice the subscript numbering of the
 . 


  p 1 
 
β’s. As an example, for simple regression, β =  0  .
 1 
The X matrix is a matrix in which each row gives the data for a different observation. The first
column equals 1 for all observations, and each column after the first gives the data for a different
variable. There is a column for each variable, including any added interactions, transformations,
indicators, and so on. The abstract formulation is
1 X 11

1 X 21
X =  .
.

1 X n1
. X 1,p 1 

. X 2,p 1 
.
.  .

.
. 

. X n ,p 1 
In the subscripting, the first value is the observation number and the second number is the variable
number. With sample data, the columns for the variables are the same as the Minitab columns of the
data for the predictor variables. The first column is always a column of 1’s. The X matrix has n
rows and p columns.
Coefficient estimates
1
The least-squares estimates of the β coefficients are calculated as b = X T X X T Y .
In this formula, XT means the transpose of X. In the transpose of a matrix, the rows are the
1 3 


 1 1 1
 .
columns of the original matrix. For example, if X = 1 0  , then XT= 
 3 0 1
1 1 


X X
is the matrix inverse of X T X  . This means that X T X X T X  = I, where I is the identity
matrix (of p rows and p columns) with the value 1 down the main diagonal (top left to bottom
right) and the value 0 in all other locations.
T
1
1
Linear Dependence
The columns of a matrix are linearly dependent if one column can be expressed as a linear
1
combination of the other columns. If there is a linear dependence in X, then X T X does not exist.
In the regression setting this means that we can’t determine estimates of the coefficients (the
1
formula for doing so needs X T X . Statistically, a linear dependence in the X matrix occurs due to
a perfect multicollinearity among the X variables.
PRACTICE PROBLEMS
1. For the X matrix given toward the top of this page, calculate X T X  .
 by multiplying it by your answer to problem 1 to see
2. Verify that X T X = 
  4 14 3 14 
if the answer = I.
3. Suppose data for a y-variable and two x-variables are:
1
Y
X1
X2
6
1
1
5
1
2
 10 14
10
3
1
12
5
1
 4 14 
14
3
2
18
5
2
For the model, y i   0  1 X i1   2 X i 2   3 X i1 X i 2   i , write out each of Y, X, β, and ε.
4. Suppose that Y = muscle mass, X1 = age, X2 = 1 if male and 0 if female, and X3 = 1 if
female and 0 if male. The data are
Y
60
50
70
42
50
45
Age
40
45
43
60
60
65
Sex
Male Female Male Female Male Female
(a) Write out the X matrix for the model y i   0  1 X i1   2 X i 2   3 X 3   i .
(b) There is a linear dependence in the X matrix, Explain what it is, and what you would do
about it in practice.
NOTE: The Exam 3 study guide said you should know something about the variancecovariance of the coefficient estimates. I’ve changed my mind – don’t worry about that.
Stat 501 Dec. 3
Family error rates and the Bonferroni Inequality
In most statistical studies, researchers may calculate many different significance tests. For instance,
in a multiple regression we ten x-variables the analysis is likely to include looking at the t-tests for
the ten different β coefficients multiplying predictors. As we increase the number of inference
procedures we carry out, we also increase the risk that at least one of the conclusions will be wrong.
As an example, the 0.05 significance level means that for 5% of all cases where the null hypothesis
really is true, the conclusion will be to reject. Thus if we examine 20 situations where the null is the
truth, we might incorrectly reject the null 1 or so times. If we examine 40 situations where the null
is the truth, we might incorrectly reject the null 2 or so times.
Wrong decisions in favor of the null also are likely when we do multiple inferences. The
power of a statistical test is the probability of picking the alternative when the alternative is true.
Power is a function of sample size, among other things. Suppose a sample size is relatively small
and the power for various significance tests is 0.60. This means that we may pick the alternative
only in about 60% of the tests we do in situations where the alternative actually is true.
A type 1 error is the error of rejecting the null when actually the null is true, the problem
described in the first paragraph of this handout. Suppose we carry out a number of significance
tests. The family wide type 1 error rate is the probability that we make at least one type 1 error in
our conclusions.
Table 1 shows the probabilities of making and not making any type 1 errors when carrying
out s independent significance tests using a 0.05 significance level for each test. For s independent
tests, each with a 0.05 significance level, the probability of making 0 type 1 errors is (0.95)s. The
probability of at least one type 1 error is 1 − (0.95)s. This last column in the table gives the family
wide type 1 error rate when each test is done with a 0.05 level of significance.
.
Two cautions about Table 1 are in order. First, it’s rare that all tests you do in a study are
independent of one another. Second, keep in mind that type 1 error has to do situations where the
null is really true. Thus Table 1 applies only to situations where the truth for all s tests is that the
null is true.
Table 1.
s = number of
tests
1
5
10
20
40
Probability of no
type 1 errors
0.95
0.774
0.599
0.358
0.129
Probability of at least
one type 1 error
0.05
0.226
0.401
0.642
0.871
Bonferroni Inequality
If we do s tests (independent or not), each with a level of significance α/s , the family type 1 error
rate is less than or equal to α.
Examples
(1) For s = 5 tests, if we want family type 1 error rate to be less than or equal to 0.05, the level of
significance for each test should be 0.05/5 = 0.01. That is, in each test, only reject the null if the pvalue is less than 0.01.
(2) For s = 3 tests, if we want family type 1 error rate to be less than or equal to 0.06, the level of
significance for each test should be 0.06/3 = 0.02. That is, in each test, only reject the null if the pvalue is less than 0.02.
(3) For s = 20 tests, if we want family type 1 error rate to be less than or equal to 0.10, the level of
significance for each test should be 0.10/40 = 0.0025. That is, in each test, only reject the null if the
p-value is less than 0.0025.
Conservatism of Bonferroni Inequality
Example (3) illustrates the problem with the Bonferroni inequality. If we use 0.0025 as a
significance level for a test, we’re making it hard to reject the null hypothesis. With such a rigorous
significance level, we’re almost to the most extreme and certain way to prevent a type 1 error,
which is to never reject the null. The problem is that in many situations (maybe most) the
alternative hypothesis is the correct decision. We shouldn’t use such a difficult standard for
significance that we lose the power to pick the alternative when it’s correct to do so.
Recommendations:
1. Only use the Bonferroni inequality for small s.
2. Don’t obsess too much about incorrectly rejecting the null. The risk of doing so is that you
decrease the power of the test(s).
Confidence Intervals
The Bonferroni Inequality also can be applied to multiple confidence intervals. An error for a
confidence interval is that it doesn’t cover the true value of the parameter. For a 95% confidence
interval, the error rate is 0.05 (5%). For a family of s intervals, the family wide error rate is the
probability at least one of the intervals doesn’t capture the true value of the parameter.
If we do s confidence intervals (independent or not), each with a error rate α/s (and confidence
level = 1 − α/s , the family error rate is less than or equal to α.
Examples
(4) For s = 5 intervals, if we want family error rate to be less than or equal to 0.05, the error rate for
each interval should be 0.05/5 = 0.01. That is, for each individual confidence interval use a
confidence level = 1−0.01 = 0.99.
(4) For s = 20 intervals, if we want family error rate to be less than or equal to 0.10, the error rate
for each interval should be 0.10/20 = 0.005. That is, for each individual confidence interval use a
confidence level = 1−0.005 = 0.995.