Download Regression.

Document related concepts
no text concepts found
Transcript
Regression
Idea behind Regression
Y
We have a scatter of
points, and we want
to find the line that
best fits that scatter.
X
For example, we might want to
know the relationship between
Exam score and hours studied, or
Wheat yield and fertilizer usage, or
Job performance and job training, or
Sales revenue and advertising expenditure.
Imagine that there is a true relationship behind
the variables in which we are interested. That
relationship is known perhaps to some supreme
being.
However, we are mere mortals, and the best we
can do is to estimate that relationship based on
a sample of observations.
Perhaps the supreme being feels that the world would be too
boring if a particular number of hours studied was always
associated with the same exam score, a particular amount of
job training always led to the same job performance, etc.
So the supreme being tosses in a random error.
Then the equation of the true relationship is:
Yi     Xi   i
The subscript i indicates which observation or which point we
are considering.
Xi is the value of the independent variable for observation i.
Yi is the value of the dependent variable.
 is the true intercept.
 is the true slope.
i is the random error.
Again the equation of the true relationship is:
Yi     Xi   i
Our estimated equation is:
Yi  a  b Xi  ei
a is our estimated intercept.
b is our estimated slope.
ei is the estimation error.
Let’s look at our regression line and one particular observation.
estimated equation
of the line
ˆ  abX
Y
i
i
Y
observed value of the Y
i
dependent variable
ˆ e
Yi  Y
i
i
predicted value of the
dependent variable Ŷi
The estimation error, ei, is the gap
between the observed value and
the predicted value of the
dependent variable.
Xi
observed value of the
independent variable
X
Fitting a scatter of points with a line by
eye is too subjective.
We need a more rigorous method.
We will consider three possible criteria.
Criterion 1:
minimize the sum of the vertical errors
n
n
 
ei 
i 1
Y
i 1
ˆ   X
Y
i
i
Yi
ˆ e
Yi  Y
i
i
Ŷi
Xi
ˆ ).
( Yi  Y
i
X
Problem: The best fit by this
criterion may not be very good.
For points below the estimated
regression line, we have a
negative error ei.
Positive and negative errors
cancel each other out.
So the points could be far from
the line, but we may have a
small sum of vertical errors.
Criterion 2:
minimize the sum of the absolute values of the vertical errors
n

n
| ei | 
i 1
Y

ˆ |.
| Yi  Y
i
i 1
ˆ   X
Y
i
i
Yi
ˆ e
Yi  Y
i
i
Ŷi
This avoids our previous
problem of positive and
negative errors canceling
each other out.
However, the absolute value
function is not differentiable,
so using calculus to minimize
will not work.
Xi
X
Criterion 3:
minimize the sum of the squares of the vertical errors
n

n
(ei ) 2 
i 1
Y

i 1
ˆ   X
Y
i
i
Yi
ˆ e
Yi  Y
i
i
Ŷi
Xi
ˆ )2 .
( Yi  Y
i
X
This also avoids the
problem of positive and
negative errors canceling
each other out.
In addition, the square
function is differentiable,
so using calculus to
minimize will work.
Minimizing the sum of the squared errors
is the criterion that we will be using.
The technique is called least squares or
ordinary least squares (OLS).
Using calculus, it can be shown that the values of a and
b that give the line with the best fit can be calculated as:
n
slope :
b

i 1

i 1
n
 Y
1
X i Yi   
Xi
 n  i 1
n
intercept :
n

1


X i2   
 n 
n

i 1
i
i 1
2

Xi 


a Y-bX
Sometimes we omit the subscripts, since they are understood,
and it’s less cumbersome without them. Then the equations are:
slope :

b

intercept :
 X Y
 X
1
XY   
n
2 1
X  
n
a Y-bX
2
Another equivalent formula for b that is
sometimes used is:
slope :
XY  n X Y

b
 X  nX
2
2
You may use either formula for b in this class.
Example: Determine the least squares regression line for
Y = wheat yield and X = fertilizer, using the following data.
X
Y
100
40
200
50
300
50
400
70
500
65
600
65
700
80
XY
X2
We need the sums of the X’s, the Y’s, the XY’s and the X2’s
X
Y
XY
X2
100
40
4,000
10,000
200
50
10,000
40,000
300
50
15,000
90,000
400
70
28,000
14,000
500
65
32,500
25,000
600
65
39,000
36,000
700
80
56,000
49,000
2800
420
184,500
1,400,000
We also need the means of X and of Y.
X
Y
XY
X2
100
40
4,000
10,000
200
50
10,000
40,000
300
50
15,000
90,000
400
70
28,000
14,000
500
65
32,500
25,000
600
65
39,000
36,000
700
80
56,000
49,000
2800
420
184,500
1,400,000
X
2800
420
 400 Y 
 60
7
7
Next, we calculate the estimated slope b.
X
Y
XY
X2
100
40
4,000
10,000
200
50
10,000
40,000
300
50
15,000
90,000
400
70
28,000
14,000
500
65
32,500
25,000
600
65
39,000
36,000
700
80
56,000
49,000
2800
420
184,500
1,400,000
X
2800
420
 400 Y 
 60
7
7
1
 XY   n  X  Y
b
2
2  1
X
X

   
n
1
184,500   ( 2800)( 420)
7

1
1,400,000   28002
7

184,500  168,000
1,400,000  1,120,000

16,500
 0.059
280,000
Then we calculate the estimated intercept a.
X
Y
XY
X2
100
40
4,000
10,000
200
50
10,000
40,000
300
50
15,000
90,000
400
70
28,000
14,000
500
65
32,500
25,000
600
65
39,000
36,000
700
80
56,000
49,000
2800
420
184,500
1,400,000
X
2800
420
 400 Y 
 60
7
7
a  Y  bX
 60  (0.059)( 400)
 36.4
So our estimated regression line is
ˆ
Y  36.4  0.059 X
Given certain assumptions, the OLS
estimators can be shown to have certain
desirable properties. The assumptions are
• The Y values are independent of each other.
• The conditional distributions of Y given X
are normal.
• The conditional standard deviations of Y
given X are equal for all values of X.
Gauss-Markov Theorem: If the previous assumptions hold,
ˆ of  ,  , and 
then the OLS estimators a, b, and Y
y.x
are best, linear, unbiased estimators (BLUE).
Linear means that the estimators are linear functions of the
observed Y values. (There are no Y2s or square roots of Y, etc.)
Unbiased means that the expected values of the estimators are
equal to the parameters you are trying to estimate.
Best means that the estimator has the lowest variance of any
linear unbiased estimators of the parameter.
Let’s look at our wheat example using our graph.
Consider the fertilizer amount Xi = 700.
Y
ˆ  abX
Y
i
i
Yi = 80
The average of all Y values is
Y  60 .
The observed value of Y
corresponding to X = 700 is
Y = 80.
ˆ  77.7
Y
i
The predicted value of Y
corresponding to X = 700 is
Y  60
ˆ  36.4  (0.059)( 700)  77.7 .
Y
Xi=700
X
Y
ˆ  abX
Y
i
i
Yi = 80
unexplained
deviation
explained
deviation
total
ˆ  77.7 deviation
Y
i
Y  60
Xi=700
X
The difference between the
predicted value of Y and the
average value is called the
explained deviation.
The difference between the
observed value of Y and the
predicted value is the unexplained
deviation.
The difference between the
observed value of Y and the
average value is the total
deviation.
If we sum the squares of those deviations, we get
SST  sum of squares total 

(Yi - Y) 2
from the total deviations .
SSR  sum of squares regression 

ˆ - Y) 2
(Y
i
from the explained deviations .
SSE  sum of squares error 

ˆ )2
(Yi - Y
i
from the unexplaine d deviations .
It can be shown that SST  SSR  SSE .
The Sums of Squares are often reported
in a Regression ANOVA Table
Source of
Variation
Sum of squares
Degrees of
freedom
Mean square
ˆ - Y) 2
(Y
i
1
MSR
SSR/1
ˆ )2
(Yi - Y
i
n–2
MSE
SSE/(n-2)
n–1
MST
SST/(n-1)
Regression
SSR 

Error
SSE 

Total
SST 
 (Y - Y)
i
2
Two measures of how well
our regression line fits our data.
The first measure is the standard error of the estimate or
the standard error of the regression, se or SER.
The se or SER tells you the typical error of fit, or how far
the observed value of Y is from the expected value of Y.
The second measure of “goodness of fit” is the
coefficient of determination or R2.
The R2 tells you the proportion of the total variation in
the dependent variable that is explained by the regression
on the independent variable (or variables).
standard error of the estimate
or standard error of the regression
SSE
Se  SER 

n-2


Yi2  a

ei2
n-2


n-2
 Y  b X Y
i
ˆ )2
( Yi  Y
i
i
i
n-2
There is a 2 in the denominator, because we estimated
2 parameters, the intercept a and the slope b.
Later, we’ll have more parameters and this will change.
Coefficient of determination or R2
SSR explained variation
R 

SST
total variation
2


(
Y

Y
)

2
ˆ
( Yi  Y )
i
2

a
SSE
 1SST
 Y  b XY  nY
Y

n
Y

2
2
2
0 R 1
2
If the line fits the scatter of points perfectly,
the points are all on the regression line and
R2 = 1.
If the line doesn’t fit at all and the scatter is
just a jumble of points, then R2 = 0.
Let’s return to our data and calculate se or SER and R2.
X
Y
XY
X2
100
40
4,000
10,000
200
50
10,000
40,000
300
50
15,000
90,000
400
70
28,000
14,000
500
65
32,500
25,000
600
65
39,000
36,000
700
80
56,000
49,000
2800
420
184,500
1,400,000
X
2800
420
 400 Y 
 60
7
7
First, let’s add a column for Y2.
X
Y
XY
X2
Y2
100
40
4,000
10,000
1600
200
50
10,000
40,000
2500
300
50
15,000
90,000
2500
400
70
28,000
14,000
4900
500
65
32,500
25,000
4225
600
65
39,000
36,000
4225
700
80
56,000
49,000
6400
2800
420
184,500
1,400,000
26,350
X
2800
420
 400 Y 
 60
7
7
Remember that a = 36.4 and b = 0.059. Then Se or SER


Yi2  a
 Y  b X Y
i
i
i

n-2
26,350  36.4(420)  0.059(184,500)
7-2
X
Y
XY
X2
Y2
100
40
4,000
10,000
1600
200
50
10,000
40,000
2500
300
50
15,000
90,000
2500
400
70
28,000
14,000
4900
500
65
32,500
25,000
4225
600
65
39,000
36,000
4225
700
80
56,000
49,000
6400
2800
420
184,500
1,400,000
26,350
X
2800
420
 400 Y 
 60
7
7
 5.94
Again, a = 36.4 and b = 0.059.
R 
2
a

Yb

XY  nY 2

36.4(420)  0.059(184,500)  7(60) 2
26,350  7(60) 2

Y 2  nY 2
X
Y
XY
X2
Y2
100
40
4,000
10,000
1600
200
50
10,000
40,000
2500
300
50
15,000
90,000
2500
400
70
28,000
14,000
4900
500
65
32,500
25,000
4225
600
65
39,000
36,000
4225
700
80
56,000
49,000
6400
2800
420
184,500
1,400,000
26,350
X
2800
420
 400 Y 
 60
7
7
973.5

1150
 0.846
So about 85% of the
variation in wheat
yield is explained by
the regression on
fertilizer.
SSR, SSE, and SST for wheat example
On the previous slide, we found that R2 = 973.5 / 1150 = 0.846.
SSR
The sum of squares error, SSE, is the difference
SSE = SST – SSR = 1150 – 973.5 = 176.5.
SST
What is the square root of R2?
It is the sample correlation coefficient, usually denoted
by lower case r.
if b  0, then r   R 2
if b  0, then r   R
2
If you don’t already have R2 calculated, the sample
correlation coefficient r can also be calculated from this
formula.
r

 X Y
1
1
 X  n  X  Y  n  Y
2
1
XY -  
n
2
2
2
For example, in our wheat problem, we had a = 36.4 and b = 0.059.
r

 
1
1


X

X
Y

 n   n  Y
1
XY   
n
2
2
X
Y
2
2
1
184,000   ( 2800)( 420)
7

1
1
1,400,000   ( 2800) 2 26,350   ( 420) 2
7
7
X
Y
XY
X2
Y2
100
40
4,000
10,000
1600
200
50
10,000
40,000
2500
300
50
15,000
90,000
2500
400
70
28,000
14,000
4900
500
65
32,500
25,000
4225
600
65
39,000
36,000
4225
700
80
56,000
49,000
6400
2800
420
184,500
1,400,000
26,350
X
2800
420
 400 Y 
 60
7
7
 .92
Also, r  R 2
 0.846  0.92
The sample correlation coefficient r is
often used to estimate the population
correlation coefficient  (rho).

Cov(X, Y)
X Y
where Cov(X, Y)  E[(X   X )( Y   Y )]
is the covariance of X and Y, and  X and  Y
are the standard deviations of X and Y respective ly.
The correlation coefficient (and the covariance)
tell how the variables move with each other.
1    1
 = 1: There is a perfect positive linear relation.
 = -1: There is a perfect negative linear relation.
 = 0: There is no linear relation.
Correlation Coefficient Graphs
Y
Y
Y
X
X
1
Y
Y
  -1
  0.5
  0.8
X
X
Y
0
X
0
X
R2 adjusted or corrected for degrees of freedom
R c2

1
 (Y - Y)
2
ˆ
(Y - Y) ( n  2)
2
( n  1)
or
R c2
 n 1 
 1  (1  R )

n  2
2
It is possible to compare specifications that would otherwise not
be comparable by using the adjusted R2 .
The “2” is because we are estimating 2 parameters,  and .
This will change when we are estimating more parameters.
Adjusted R2 for wheat example
 n 1 
 1  (1  R )

n  2
 7 1 
 1  (1  0.846)

7  2
2
Rc
2
 0.815
Test on the correlation coefficient
H0 :   0 versus H1 :   0
t n 2 
r
(1  r ) ( n  2)
2
Test at the 5% H0 :   0 versus H1 :   0
for the wheat example. Recall that r = 0 .92 and n = 7.
t n 2 

r
From our t table, we see that for 5 dof,
and a 2-tailed critical region, our cut-off
points are -2.571 and 2.571.
(1  r 2 ) ( n  2)
.92  0
(1  0.922 ) (7  2)
critical
region
.025
-2.571
 5.25
critical
region
.025
0
2.571
t5
Since our t value of 5.25 is in the critical
region, we reject H0 and accept H1 that
the population correlation  is not zero.
If our regression line slope estimate b is close
to zero, that would indicate that the true slope
 might be zero.
To test if  equals zero, we need to know the
distribution of b.
If  is normally distributed with mean 0 and
standard deviation , then b is normally
distributed with mean  and standard deviation

.
or standard error  b 

1
X 
n
2
 X
2
Then, Z 
b-
b
b-



1
X 
n
is a standard normal variable.
2
 X
2
Since we usually don’t know , we estimate it using
SER = se, and a tn-2 instead of the Z.
So for our test statistic, we have
t n -2 

b-
SER
1
2
X 
n
.
 X 
2
sb
For the wheat example, test at the 5% level H0 :   0 vs. H1 :   0 .
Recall : b  0.059, n  7, SER  5.94,
t n -2 

t5 

b-
SER
1
X2 
n
.
 X
 5.27

X 2  1,400,000 .
From our t table, we see that for 5 dof,
and a 2-tailed critical region, our cut-off
points are -2.571 and 2.571.
2
0.059 - 0
5.94
1
1,400,000  28002
7
0.059  0
0.0112

X  2800,
critical
region
critical
region
.025
-2.571
.025
0
2.571
t5
Since our t value of 5.27 is in the critical
region, we reject H0 and accept H1 that
the slope  is not zero.
Notice that the value of the statistic we calculated
when testi ng H 0 :   0 vs. H1 :   0 was 5.25,
which is very close to the value of 5.27 that we
found for the statistic when test ing H 0 :   0
vs. H1 :   0 .
This is not a coincidence. When dealing with a
regression with a single X value on the right side of
the equation, testing whether there is a linear
correlation between the 2 variables ( = 0) and testing
whether the slope is zero ( = 0) are equivalent. Our
values differ only because of rounding error.
We can do an ANOVA test based on the amount of variation in
the dependent variable Y that is explained by the regression.
This is referred to as testing the significance of the regression.
H0: there is no linear relationship between X and Y
(this is the same thing as  equals zero.)
H1: there is a linear relationship between X and Y
(this is the same thing as  is not zero.)
The statistic is
MSR
SSR 1
F1, n - 2 

MSE SSE ( n  2)
Example: Test the significance of the regression in the wheat
problem at the 5% level . Recall SSR = 973.5 and SSE = 176.5.
F1, n - 2  F1,5
973.5 1
MSR
SSR 1

 27.58


176.5 5
MSE SSE ( n  2)
The F table shows that for
1 and 5 degrees of freedom,
the 5% critical value is 6.61.
Since our F has a value of 27.58,
we reject H0: no linear relation
and accept H1: there is a linear
relation between wheat yield
and fertilizer.
f(F1,5)
acceptance
region
crit. reg.
0.05
6.61
27.58
F1, 5
For a regression with just one independent variable X on
the right side of the equation, testing the significance of the
regression is equivalent to testing whether the slope is zero.
Therefore, you might expect there to be a relationship
between the statistics used for these tests, and there is one.
The F-statistic for this test is the square of the t-statistic for
the test on .
In our wheat example, the t-statistic for the test on  was
5.27 and the critical value or cut-off point was 2.571.
For the F-test, the statistic was 27.58  (5.27)2 and the
critical value or cut-off point was 6.61  (2.571)2. (The
numbers don’t match exactly because of rounding error.)
We can also calculate confidence intervals for
the slope .
b - t n -2 sb    b  t n -2 sb
Calculate a 95% confidence interval for the slope
 for the wheat example. Recall that b = 0.059,
n = 7, and sb = 0.0112. We also found the critical
values for a 2-tailed t with 5 dof are 2.571 and
-2.571.
b - t n -2 sb    b  t n -2 sb
0.059 - 2.571 (0.0112)    0.059  2.571 (0.0112)
0.059 - 0.028    0.059  0.028
0.031    0.087
Our 95% confidence interval
0.031    0.087
means that we are 95% sure that the true slope of the
relationship is between 0.031 and 0.087.
Since zero is not in this interval, the results also imply that
for a 5% test level, we would reject H 0 :   0 and accept
H1 :   0 .
Sometimes we want to calculate forecasting intervals for
predicted Y values.
For example, perhaps we’re working for an agricultural agency.
A farmer calls to ask us for an estimate of the wheat yield that
might be expected based on a particular fertilizer usage level on
the farmer’s wheat field.
We might reply that we are 95% certain that the yield would be
between 60 and 80 bushels per acre.
A representative from a cereal company might ask for an estimate
of the average wheat yield that might be expected based on that
same fertilizer usage level on many wheat fields.
To that question, we might reply that we are 95% certain that the
yield would be between 65 and 75 bushels per acre.
Our intervals would both be centered around the same number
(70 in this example), but we can give a more precise prediction
for an average of many fields, than we can for an individual field.
The width of our forecasting intervals also depends on our level
of expertise with the specified value of the independent variable.
Recall that the fertilizer values in our wheat problem had a mean
of 400 and were all between 100 and 700.
If someone asks about applying 2000 units of fertilizer to a
field, we would probably feel less comfortable with our
prediction than we would if the person asked about applying
500 units of fertilizer.
The closer the value of X is to the mean value of our sample, the
more comfortable we are with our numbers, and the narrower
the interval required for a particular confidence level.
Forecasting intervals for the individual case and for the
mean of many cases.
upper endpoint for the forecasting
interval for the individual case
Y
upper endpoint for the forecasting
interval for the mean of many cases
regression line
lower endpoint for the forecasting
interval for the mean of many cases
lower endpoint for the forecasting
interval for the individual case
X
X
Notice that the intervals for
individual case are narrower that
those for the average of many cases.
Also all the intervals are narrower
near the sample mean of the
independent variable.
For the given level of X requested by our callers, we
would have the following.
Y
confidence
interval for the
individual case
80
75
70
65
confidence
interval for
the mean of
many cases
60
Xgiven
X
Formulae for forecasting intervals
ˆ  a bX
For both of the following intervals, Y
g
g
forecasting interval for individual case:
ˆ t s Y Y
ˆ t s
Y
g
n - 2 ind
g
g
n - 2 ind
2
(X g  X )
1
where s ind  SER 1  
1
2
n
2
 X   X 
n
forecasting interval for the mean of many cases:
ˆ t s
ˆ t s
Y



Y
g
n - 2 mean
Y.X g
g
n - 2 mean
where s mean
1
 SER

n
(X g  X )
1
2
2
 X   X 
n
2
Example: If 550 pounds of fertilizer are applied in our wheat
example, find the 95% forecasting interval for the mean wheat
yield if we fertilized many fields.
Recall : a  36.4, b  0.059, n  7, t 5,.05  2.571, SER  5.94,

X  2800,

X 2  1,400,000 .
ˆ  a  b X  36.4  0.059(550)  68.8
Y
g
g
s mean  SER
1

n

( X g  X )2
2
1
X2 
X
n
 
1
(550  400)2
 5.94

7 1,400,000  1 ( 2800)2
7
 2.81
ˆ t s
ˆ t s
Y



Y
g
n - 2 mean
Y.X g
g
n - 2 mean
68.8  2.571( 2.81)  Y.X g  68.8  2.571( 2.81)
61.6  Y.X g  76.0
Example: If 550 pounds of fertilizer are applied in our wheat
example, find the 95% forecasting interval for the wheat yield
if we fertilized one field.
Recall : a  36.4, b  0.059, n  7, t 5,.05  2.571, SER  5.94,

X  2800,

X 2  1,400,000 .
ˆ  a  b X  36.4  0.059(550)  68.8
Y
g
g
sind
1
 SER 1  
n

( X g  X )2
1
(550  400)2
 5.94 1  
2
1
7 1,400,000  1 ( 2800)2
X2 
X
n
7
 
 6.56
ˆ t s Y Y
ˆ t s
Y
g
n - 2 ind
g
g
n - 2 ind
68.8  2.571( 6.56)  Yg  68.8  2.571( 6.56)
51.9  Yg  85.7
Notice that, as we stated previously, the
interval for the mean of many cases is
narrower than the interval for the
individual case.
51.9  Yg  85.7
61.6  Y.X g  76.0
Related documents