Download Power 9

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Regression
Econ 240A
Retrospective

Week One
• Descriptive statistics
• Exploratory Data Analysis

Week Two
• Probability
• Binomial Distribution

Week Three
• Normal Distribution
• Interval Estimation, Hypothesis Testing,
Decision Theory
2
Week Four


Bivariate Relationships
Correlation and Analysis of Variance
3
Outline



A cognitive device to help understand the
formulas for estimating the slope and the
intercept, as well as the analysis of variance
Table of Analysis of Variance (ANOVA) for
regression
F distribution for testing the significance of
the regression, i.e. does the independent
variable, x, significantly explain the
dependent variable, y?
4
Outline (Cont.)



The Coefficient of Determination, R2, and
the Coefficient of Correlation, r.
Estimate of the error variance, s2.
Hypothesis tests on the slope, b.
5
Part I: A Cognitive Device
6



A Cognitive Device: The
Conceptual Model
(1) yi = a + b*xi + ei
Take expectations , E:
(2) E yi = a + b*E xi +E ei, where
• assume (3) E ei =0



Subtract (2) from (1) to obtain model in
deviations:
(4) [yi - E yi ] = b*[xi - E xi ] + ei
Multiply (3) by [xi - E xi ] and take
expectations:
7
A Cognitive Device: (Cont.)

(5) E{[yi - E yi ] [xi - E xi ]} = b*E[xi - E xi ]2
+ E{ei [xi - E xi ] }, where assume
• E{ei [xi - E xi ] }= 0, i.e. e and x are independent



By definition, (6) cov yx = b* var x, i.e.
(7) b= cov yx/ var x
The corresponding empirical estimate, by the
method of moments:
(8) bˆ   [y(i)  y ][x(i)  x ]  [x(i)  x ]2
i
i
8
A Cognitive Device (Cont.)

The empirical counter part to (2)

Square both sides of (4), and take
expectations,
(10) E [yi - E yi ]2 = b2*E[xi - E xi ]2 +
2E{ei*[xi - E xi ]}+ E[ei]2
Where (11) E{ei*[xi - E xi ] = 0 , i.e. the
explanatory variable x and the error e are
assumed to be independent, cov ex = 0


y  a  bˆ * x,so(9) aˆ  y  bˆ *x
9
A Cognitive Device (Cont.)



From (10) by definition
(11) var y = b2 * var x + var e, this is the
partition of the total variance in y into the
variance explained by x, b2 * var x , and the
unexplained or error variance, var e.
the empirical counterpart to (11) is the total
sum of squares equals the explained sum of
squares plus the unexplained sum of
squares: (12) [ y(i)  y ]2  bˆ 2  [ x(i)  x ]2   [eˆ(i)]2
i
i
i
10
A Cognitive Device (Cont.)

From Eq. 7, substitute for b in Eq. 11:
• Var y = [covyx]2/Var x + Var e

Divide by Var y: 1 = [covyx]2/vary*varx +
var e/var y
• or 1 = r2 + var e/var y where r is the correlation
coefficient
11
Population Model and Sample Model Side by Side
12
Conceptual Vs. Fitted Model







Conceptual
(1) yi = a + b*xi + ei
Take expectations, E
(2) Ey = a + b*Ex +
Eei
(3) Where Eei = 0
Subtract (2) from (1)
(4)[yi - Ey] = b*[xi Ex] + ei

Fitted
(i) yˆ (i)  aˆ  bˆ * x (i)
(ii ) eˆ  y (i )  yˆ (i )

Minimize
(iii ) [eˆ(i)]2 
i
(iv )  [ yˆ (i )  aˆ  bˆx(i )]2
i
13
Conceptual Vs. Fitted (Cont.)






Conceptual
Multiply (4) by [xi Ex] and take
expectations, E
E [yi - Ey] [xi -Ex] =
b*E [xi -Ex]2 + Eei* [xi
-Ex],
(5) where Eei* [xi -Ex]
=0
(6) cov[y*x] = b*varx
(7) b = cov[y*x]/varx

Fitted
 First order condition
(v) [ y (i)  aˆ  bˆx(i)]  0
i
(vi) eˆ(i)  0
i

compare (3) & (vi)
 From (v) the fitted line
goes through the
sample means
ˆ
ˆ
(vii) y  a  bx
14
Conceptual vs. Fitted (Cont.)
(vii)  [ y (i)  aˆ  bˆx(i)][ x(i)]  0
i
(viii) eˆ(i) * x(i)  0
i
15
Part II: ANOVA in Regression
16
ANOVA



Testing the significance of the regression,
i.e. does x significantly explain y?
F1, n -2 = EMS/UMS
Distributed with the F distribution with 1
degree of freedom in the numerator and n-2
degrees of freedom in the denominator
17
Table of Analysis of Variance
(ANOVA)
Source of
Variation
Explained,
ESS
Error,
USS
Total, TSS
Sum of Degrees of Mean
Squares Freedom Square
bˆ [x(i)  x ]
bˆ [x(i)  x ]
1
2
2
2
i
 [eˆ(i)]
i
2
i
 [y(i)  y ]
2
i
2
n-2
 [eˆ(i)] /n-2
n-1
 [y(i)  y ] /n-1
2
i
2
i
F1,n -2 = Explained Mean Square / Error Mean Square
18
Example from Lab Four

Linear Trend Model for UC Budget
19
UC Budget, General Fund Component, Millions of Nominal $
4000
y = 81.613x + 19.497
R2 = 0.933
3500
2500
$2670.529
2000
1500
1000
500
Fiscal Year
-0
5
04
-0
3
02
-0
1
00
-9
9
98
-9
7
96
-9
5
94
-9
3
92
-9
1
90
-8
9
88
-8
7
86
-8
5
84
-8
3
82
-8
1
80
-7
9
78
-7
7
76
-7
5
74
-7
3
72
-7
1
70
-6
9
0
68
Millions $
3000
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.965901681
R Square
0.932966058
Adjusted R Square 0.931050802
Standard Error
240.1544701
Observations
37
ANOVA
df
Regression
Residual
Total
1
35
36
SS
MS
F
Significance F
28094446.39 28094446 487.12355
3.992E-22
2018595.933 57674.17
30113042.32
Coefficients Standard Error
t Stat
P-value
Lower 95%
101.1096814
77.38814601 1.306527 0.1998941 -55.99679935
81.61255073
3.697748649 22.07088 3.992E-22 74.10571271
Intercept
X Variable 1
RESIDUAL OUTPUT
Observation
Predicted Y
1 101.1096814
2 182.7222321
Residuals
190.1903186
146.5777679
21
Example from Lab Four




Exponential trend model for UC Budget
UCBud(t) =exp[a+b*t+e(t)]
taking the logarithms of both sides
ln UCBud(t) = a + b*t +e(t)
22
UC Budget
4.5
y = 0.3623e0.0654x
R2 = 0.9131
4
$ Billions
3.5
3
2.5
2
1.5
1
0.5
0
0
5
10
15
20
25
30
35
40
Fiscal year
23
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.95557214
R Square
0.91311811
Adjusted R Square
0.91063577
Standard Error
0.22161289
Observations
37
ANOVA
Regression
Residual
Total
df
SS
MS
F
Significance F
1 18.06574216 18.06574 367.8458 3.77305E-20
35 1.718929586 0.049112
36 19.78467174
Intercept
X Variable 1
Coefficients Standard Error t Stat
P-value
Lower 95% Upper 95%
-0.94975318 0.071413248 -13.2994 3.01E-15 -1.094729955 -0.80478
0.06544472 0.003412257 19.17931 3.77E-20 0.058517462 0.072372
RESIDUAL OUTPUT
Observation
1
2
3
4
5
Predicted Y
-0.94975318
-0.88430846
-0.81886374
-0.75341901
-0.68797429
Residuals
-0.283648439
-0.226477634
-0.272078047
-0.268232233
-0.267317175
Exp(-0.950) = 0.387
24
Part III: The F Distribution
25
The F Distribution

The density function of the F distribution:
 n1  n 2  2 
n1  2
n1

 !  n  2
F 2
2
1
 
f (F) 
n1  n 2
 n1  2   n 2  2   n 2 
 n1F  2
 2  !  2  !
 1 

n2 

F0
n1 and n2 are the numerator and
denominator degrees of freedom.
26
The F Distribution

This density function generates a rich family of
distributions, depending on the values of n1 and n2
n1 = 5, n2 = 10
0.01
0.008
n1 = 50, n2 = 10
0.006
0.004
0.002
0
0
1
2
3
0.008
0.007
0.006
0.005
0.004
0.003
0.002
40.001
0
n1 = 5, n2 = 10
n1 = 5, n2 = 1
5
0
1
2
3
4
5
27
Determining Values of F


The values of the F variable can be found in
the F table, Table 6(a) in Appendix B for a
type I error of 5%, or Excel .
The entries in the table are the values of the
F variable of the right hand tail probability
(A), for which P(Fn1,n2>FA) = A.
28
UC Budget, General Fund Component, Millions of Nominal $
4000
y = 81.613x + 19.497
R2 = 0.933
3500
2500
$2670.529
2000
1500
1000
500
Fiscal Year
-0
5
04
-0
3
02
-0
1
00
-9
9
98
-9
7
96
-9
5
94
-9
3
92
-9
1
90
-8
9
88
-8
7
86
-8
5
84
-8
3
82
-8
1
80
-7
9
78
-7
7
76
-7
5
74
-7
3
72
-7
1
70
-6
9
0
68
Millions $
3000
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.965901681
R Square
0.932966058
Adjusted R Square 0.931050802
Standard Error
240.1544701
Observations
37
ANOVA
df
Regression
Residual
Total
1
35
36
SS
MS
F
Significance F
28094446.39 28094446 487.12355
3.992E-22
2018595.933 57674.17
30113042.32
Coefficients Standard Error
t Stat
P-value
Lower 95%
101.1096814
77.38814601 1.306527 0.1998941 -55.99679935
81.61255073
3.697748649 22.07088 3.992E-22 74.10571271
Intercept
X Variable 1
F1, 35 = (n-2)*[R2/(1 - R2) =35*(0.933/0.067)= 487
RESIDUAL OUTPUT
Observation
Predicted Y
1 101.1096814
2 182.7222321
Residuals
190.1903186
146.5777679
30
1 dof
35 dof
F1,35 = 4.12
31
Part IV: The Pearson Coefficient
of Correlation, r


The Pearson coefficient of correlation, r, is
(13) r = cov yx/[var x]1/2 [var y]1/2
Estimated counterpart
(14) rˆ  [y(i)  y ][x(i)  x ] 
i

2
[y(i)

y
]

i
2
[x(i)

x
]

i
Comparing (13) to (7) note that
(15) r*{[var y]1/2 /[var x]1/2} = b
32
A Cognitive Device: (Cont.)

(5) E{[yi - E yi ] [xi - E xi ]} = b*E[xi - E xi ]2
+ E{ei [xi - E xi ] }, where assume
• E{ei [xi - E xi ] }= 0, i.e. e and x are independent



By definition, (6) cov yx = b* var x, i.e.
(7) b= cov yx/ var x
The corresponding empirical estimate:
2
ˆ
(8) b   [y(i)  y ][x(i)  x ]  [x(i)  x ]
i
i
33
Part IV (Cont.) The coefficient of
Determination, R2


For a bivariate regression of y on a single
explanatory variable, x, R2 = r2, i.e. the
coefficient of determination equals the
square of the Pearson coefficient of
correlation
Using (14) to square the estimate of r
(16)[ rˆ]2 { [y(i)  y ][x(i)  x ]}2 [y(i)  y ]2 [x(i)  x ]
i
i
i
34
Part IV (Cont.)

Using (8), (16) can be expressed as
(19) rˆ2  bˆ2 * [x(i)  x ]2   [y(i)  y ]2  ESS / TSS
i

i
And so
(20)1  rˆ 2  1  [ESS/ TSS}  [TSS  ESS]/ TSS  USS/ TSS

In general, including multivariate
regression, the estimate of the coefficient of
2
ˆ
R
determination,
, can be calculated from
2
ˆ
R
(21)
=1 -USS/TSS .
35
Part IV (Cont.)


For the bivariate regression, the F-test can
be calculated from
F1, n-2 = [(n-2)/1][ESS/TSS]/[USS/TSS]
F1, n-2 = [(n-2)/1][ESS/USS]=[(n-2)] Rˆ 2  [1  Rˆ 2 ]
For a multivariate regression with k
explanatory variables, the F-test can be
calculated as
Fk, n-2 = [(n-k-1)/k][ESS/USS]
ˆ 2  [1  Rˆ 2 ]
R
Fk, n-2 = [(n-k-1)/k]
36
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.965901681
R Square
0.932966058
Adjusted R Square 0.931050802
Standard Error
240.1544701
Observations
37
ANOVA
df
Regression
Residual
Total
1
35
36
Coefficients Standard Error
t Stat
P-value
Lower 95%
101.1096814
77.38814601 1.306527 0.1998941 -55.99679935
81.61255073
3.697748649 22.07088 3.992E-22 74.10571271
Intercept
X Variable 1
RESIDUAL OUTPUT
Observation
SS
MS
F
Significance F
28094446.39 28094446 487.12355
3.992E-22
2018595.933 57674.17
30113042.32
Predicted Y
1 101.1096814
2 182.7222321
R2 = 1 – 2,018,596/30,113,042
Residuals
190.1903186
146.5777679
37
Part V:Estimate of the Error
Variance


Var ei = s2
Estimate is unexplained mean square, UMS
sˆ 2  [eˆ(i)]2  (n  2)  [ y(i)  yˆ (i)]2  (n  2)
i

i
Standard error of the regression is sˆ
38
Time index, t = 0 for 1968-69, t=1 for 1969-70 etc
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.965901681
R Square
0.932966058
Adjusted R Square 0.931050802
Standard Error
240.1544701
Observations
37
ANOVA
df
Regression
Residual
Total
1
35
36
SS
MS
F
Significance F
28094446.39 28094446 487.12355
3.992E-22
2018595.933 57674.17
30113042.32
Coefficients Standard Error
t Stat
P-value
Lower 95%
101.1096814
77.38814601 1.306527 0.1998941 -55.99679935
81.61255073
3.697748649 22.07088 3.992E-22 74.10571271
Intercept
X Variable 1
RESIDUAL OUTPUT
Observation
Predicted Y
1 101.1096814
2 182.7222321
Residuals
190.1903186
146.5777679
sˆ  240.15  UMS  57674.17
39
Part VI: Hypothesis Tests on the
Slope




Hypotheses, H0 : b=0; HA: b>0
Test statistic:
ˆ ( bˆ),where E( bˆ)  b undertheH0
[ bˆ  E(bˆ )] s
Set probability for the type I error, say 5%
Note: for bivariate regression, the square of
the t-statistic for the null that the slope is
zero is the F-statistic
40
t = {81.6 - 0]/3.70 = 22.1
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.965901681
R Square
0.932966058
Adjusted R Square 0.931050802
Standard Error
240.1544701
Observations
37
ANOVA
df
Regression
Residual
Total
1
35
36
Coefficients Standard Error
t Stat
P-value
Lower 95%
101.1096814
77.38814601 1.306527 0.1998941 -55.99679935
81.61255073
3.697748649 22.07088 3.992E-22 74.10571271
Intercept
X Variable 1
RESIDUAL OUTPUT
Observation
SS
MS
F
Significance F
28094446.39 28094446 487.12355
3.992E-22
2018595.933 57674.17
30113042.32
t2 = F, i.e. 22.1*22.1 = 488
2 = F, i.e. 22.36*22.36 = 500
t
Predicted Y
Residuals
1 101.1096814
2 182.7222321
190.1903186
146.5777679
41
Part VII: Student’s t-Distribution
42
The Student t Distribution

The Student t density function
[(n  1)]!  t 
f (t ) 
1  
n[(n  2)]!  n 
2
 ( n 1) / 2
nis the parameter of the student t distribution
E(t) = 0 V(t) = n/(n– 2)
(for n > 2)
43
The Student t Distribution
0.2
0.15
n=3
0.1
0.05
0
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
0.2
0.15
n = 10
0.1
0.05
0
-6
-4
-2
0
2
4
6
44
Determining Student t Values

The student t distribution is used extensively in
statistical inference.
 Thus, it is important to determine values of tA
associated with a given number of degrees of
freedom.
 We can do this using
• t tables , Table 4 Appendix B
• Excel
45
Using the t Table

The table provides the t values (tA) for which P(t
t n>
tA ) = A
A = .05
A = .05
The t distribution is
symmetrical around 0
tA =1.812
-tA=-1.812
t.100
t.05
t.025
t.01
t.005
3.078
1.886
.
.
1.372
6.314
2.92
.
.
1.812
12.706
4.303
.
.
2.228
31.821
6.965
.
.
2.764
.
.
.
.
.
.
.
.
.
.
200
1.286
1.282
1.653
1.645
1.972
1.96
2.345
2.326
63.657
9.925
.
.
3.169
.
.
2.601
2.576
Degrees of Freedom
1
2
.
.
10

46
Problem 6.32 in Text
Table of Joint Probabilities
Manual Calc. Computer Calc.
Quant Ed.
0.23
0.36
Other Ed.
0.11
0.30
47
Problem 6.32

The method of instruction in college and
university applied statistics courses is
changing. Historically, most courses were
taught with an emphasis on manual
calculation. The alternative is to employ a
computer and a software package to
perform the calculations. An analysis of
applied statistics courses investigated
whether the instructor’s educational
background is primarily mathematics (or
statistics) or some other field.
48
Problem 6.32



A. What is the probability that a randomly
selected applied statistics course instructor
whose education was in statistics
emphasizes manual calculations?
What proportion of applied statistics
courses employ a computer and software?
Are the educational background of the
instructor and the way his or her course are
taught independent?
49
Midterm 2000
• .(15 points) The following table shows the results of
regressing the natural logarithm of California
General Fund expenditures, in billions of nominal
dollars, against year beginning in 1968 and ending
in 2000. A plot of actual, estimated and residual
values follows.
– .How much of the variance in the dependent variable is
explained by trend?
– .What is the meaning of the F statistic in the table? Is it
significant?
– .Interpret the estimated slope.
– .If General Fund expenditures was $68.819 billion in
California for fiscal year 2000-2001, provide a point
estimate for state expenditures for 2001-2002.

50

Cont.
• A state senator believes that state expenditures
in nominal dollars have grown over time at 7%
a year. Is the senator in the ballpark, or is his
impression significantly below the estimated
rate, using a 5% level of significance?
• If you were an aide to the Senator, how might
you criticize this regression?
51
Table
Dependent Variable: LNGENFND
Method: Least Squares
Sample: 1968 2000
Included observations: 33
Variable
Coefficient
Std. Error
t-Statistic
Prob.
YEAR
C
0.086958
-169.4787
0.003895
7.726922
22.32804
-21.93353
0.0000
0.0000
R-squared
Adjusted R-squared
S.E. of regression
Sum squared resid
Log likelihood
Durbin-Watson stat
0.941459
0.939570
0.213030
1.406835
5.235258
0.118575
Mean dependent var
S.D. dependent var
Akaike info criterion
Schwarz criterion
F-statistic
Prob(F-statistic)
3.046404
0.866594
-0.196076
-0.105379
498.5416
0.000000
Plot
Actual, Fitted and Residual Values from the Regression
of the Logarithm of General Fund Expenditures ($B) on Year
5
4
3
0.4
2
0.2
1
0.0
-0.2
-0.4
70
75
80
Residual
85
Actual
90
95
00
Fitted
52
Related documents