Download Document

Document related concepts

Types of artificial neural networks wikipedia , lookup

History of statistics wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Confidence interval wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Inference for Regression
Inference about the Regression Model and
Using the Regression Line
Chapter 10, Guan ch10, 11
Research question

Do we have negative demand curve?

Chen et al (2006) JPE
P
P1
P2
P3
Q
P
10
10
10
10
7
7
7
7
4
4
4
4
Q
2
3
4
5
4
5
6
10
5
9
15
20

What is the demand elasticity, (dQ/Q)(dP/P)?
Research question

Does garbage predict stock return?



“Asset Pricing with Garbage (Journal of Finance, 2011)”, by Alexi Savov.
Let d(garbage) is the annual growth in total garbage,
One % increases in garbage, we get 2.44% increases in stock return.
Objectives
Inference about the regression model and using the Regression
Line









Statistical model for simple linear regression
Estimating the regression parameters
Conditions for regression inference
Confidence intervals and significance tests
Inference about correlation
Confidence and prediction intervals for µy
Standard errors
Analysis of variance for regression
Some advanced issues
yˆ  0.125x  41.4
The data in a scatterplot are a random
sample from a population that may

exhibit a linear relationship between x
and y. Different sample  different plot.
Now we want to describe the population mean
response my as a function of the explanatory
variable x: my = b0 + b1x…
and to assess whether the observed relationship
is statistically significant (not entirely explained
by chance events due to random sampling).
Some terminations

Y variable (應變數; 因變數; 被解釋變數; 反應變數)




Dependent variable
Response
Regressand
X variable (自變數; 解釋變數)


Independent variable
Explanatory variable

Ordinary least squares (普通最小平方法)

Residual (殘差)
Statistical model for simple linear regression
In the population, the linear regression equation is my = b0 + b1x.
Sample data then fits the model:
Data =
fit
+ residual
y i = (b 0 + b 1 x i ) +
(ei)
where the ei are identical,
independent and
normally distributed N(0,s).
(we called i.i.d~N(0,s))
Linear regression assumes equal variance of y
(s is the same for all values of x).
Estimating the regression parameters
my = b0 + b1x
The intercept (截距) b0, the slope (斜率) b1, and the standard deviation
s of y are the unknown parameters of the regression model. We rely on
the random sample data to provide unbiased estimates of these
parameters.

The value of ŷ from the least-squares regression line is really a prediction
of the mean value of y (my) for a given value of x.

The least-squares regression line (ŷ = b0 + b1x) obtained from sample data
is the best estimate of the true population regression line (my = b0 + b1x).
ŷ unbiased estimate for mean response my
b0 unbiased estimate for intercept b0
b1 unbiased estimate for slope b1
The population standard deviation s
for y at any given value of x represents
the spread of the normal distribution of
the ei around the mean my .
The regression standard error, s, for n sample data points is
calculated from the residuals (yi – ŷi):
s
2
residual

n2

2
ˆ
(
y

y
)
 i i
n2
s is an unbiased estimate of the regression standard deviation s.
Conditions for regression inference

The observations are independent.

The relationship is indeed linear.


β0+β1X2 is a linear model (linear in parameters)
β0+ X β1 is not a linear model (non-linear in parameters)

The standard deviation of y, σ, is the same for all values of x.

The response y varies normally
around its mean.
Using residual plots to check for regression validity
The residuals (y − ŷ) give useful information about the contribution of
individual data points to the overall pattern of scatter.
We view the residuals in
a residual plot:
If residuals are scattered randomly around 0 with uniform variation, it
indicates that the data fits a linear model, and has normally distributed
residuals for each value of x and constant standard deviation σ.
Residuals are randomly scattered
 good!
Curved pattern
 the relationship is not linear.
Change in variability across plot
 σ not equal for all values of x.
What is the relationship between job stress and locus of control (LOC,內/
外控量表)?
We plot stress level against LOC for a random sample of 100 accountants. The
relationship is approximately linear with no outliers.
Residual plot:
The spread of the residuals is
reasonably random—no clear pattern.
The relationship is indeed linear.
Normal quantile plot for residuals:
The plot is fairly straight, supporting the
assumption of normally distributed
residuals.
 Data okay for inference.
Look into more details about statistical properties and
inference



Ordinary Least Squares Method
Find β0 and β1 to minimize the errors. We take the squared errors to
avoid the positive and negative values cancelling out each other.
Objection function:
1 n
1 n 2
2
Q( b0 , b1 )   (Yi  b 0  b1 X i )  U i .
n i 1
n i 1


The OLS is to find the line making sum squared error minimum.
We then get first order conditions as:

1 n
Q ( b 0 , b1 )  2  (Yi  b 0  b1 X i )  0,
b 0
n i 1

1 n
Q ( b 0 , b1 )  2  (Yi  b 0  b1 X i ) X i  0.
b1
n i 1
Ordinary Least Squares Method

Upon first order conditions, we solve ordinary least squares
estimator (最小平方估計式, OLS estimator)
n

b1,n 

( X
i 1
i
 X n )(Yi  Yn )
n
2
(
X

X
)
 i n
,
i 1

b 0, n  Yn  b n X n .


What happens if X=X0, which is a consistent term?
What happen if we restrict β0=0?
Regression without intercept
Yˆ  bˆ1,n X
bn
Yˆ  bˆ0,n  bˆ1,n X
estimated regression line
(估計的迴歸線)
an
Some other expressions for OLS estimator

Β1 could be also expressed as:
n

b1,n 
(X
i 1
i
 X n )(Yi  Yn )
n
2
(
X

X
)
 i n
i 1
n

XY
i 1
n
i i
n

(X
i 1
n
i
2
(
X

X
)
 i n
i 1
 nX nYn
2
X
 i  nX n
2
i 1
Xi  X
n
  K iYi where K i 
i 1
n
2
(
X

X
)
 j
i j
 X n )Yi
Some properties under OLS

Given Uˆ
n
i
^
U
i
 Yi  Yˆi , we have following properties from F.O.C
 0.
i 1
n
 X Uˆ
i
i 1
n
 YˆUˆ
i 1

i
i
i
0
0
The estimated regression line must passes through
a
n
, bn 
Confidence interval for regression parameters
Estimating the regression parameters b0, b1 is a case of one-sample
inference with unknown population variance.
 We rely on the t distribution, with n – 2 degrees of freedom.
A level C confidence interval for the slope, b1 is:
b1 ± t* SEb1
A level C confidence interval for the intercept, b0 is:
b0 ± t* SEb0
t* is the t critical for the t (n – 2) distribution with area C between –t* and +t*.
What are SEb1 and SEb0, we leave the problem to later
sections.
Significance test for the slope
We can test the hypothesis H0: b1 = 0 versus a one- or two-sided
alternative.
We calculate
t = b1 / SEb1
which has the t (n – 2)
distribution to find the
p-value of the test.
Note: Software typically provides
two-sided p-values.
Testing the hypothesis of no relationship
We may look for evidence of a significant relationship between
variables x and y in the population from which our data were drawn.
For that, we can test the hypothesis that the regression slope
parameter β is equal to zero.
H0: β1 = 0 vs. H0: β1 ≠ 0
s y Testing H0: β1 = 0 also allows us to test the hypothesis of
slope b1  r
sx no correlation between x and y in the population.
Note: A test of hypothesis for b0 is irrelevant (b0 is often not even achievable).
Using technology
Computer software runs all the computations for regression analysis.
Here is some software output for the stress/LOC example.
Slope
Intercept
p-value for tests
of significance
The t-test for regression slope is highly significant (p = 0.002).
There is a significant relationship between stress and LOC.
“intercept”: intercept
“LOC”: slope
P-value for tests
of significance
confidence
intervals
Inference for correlation
To test for the null hypothesis of no linear association, we have the
choice of also using the correlation parameter ρ.

When x is clearly the explanatory variable, this test
is equivalent to testing the hypothesis H0: β = 0.

b1  r
sy
sx
When there is no clear explanatory variable (e.g., arm length vs. leg length),
a regression of x on y is not any more legitimate than one of y on x. In that
case, the correlation test of significance should be used.

When both x and y are normally distributed, H0: ρ = 0 tests for no
association of any kind between x and y—not just linear associations.
The test of significance for ρ uses the one-sample t-test for: H0: ρ = 0.
We compute the t statistics
for sample size n and
correlation coefficient r.
The p-value is the area
under t (n – 2) for values of
T as extreme as t or more
in the direction of Ha:
t
r n2
1 r2
Confidence Interval for Regression Response
Using inference, we can also calculate a confidence interval for the
population mean μy of all responses y when x takes the value x*
(within the range of data tested):
This interval is centered on ŷ, the unbiased estimate of μy.
The true value of the population mean μy at a given
value of x will indeed be within our confidence
interval in C% of all intervals calculated
from many different random samples.
The level C confidence interval for the mean response μy at a given
value x* of x is:
t* is the t critical for the t (n – 2)
y^ ± tn − 2 * SEm^
distribution with area C
between –t* and +t*.
We again leave the formula to later section
A separate confidence interval is
calculated for μy along all the values
that x takes.
Graphically, the series of confidence
intervals is shown as a continuous
interval on either side of ŷ.
95% confidence
interval for my
Prediction interval for regression response
One use of regression is for predicting the value of y, ŷ, for any value
of x within the range of data tested: ŷ = b0 + b1x.
But the regression equation depends on the particular sample drawn.
More reliable predictions require statistical inference:
To estimate an individual response y for a given value of x, we use a
prediction interval.
If we randomly sampled many times, there
would be many different values of y
obtained for a particular x following
N(0, σ) around the mean response µy.
The level C prediction interval for a single observation on y when x
takes the value x* is:
t* is the t critical for the t (n – 2)
ŷ ± t*n − 2 SEŷ
distribution with area C
between –t* and +t*.
The prediction interval represents
mainly the error from the normal
distribution of the residuals ei.
Graphically, the series confidence
interval is shown as a continuous
interval on either side of ŷ.
95% prediction
interval for ŷ
1918 flu epidemics
500
400
300
200
100
17
ee
k
15
w
ee
k
w
w
ee
k
11
9
w
ee
k
ee
k
7
w
ee
k
w
ee
k
w
ee
k
w
ee
k
w
5
1918 influenza epidemic
13
0
800
700
600
500
400
300
of
200
100
0
# Deaths
17
w
ee
k
15
w
ee
k
13
w
ee
k
# Cases
k
11
9
ee
w
w
ee
k
7
w
ee
k
5
w
ee
k
3
k
w
ee
k
1
We look at the relationship between the number of
deaths in a given week and the number of new
diagnosed cases one week earlier.
# deaths reported
600
10000
9000
8000
# Cases
# Deaths
7000
6000
The
line graph suggests that 7 to 8% of those
5000
4000
diagnosed
with the flu died within about a week
3000
2000
their
diagnosis.
1000
0
ee
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
w
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
700
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
800
1
1918 influenza epidemic
Date
# Cases # Deaths
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
3
# cases diagnosed
1918 influenza epidemic
1918 flu epidemic: Relationship between the number of
r = 0.91
deaths in a given week and the number of new diagnosed
cases one week earlier.
EXCEL
Regression Statistics
Multiple R
0.911
R Square
0.830
Adjusted R Square
0.82
Standard Error
85.07 s
Observations
16.00
Coefficients
Intercept
49.292
FluCases0
0.072
b1
St. Error
29.845
0.009
SEb1
t Stat
1.652
8.263
P-value Lower 95% Upper 95%
0.1209
(14.720) 113.304
0.0000
0.053
0.091
P-value for
H0: β1 = 0
P-value very small  reject H0  β1 significantly different from 0
There is a significant relationship between the number of flu
cases and the number of deaths from flu a week later.
SPSS
CI for mean weekly death
count one week after 4000
flu cases are diagnosed: µy
within about 300–380.
Prediction interval for a
weekly death count one
week after 4000 flu cases
are diagnosed: ŷ within
about 180–500 deaths.
Least squares regression line
95% prediction interval for ŷ
95% confidence interval for my
Standard errors
To estimate the parameters of the regression, we calculate the
standard errors for the estimated regression coefficients.
The standard error of the least-squares slope β1 is:
SEb1 
s
 ( xi  xi ) 2
The standard error of the intercept β0 is:
SEb 0
1
x2
s

n  ( xi  xi ) 2
To estimate or predict future responses, we calculate the following
standard errors.
The standard error of the mean response µy is:
The standard error for predicting an individual response ŷ is:
Analysis of variance for regression
The regression model is:
Data =
fit
+ residual
y i = (b 0 + b 1 x i ) +
(ei)
where the ei is identical, independent and
normally distributed N(0,s), and
s is the same for all values of x.
It resembles an ANOVA, which also assumes equal variance, where
SST = SS model +
DFT = DF model +
SS error
DF error
and
Goodness of fit (配適度)

Different independent (X) variable may have different
explanatory power on dependent variable (Y). If we can
find a better model in explaining the Y variation, then we
can have better model interpretation. The method to
select models is goodness of fit (配適度)

For example, if the inflation rate on stock return has a better
goodness of fit than GDP growth rate on stock return, then we
will say inflation rate is a better model in explaining stock
return pattern.
Goodness of fit

We look at the Y variation by detrend (get rid of mean of Y)
Y iYn  (Yˆi  Uˆ i )  Yn  (Yˆi  Yn )  Uˆ i ,
Y iYn  (Yˆi  Yˆn )  Uˆ i ,
n
n
n
n
  (Y iYn )   (Yˆi  Yˆn )  Uˆ  2 (Yˆi  Yˆn )Uˆ i ,
2
i 1
n
2
i 1
i 1
n
n
i 1
i 1
2
i
i 1
  (Y iYn )   (Yˆi  Yˆn ) 2  Uˆ i2 .
2
i 1

The first item is Sum of Squared Total (SST, or call TSS; 總平方和)

The second item is Sum Squared Regressor (SSR, or RSS, or SSM; 迴
歸平方和)


Some Econometricians call this term as SSE (Sum of Squared
Explained), which makes thing confusing!!
The third one is Sum of Squared Error (SSE; 殘差平方和)

Some Econometricians call this term as SSR (Sum of Squared
Residual), which makes thing confusing!!
Degree of Freedom



In SST, we use one degree of freedom in calculating
sample mean, so df for SST is n-1.
SSE comes from OLS, which requires two F.O.C, so we
loss two degree of freedom. The fd for SSR is n-2.
The difference between degrees of freedom for SST
and SSR (or SSM) is df for SSE, which is 1.
For a simple linear relationship, the ANOVA tests the hypotheses
H0: β1 = 0 versus Ha: β1 ≠ 0
by comparing MSM (model) to MSE (error): F = MSM/MSE (only valid
for simple regression model)
When H0 is true, F follows
the F(1, n − 2) distribution.
The p-value is P(F > f ).
The ANOVA test and the two-sided t-test for H0: β1 = 0 yield the same p-value.
Software output for regression may provide t, F, or both, along with the p-value.
ANOVA table
Source
Sum of squares SS
Model
(Regression)
2
ˆ
(
y

y
)
 i
Error
(Residual)
 ( y  yˆ )
Total
 ( y  y)
i
2
DF
Mean square MS
F
P-value
1
SSM/DFM
MSM/MSE
Tail area above F
n−2
SSE/DFE
i
n−1
2
i
SST = SSM + SSE
DFT = DFM + DFE
The standard deviation of the sampling distribution, s, for n sample
data points is calculated from the residuals ei = yi – ŷi.
s2 
2
e
i
n2

2
ˆ
(
y

y
)
 i i
n2

SSE
 MSE
DFE
s2 is an unbiased estimate of the regression variance σ2.
Coefficient of determination, r2
The coefficient of determination, r2 (判定係數), square of the
correlation coefficient, is the percentage of the variance in y (vertical
scatter from the regression line) that can be explained by changes in
x.
r 2 = variation in y caused by x (i.e., the regression line)
total variation in observed y values around the mean
2
ˆ
(
y

y
)
 i
SSM
r 

2
 ( yi  y ) SST
2
Two types of coefficient of determination


We have two types of r2 :
Centered (置中) and Non-centered (非置中)
coefficient of determination. When no indication, it
usually refers to the former one.
n
Centered R 2 
 (Yi Yˆn )
n
i 1
n
 (Yi Yn )2
 1
i 1
n
Non  centered R 2 
Uˆ i2
2

i 1
n
 (Yi Yn )2
i 1
Yˆi 2
Yi 2
i 1
i 1
n
n
 1
Uˆ i2
i 1
n
Yi 2
i 1
.
.
How can we interpret the R2

R2 is higher, then the variation of Y is captured by X
more, and with a better goodness of fit.
As R2=1, there is no residual, we call it perfect fitted (完全配適).

As R2=0 , SSE=SST, so there is no explanatory power at all.

Some properties of R2
 What if we let Y’=Y/100?


Different expression:
R 
2
2
 n

ˆ
ˆ
 (Yi Yn )(Yi Yn ) 
 i1

n
 n



2
2
ˆ
ˆ
 (Yi Yn )   (Yi Yn ) 
 i1
  i1





2
 n

 ( X i  X n )(Yi Yn ) 
 i1

n
 n



2
2
 (Yi Yn )   ( X i  X n ) 
 i1
  i1




.
Interpreting regression results
Using
software:
SPSS
r2 =SSM/SST
= 2.157/22.114
ANOVA and t-test
give same p-value.
1918 influenza epidemic
1918 influenza epidemic
Date
# Cases # Deaths
800
700
600
500
400
300
200
100
0
17
ee
k
15
ee
k
13
ee
k
11
9
ee
k
ee
k
7
w
ee
k
5
w
ee
k
3
w
ee
k
w
w
ee
k
1
1918 influenza epidemic
w
w
w
w
10000
800
9000
700
8000
600
# Cases
# Deaths
7000
500
6000
5000
400
The
4000line graph suggests that about 7 to 8% of those
300
3000
diagnosed
with the flu died within about a week200
of
2000
100
1000 diagnosis.
their
0
0
# Deaths
17
w
ee
k
15
w
ee
k
13
w
ee
k
# Cases
k
11
9
ee
w
w
ee
k
7
w
ee
k
5
w
ee
k
3
k
w
ee
k
1
We look at the relationship between the number of
deaths in a given week and the number of new
diagnosed cases one week earlier.
ee
0
0
130
552
738
414
198
90
56
50
71
137
178
194
290
310
149
w
36
531
4233
8682
7164
2229
600
164
57
722
1517
1828
1539
2416
3148
3465
1440
Incidence
week 1
week 2
week 3
week 4
week 5
week 6
week 7
week 8
week 9
week 10
week 11
week 12
week 13
week 14
week 15
week 16
week 17
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
# deaths reported
# cases diagnosed
1918 flu epidemics
r = 0.91
1918 flu epidemic: Relationship between the number of
deaths in a given week and the number of new diagnosed
cases one week earlier.
MINITAB
- Regression Analysis:
FluDeaths1 versus FluCases0
The regression equation is
FluDeaths1 = 49.3 + 0.0722 FluCases0
Predictor
Coef
Constant
49.29
FluCases
0.072222
S = 85.07
s  MSE
SE Coef
SEb 0
0.008741 SE
b1
29.85
R-Sq = 83.0%
T
P
1.65
0.121
8.26
0.000
R-Sq(adj) = 81.8%
r2 = SSM / SST
Analysis of Variance
Source
Regression
DF
1
SS
P-value for
H0: β = 0; Ha: β ≠ 0
MS
F
P
68.27
0.000
Residual Error
14
494041 SSM 494041
101308
7236
Total
15
595349 SST
MSE  s 2


Relation between repurchase and book-to-market ratio: 8,434 repurchase
announcement return using SAS.
What do you find?
Advanced Analyses on Regressions*

BLUE (Best Linear Unbiased Estimator)

Asymptotic Least Square Method

MLE

Conditional heteroskedasticity (條件變異數不齊一)

Weighted Least Square Estimator (WLS estimator; 加權最小平方法)
Statistical properties of OLS

Specify linear regression as:
Yi    b X i  U i ,

i  1,..., n.
Two classical conditions of regression analysis:


[B1] X i , i = 1, … , n, are non-stochastic variables
[B2] α0 and β0 exist so that
Y i = α 0 + β0 X i + V i ,
i = 1, … , n.
Classical conditions

Vi satisfies:
(i) (Vi )  0, i  1, ,n.
(ii) var(Vi )  s 02 , i  1, ,n.
當 i  j 時, cov(Vi ,V j )  0.


[B2] Vi is the error term under parameters  0 and b 0 , Vi
is also known as disturbance (干擾項).
When Vi has identical variance, Vi are under
homoscedasticity (變異數齊一性); otherwise Vi are
under heteroskedasticity (變異數不齊一性 ).
Regression variance


In [B2](ii), to estimate parameter σ02, the estimator under
OLS is the Sum of Squared Error divided by degree of
freedom:
1 n ˆ2
2
2
ˆ
sˆ n 
U
,
s

i
n is regression variance .
n  2 i 1
If we consider a model with only intercept, then the regression
variance is exactly the sample variance:
n
n
1
1
2
2
ˆ
sˆ n2 
U

(
Y

Y
)
.


i
i
n
n  1 i 1
n  1 i 1
Recall OLS estimators


Given [B1] and [B2](i), OLS estimators are linear and
unbiased estimators of α0 and β0 .
Given [B1] and [B2], then
s
2
bˆ
 var( bˆn )  s 02
1
n
(X
i 1
s 2ˆ
2 1
ˆ
 var( n )  s 0 ( 
n
 X n )2
i
Xn
n
(X
i 1
s ˆbˆ  cov(ˆ n , bˆn )  s 02
i
2
),
 X n )2
Xn
n
2
(
X

X
)
 i
n
i 1
統計學 ch11 簡單線性迴歸:統計分析
,
.
Best Linear Unbiased Estimator (BLUE)

Assuming [B1] and [B2] held, Gauss-Markov theorem (
高斯--馬可夫定理) tells us that OLS estimators are Best
Linear Unbiased Estimator (BLUE; 最佳線性不偏估計式)

假設 [B1] 和 [B2] 成立,則
(Uˆ i )  0,
2
(
X

X
)
1
i
n
var(Uˆ i )  s 02 (1  
), i  1,, n,
n
2
n  j 1 ( X j  X n )
( X i  X n )( X j  X n )
2 1
ˆ
ˆ
cov(U i ,U j )  s 0 ( 
), i  j.
n
2
n
 k 1 ( X k  X n )
Best Linear Unbiased Estimator

2
ˆ
s
Given [B1] and [B2], n is unbiased estimator of σ02 .

Yet it is not best linear estimator.
Given [B1] and [B2], parameter variance estimators are
2
unbiased estimators of s 2、
s
、
ˆ
ˆ s ˆ ˆ
b
b
1
S  sˆ
,
n
2
 i1 ( X i  X n )
2
bˆ
2
n
2
Xn
2
2 1
Sˆ  sˆ n ( 
),
n
2
n  i 1 ( X i  X n )
2
ˆbˆ
S
Xn
 sˆ
.
n
2
 i1 ( X i  X n )
2
n
Restrict classical conditions of regression

We did not require the distribution form for Vi. Now we
add more restrictions and learn the real distribution of
parameter estimators:
 [B2’] exists, under  0 and β0, we have
Yi =  0 +β0 Xi + Vi, i = 1, …, n,
where Vi are independent variables obeying N (0, σ02 ).

If [B1] and [B2’] held, then
bˆn ~ Ν ( b 0 , s 02
1

n
i 1
(Xi  Xn)
2
),
X n2
1
ˆ n ~ Ν ( 0 , s ( 
)),
2
n
n  i 1 ( X i  X n )
2
0
2
(
X

X
)
1
i
n
Uˆ i ~ Ν (0, s (1  
)).
2
n
n  j 1 ( X j  X n )
2
0
Restrict classical conditions of regression

If [B1] and [B2’] are held, then
 
.ˆ n and sˆ n2 are independnt , bˆn and sˆ n2 are independen t.

(. n  2)sˆ n2 / s 02 ~  2 (n  2).

Without the normal distribution assumption, OLS
estimators are not necessarily independent of variance
estimators, χ2 distribution is not necessarily held.

Under normal distribution assumption, OLS estimator is
not only the BLUE but also the Best Unbiased
Estimator.
Asymptotic Least Squares



We next release the condition of non-stochastic X
[C1] {(X1,Y1), …, (Xn,Yn)} are random i.i.d. variables with
finite variances.
[C2] Given α0 and β0 so that
Yi = α0 + β0 Xi + Vi, i = 1, … , n.
, in which
(i) (Vi | X i )  0.
(ii) var(Vi | X i )  s 02 , i  1, ,n.
OLS under large sample

Under [C1], we are able to apply weak law of large
numbers and get:
bˆn 

n
i 1
( X i  X n )(Yi  Yn ) /( n  1)

n
i 1
( X i  X n ) 2 /( n  1)
s XY

 2 .
sX
P
s XY
P
ˆ
ˆ n  Yn  b n X n 
 mY  2 m X .
sX
[ in1 ( X i  X n )(Yi  Yn ) /( n  1)]2
2
s
P
XY
R2 


.
n
2
n
2
2
2
[ i 1 (Yi  Yn ) /( n  1)][  i 1 ( X i  X n ) /( n  1)]
s XsY
P
2
sˆ n2 

s Y2 (1   XY
).
OLS under large sample

When [C1] and [C2] (i) are held,
β0 = σXY/σX2,
α0 = μY – μXσXY/σX2.

Under [C1] and [C2] ,
σ02 = σY2 (1 – ρ2XY).


Therefore, parameters are determined by data (data
variances and correlations) rather than the model setting.
Again, we set regression model Yi = α +β Xi + Ui , then
ˆ n are consistent
 Under [C1] and [C2] (i), OLS estimators bˆn and 
estimator of b 0 and  0

Under [C1] and [C2] , sˆ 2 is consistent estimator of σ02 .
Asymptotic distribution of estimators

Under [C1] , [C2] and a linear model (shown above), we
get
A
n ( bˆ  b ) ~ Ν (0, s 2 / s 2 ),
n
0
0
X
2
m
n (ˆ n   0 ) ~ Ν (0, s 02 (1  X2 )).
sX
A

[C1] requires random sample, making random variables
independent.

If weal law of large numbers and central limit theorem are
held, then consistency and asymptotic properties are held as
well.
Maximum Likelihood Estimation (MLE)




Under [B1] and [B2]’, and given α0 and β0 so that
Yi = α0 + β0 Xi + Vi, i = 1, … , n.
then Yi ~ N(α0 + β0 Xi, σ02),
Yi and Yj are independent, so we use maximum likelihood method (
最大概似法) to estimator parameters:
Let 0 : ( 0 , b0,s 02 )
{.Yi }in1 ~ N ( 0  b 0 X i , s 02 ), The maximum likelihood function is
Maximum Likelihood Estimation

We then tale nature logarithm of likelihood function, and get F.O.C.
as
n

bn 

(X
i 1
i
 X n )(Yi  Yn )
,
n
2
(
X

X
)
 i n
i 1

 n  Yn  b n X n .
Biased estimator
1 n ˆ2 1 n
sˆ  U i   (Yi  Yn ) 2 .
n i 1
n i 1
2
n
Conditional heteroskedasticity
Conditional heteroskedasticity



Now we further release the homoscedasticity
assumption:
[H1] {(X1, Y1), … , (Xn, Yn)} are i.i.d. random variables
with at least forth moment.
[H2] Given α0 and β0 so that
Yi = α0 + β0 Xi + Vi, i = 1, … , n.
, in which (V | X )  0.
i
i
We do not assume Var(Vi|Xi) to be irrelevant to Xi, so
Var(Vi|Xi) could be a function of Xi under [H1] and [H2].
White (1980) adjustment





One way to deal with the heteroskedasticity is White’s method.
White’s test grew out of what are known as “White standard
errors.” in his Econometrica (1980) paper.
These are also sometimes called “robust standard errors” or
“heteroskedasticity-consistent standard errors.”
Under [H1] and [H2], we obtain
 mx 
in which Z i : ( X i  m x )Vi and H i  1  
Xi
2 
 E( X i ) 
White (1980) adjustment

We get estimator variables (for t-test) as

Where
The estimator variances are consistent estimator under [h1] and [H2].


Weighted Least Square Estimator


Another way to deal with the heteroskedasticity issue is
Weighted Least Square (WLS) Method (加權最小平方法)
Given [C1] and [C2](i), we have conditional variance
heteroskedasticity with form of
Var (Vi | X i )  h( X i )

h(.) is a given function.
We can use following transformation to estimate parameters:

where

Weighted Least Square Estimator

Upon OLS method on abovementioned transformation, we get WLS
estimator:

We can verify that transformed variables
[C1] and [C2](i). Defining

So the transformation follows conditional homoscedasticity.
are satisfied with
, then