Download Chapter 2. Simple Linear Regression Model Background Suppose

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

German tank problem wikipedia , lookup

Linear regression wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Maximum likelihood estimation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
1
Chapter 2. Simple Linear Regression Model
Background
Suppose we wish to learn about the effect of education (x) on wage rate (y) in the U.S.
We believe that the average (expected) wage rate depends on the education level
𝐸(π‘Œπ‘– ) = 𝛽0 + 𝛽1 𝑋𝑖
Deviation of individual's wage rate from the average is random.
π‘Œπ‘– = 𝐸(π‘Œπ‘– ) + 𝑒𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
Yi=dependent variable, or explained variable
Xi=independent variable, or explanatory variable
ui=error term, or disturbance term
𝛽0 =intercept parameter, or intercept coefficient, or constant term
𝛽1 =slope parameter, or slope coefficient
- The error term captures the effects of all other variables, some of which are observable and some are
not.
- Properties of error terms play an important role in determining the properties of parameter estimates. We
will talk about this much later.
- The slope parameter represents the marginal effect of X on Y. If X increases by one unit, Y changes by
𝛽1 .
What is the estimation? See Excel file 2.1
- collect data on x and y from n individuals
- plot them on x-y plane. This is called the scatterplot. Each point represents an individual
- Since the model is specified as a linear model, we wish to find a straight line that captures the
relationship between the two variables.
- There can be many lines that appear a good line.
- We need to pick one line. Which line is the best?
- We need to decide what we mean by the β€œbest”
Estimation Methods
We will start a simpler approach, called the Ordinary Least Squares (OLS) Estimation method.
To learn about the OLS method, let
𝛽̂0 , 𝛽̂1 estimated parameters
2
π‘ŒΜ‚π‘– = 𝛽̂0 + 𝛽̂1 𝑋𝑖 predicted (or estimated) dependent variable
𝑒̂𝑖 = π‘Œπ‘– βˆ’ π‘ŒΜ‚π‘–
(regression) residuals, (or prediction errors)
12
10
u5
8
6
4
u4
2
0
0
1
2
3
4
5
6
7
8
9
10
See Excel 2.2
The line of predicted y for each x is the regression line we are looking for.
The objective of estimation is to find a line that is closest to the sample points.
- Parameter estimates that make the predicted value π‘ŒΜ‚π‘– and actual observed value Yi as close as possible.
- But, we cannot make all predicted values close to actual values. Some are close and some are not.
How do we measure the overall closeness?
Idea 0: measure by the sum of residuals. This does not work because positive and negative residuals
cancel each other.
Idea 1. to avoid the cancellation problem, we can take the sum of absolute residuals (SAR):
𝑆𝐴𝑅 = βˆ‘π‘›π‘–=1 |𝑒̂𝑖 |
And we find parameters that make this SAR the smallest. Such estimator is called the Minimum Absolute
Deviation (MAD) estimator, or the Least Absolute Deviation (LAD) estimator.
Idea 2. Another way to avoid the cancellation problem is to take the sum of squared residuals (SSR):
𝑆𝑆𝑅 = βˆ‘π‘›π‘–=1 𝑒̂𝑖 2 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 )2
The parameter values that make the SSR the smallest are called the Ordinary Least Squares (OLS)
estimator.
Remarks:
- The LAD estimator of parameters represent the intercept and marginal effect of Xi on the median of
3
Yi. That is, median of Yi is equal to 𝛽0 + 𝛽1 𝑋𝑖
- The OLS estimator of parameters represent the intercept and marginal effect of Xi on the mean of
Yi. That is, mean of Yi is equal to 𝛽0 + 𝛽1 𝑋𝑖
- We may be interested in to know the effect of Xi on the quantiles (eg., 25% quantile, or 75%
qyantile) of Yi. Since the 50% quantile is the median, this is a generalization of the idea of LAD
estimator. And it is called the Quantile Regression instead of Linear Regression.
- Objective functions of LAD and Quantile estimators ( quantile)
LAD estimator:
min 𝑆𝐴𝑅 = βˆ‘π‘›
Μ‚ 𝑖 | = βˆ‘π‘›π‘–=1 |π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 |
𝑖=1 |𝑒
𝛽0 ,𝛽1
Quantile estimator:
min 𝑄(𝜏) = βˆ‘π‘’Μ‚π‘– >0 𝜏|𝑒
Μ‚ 𝑖 | + βˆ‘π‘’Μ‚π‘–β‰€0(1 βˆ’ 𝜏)|𝑒̂𝑖 |
𝛽0 ,𝛽1
- Note that the LAD estimator gives the same weight (equal to 1) on the absolute error regardless of
the sign of the error (whether it is positive or negative).
- Note that the quantile estimator gives different asymmetric weight on absolute error: weight  on
the positive error (under-prediction) and weight (1-) on the negative error (over-prediction)
How do we compute the estimator of parameters? See Excel 2.3
(1) Use Excel’s intercept and slope commands
(2) Use Excel’s Regression function
(3) Use excel’s Solver function
(4) Use algebraic solutions
𝛽̂0 = π‘ŒΜ… βˆ’ 𝛽̂1 𝑋̅
𝑛
βˆ‘ (𝑋𝑖 βˆ’π‘‹Μ…)(π‘Œπ‘– βˆ’π‘ŒΜ…)
𝛽̂1 = 𝑖=1
Μ… 2
βˆ‘π‘›
𝑖=1(𝑋𝑖 βˆ’π‘‹)
where 𝑋̅ and π‘ŒΜ… are the sample means of X and Y
Derivation of OLS estimators
𝑆𝑆𝑅 = βˆ‘π‘›π‘–=1 𝑒̂𝑖 2 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 )2
𝑆𝑆𝑅
πœ•π‘†π‘†π‘…
Μ‚0
πœ•π›½
= βˆ’2 βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 ) = 0 β‡’ 𝛽̂0 = π‘ŒΜ… βˆ’ 𝛽̂1 𝑋̅
𝑆𝑆𝑅
πœ•π‘†π‘†π‘…
Μ‚1
πœ•π›½
= βˆ’2 βˆ‘π‘›π‘–=1 𝑋𝑖 (π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 ) = 0 β‡’ βˆ‘π‘›π‘–=1 𝑋𝑖 𝑒̂𝑖 = 0
Interpretation of OLS estimator
For proper interpretation of the estimation results, you have to remember
(i) the measurement unit of each variable
(ii) the functional form of the regression
4
- Measurement Unit
Example
wage=-0.90+0.54 educ
wage is measured in dollars and educ is measured in years
Slope estimate 0.54: one more year of education is expected to raise the hourly wage by 0.54
dollars (54 cents) on the average. (1976 data)
Intercept estimate -0.90: A person with no education receives negative 90 cents per hour -- silly
result. This is caused by the lack of sample with zero education and hence the estimate at this low
end is not good.
Prediction: The predicted wage of a person who has 10 years of education is $4.50/hour, which is
computed by substituting 10 for educ: 4.50=-0.90+0.5410.
- Functional form
ln(wage)=0.584+0.083 educ
wage is measured in dollars and educ is measured in years
Slope estimate 0.083: one more year of education is expected to increase the hourly wage by
8.3% (1000.083)
Intercept estimate 0.584: A person with no education is expected to receive wage equal to
$1.79=exp(0.584)
Prediction: The predicted wage of a person who has 10 years of education is $4.50/hour, which is
computed by substituting 10 for educ: 4.50=-0.90+0.5410.
Unexpected Results
We expect that the education level has a positive effect on the wage, i.e., 𝛽1 is positive.
What if its estimate is negative? i.e., unexpected result?
It is due to misspecification of the model such as
- omitted explanatory variables
- wrong functional form
- the coefficient (the marginal effect of educ) is not the same for all levels of education.
- and others
What do we have to do?
Example of Phillips Curve
- simple regression model gives a positively sloped Phillips curve.
- this is caused by the shift of the curve over time.
- See Excel file 2.4
5
A Measure of Goodness-of-fit: R Squared R2
Example
- Differences in wage rates can be explained partly by the differences in education levels
- Remaining part of differences in wage is due to unobservable random factors (error terms)
Question
- Is this model any good?
- Does the education explain the wage well?
- How much of the differences in years of education explain the differences in wage rates across
individuals?
- How can we measure it?
To answer this question, we may compare two models
- One model that uses information on education... unrestricted model
π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
- One model that does not use information on education... restricted model
π‘Œπ‘– = 𝛽0 + 𝑒𝑖
The restriction is 𝛽1 = 0
How do we measure the relative performances of these two models?
- The Least Squares estimators are the estimators that make the SSR as small as possible.
- Smaller SSR means a relatively better overall prediction.
- Therefore, we can compare the minimum SSR of the two models
Estimate both models using the OLS method, and let their SSR’s be denoted by
SSRu from the unrestricted model
SSRr from the restricted model
If education has a strong explanatory power, we would expect SSRu is much smaller than SSRr.
What is the percentage reduction in SSR when education level is used to predict wages? This ratio in
fraction is called decimal point
𝑆𝑆𝑅𝑒
𝑅2 = 1 βˆ’
0 ≀ 𝑅2 ≀ 1
π‘†π‘†π‘…π‘Ÿ
A high R2 means that the differences in education have a strong explanatory power for the differences in
wage rates, and hence our model is good, and vice versa.
Note: The estimator of 𝛽0 in the restricted model is 𝛽̃0 = π‘ŒΜ…. Therefore,
π‘†π‘†π‘…π‘Ÿ = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ… )2
This is called the total sum of squares (TSS) and SSRu is the usual SSR that we discussed before.
Therefore, R2 is usually written as
𝑆𝑆𝑅
𝑅 2 = 1 βˆ’ 𝑇𝑆𝑆 =
π‘‡π‘†π‘†βˆ’π‘†π‘†π‘…
𝑇𝑆𝑆
SSE=explained sum of squares
𝑆𝑆𝐸
= 𝑇𝑆𝑆
6
Sampling Variations of OLS Estimator
π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
Collect a data and estimate parameters: 𝛽̂0 , 𝛽̂1
A simple regression model:
Collect another data and estimate parameters
These estimates will be different from previous estimates
Repeat the procedure and get different estimated values
What can we say about these different estimated values of the same unknown parameters?
Average of these estimates?
Dispersion of these estimates?
Basic facts
ui is a random variable, Yi is a function of ui. Therefore, Yi is also a random.
Xi can be a random variable too.
𝑛
βˆ‘ (𝑋𝑖 βˆ’π‘‹Μ…)(π‘Œπ‘– βˆ’π‘ŒΜ…)
From the estimators, 𝛽̂0 = π‘ŒΜ… βˆ’ 𝛽̂1 𝑋̅ and 𝛽̂1 = 𝑖=1
, they are random variables also.
βˆ‘π‘› (𝑋 βˆ’π‘‹Μ…)2
𝑖=1
𝑖
Therefore, estimators vary from sample to sample (realized values of error terms)
Desired properties of estimators
Unbiasedness: The average of estimators is equal to their true values
Precision of estimators: Smallest dispersion
If an estimator is unbiased and has the smallest dispersion among all unbiased, it is called the best
unbiased estimator.
Can we claim that the OLS estimators are the best unbiased estimators?
The answer is yes under certain conditions.
Assumption 1. Conditional distribution of ui given Xi has a zero mean.
Assumption 2. (Xi,Yi) are independently and identically distributed.
Assumption 3. Large outliers are unlikely.
Under these assumptions, the OLS estimators are the BLUE (Best Linear Unbiased Estimator)
To understand these concepts, we will do a very brief review of basic statistic theory.
7
Review of Statistics Theory
Random Experiment: an experiment whose outcome cannot be predicted with certainty.
Probability: A numerical measure of the relative likelihood (frequencies) of various outcomes of a
random experiment. It takes a value in a closed interval [0,1].
Toss a coin with an outcome of either Head (H) or Tail (T). Let
ο€²=prob(H), and 1-ο€²=prob(T).
If ο€²=1/2, it is a fair coin.
Define a random variable Y
Y=1 if the coin lands on H
Y=0 if the coin lands on T
Probability density function (pdf) of Y: f(y)
f(y)=P(Y=y): f(1)=ο€², f(0)=1-ο€²
Graph of the pdf
Cumulative distribution function (cdf): F(y)
F(y)=P(Y<=y)
F(y)=0 if y<0
=1-ο€²ο€ if 0≀y<1
=1 if yβ‰₯ 1
Graph of the cdf
Moments of random variable Y: Expected value and Variance
Summary value of the characteristics of the probability distribution
Mean or Expected value of Y
The expected value of Y is a measure of the location of the β€œcenter” of the probability distribution
It is a probability weighted average of the outcomes of Y
ΞΌ=E(Y)=ο€ ο€²οˆ1+(1-ο€²)0=ο€²
Variance and standard deviation of Y
The variance is a measure of the degree of dispersion or spread of the random outcomes around the
mean
𝜎 2 = π‘£π‘Žπ‘Ÿ(π‘Œ) = 𝐸[(π‘Œ βˆ’ πœ‡)2 ]=E[Y2-2ΞΌY+ΞΌ2]=E(Y2)- ΞΌ2
𝜎 = √𝜎 2
8
For the example above,
𝜎 2 = πœƒ(1 βˆ’ πœ‡)2 + (1 βˆ’ πœƒ)(0 βˆ’ πœ‡)2 = πœƒ(1 βˆ’ πœƒ)2 + (1 βˆ’ πœƒ)(0 βˆ’ πœƒ)2 = πœƒ(1 βˆ’ πœƒ)
This shows that the variance is the probability weighted average of the squared distance of outcomes
from their mean.
Example: more than two outcomes
Toss two balanced coins (or a balanced coin twice)
Potential outcomes: {HH}, {HT}, {TH}, {TT}
Each outcome is equally likely, and hence the probability of each outcome is 1/4
You win a prize in dollars that is equal to the number of heads
Let Y denote the amount of prize that you can win
Y can take $2, $1, $1, and $0
Expected prize:
μ=E(Y)=1/42+1/41+1/41+1/40=1
Variance of prize:
𝜎 2 =var(Y)=1/4(2-1)^2+1/4(1-1)^2+1/4(1-1)^2+1/4(0-1)^2=1/2
9
Joint Distribution of two random variables
Toss a fair coin twice
Possible outcomes: {HH}, {HT}, {TH}, {TT}
equally likely: probability of each outcome is ¼
Two prizes X and Y as specified below.
X
-1
{HT or TH}
2
{HH or TT}
f(y)
-2
{TT}
Y
0
{HT or TH}
2
{HH}
f(x)
0
1/2
0
1/2
1/4
0
1/4
1/2
1/4
1/2
1/4
1
Jane is given ticket Y and John is given ticket X.
Probability of individual random variable - marginal probability
What are the probabilities that Jane wins $-2? $0? $2?
What are the probabilities that John wins $2? lose $1?
Joint Probability
(i) What is the probability that both win $2?
This happens only if {HH} and its probability is 1/4.
(ii) What is the probability that John wins $2 and Jane lose $2?
This happens only if {TT} and its probability is 1/4
The last column shows the marginal probability density of X
The last row shows the marginal probability density of Y
πœ‡π‘₯ =0.5, 𝜎π‘₯2 =E(X2)-πœ‡π‘₯2 =2.5-0.25=2.25
πœ‡π‘¦ =0.0, πœŽπ‘¦2 =E(Y2)-πœ‡π‘¦2 =2.0-0.0=2.0
Covariance of two random variables X and Y
A measure of the degree of covariation of the two random variables
It is computed by
𝜎π‘₯𝑦 =cov(X,Y)=E[(X-πœ‡π‘₯ )(Y-πœ‡π‘¦ )]=E(XY)-E(X) πœ‡π‘¦ βˆ’ πœ‡π‘₯ 𝐸(π‘Œ) + πœ‡π‘₯ πœ‡π‘¦ = 𝐸(π‘‹π‘Œ) βˆ’ πœ‡π‘₯ πœ‡π‘¦
In the example above, 𝜎π‘₯𝑦 = 𝐸(π‘‹π‘Œ) βˆ’ πœ‡π‘₯ πœ‡π‘¦ =-1+1-0=0
10
Covariance can take a value of positive, negative or zero.
Value of covariance depends on the measurement unit: A covariance of 2 will change to 20000 if the
measurement unit changes from dollars to cents.
Let X and Y are measured by dollars, and W and Z are measured in cents. Then,
W=100X and Z=100Y. Therefore,
πœ‡π‘€ =E(W)=100E(X),~~ πœ‡π‘§ =E(Z)=100E(Y)
cov(W,Z)=cov(100X,100Y)=E(100X-100 πœ‡π‘₯ )(100Y-100πœ‡π‘€ )=10000E(X-πœ‡π‘₯ )(Y-πœ‡π‘¦ )
Correlation Coefficient of X and Y
To avoid the dependence of covariance on the measurement unit, standardize it
It is computed by
π‘π‘œπ‘£(𝑋,π‘Œ)
𝜎π‘₯𝑦
𝜌π‘₯𝑦 =corr(X,Y)=𝑠𝑑(𝑋)𝑠𝑑(π‘Œ)=𝜎
π‘₯ πœŽπ‘¦
Its value lies between -1 and 1:
𝜌π‘₯𝑦 =+1: X and Y are perfectly positively correlated
𝜌π‘₯𝑦 =-1: X and Y are perfectly negatively correlated
𝜌π‘₯𝑦 =0: X and Y are uncorrelated
Conditional Probability and Conditional Moments
X
-1
{HT or TH}
2
{HH or TT}
f(y)
-2
{TT}
Y
0
{HT or TH}
2
{HH}
f(x)
0
1/2
0
1/2
1/4
0
1/4
1/2
1/4
1/2
1/4
1
Suppose that you are told that X=-1 (i.e., you are told that outcomes are either TH or HT).
What is the probability of Y=-2? That is, when the outcome is known to be either TH or HT, what is
the probability that the outcome is {TT}? It is zero because {TT} will not occur when the outcome is
either TH or HT.
A similar reasoning gives P(Y=2|X=-1)=0, and P(Y=0|X=-1)=1
P(Y=-2|X=2)=1/2, P(Y=0|X=2)=0, P(Y=2|X=2)=1/2
Conditional probability is denoted by f(y|x)=P(Y=y|X=x)
The conditional probability density is computed by
𝑓(𝑦|π‘₯) =
𝑓(π‘₯,𝑦)
𝑓(π‘₯)
𝑓(π‘₯) β‰  0
11
To understand this formula, suppose Y takes values 𝑦1 , 𝑦2 , and 𝑦3 . For a given X=x, we wish to find
𝑓(𝑦1 |π‘₯), 𝑓(𝑦2 |π‘₯), and 𝑓(𝑦3 |π‘₯). These conditional probabilities must satisfy two properties:
(1) They must sum to 1: 𝑓(𝑦1 |π‘₯) + 𝑓(𝑦2 |π‘₯) + 𝑓(𝑦3 |π‘₯) = 1
𝑓(𝑦𝑖 |π‘₯ )
𝑓(π‘₯,𝑦𝑖 )
(2) They must maintain relative probabilities of outcomes of Y:
=
𝑓(π‘₯,𝑦𝑗 )
𝑓(𝑦𝑗 |π‘₯ )
Joint probabilities of course satisfy property 2, but their sum is not equal to 1. Their sum is the marginal
probability f(x). Therefore, we divide the with f(x) so that
𝑓(𝑦1 |π‘₯) + 𝑓(𝑦2 |π‘₯) + 𝑓(𝑦3 |π‘₯) =
𝑓(π‘₯,𝑦1 )
𝑓(π‘₯)
+
𝑓(π‘₯,𝑦2 )
𝑓(π‘₯)
+
𝑓(π‘₯,𝑦3 )
𝑓(π‘₯)
=
𝑓(π‘₯,𝑦1 )+𝑓(π‘₯,𝑦2 )+𝑓(π‘₯,𝑦3 )
𝑓(π‘₯)
𝑓(π‘₯)
= 𝑓(π‘₯) = 1
The mean and the variance based on the conditional probabilities are called the conditional mean
conditional expected value) and the conditional variance.
E(Y|X=-1)=(-2)×0+(0×1)+2×0=0
E(Y|X=2)=(-2)×1/2+(0×0)+2×1/2=0
Statistical Independence of Random Variables
Random variables X and Y are statistically independent if an information on X does not change the
marginal probability of Y, i.e., P(Y=y|X=x)=P(Y=y),
X and Y are independent if f(x,y)=f(x)f(y), and f(Y=y|X=x)=f(Y=y)
X
-1
{HT or TH}
2
{HH or TT}
f(y)
-2
{TT}
Y
0
{HT or TH}
2
{HH}
f(x)
0
1/2
0
1/2
1/4
0
1/4
1/2
1/4
1/2
1/4
1
We have shown that X and Y are not correlated. But, they are not independent statistically. This is easily
verified by checking weather joint probabilities are product of marginal probabilities.
Computations of moments of a function of random variables
Moments of functions of random variables
Consider two random variables X and Y, and let a, b, c and d be constants.
(i) Let Z=a+bX. Then, E(Z)=E(a+bX)=a+b E(X) and πœŽπ‘§2 =var(Z)=𝑏 2 𝜎π‘₯2
Prove these results.
(ii) Let Z=a+bX+cY
E(Z)=E(a+bX+cY)=a+b E(X)+c E(Y)
πœŽπ‘§2 =𝑏 2 𝜎π‘₯2 + 𝑐 2 πœŽπ‘¦2 + 2π‘π‘πœŽπ‘₯𝑦
12
If X and Y are uncorrelated (or independent), then
πœŽπ‘§2 =𝑏 2 𝜎π‘₯2 + 𝑐 2 πœŽπ‘¦2
Prove these results.
(iii) cov(a+bX, c+dY)=bc cov(X,Y)
13
Desired Properties of Estimators
Toss a coin n times. Let the outcomes be denoted by Xi for the outcome of ith toss.
Xi=1 if H and Xi=0 if T
The coin is not necessarily balanced. Let ΞΈ=prob(H).
We decided to estimate ΞΈ by the fraction of the number of heads in n tosses:
𝑛
βˆ‘π‘–=1 𝑋𝑖
# π‘œπ‘“ β„Žπ‘’π‘Žπ‘‘π‘ 
πœƒΜ‚ =
=
𝑛
𝑛
What can we say about the statistical properties of this estimator?
(a) 𝐸(πœƒΜ‚) = πœƒ for any ΞΈ
πœƒ(1βˆ’πœƒ)
(b) π‘£π‘Žπ‘Ÿ(πœƒΜ‚) =
𝑛
Unbiased Estimator: An estimator that satisfies property (a) is called an unbiased estimator.
bias= 𝐸(πœƒΜ‚) βˆ’ πœƒ
Best Estimator: An unbiased estimator that has the smallest variance among all unbiased estimator is
called the Best Unbiased Estimator.
Alternative unbiased estimator
𝑋 +𝑋 +𝑋
πœƒΜƒ = 1 33 5
𝐸(𝑋1 )+𝐸(𝑋3 )+𝐸(𝑋5 )
𝐸(πœƒΜƒ ) =
=πœƒ
... πœƒΜƒ is an unbiased estimator
π‘£π‘Žπ‘Ÿ(πœƒΜƒ ) =
... If n>3, then πœƒΜ‚ has a smaller variance than πœƒΜƒ
3
πœƒ(1βˆ’πœƒ)
3
14
Statistical Properties of OLS estimators
Linear Model: π‘Œπ‘– = 𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖
𝛽̂0 = π‘ŒΜ… βˆ’ 𝛽̂1 𝑋̅
OLS estimators:
𝑛
βˆ‘ (𝑋𝑖 βˆ’π‘‹Μ…)(π‘Œπ‘– βˆ’π‘ŒΜ…)
𝛽̂1 = 𝑖=1
Μ… 2
βˆ‘π‘›
𝑖=1(𝑋𝑖 βˆ’π‘‹)
Best Linear Unbiased Estimator (BLUE)
An estimator is called the best linear unbiased estimator if it is a linear estimator of Yi and unbiased, and
its variance is the smallest among all linear unbiased estimators.
Assumption 1. Conditional distribution of ui given Xi has a zero mean.
Assumption 2. (Xi,Yi) are independently and identically distributed.
Assumption 3. Large outliers are unlikely.
Assumption 1. E(ui|Xi)=0 β‡’ 𝐸(π‘Œπ‘– |𝑋𝑖 ) = 𝛽0 + 𝛽1 𝑋𝑖 ,
𝐸(𝑒𝑖 ) = 𝐸[𝐸(𝑒𝑖 |𝑋𝑖 )] = 𝐸[0] = 0
Example of wage rate of individuals: Individuals of 12 year education (Xi=12) get 𝛽0 + 𝛽1 𝑋𝑖
on the average. Some individual gets a higher and some gets a lower wage rate than the average rate and
the average of deviations from the mean is zero.
Violation of Assumption 1.
Classical case is the simultaneous equation model: Supply-Demand equilibrium model
𝑄 𝑠 = π‘Ž + 𝑏𝑃 + 𝑒 𝑠 ,
𝑃=
π›Όβˆ’π‘Ž
𝑏+𝛽
+
𝑒𝑑 βˆ’π‘’π‘ 
𝑏+𝛽
𝑄 𝑑 = 𝛼 βˆ’ 𝛽𝑃 + 𝑒𝑑 ,
𝑄 = 𝑄 𝑠 = 𝑄𝑑
𝑒𝑑 = (π‘Ž βˆ’ 𝛼) + (𝑏 + 𝛽)𝑃 + 𝑒 𝑠
𝐸(𝑒𝑑 |𝑃) = (π‘Ž βˆ’ 𝛼) + (𝑏 + 𝛽)𝑃 + 𝐸(𝑒 𝑠 |𝑃)
Feedback Policy
GDP growth rate depends on the interest rate and the monetary authority sets the target interest rate
considering the GDP growth.
Connecticut Huskies won 100th consecutive games.
Performance of a team depends on the quality of players. The quality of new players also depend on
the performance of the team.
Omitted variable. Omitted variable may cause the correlation between the error term and the explanatory
variable. The regression is the effect of the percentage of children who are eligible for free lunch program
(lunch) on the percentage (math) of 10th graders who pass the math exam. The regression result is
math=32.14 - 0.319lunch, n=408, R2=0.171
This indicates that a 10% increase in the number of students who are eligible for free lunch will reduce
the passing percentage by 3.19 percent. A policy implication is that the government must tighten the
15
eligibility criteria to increase the passing percentage. This regression result doesn't seem to be right. It is
likely that the explanatory variable (percentage of eligible students) is correlated with the poverty level,
school quality and resources of the school which are contained in the error term. This causes the OLS
estimator biased.
Assumption 2.
This assumption means that samples are drawn randomly. This is possible in experiments. But, with
observed data, we hope that they are reasonably independent.
Independence of Yi and Yj means that ui and uj are independent. That is, ui's are i.i.d. random variables.
𝐸(𝑒𝑖 |𝑋𝑖 ) = 𝐸(𝑒𝑖 ) = 0
π‘£π‘Žπ‘Ÿ(𝑒𝑖 |𝑋𝑖 ) = π‘£π‘Žπ‘Ÿ(𝑒𝑖 ) = πœŽπ‘’2
homoscedasticity
π‘£π‘Žπ‘Ÿ(𝑒𝑖 |𝑋𝑖 ) = π‘£π‘Žπ‘Ÿ(𝑒𝑖 ) = πœŽπ‘’2
Heteroskedasticity
π‘£π‘Žπ‘Ÿ(𝑒𝑖 |𝑋𝑖 ) = π‘£π‘Žπ‘Ÿ(𝑒𝑖 ) = πœŽπ‘–2
Theorem. The OLS estimator of coefficients are the BLUE under assumptions 1,2 and 3.
Proof of unbiasedness
(a) OLS estimators 𝛽̂1 can be written as
(𝑋 βˆ’π‘‹Μ…)
𝛽̂1 = 𝛽1 + βˆ‘π‘›π‘–=1 𝑀𝑖 𝑒𝑖 where 𝑀𝑖 = βˆ‘π‘› 𝑖 Μ…
𝑖=1(𝑋𝑖 βˆ’π‘‹)
2
𝛽̂0 = π‘ŒΜ… βˆ’ 𝛽̂1 𝑋̅ = (𝛽0 + 𝛽1 𝑋̅ + 𝑒̅) βˆ’ 𝛽̂1 𝑋̅ = 𝛽0 βˆ’ (𝛽̂1 βˆ’ 𝛽1 )𝑋̅ + 𝑒̅
Proof.
Note that
π‘ŒΜ… = 𝛽0 + 𝛽1 𝑋̅ + 𝑒̅
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)π‘ŒΜ… = βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝛽0 = 0
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑋𝑖 = βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)(𝑋𝑖 βˆ’ 𝑋̅) = βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)2
Using these relationships we can write
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)(π‘Œπ‘– βˆ’ π‘ŒΜ…) = βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)π‘Œπ‘– = βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)(𝛽0 + 𝛽1 𝑋𝑖 + 𝑒𝑖 ) =
= βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑋𝑖 + βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑒𝑖 = 𝛽1 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)2 + βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑒𝑖
Therefore,
𝑛
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑒𝑖
𝛽1 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)2 + βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)𝑒𝑖
𝛽̂1 =
=
𝛽
+
= 𝛽1 + βˆ‘ 𝑀𝑖 𝑒𝑖
1
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)2
βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅)2
𝑖=1
Taking conditional expectation
𝐸(𝛽̂1 |𝑋𝑖 , 𝑖 = 1,2, … , 𝑛) = 𝛽1 + βˆ‘π‘›π‘–=1 𝑀𝑖 𝐸(𝑒𝑖 | 𝑋𝑖 , 𝑖 = 1,2, … , 𝑛) = 𝛽1
𝐸(𝛽̂0 |𝑋𝑖 , 𝑖 = 1,2, … , 𝑛) = 𝛽0 βˆ’ 𝐸[(𝛽̂1 βˆ’ 𝛽1 )𝑋̅|𝑋𝑖 , 𝑖 = 1,2, … , 𝑛] + 𝐸[𝑒̅|𝑋𝑖 , 𝑖 = 1,2, … , 𝑛] = 𝛽0
16
Theorem. Variance and covariance of coefficient estimators under homoscedasticity
Under the assumptions listed above, the variances of OLS estimators are given by
πœŽΜ‚2 = πœŽπ‘’2 𝑄0
πœŽΜ‚2 = πœŽπ‘’2 𝑄1
πœŽΜ‚ Μ‚ = π‘π‘œπ‘£(𝛽̂0 , 𝛽̂1 ) = βˆ’π‘‹Μ…π‘£π‘Žπ‘Ÿ(𝛽̂1 ) = βˆ’πœŽπ‘’2 𝑋̅𝑄1
𝛽0
𝛽0 ,𝛽1
𝛽1
βˆ‘π‘› 𝑋 2
𝑖
𝑄0 = 𝑛 βˆ‘π‘› 𝑖=1
(𝑋 βˆ’π‘‹Μ…)2
𝑖=1
𝑖
𝑄1 = βˆ‘π‘›
1
Μ… 2
𝑖=1(𝑋𝑖 βˆ’π‘‹)
Remarks:
1. Estimators become less precise (i.e., higher variances) as there is more uncertainty in the error term
(i.e., higher value of πœŽπ‘’2 ).
2. Estimators become more precise (i.e., lower variances) as the regressor X is more widely dispersed
around its mean, i.e., the denominator term is larger.
3. Estimators become more precise (i.e., lower variances) as the sample size n increases.
4. Variance of intercept term increases if values of regressor X are far away from the origin 0.
5. 𝛽̂0 and 𝛽̂1 are negatively (positively) correlated if 𝑋̅ is positive (negative) because the regression line
must pass the point of sample means (𝑋̅, π‘ŒΜ…).
6. Variances of coefficient estimators are unknown because they involve πœŽπ‘’2 which is unknown. To
compute their variances, we need an estimator for πœŽπ‘’2 .
Least squares estimator of πœŽπ‘’2
Variances of least squares estimators of coefficients and their covariance cannot be computed from the
given data because πœŽπ‘’2 is unknown. How do we estimate it?
- πœŽπ‘’2 is the variance of error term: πœŽπ‘’2 = π‘£π‘Žπ‘Ÿ(𝑒𝑖 ) = 𝐸[𝑒𝑖 βˆ’ 𝐸(𝑒𝑖 )]2 = 𝐸[𝑒𝑖 ]2
- Expected value is a theoretical counter part of sample mean
- If we have observed data on ui, we may estimate the variance by sample mean βˆ‘π‘›π‘–=1 𝑒𝑖2 /𝑛.
- Since we don’t have observed values of error terms, we may use its estimate 𝑒̂𝑖2
- A problem is that not all 𝑒̂𝑖2 can take independent values:
For example, βˆ‘π‘›π‘–=1 𝑒̂𝑖 = 0. This means that if we have values of the first n-1 estimated error
terms, the last one is automatically determined.
This is called the loss of degrees of freedom.
- The number loss in degrees of freedom is equal to the number of parameters we estimate.
17
- in the simple linear regression model, it is two.
- when we add all estimated residuals, are actually adding n-2 independent values.
- the degrees of freedom is therefore n-2.
- and we estimate the variance by the average of n-2 independent residuals
πœŽΜ‚π‘’2 =
βˆ‘π‘›
̂𝑖2
𝑖=1 𝑒
π‘›βˆ’2
Estimated variances and covariance of coefficient estimators are obtained by replacing the unknown
sigma^2 with its unbiased estimate sigma hat^2.
πœŽΜ‚π›½Μ‚2 = πœŽΜ‚π‘’2 𝑄0
πœŽΜ‚π›½Μ‚2 = πœŽΜ‚π‘’2 𝑄1
0
πœŽΜ‚π›½Μ‚0 ,𝛽̂1 = βˆ’π‘‹Μ…πœŽΜ‚π›½Μ‚2
1
1
Remark: Goodness of Fit R2
The idea of the R2 measure of goodness of fit was to compare the SSR in models with and without
information about the regressor and measure the fraction of the reduction in the SSR
𝑅2 =
π‘†π‘†π‘…π‘Ÿ βˆ’π‘†π‘†π‘…π‘’
π‘†π‘†π‘…π‘Ÿ
=1βˆ’
𝑆𝑆𝑅𝑒
π‘†π‘†π‘…π‘Ÿ
Another idea is to see how close predicted values π‘ŒΜ‚π‘– are to the observed π‘Œπ‘– . The closeness is measured by
the sample correlation coefficient or its squared value
π‘π‘œπ‘Ÿπ‘Ÿ(π‘Œπ‘– , π‘ŒΜ‚π‘– ) =
π‘π‘œπ‘£(π‘Œπ‘– ,π‘ŒΜ‚π‘– )
𝑆𝐷(π‘Œπ‘– )𝑆𝐷(π‘ŒΜ‚π‘– )
Μ‚
2
[π‘π‘œπ‘£(π‘Œπ‘– ,π‘Œπ‘– )]
𝑅 2 = [π‘π‘œπ‘Ÿπ‘Ÿ(π‘Œπ‘– , π‘ŒΜ‚π‘– )]2 =
π‘£π‘Žπ‘Ÿ(π‘Œ )π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚ )
𝑖
𝑖
2
To show the last expression is the same as the previous expression of R , we first show
βˆ‘π‘›π‘–=1 𝑒̂𝑖 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 ) = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…) + 𝛽̂1 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅) = 0 + 0 = 0
βˆ‘π‘›π‘–=1 𝑋𝑖 𝑒̂𝑖 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ 𝛽̂0 βˆ’ 𝛽̂1 𝑋𝑖 ) = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…) + 𝛽̂1 βˆ‘π‘›π‘–=1(𝑋𝑖 βˆ’ 𝑋̅) = 0 + 0 = 0
Noting
π‘Œπ‘– = π‘ŒΜ‚π‘– + 𝑒̂𝑖
π‘ŒΜ… = π‘ŒΜ…Μ‚ + 𝑒̅̂ = π‘ŒΜ…Μ‚
β‡’
it is easy to show
1
1
1
π‘π‘œπ‘£(π‘Œπ‘– , π‘ŒΜ‚π‘– ) = 𝑛 βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…)(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…Μ‚ ) = 𝑛 βˆ‘π‘›π‘–=1(π‘ŒΜ‚π‘– + 𝑒̂𝑖 βˆ’ π‘ŒΜ…Μ‚ )(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…Μ‚ ) = 𝑛 βˆ‘π‘›π‘–=1(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…Μ‚ )2
where the last equality is due to βˆ‘π‘›π‘–=1 𝑋𝑖 𝑒̂𝑖 = 0. This shows
1
π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– ) = 𝑛 βˆ‘π‘›π‘–=1(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…Μ‚ )2 = π‘π‘œπ‘£(π‘Œπ‘– , π‘ŒΜ‚π‘– )
Therefore,
[π‘π‘œπ‘£(π‘Œ ,π‘ŒΜ‚ )]2
π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚ )
𝑖 𝑖
𝑅 2 = π‘£π‘Žπ‘Ÿ(π‘Œ )π‘£π‘Žπ‘Ÿ(π‘Œ
= π‘£π‘Žπ‘Ÿ(π‘Œπ‘–)
Μ‚)
𝑖
𝑖
𝑖
18
Note that
1
𝑛
1
𝑛
π‘£π‘Žπ‘Ÿ(π‘Œπ‘– ) = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…)2 = π‘†π‘†π‘…π‘Ÿ
Now, we will show 𝑛 × π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– ) = π‘†π‘†π‘…π‘Ÿ βˆ’ 𝑆𝑆𝑅𝑒
βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ…)2 = βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ‚π‘– + π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…)2
2
= βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ‚π‘– )2 + βˆ‘π‘›π‘–=1(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…) + 2 βˆ‘π‘›π‘–=1(π‘Œπ‘– βˆ’ π‘ŒΜ‚π‘– )(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…)
2
= βˆ‘π‘›π‘–=1 𝑒̂𝑖 2 + βˆ‘π‘›π‘–=1(π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…) + 2 βˆ‘π‘›π‘–=1 𝑒̂𝑖 (π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…)
The last term is zero
βˆ‘π‘›π‘–=1 𝑒̂𝑖 (π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…) = βˆ‘π‘›π‘–=1 𝑒̂𝑖 (π‘ŒΜ‚π‘– βˆ’ π‘ŒΜ…) = βˆ‘π‘›π‘–=1 𝑒̂𝑖 (𝛽̂0 + 𝛽̂1 𝑋𝑖 βˆ’ π‘ŒΜ…) = 0
where the last equality is due to
βˆ‘π‘›π‘–=1 𝑒̂𝑖 (𝛽̂0 βˆ’ π‘ŒΜ…) = (𝛽̂0 βˆ’ π‘ŒΜ…) βˆ‘π‘›π‘–=1 𝑒̂𝑖 = 0
βˆ‘π‘›π‘–=1 𝑒̂𝑖 𝑋𝑖 = 0
This is shown before.
Putting all these results together, we have
𝑅2 =
[π‘π‘œπ‘£(π‘Œπ‘– ,π‘ŒΜ‚π‘– )]2
π‘£π‘Žπ‘Ÿ(π‘Œπ‘– )π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– )
=
π‘£π‘Žπ‘Ÿ(π‘ŒΜ‚π‘– )
π‘£π‘Žπ‘Ÿ(π‘Œπ‘– )
=
π‘†π‘†π‘…π‘Ÿ βˆ’π‘†π‘†π‘…π‘’
π‘†π‘†π‘…π‘Ÿ
=1βˆ’
𝑆𝑆𝑅𝑒
π‘†π‘†π‘…π‘Ÿ