Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Expectationβmaximization algorithm wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
German tank problem wikipedia , lookup
Linear regression wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
1 Chapter 2. Simple Linear Regression Model Background Suppose we wish to learn about the effect of education (x) on wage rate (y) in the U.S. We believe that the average (expected) wage rate depends on the education level πΈ(ππ ) = π½0 + π½1 ππ Deviation of individual's wage rate from the average is random. ππ = πΈ(ππ ) + π’π = π½0 + π½1 ππ + π’π Yi=dependent variable, or explained variable Xi=independent variable, or explanatory variable ui=error term, or disturbance term π½0 =intercept parameter, or intercept coefficient, or constant term π½1 =slope parameter, or slope coefficient - The error term captures the effects of all other variables, some of which are observable and some are not. - Properties of error terms play an important role in determining the properties of parameter estimates. We will talk about this much later. - The slope parameter represents the marginal effect of X on Y. If X increases by one unit, Y changes by π½1 . What is the estimation? See Excel file 2.1 - collect data on x and y from n individuals - plot them on x-y plane. This is called the scatterplot. Each point represents an individual - Since the model is specified as a linear model, we wish to find a straight line that captures the relationship between the two variables. - There can be many lines that appear a good line. - We need to pick one line. Which line is the best? - We need to decide what we mean by the βbestβ Estimation Methods We will start a simpler approach, called the Ordinary Least Squares (OLS) Estimation method. To learn about the OLS method, let π½Μ0 , π½Μ1 estimated parameters 2 πΜπ = π½Μ0 + π½Μ1 ππ predicted (or estimated) dependent variable π’Μπ = ππ β πΜπ (regression) residuals, (or prediction errors) 12 10 u5 8 6 4 u4 2 0 0 1 2 3 4 5 6 7 8 9 10 See Excel 2.2 The line of predicted y for each x is the regression line we are looking for. The objective of estimation is to find a line that is closest to the sample points. - Parameter estimates that make the predicted value πΜπ and actual observed value Yi as close as possible. - But, we cannot make all predicted values close to actual values. Some are close and some are not. How do we measure the overall closeness? Idea 0: measure by the sum of residuals. This does not work because positive and negative residuals cancel each other. Idea 1. to avoid the cancellation problem, we can take the sum of absolute residuals (SAR): ππ΄π = βππ=1 |π’Μπ | And we find parameters that make this SAR the smallest. Such estimator is called the Minimum Absolute Deviation (MAD) estimator, or the Least Absolute Deviation (LAD) estimator. Idea 2. Another way to avoid the cancellation problem is to take the sum of squared residuals (SSR): πππ = βππ=1 π’Μπ 2 = βππ=1(ππ β π½Μ0 β π½Μ1 ππ )2 The parameter values that make the SSR the smallest are called the Ordinary Least Squares (OLS) estimator. Remarks: - The LAD estimator of parameters represent the intercept and marginal effect of Xi on the median of 3 Yi. That is, median of Yi is equal to π½0 + π½1 ππ - The OLS estimator of parameters represent the intercept and marginal effect of Xi on the mean of Yi. That is, mean of Yi is equal to π½0 + π½1 ππ - We may be interested in to know the effect of Xi on the quantiles (eg., 25% quantile, or 75% qyantile) of Yi. Since the 50% quantile is the median, this is a generalization of the idea of LAD estimator. And it is called the Quantile Regression instead of Linear Regression. - Objective functions of LAD and Quantile estimators (ο quantile) LAD estimator: min ππ΄π = βπ Μ π | = βππ=1 |ππ β π½Μ0 β π½Μ1 ππ | π=1 |π’ π½0 ,π½1 Quantile estimator: min π(π) = βπ’Μπ >0 π|π’ Μ π | + βπ’Μπβ€0(1 β π)|π’Μπ | π½0 ,π½1 - Note that the LAD estimator gives the same weight (equal to 1) on the absolute error regardless of the sign of the error (whether it is positive or negative). - Note that the quantile estimator gives different asymmetric weight on absolute error: weight ο on the positive error (under-prediction) and weight (1-ο) on the negative error (over-prediction) How do we compute the estimator of parameters? See Excel 2.3 (1) Use Excelβs intercept and slope commands (2) Use Excelβs Regression function (3) Use excelβs Solver function (4) Use algebraic solutions π½Μ0 = πΜ β π½Μ1 πΜ π β (ππ βπΜ )(ππ βπΜ ) π½Μ1 = π=1 Μ 2 βπ π=1(ππ βπ) where πΜ and πΜ are the sample means of X and Y Derivation of OLS estimators πππ = βππ=1 π’Μπ 2 = βππ=1(ππ β π½Μ0 β π½Μ1 ππ )2 πππ ππππ Μ0 ππ½ = β2 βππ=1(ππ β π½Μ0 β π½Μ1 ππ ) = 0 β π½Μ0 = πΜ β π½Μ1 πΜ πππ ππππ Μ1 ππ½ = β2 βππ=1 ππ (ππ β π½Μ0 β π½Μ1 ππ ) = 0 β βππ=1 ππ π’Μπ = 0 Interpretation of OLS estimator For proper interpretation of the estimation results, you have to remember (i) the measurement unit of each variable (ii) the functional form of the regression 4 - Measurement Unit Example wage=-0.90+0.54 οeduc wage is measured in dollars and educ is measured in years Slope estimate 0.54: one more year of education is expected to raise the hourly wage by 0.54 dollars (54 cents) on the average. (1976 data) Intercept estimate -0.90: A person with no education receives negative 90 cents per hour -- silly result. This is caused by the lack of sample with zero education and hence the estimate at this low end is not good. Prediction: The predicted wage of a person who has 10 years of education is $4.50/hour, which is computed by substituting 10 for educ: 4.50=-0.90+0.54ο10. - Functional form ln(wage)=0.584+0.083 οeduc wage is measured in dollars and educ is measured in years Slope estimate 0.083: one more year of education is expected to increase the hourly wage by 8.3% (100ο0.083) Intercept estimate 0.584: A person with no education is expected to receive wage equal to $1.79=exp(0.584) Prediction: The predicted wage of a person who has 10 years of education is $4.50/hour, which is computed by substituting 10 for educ: 4.50=-0.90+0.54ο10. Unexpected Results We expect that the education level has a positive effect on the wage, i.e., π½1 is positive. What if its estimate is negative? i.e., unexpected result? It is due to misspecification of the model such as - omitted explanatory variables - wrong functional form - the coefficient (the marginal effect of educ) is not the same for all levels of education. - and others What do we have to do? Example of Phillips Curve - simple regression model gives a positively sloped Phillips curve. - this is caused by the shift of the curve over time. - See Excel file 2.4 5 A Measure of Goodness-of-fit: R Squared R2 Example - Differences in wage rates can be explained partly by the differences in education levels - Remaining part of differences in wage is due to unobservable random factors (error terms) Question - Is this model any good? - Does the education explain the wage well? - How much of the differences in years of education explain the differences in wage rates across individuals? - How can we measure it? To answer this question, we may compare two models - One model that uses information on education... unrestricted model ππ = π½0 + π½1 ππ + π’π - One model that does not use information on education... restricted model ππ = π½0 + π’π The restriction is π½1 = 0 How do we measure the relative performances of these two models? - The Least Squares estimators are the estimators that make the SSR as small as possible. - Smaller SSR means a relatively better overall prediction. - Therefore, we can compare the minimum SSR of the two models Estimate both models using the OLS method, and let their SSRβs be denoted by SSRu from the unrestricted model SSRr from the restricted model If education has a strong explanatory power, we would expect SSRu is much smaller than SSRr. What is the percentage reduction in SSR when education level is used to predict wages? This ratio in fraction is called decimal point πππ π’ π 2 = 1 β 0 β€ π 2 β€ 1 πππ π A high R2 means that the differences in education have a strong explanatory power for the differences in wage rates, and hence our model is good, and vice versa. Note: The estimator of π½0 in the restricted model is π½Μ0 = πΜ . Therefore, πππ π = βππ=1(ππ β πΜ )2 This is called the total sum of squares (TSS) and SSRu is the usual SSR that we discussed before. Therefore, R2 is usually written as πππ π 2 = 1 β πππ = πππβπππ πππ SSE=explained sum of squares πππΈ = πππ 6 Sampling Variations of OLS Estimator ππ = π½0 + π½1 ππ + π’π Collect a data and estimate parameters: π½Μ0 , π½Μ1 A simple regression model: Collect another data and estimate parameters These estimates will be different from previous estimates Repeat the procedure and get different estimated values What can we say about these different estimated values of the same unknown parameters? Average of these estimates? Dispersion of these estimates? Basic facts ui is a random variable, Yi is a function of ui. Therefore, Yi is also a random. Xi can be a random variable too. π β (ππ βπΜ )(ππ βπΜ ) From the estimators, π½Μ0 = πΜ β π½Μ1 πΜ and π½Μ1 = π=1 , they are random variables also. βπ (π βπΜ )2 π=1 π Therefore, estimators vary from sample to sample (realized values of error terms) Desired properties of estimators Unbiasedness: The average of estimators is equal to their true values Precision of estimators: Smallest dispersion If an estimator is unbiased and has the smallest dispersion among all unbiased, it is called the best unbiased estimator. Can we claim that the OLS estimators are the best unbiased estimators? The answer is yes under certain conditions. Assumption 1. Conditional distribution of ui given Xi has a zero mean. Assumption 2. (Xi,Yi) are independently and identically distributed. Assumption 3. Large outliers are unlikely. Under these assumptions, the OLS estimators are the BLUE (Best Linear Unbiased Estimator) To understand these concepts, we will do a very brief review of basic statistic theory. 7 Review of Statistics Theory Random Experiment: an experiment whose outcome cannot be predicted with certainty. Probability: A numerical measure of the relative likelihood (frequencies) of various outcomes of a random experiment. It takes a value in a closed interval [0,1]. Toss a coin with an outcome of either Head (H) or Tail (T). Let ο²=prob(H), and 1-ο²=prob(T). If ο²=1/2, it is a fair coin. Define a random variable Y Y=1 if the coin lands on H Y=0 if the coin lands on T Probability density function (pdf) of Y: f(y) f(y)=P(Y=y): f(1)=ο², f(0)=1-ο² Graph of the pdf Cumulative distribution function (cdf): F(y) F(y)=P(Y<=y) F(y)=0 if y<0 =1-ο²ο if 0β€y<1 =1 if yβ₯ 1 Graph of the cdf Moments of random variable Y: Expected value and Variance Summary value of the characteristics of the probability distribution Mean or Expected value of Y The expected value of Y is a measure of the location of the βcenterβ of the probability distribution It is a probability weighted average of the outcomes of Y ΞΌ=E(Y)=ο ο²ο1+(1-ο²)ο0=ο² Variance and standard deviation of Y The variance is a measure of the degree of dispersion or spread of the random outcomes around the mean π 2 = π£ππ(π) = πΈ[(π β π)2 ]=E[Y2-2ΞΌY+ΞΌ2]=E(Y2)- ΞΌ2 π = βπ 2 8 For the example above, π 2 = π(1 β π)2 + (1 β π)(0 β π)2 = π(1 β π)2 + (1 β π)(0 β π)2 = π(1 β π) This shows that the variance is the probability weighted average of the squared distance of outcomes from their mean. Example: more than two outcomes Toss two balanced coins (or a balanced coin twice) Potential outcomes: {HH}, {HT}, {TH}, {TT} Each outcome is equally likely, and hence the probability of each outcome is 1/4 You win a prize in dollars that is equal to the number of heads Let Y denote the amount of prize that you can win Y can take $2, $1, $1, and $0 Expected prize: ΞΌ=E(Y)=1/4ο2+1/4ο1+1/4ο1+1/4ο0=1 Variance of prize: π 2 =var(Y)=1/4(2-1)^2+1/4(1-1)^2+1/4(1-1)^2+1/4(0-1)^2=1/2 9 Joint Distribution of two random variables Toss a fair coin twice Possible outcomes: {HH}, {HT}, {TH}, {TT} equally likely: probability of each outcome is ¼ Two prizes X and Y as specified below. X -1 {HT or TH} 2 {HH or TT} f(y) -2 {TT} Y 0 {HT or TH} 2 {HH} f(x) 0 1/2 0 1/2 1/4 0 1/4 1/2 1/4 1/2 1/4 1 Jane is given ticket Y and John is given ticket X. Probability of individual random variable - marginal probability What are the probabilities that Jane wins $-2? $0? $2? What are the probabilities that John wins $2? lose $1? Joint Probability (i) What is the probability that both win $2? This happens only if {HH} and its probability is 1/4. (ii) What is the probability that John wins $2 and Jane lose $2? This happens only if {TT} and its probability is 1/4 The last column shows the marginal probability density of X The last row shows the marginal probability density of Y ππ₯ =0.5, ππ₯2 =E(X2)-ππ₯2 =2.5-0.25=2.25 ππ¦ =0.0, ππ¦2 =E(Y2)-ππ¦2 =2.0-0.0=2.0 Covariance of two random variables X and Y A measure of the degree of covariation of the two random variables It is computed by ππ₯π¦ =cov(X,Y)=E[(X-ππ₯ )(Y-ππ¦ )]=E(XY)-E(X) ππ¦ β ππ₯ πΈ(π) + ππ₯ ππ¦ = πΈ(ππ) β ππ₯ ππ¦ In the example above, ππ₯π¦ = πΈ(ππ) β ππ₯ ππ¦ =-1+1-0=0 10 Covariance can take a value of positive, negative or zero. Value of covariance depends on the measurement unit: A covariance of 2 will change to 20000 if the measurement unit changes from dollars to cents. Let X and Y are measured by dollars, and W and Z are measured in cents. Then, W=100X and Z=100Y. Therefore, ππ€ =E(W)=100E(X),~~ ππ§ =E(Z)=100E(Y) cov(W,Z)=cov(100X,100Y)=E(100X-100 ππ₯ )(100Y-100ππ€ )=10000E(X-ππ₯ )(Y-ππ¦ ) Correlation Coefficient of X and Y To avoid the dependence of covariance on the measurement unit, standardize it It is computed by πππ£(π,π) ππ₯π¦ ππ₯π¦ =corr(X,Y)=π π(π)π π(π)=π π₯ ππ¦ Its value lies between -1 and 1: ππ₯π¦ =+1: X and Y are perfectly positively correlated ππ₯π¦ =-1: X and Y are perfectly negatively correlated ππ₯π¦ =0: X and Y are uncorrelated Conditional Probability and Conditional Moments X -1 {HT or TH} 2 {HH or TT} f(y) -2 {TT} Y 0 {HT or TH} 2 {HH} f(x) 0 1/2 0 1/2 1/4 0 1/4 1/2 1/4 1/2 1/4 1 Suppose that you are told that X=-1 (i.e., you are told that outcomes are either TH or HT). What is the probability of Y=-2? That is, when the outcome is known to be either TH or HT, what is the probability that the outcome is {TT}? It is zero because {TT} will not occur when the outcome is either TH or HT. A similar reasoning gives P(Y=2|X=-1)=0, and P(Y=0|X=-1)=1 P(Y=-2|X=2)=1/2, P(Y=0|X=2)=0, P(Y=2|X=2)=1/2 Conditional probability is denoted by f(y|x)=P(Y=y|X=x) The conditional probability density is computed by π(π¦|π₯) = π(π₯,π¦) π(π₯) π(π₯) β 0 11 To understand this formula, suppose Y takes values π¦1 , π¦2 , and π¦3 . For a given X=x, we wish to find π(π¦1 |π₯), π(π¦2 |π₯), and π(π¦3 |π₯). These conditional probabilities must satisfy two properties: (1) They must sum to 1: π(π¦1 |π₯) + π(π¦2 |π₯) + π(π¦3 |π₯) = 1 π(π¦π |π₯ ) π(π₯,π¦π ) (2) They must maintain relative probabilities of outcomes of Y: = π(π₯,π¦π ) π(π¦π |π₯ ) Joint probabilities of course satisfy property 2, but their sum is not equal to 1. Their sum is the marginal probability f(x). Therefore, we divide the with f(x) so that π(π¦1 |π₯) + π(π¦2 |π₯) + π(π¦3 |π₯) = π(π₯,π¦1 ) π(π₯) + π(π₯,π¦2 ) π(π₯) + π(π₯,π¦3 ) π(π₯) = π(π₯,π¦1 )+π(π₯,π¦2 )+π(π₯,π¦3 ) π(π₯) π(π₯) = π(π₯) = 1 The mean and the variance based on the conditional probabilities are called the conditional mean conditional expected value) and the conditional variance. E(Y|X=-1)=(-2)×0+(0×1)+2×0=0 E(Y|X=2)=(-2)×1/2+(0×0)+2×1/2=0 Statistical Independence of Random Variables Random variables X and Y are statistically independent if an information on X does not change the marginal probability of Y, i.e., P(Y=y|X=x)=P(Y=y), X and Y are independent if f(x,y)=f(x)f(y), and f(Y=y|X=x)=f(Y=y) X -1 {HT or TH} 2 {HH or TT} f(y) -2 {TT} Y 0 {HT or TH} 2 {HH} f(x) 0 1/2 0 1/2 1/4 0 1/4 1/2 1/4 1/2 1/4 1 We have shown that X and Y are not correlated. But, they are not independent statistically. This is easily verified by checking weather joint probabilities are product of marginal probabilities. Computations of moments of a function of random variables Moments of functions of random variables Consider two random variables X and Y, and let a, b, c and d be constants. (i) Let Z=a+bX. Then, E(Z)=E(a+bX)=a+b E(X) and ππ§2 =var(Z)=π 2 ππ₯2 Prove these results. (ii) Let Z=a+bX+cY E(Z)=E(a+bX+cY)=a+b E(X)+c E(Y) ππ§2 =π 2 ππ₯2 + π 2 ππ¦2 + 2ππππ₯π¦ 12 If X and Y are uncorrelated (or independent), then ππ§2 =π 2 ππ₯2 + π 2 ππ¦2 Prove these results. (iii) cov(a+bX, c+dY)=bc cov(X,Y) 13 Desired Properties of Estimators Toss a coin n times. Let the outcomes be denoted by Xi for the outcome of ith toss. Xi=1 if H and Xi=0 if T The coin is not necessarily balanced. Let ΞΈ=prob(H). We decided to estimate ΞΈ by the fraction of the number of heads in n tosses: π βπ=1 ππ # ππ βππππ πΜ = = π π What can we say about the statistical properties of this estimator? (a) πΈ(πΜ) = π for any ΞΈ π(1βπ) (b) π£ππ(πΜ) = π Unbiased Estimator: An estimator that satisfies property (a) is called an unbiased estimator. bias= πΈ(πΜ) β π Best Estimator: An unbiased estimator that has the smallest variance among all unbiased estimator is called the Best Unbiased Estimator. Alternative unbiased estimator π +π +π πΜ = 1 33 5 πΈ(π1 )+πΈ(π3 )+πΈ(π5 ) πΈ(πΜ ) = =π ... πΜ is an unbiased estimator π£ππ(πΜ ) = ... If n>3, then πΜ has a smaller variance than πΜ 3 π(1βπ) 3 14 Statistical Properties of OLS estimators Linear Model: ππ = π½0 + π½1 ππ + π’π π½Μ0 = πΜ β π½Μ1 πΜ OLS estimators: π β (ππ βπΜ )(ππ βπΜ ) π½Μ1 = π=1 Μ 2 βπ π=1(ππ βπ) Best Linear Unbiased Estimator (BLUE) An estimator is called the best linear unbiased estimator if it is a linear estimator of Yi and unbiased, and its variance is the smallest among all linear unbiased estimators. Assumption 1. Conditional distribution of ui given Xi has a zero mean. Assumption 2. (Xi,Yi) are independently and identically distributed. Assumption 3. Large outliers are unlikely. Assumption 1. E(ui|Xi)=0 β πΈ(ππ |ππ ) = π½0 + π½1 ππ , πΈ(π’π ) = πΈ[πΈ(π’π |ππ )] = πΈ[0] = 0 Example of wage rate of individuals: Individuals of 12 year education (Xi=12) get π½0 + π½1 ππ on the average. Some individual gets a higher and some gets a lower wage rate than the average rate and the average of deviations from the mean is zero. Violation of Assumption 1. Classical case is the simultaneous equation model: Supply-Demand equilibrium model π π = π + ππ + π’ π , π= πΌβπ π+π½ + π’π βπ’π π+π½ π π = πΌ β π½π + π’π , π = π π = ππ π’π = (π β πΌ) + (π + π½)π + π’ π πΈ(π’π |π) = (π β πΌ) + (π + π½)π + πΈ(π’ π |π) Feedback Policy GDP growth rate depends on the interest rate and the monetary authority sets the target interest rate considering the GDP growth. Connecticut Huskies won 100th consecutive games. Performance of a team depends on the quality of players. The quality of new players also depend on the performance of the team. Omitted variable. Omitted variable may cause the correlation between the error term and the explanatory variable. The regression is the effect of the percentage of children who are eligible for free lunch program (lunch) on the percentage (math) of 10th graders who pass the math exam. The regression result is math=32.14 - 0.319lunch, n=408, R2=0.171 This indicates that a 10% increase in the number of students who are eligible for free lunch will reduce the passing percentage by 3.19 percent. A policy implication is that the government must tighten the 15 eligibility criteria to increase the passing percentage. This regression result doesn't seem to be right. It is likely that the explanatory variable (percentage of eligible students) is correlated with the poverty level, school quality and resources of the school which are contained in the error term. This causes the OLS estimator biased. Assumption 2. This assumption means that samples are drawn randomly. This is possible in experiments. But, with observed data, we hope that they are reasonably independent. Independence of Yi and Yj means that ui and uj are independent. That is, ui's are i.i.d. random variables. πΈ(π’π |ππ ) = πΈ(π’π ) = 0 π£ππ(π’π |ππ ) = π£ππ(π’π ) = ππ’2 homoscedasticity π£ππ(π’π |ππ ) = π£ππ(π’π ) = ππ’2 Heteroskedasticity π£ππ(π’π |ππ ) = π£ππ(π’π ) = ππ2 Theorem. The OLS estimator of coefficients are the BLUE under assumptions 1,2 and 3. Proof of unbiasedness (a) OLS estimators π½Μ1 can be written as (π βπΜ ) π½Μ1 = π½1 + βππ=1 π€π π’π where π€π = βπ π Μ π=1(ππ βπ) 2 π½Μ0 = πΜ β π½Μ1 πΜ = (π½0 + π½1 πΜ + π’Μ ) β π½Μ1 πΜ = π½0 β (π½Μ1 β π½1 )πΜ + π’Μ Proof. Note that πΜ = π½0 + π½1 πΜ + π’Μ βππ=1(ππ β πΜ )πΜ = βππ=1(ππ β πΜ )π½0 = 0 βππ=1(ππ β πΜ )ππ = βππ=1(ππ β πΜ )(ππ β πΜ ) = βππ=1(ππ β πΜ )2 Using these relationships we can write βππ=1(ππ β πΜ )(ππ β πΜ ) = βππ=1(ππ β πΜ )ππ = βππ=1(ππ β πΜ )(π½0 + π½1 ππ + π’π ) = = βππ=1(ππ β πΜ )ππ + βππ=1(ππ β πΜ )π’π = π½1 βππ=1(ππ β πΜ )2 + βππ=1(ππ β πΜ )π’π Therefore, π βππ=1(ππ β πΜ )π’π π½1 βππ=1(ππ β πΜ )2 + βππ=1(ππ β πΜ )π’π π½Μ1 = = π½ + = π½1 + β π€π π’π 1 βππ=1(ππ β πΜ )2 βππ=1(ππ β πΜ )2 π=1 Taking conditional expectation πΈ(π½Μ1 |ππ , π = 1,2, β¦ , π) = π½1 + βππ=1 π€π πΈ(π’π | ππ , π = 1,2, β¦ , π) = π½1 πΈ(π½Μ0 |ππ , π = 1,2, β¦ , π) = π½0 β πΈ[(π½Μ1 β π½1 )πΜ |ππ , π = 1,2, β¦ , π] + πΈ[π’Μ |ππ , π = 1,2, β¦ , π] = π½0 16 Theorem. Variance and covariance of coefficient estimators under homoscedasticity Under the assumptions listed above, the variances of OLS estimators are given by πΜ2 = ππ’2 π0 πΜ2 = ππ’2 π1 πΜ Μ = πππ£(π½Μ0 , π½Μ1 ) = βπΜ π£ππ(π½Μ1 ) = βππ’2 πΜ π1 π½0 π½0 ,π½1 π½1 βπ π 2 π π0 = π βπ π=1 (π βπΜ )2 π=1 π π1 = βπ 1 Μ 2 π=1(ππ βπ) Remarks: 1. Estimators become less precise (i.e., higher variances) as there is more uncertainty in the error term (i.e., higher value of ππ’2 ). 2. Estimators become more precise (i.e., lower variances) as the regressor X is more widely dispersed around its mean, i.e., the denominator term is larger. 3. Estimators become more precise (i.e., lower variances) as the sample size n increases. 4. Variance of intercept term increases if values of regressor X are far away from the origin 0. 5. π½Μ0 and π½Μ1 are negatively (positively) correlated if πΜ is positive (negative) because the regression line must pass the point of sample means (πΜ , πΜ ). 6. Variances of coefficient estimators are unknown because they involve ππ’2 which is unknown. To compute their variances, we need an estimator for ππ’2 . Least squares estimator of ππ’2 Variances of least squares estimators of coefficients and their covariance cannot be computed from the given data because ππ’2 is unknown. How do we estimate it? - ππ’2 is the variance of error term: ππ’2 = π£ππ(π’π ) = πΈ[π’π β πΈ(π’π )]2 = πΈ[π’π ]2 - Expected value is a theoretical counter part of sample mean - If we have observed data on ui, we may estimate the variance by sample mean βππ=1 π’π2 /π. - Since we donβt have observed values of error terms, we may use its estimate π’Μπ2 - A problem is that not all π’Μπ2 can take independent values: For example, βππ=1 π’Μπ = 0. This means that if we have values of the first n-1 estimated error terms, the last one is automatically determined. This is called the loss of degrees of freedom. - The number loss in degrees of freedom is equal to the number of parameters we estimate. 17 - in the simple linear regression model, it is two. - when we add all estimated residuals, are actually adding n-2 independent values. - the degrees of freedom is therefore n-2. - and we estimate the variance by the average of n-2 independent residuals πΜπ’2 = βπ Μπ2 π=1 π’ πβ2 Estimated variances and covariance of coefficient estimators are obtained by replacing the unknown sigma^2 with its unbiased estimate sigma hat^2. πΜπ½Μ2 = πΜπ’2 π0 πΜπ½Μ2 = πΜπ’2 π1 0 πΜπ½Μ0 ,π½Μ1 = βπΜ πΜπ½Μ2 1 1 Remark: Goodness of Fit R2 The idea of the R2 measure of goodness of fit was to compare the SSR in models with and without information about the regressor and measure the fraction of the reduction in the SSR π 2 = πππ π βπππ π’ πππ π =1β πππ π’ πππ π Another idea is to see how close predicted values πΜπ are to the observed ππ . The closeness is measured by the sample correlation coefficient or its squared value ππππ(ππ , πΜπ ) = πππ£(ππ ,πΜπ ) ππ·(ππ )ππ·(πΜπ ) Μ 2 [πππ£(ππ ,ππ )] π 2 = [ππππ(ππ , πΜπ )]2 = π£ππ(π )π£ππ(πΜ ) π π 2 To show the last expression is the same as the previous expression of R , we first show βππ=1 π’Μπ = βππ=1(ππ β π½Μ0 β π½Μ1 ππ ) = βππ=1(ππ β πΜ ) + π½Μ1 βππ=1(ππ β πΜ ) = 0 + 0 = 0 βππ=1 ππ π’Μπ = βππ=1(ππ β π½Μ0 β π½Μ1 ππ ) = βππ=1(ππ β πΜ ) + π½Μ1 βππ=1(ππ β πΜ ) = 0 + 0 = 0 Noting ππ = πΜπ + π’Μπ πΜ = πΜ Μ + π’Μ Μ = πΜ Μ β it is easy to show 1 1 1 πππ£(ππ , πΜπ ) = π βππ=1(ππ β πΜ )(πΜπ β πΜ Μ ) = π βππ=1(πΜπ + π’Μπ β πΜ Μ )(πΜπ β πΜ Μ ) = π βππ=1(πΜπ β πΜ Μ )2 where the last equality is due to βππ=1 ππ π’Μπ = 0. This shows 1 π£ππ(πΜπ ) = π βππ=1(πΜπ β πΜ Μ )2 = πππ£(ππ , πΜπ ) Therefore, [πππ£(π ,πΜ )]2 π£ππ(πΜ ) π π π 2 = π£ππ(π )π£ππ(π = π£ππ(ππ) Μ) π π π 18 Note that 1 π 1 π π£ππ(ππ ) = βππ=1(ππ β πΜ )2 = πππ π Now, we will show π × π£ππ(πΜπ ) = πππ π β πππ π’ βππ=1(ππ β πΜ )2 = βππ=1(ππ β πΜπ + πΜπ β πΜ )2 2 = βππ=1(ππ β πΜπ )2 + βππ=1(πΜπ β πΜ ) + 2 βππ=1(ππ β πΜπ )(πΜπ β πΜ ) 2 = βππ=1 π’Μπ 2 + βππ=1(πΜπ β πΜ ) + 2 βππ=1 π’Μπ (πΜπ β πΜ ) The last term is zero βππ=1 π’Μπ (πΜπ β πΜ ) = βππ=1 π’Μπ (πΜπ β πΜ ) = βππ=1 π’Μπ (π½Μ0 + π½Μ1 ππ β πΜ ) = 0 where the last equality is due to βππ=1 π’Μπ (π½Μ0 β πΜ ) = (π½Μ0 β πΜ ) βππ=1 π’Μπ = 0 βππ=1 π’Μπ ππ = 0 This is shown before. Putting all these results together, we have π 2 = [πππ£(ππ ,πΜπ )]2 π£ππ(ππ )π£ππ(πΜπ ) = π£ππ(πΜπ ) π£ππ(ππ ) = πππ π βπππ π’ πππ π =1β πππ π’ πππ π