Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Econometrics I Professor William Greene Stern School of Business Department of Economics 3-1/58 Part 3: Least Squares Algebra Econometrics I Part 3 – Least Squares Algebra 3-2/58 Part 3: Least Squares Algebra Vocabulary 3-3/58 Some terms to be used in the discussion. Population characteristics and entities vs. sample quantities and analogs Residuals and disturbances Population regression line and sample regression Objective: Learn about the conditional mean function. Estimate and 2 First step: Mechanics of fitting a line (hyperplane) to a set of data Part 3: Least Squares Algebra Fitting Criteria The set of points in the sample Fitting criteria - what are they: LAD Least squares and so on Why least squares? A fundamental result: Sample moments are “good” estimators of their population counterparts We will spend the next few weeks using this principle and applying it to least squares computation. 3-4/58 Part 3: Least Squares Algebra An Analogy Principle for Estimating In the population Continuing Summing, Exchange Σi and E[] E[y | X ] = E[y - X |X] = E[xi i] = Σi E[xi i] = E[Σi xi i] = E[X(y - X) ] = X so 0 0 Σi 0 = 0 E[ X ] = 0 0 Choose b, the estimator of to mimic this population result: i.e., mimic the population mean with the sample mean Find b such that 1 1 X e = 0 X (y - Xb) N N As we will see, the solution is the least squares coefficient vector. 3-5/58 Part 3: Least Squares Algebra Population and Sample Moments We showed that E[i|xi] = 0 and Cov[xi,i] = 0. If so, and if E[y|X] = X, then = (Var[xi])-1 Cov[xi,yi]. This will provide a population analog to the statistics we compute with the data. 3-6/58 Part 3: Least Squares Algebra U.S. Gasoline Market, 1960-1995 3-7/58 Part 3: Least Squares Algebra Least Squares Example will be, Gi on xi = [1, PGi , Yi] Fitting criterion: Fitted equation will be yi = b1xi1 + b2xi2 + ... + bKxiK. Criterion is based on residuals: ei = yi - b1xi1 + b2xi2 + ... + bKxiK Make ei as small as possible. Form a criterion and minimize it. 3-8/58 Part 3: Least Squares Algebra Fitting Criteria Sum of residuals: i 1 ei N 2 e Sum of squares: i 1 i N Sum of absolute values of residuals: Absolute value of sum of residuals We focus on e now and 3-9/58 N 2 i 1 i N i 1 ei N N i 1 ei e i 1 i later Part 3: Least Squares Algebra Least Squares Algebra 2 e i 1 i ee = (y - Xb)'(y - Xb) N A digression on multivariate calculus. Matrix and vector derivatives. Derivative of a scalar with respect to a vector Derivative of a column vector wrt a row vector Other derivatives 3-10/58 Part 3: Least Squares Algebra Least Squares Normal Equations e N 2 i 1 i i 1 ( yi - xi b)2 N b b (y - Xb)'(y - Xb) 2 X'(y - Xb) = 0 b (1x1) / (kx1) (-2)(NxK)'(Nx1) = (-2)(KxN)(Nx1) = Kx1 Note: Derivative of 1x1 wrt Kx1 is a Kx1 vector. Solution: 2 X'(y - Xb) = 0 X'y = X'Xb 3-11/58 Part 3: Least Squares Algebra Least Squares Solution Assuming it exists: b = (X'X)-1X'y Note the analogy: = Var(x) 1 Cov(x,y) 1 1 1 b= X'X X'y N N Suggests something desirable about least squares 3-12/58 Part 3: Least Squares Algebra Second Order Conditions Necessary Condition: First derivatives = 0 (y - Xb)'(y - Xb) 2 X'(y - Xb) b Sufficient Condition: Second derivatives ... 2 (y - Xb)'(y - Xb) bb 3-13/58 (y - Xb)'(y - Xb) b = b K 1 column vector = 1 K row vector = 2X'X Part 3: Least Squares Algebra Does b Minimize e’e? iN1 xi21 iN1 xi1 xi 2 N N 2 2 x x x e'e i 1 i 2 2 X'X = 2 i 1 i 2 i1 ... bb' ... N N i 1 xiK xi1 i 1 xiK xi 2 ... iN1 xi1 xiK ... iN1 xi 2 xiK ... ... N 2 ... i 1 xiK If there were a single b, we would require this to be positive, which it would be; 2x'x = 2 i 1 xi2 0. N The matrix counterpart of a positive number is a positive definite matrix. 3-14/58 Part 3: Least Squares Algebra Sample Moments - Algebra iN1 xi21 iN1 xi1 xi 2 N N 2 x x x i 1 i 2 X'X = i 1 i 2 i1 ... ... N N i 1 xiK xi1 i 1 xiK xi 2 xi1 x =iN1 i 2 xi1 xi 2 ... ... xik =iN1xi xi 3-15/58 xi21 ... iN1 xi1 xiK ... iN1 xi 2 xiK N xi 2 xi1 =i 1 ... ... ... N 2 ... i 1 xiK xiK xi1 xi1 xi 2 xi22 ... xiK xi 2 ... xi1 xiK ... xi 2 xiK ... ... 2 ... xiK xiK Part 3: Least Squares Algebra Positive Definite Matrix Matrix C is positive definite if a'Ca is > 0 for any a. Generally hard to check. Requires a look at characteristic roots (later in the course). For some matrices, it is easy to verify. X'X is one of these. a'X'Xa = (a'X')( Xa) = ( Xa)'( Xa) = v'v = k=1 v 2k 0 K Could v = 0? v = 0 means Xa = 0. This is not possible. Conclusion: b = ( X'X)-1 X'y does indeed minimize e'e. 3-16/58 Part 3: Least Squares Algebra Algebraic Results - 1 In the population: E[X'] = 0 In the sample : 3-17/58 1 N x i ei 0 i 1 N Part 3: Least Squares Algebra Residuals vs. Disturbances Disturbances (population) y i x i i Partitioning y: y = E[y|X ] + ε = conditional mean + disturbance Residuals (sample) y i x ib ei Partitioning y : y = Xb + e = projection + residual (Note : Projection into the column space of X , i.e., the set of linear combinations of the columns of X; Xb is one of these. ) 3-18/58 Part 3: Least Squares Algebra Algebraic Results - 2 3-19/58 A “residual maker” M = (I - X(X’X)-1X’) e = y - Xb= y - X(X’X)-1X’y = My My = The residuals that result when y is regressed on X MX = 0 (This result is fundamental!) How do we interpret this result in terms of residuals? When a column of X is regressed on all of X, we get a perfect fit and zero residuals. (Therefore) My = MXb + Me = Me = e (You should be able to prove this. y = Py + My, P = X(X’X)-1X’ = (I - M). PM = MP = 0. Py is the projection of y into the column space of X. Part 3: Least Squares Algebra The M Matrix M = I- X(X’X)-1X’ is an nxn matrix M is symmetric – M = M’ M is idempotent – MM = M (just multiply it out) M is singular; M-1 does not exist. (We will prove this later as a side result in another derivation.) 3-20/58 Part 3: Least Squares Algebra Results when X Contains a Constant Term X = [1,x2,…,xK] The first column of X is a column of ones Since X’e = 0, x1’e = 0 – the residuals sum to zero. y Xb + e Define i [1,1,...,1]' a column of n ones i'y = N i=1 y i ny i'y i'Xb + i'e = i'Xb implies (after dividing by N) y x b (the regression line passes through the means) These do not apply if the model has no constant term. 3-21/58 Part 3: Least Squares Algebra Dummy Variable for One Observation A dummy variable that isolates a single observation. What does this do? Define d to be the dummy variable in question. Z = all other regressors. X = [Z,d] Multiple regression of y on X. We know that X'e = 0 where e = the column vector of residuals. That means d'e = 0, which says that ej = 0 for that particular residual. The observation will be predicted perfectly. Fairly important result. Important to know. 3-22/58 Part 3: Least Squares Algebra 3-23/58 Part 3: Least Squares Algebra Least Squares Algebra 3-24/58 Part 3: Least Squares Algebra Least Squares 3-25/58 Part 3: Least Squares Algebra Residuals 3-26/58 Part 3: Least Squares Algebra Least Squares Residuals. Note the peculiar pattern. 3-27/58 Part 3: Least Squares Algebra Least Squares Algebra-3 X I e X XX X M M is NxN potentially huge 3-28/58 Part 3: Least Squares Algebra Least Squares Algebra-4 MX = Not identically zero in a digital computer. Rounding error. 3-29/58 Part 3: Least Squares Algebra Econometrics I Part 3.1 – Regression Algebra and Fit 3-30/58 Part 3: Least Squares Algebra The Fit of the Regression 3-31/58 “Variation:” In the context of the “model” we speak of covariation of a variable as movement of the variable, usually associated with (not necessarily caused by) movement of another variable. n Total variation = (y i - y)2 = yM0y. i=1 0 -1 M = I – i(i’i) i’ = the M matrix for X = a column of ones. Part 3: Least Squares Algebra Decomposing the Variation y i x ib + ei y i y x ib - x b + ei = (x i - x )b+ei N i1 (y i y)2 i1 [(x i - x)b]2 i1 ei2 N N (Sum of cross products is zero.) Total variation = regression variation + residual variation Recall the decomposition: Var[y] = Var [E[y|x]] + E[Var [ y | x ]] = Variation of the conditional mean around the overall mean + Variation around the conditional mean function. 3-32/58 Part 3: Least Squares Algebra Decomposing the Variation of Vector y Decomposition: (This all assumes the model contains a constant term. one of the columns in X is i.) y = Xb + e so M0y = M0Xb + M0e = M0Xb + e. (Deviations from means. Why is M0e = e? ) yM0y = b(X’ M0)(M0X)b + ee = bXM0Xb + ee. (M0 is idempotent and e’ M0X = e’X = 0.) Total sum of squares = Regression Sum of Squares (SSR)+ Residual Sum of Squares (SSE) 3-33/58 Part 3: Least Squares Algebra The Sum of Squared Residuals b minimizes ee = (y - Xb)(y - Xb). Algebraic equivalences, at the solution b = (XX)-1Xy e’e = ye (why? e’ = y’ – b’X’ and X’e=0 ) (This is the F.O.C. for least squares.) ee = yy - y’Xb = yy - bXy = ey as eX = 0 (or e’y = y’e) 3-34/58 Part 3: Least Squares Algebra A Fit Measure y = Xb + e M0y = M0Xb + M0e M0e = e. y’M0y = bXM0Xb + e’e 1 = bXM0Xb/yM0y + e’e/yM0y R2 = bXM0Xb/yM0y Regression Variation R 1 N 2 Total Variation i1 (yi y) 2 e'e (VIR) R2 is bounded by zero if and one only if: (a) There is a constant term in X and (b) The line is computed by linear least squares. 3-35/58 Part 3: Least Squares Algebra Minimizing ee Any other coefficient vector has a larger sum of squares. A quick proof: d = the vector, not equal to b u = y – Xd = y – Xb + Xb – Xd = e - X(d - b). Then, uu = (y - Xd)(y-Xd) = [y - Xb - X(d - b)][y - Xb - X(d - b)] = [e - X(d - b)] [e - X(d - b)] Expand to find uu = ee + (d-b)XX(d-b) = e’e + v’v > ee 3-36/58 Part 3: Least Squares Algebra Dropping a Variable An important special case. Suppose bX,z = [b,c] = the regression coefficients in a regression of y on [X,z] bX = [d,0] = is the same, but computed to force the coefficient on z to equal 0. This removes z from the regression. We are comparing the results that we get with and without the variable z in the equation. Results which we can show: Dropping a variable(s) cannot improve the fit - that is, it cannot reduce the sum of squared residuals. Adding a variable(s) cannot degrade the fit - that is, it cannot increase the sum of squared residuals. 3-37/58 Part 3: Least Squares Algebra Adding a Variable Never Increases the Sum of Squares Theorem 3.5 on text page 38. u = the residual in the regression of y on [X,z] e = the residual in the regression of y on X alone, uu = ee – c2(z*z*) ee where z* = MXz. 3-38/58 Part 3: Least Squares Algebra Adding Variables R2 never falls when a z is added to the regression. A useful general result R 2 with both X and variable z equals 2 R with only X plus the increase in fit due to z after X is accounted for: *2 R 2Xz R 2X (1 R 2X )ryz|X 3-39/58 Squared partial correlation between y and z controlling for x. Part 3: Least Squares Algebra Adding Variables to a Model What is the effect of adding PN, PD, PS, 3-40/58 Part 3: Least Squares Algebra Comparing fits of regressions Make sure the denominator in R2 is the same - i.e., same left hand side variable. Example, linear vs. loglinear. Loglinear will almost always appear to fit better because taking logs reduces variation. 3-41/58 Part 3: Least Squares Algebra 3-42/58 Part 3: Least Squares Algebra Adjusted R Squared Adjusted R2 (for degrees of freedom?) 2 R = 1 - [(n-1)/(n-K)](1 - R2) Degrees of freedom” adjustment suggests something about “unbiasedness.” The ratio is not unbiased. 2 R includes a penalty for variables that don’t add much fit. Can fall when a variable is added to the equation. 3-43/58 Part 3: Least Squares Algebra Adjusted R2 What is being adjusted? The penalty for using up degrees of freedom. R 2 = 1 - [ee/(n – K)]/[yM0y/(n-1)] uses the ratio of two ‘unbiased’ estimators. Is the ratio unbiased? R 2 = 1 – [(n-1)/(n-K)(1 – R2)] Will R 2 rise when a variable is added to the regression? R 2 is higher with z than without z if and only if the t ratio on z is in the regression when it is added is larger than one in absolute value. 3-44/58 Part 3: Least Squares Algebra Full Regression (Without PD) ---------------------------------------------------------------------Ordinary least squares regression ............ LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36 Model size Parameters = 9 Degrees of freedom = 27 Residuals Sum of squares = 596.68995 Standard error of e = 4.70102 Fit R-squared = .99334 <********** Adjusted R-squared = .99137 <********** Model test F[ 8, 27] (prob) = 503.3(.0000) --------+------------------------------------------------------------Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X --------+------------------------------------------------------------Constant| -8220.38** 3629.309 -2.265 .0317 PG| -26.8313*** 5.76403 -4.655 .0001 2.31661 Y| .02214*** .00711 3.116 .0043 9232.86 PNC| 36.2027 21.54563 1.680 .1044 1.67078 PUC| -6.23235 5.01098 -1.244 .2243 2.34364 PPT| 9.35681 8.94549 1.046 .3048 2.74486 PN| 53.5879* 30.61384 1.750 .0914 2.08511 PS| -65.4897*** 23.58819 -2.776 .0099 2.36898 YEAR| 4.18510** 1.87283 2.235 .0339 1977.50 --------+------------------------------------------------------------- 3-45/58 Part 3: Least Squares Algebra PD added to the model. R2 rises, Adj. R2 falls ---------------------------------------------------------------------Ordinary least squares regression ............ LHS=G Mean = 226.09444 Standard deviation = 50.59182 Number of observs. = 36 Model size Parameters = 10 Degrees of freedom = 26 Residuals Sum of squares = 594.54206 Standard error of e = 4.78195 Fit R-squared = .99336 Was 0.99334 Adjusted R-squared = .99107 Was 0.99137 --------+------------------------------------------------------------Variable| Coefficient Standard Error t-ratio P[|T|>t] Mean of X --------+------------------------------------------------------------Constant| -7916.51** 3822.602 -2.071 .0484 PG| -26.8077*** 5.86376 -4.572 .0001 2.31661 Y| .02231*** .00725 3.077 .0049 9232.86 PNC| 30.0618 29.69543 1.012 .3207 1.67078 PUC| -7.44699 6.45668 -1.153 .2592 2.34364 PPT| 9.05542 9.15246 .989 .3316 2.74486 PD| 11.8023 38.50913 .306 .7617 1.65056 (NOTE LOW t ratio) PN| 47.3306 37.23680 1.271 .2150 2.08511 PS| -60.6202** 28.77798 -2.106 .0450 2.36898 YEAR| 4.02861* 1.97231 2.043 .0514 1977.50 --------+------------------------------------------------------------- 3-46/58 Part 3: Least Squares Algebra y|x1,x2,…,x1000; what is the right model? 2 ˆ e e / ( n K ) Adjusted R2 = 1 1 y M 0 y / (n 1) ˆ 2y Information Criteria logLikelihood = -n/2(1 + log2pi + log(e’e/n)) Akaike IC = -2logL + 2K Bayesian IC = -2logL + Klogn “Cross Validation” (Training samples and test samples) 3-47/58 Part 3: Least Squares Algebra Econometrics I Part 3.2 – Transformed Data 3-48/58 Part 3: Least Squares Algebra (Linearly) Transformed Data 3-49/58 How does linear transformation affect the results of least squares? Z = XP for KxK nonsingular P Based on X, b = (XX)-1X’y. You can show (just multiply it out), the coefficients when y is regressed on Z are c = P-1 b “Fitted value” is Zc = XPP-1b = Xb. The same!! Residuals from using Z are y - Zc = y - Xb (we just proved this.). The same!! Sum of squared residuals must be identical, as y-Xb = e = y-Zc. R2 must also be identical, as R2 = 1 - ee/y’M0y (!!). Part 3: Least Squares Algebra Understanding Linear Transformation 3-50/58 Xb is the projection of y into the column space of X. Zc is the projection of y into the column space of Z. But, since the columns of Z are just linear combinations of those of X, the column space of Z must be identical to that of X. Therefore, the projection of y into the former must be the same as the latter, which now produces the other results.) What are the practical implications of this result? Transformation does not affect the fit of a model to a body of data. Transformation does affect the “estimates.” If b is an estimate of something (), then c cannot be an estimate of - it must be an estimate of P-1, which might have no meaning at all. Part 3: Least Squares Algebra ECONOMETRIC 911 I have a simple question for you. Yesterday, I was estimating a regional production function with yearly dummies. The coefficients of the dummies are usually interpreted as a measure of technical change with respect to the base year (excluded dummy variable). However, I felt that it could be more interesting to redefine the dummy variables in such a way that the coefficient could measure technical change from one year to the next. You could get the same result by subtracting two coefficients in the original regression but you would have to compute the standard error of the difference if you want to do inference. Is this a well known procedure? 3-51/58 Part 3: Least Squares Algebra Production Function for 247 Dairy Farms: 1993-1998. 3-52/58 Part 3: Least Squares Algebra 3-53/58 Part 3: Least Squares Algebra A model of the probability that an individual will visit the doctor at least once in the survey year. 3-54/58 Part 3: Least Squares Algebra 3-55/58 Part 3: Least Squares Algebra Reducing the Dimensionality of X Principal Components Z = XC Why do we do this? Fewer columns than X Includes as much ‘variation’ of X as possible Columns of Z are orthogonal Collinearity Combine variables of ambiguous identity such as test scores as measures of ‘ability’ How do we do this? Later in the course. Requires some further results from matrix algebra. 3-56/58 Part 3: Least Squares Algebra What is a Principal Component? X = a data matrix (deviations from means) z = Xp = linear combination of the columns of X. Choose p to maximize the variation of z. How? p = eigenvector that corresponds to the largest eigenvalue of X’X 3-57/58 Part 3: Least Squares Algebra +----------------------------------------------------+ | Movie Regression. Opening Week Box for 62 Films | | Ordinary least squares regression | | LHS=LOGBOX Mean = 16.47993 | | Standard deviation = .9429722 | | Number of observs. = 62 | | Residuals Sum of squares = 20.54972 | | Standard error of e = .6475971 | | Fit R-squared = .6211405 | | Adjusted R-squared = .5283586 | +----------------------------------------------------+ +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ |Constant| 12.5388*** .98766 12.695 .0000 | |LOGBUDGT| .23193 .18346 1.264 .2122 3.71468| |STARPOWR| .00175 .01303 .135 .8935 18.0316| |SEQUEL | .43480 .29668 1.466 .1492 .14516| |MPRATING| -.26265* .14179 -1.852 .0700 2.96774| |ACTION | -.83091*** .29297 -2.836 .0066 .22581| |COMEDY | -.03344 .23626 -.142 .8880 .32258| |ANIMATED| -.82655** .38407 -2.152 .0363 .09677| |HORROR | .33094 .36318 .911 .3666 .09677| 4 INTERNET BUZZ VARIABLES |LOGADCT | .29451** .13146 2.240 .0296 8.16947| |LOGCMSON| .05950 .12633 .471 .6397 3.60648| |LOGFNDGO| .02322 .11460 .203 .8403 5.95764| |CNTWAIT3| 2.59489*** .90981 2.852 .0063 .48242| +--------+------------------------------------------------------------+ 3-58/58 Part 3: Least Squares Algebra +----------------------------------------------------+ | Ordinary least squares regression | | LHS=LOGBOX Mean = 16.47993 | | Standard deviation = .9429722 | | Number of observs. = 62 | | Residuals Sum of squares = 25.36721 | | Standard error of e = .6984489 | | Fit R-squared = .5323241 | | Adjusted R-squared = .4513802 | +----------------------------------------------------+ +--------+--------------+----------------+--------+--------+----------+ |Variable| Coefficient | Standard Error |t-ratio |P[|T|>t]| Mean of X| +--------+--------------+----------------+--------+--------+----------+ |Constant| 11.9602*** .91818 13.026 .0000 | |LOGBUDGT| .38159** .18711 2.039 .0465 3.71468| |STARPOWR| .01303 .01315 .991 .3263 18.0316| |SEQUEL | .33147 .28492 1.163 .2500 .14516| |MPRATING| -.21185 .13975 -1.516 .1356 2.96774| |ACTION | -.81404** .30760 -2.646 .0107 .22581| |COMEDY | .04048 .25367 .160 .8738 .32258| |ANIMATED| -.80183* .40776 -1.966 .0546 .09677| |HORROR | .47454 .38629 1.228 .2248 .09677| |PCBUZZ | .39704*** .08575 4.630 .0000 9.19362| +--------+------------------------------------------------------------+ 3-59/58 Part 3: Least Squares Algebra