Download Davidson-McKinnon book chapter 2 notes

DmcKCh2 what is the difference between numerical and statistical properties of OLS estimator? numerical means irrespective of how data are collected. Geometry of vector spaces is useful for that. scalar product of two vectors is xTy also denoted as <x,y>. It is scalar product because the ANSWER is a scalar. Euclidian length of a vector is the NORM of the vector In R software length simply means number of elements (do not confuse the two) page 45 vector x=(x1,x2) is denoted as hypotenuse with the coordinates. show eq 2.07 x and  x are always parallel vectors p.47 z vector is of Euclidian length 1 shown in fig on page 48 coordinates of z must be cos , sin and eq 2.06 holds w is (1,0) vector eq 2.08 C-S inequality comes from cos being always between -1 and +1 applied to 2.07 Linear dependence is important concept for checking the collinearity of economic data. If one column of data are obtained as a weighted sum of other columns, the data are collinear. e.g. (2.14). Try this in Excel and R file named historyr.txt has this in machine readable form > X=matrix(c(1, 0, 1, 1, 4, 0,1,0,1,1,4,0,1,0,1),byrow=T,5,3) # need comma after each entry, if not for byrow=T, u will get by column, dim is 5by3 >X [,1] [,2] [,3] [1,] 1 0 1 [2,] 1 4 0 [3,] 1 0 1 [4,] 1 4 0 [5,] 1 0 1 > > XtX=t(X)%*%X #this is matrix multiplication > XtX [,1] [,2] [,3] [1,] 5 8 3 [2,] 8 32 0 [3,] 3 0 3 > xtxinv=solve(XtX) Error in solve.default(XtX) : Lapack routine dgesv: system is exactly singular > det(XtX) [1] 0 since determinant is zero it is a singular matrix. > > eigen(XtX) $values [1] 3.421267e+01 5.787330e+00 1.475650e-15 $vectors [,1] [,2] [,3] [1,] -0.26648790 0.6664351 0.6963106 [2,] -0.96349787 -0.2033933 -0.1740777 [3,] -0.02561344 0.7172834 -0.6963106 #note the smallest eigenvalue is near zero # eigen is a powerful command to compute the eigenvalues and eigenvectors Singular value decomposition is a more fundamental way of studying the X data in a regression model. svdx=svd(X) # decomposes X into 3 matrices X is T by p matrix X = H 0.5 GT. H is T by p like X. It has sample principal coordinates of X standardized in the sense that HTH=I The middle matrix is p by diagonal, denoted by d in R software. It contains singular values of X. The eigenvalue eigenvector decomposition is for XTX not for X It is given by XTX=GGT. Thus the eigenvalues are squares of singular values. The matrix G is p by p containing columns g1 to gp. The gi are direction cosine vectors which orient the i-th principal axis of X with respect to the given original axes of the X data. If the condition number K# = max(singular value)/ min(singular value) if K#>30 we have severe collinearity If the units are such that XTX=is a correlation matrix then orthogonal regressors have K#=1 and K# exceeding p is considered collinear data. Sometimes the units of measurement of different regressors might be too distinct with high standard deviations. It is advisable to have regressors with numerically similar regressors in units. But this may not be always possible since some variables like interest rates move in narrow bands while others like GDP are large. Still it is possible to measure GDP in billions or tens of billions to make units closer to other variables in the model for numerical reliability. cor(X) command in R computes the matrix of correlation coefficients and one can compute the svd of this matrix to assess collinearity. Now the K# is square root of ratio of largest to smallest eigenvalue. The K# exceeding p the dimension of X is an indication of collinearity. For perfect orthogonal regressors the correlation matrix is identity matrix with K#=1. But this almost never happens in econometrics. y= X +u is the original model then with svd we have y= X +u = H 0.5 GT +u = H  + u Note that here we define  =0.5 GT so that the regressors are perfectly orthogonal (no problem of collinearity?) Obviously Not! near-collinearity is a serious problem which does not disappear with mere reparameterization of the problem. The simplest solution is to omit variables. More advanced solution is ridge regression. Consider following parameterization (change of notation) with uncorrelated components y= X +u = H 0.5 GT +u = X* + u, where X*=H 0.5 and where  =GT y=X* +u is the model so OLS estimator is c=(X*TX*)-1X*Ty. X*=H 0.5 its transpose is 0.5 H since transpose of diagonal is same as the diagonal (X*TX*) is 0.5 H H 0.5 =  since HH=I cancels inverse of (X*TX*) is simply the inverse of  =diag(i) c=(X*TX*)-1X*Ty. c simplifies to -10.5 H y c= -0.5 H y. WARNING! power of  has NEGATIVE 0.5 What is the relation between c as estimates of  and original regression coefficients  or OLS estimates b? By definition,  =GT Hence c=GTb Let us call c as the uncorrelated components of b. Why? We need some discussion before we can say why. How good is an estimator depends on how close it is to true values! Obvious, right? Euclidian distance between estimator b and true value  is ||b|| or sum of squared errors. Sum of squared errors of all elements of vector b is ||b|| =(b)T(b), which is a scalar 1 by 1 matrix. Average of squared error when we allow for the fact that b is random variable is called mean squared error or MSE. Average is given by expected values. Hence mean or average of squared errors is given by applying expectation operator. We have MSE= E(b)T(b), It is desirable to minimize the MSE since that means we are close to the truth ( ) For the model y = X +u, Forecast of y vector is given by Xb where b is the OLS vector of dimension p by 1 What is forecast error? Xb X What is the sum of squared forecast errors? ||XbX|| =(XbX)T(XbX) =(b)T(XTX) (b) If we define weighted MSE = WMSE = (b)TW(b), where W is a positive semidefinite matrix. Then sum of squared forecast error is WMSE with W= (XTX) as weights. It is desirable to minimize the MSE since that means our forecasts are close to the truth (X) MSE and WMSE are better considered in a multivariate matrix setting instead of as scalars. So we define MtxMSE (b) =E(b)(b)T, where the Transpose is on the second vector The scalar MSE is a simply the trace(MtxMSE) or sum of diagonals of MtxMSE. [recall trace is a matrix operation] Let b* be some biased estimator of  so that E(b*) does not equal  Now MtxMSE (b*) =E(b*)(b*)T, we can add and subtract Eb* with no change in value b* =b* +Eb* Eb* = (b*Eb* ) +(Eb* ) =P1+P2 as two parts. Note the second part P2 is the Bias vector of dimension p by 1 The first part P1 has interesting the property that E(P1)= E (b*Eb* )= Eb*Eb* =0 (by definition of expectaion) The expectation of the second part P2 is not a random variable at all E(P2)= E (Eb* ) is still a bias vector and is nonzero for our biased estimator b*. Now MtxMSE (b*) =E(b*)(b*)T=E(P1+P2)(P1+P2)T has 4 terms =E(P1 P1T) + E (P1 P2T) + E(P2 P1T) + E (P2 P2T). We study the 4 terms one by one. Consider first of 4 terms E(P1 P1T)= E (b*Eb* ) (b*Eb* )T= variance covariance matrix of b* If x is a random variable vector with Ex=0 and c is a constant the Ecx=0. We apply this to show that the second and third term expectations are zero. c is like the bias vector and x is like b*Eb* here with zero expectation. Now the last positive term of the four terms remains So no generality is lost in considering MtxMSE, but we are not lumping all regression coefficients in one but looking more closely at them individually. The OLS estimator is unbiased. Hence Eb =  Hence b = b  Eb mean squared error matrix MtxMSE(b) is simply its variance covariance matrix MtxMSE(b) = 2 (XTX)-1 scalar MSE(b)=2 trace[ (XTX)-1].= G-1GT. trace ABC = trace CAB always holds hence scalar MSE(b)=2 trace[G-1GT]= 2 trace[GGT -1] =2 trace[-1] is proportional to the sum of reciprocals of eigenvalues of XTX When data are near-collinear, the smallest eigenvalue is close to zero so its reciprocal is very large so sum of reciprocals of all eigenvalues of XTX including the last is also very large. So the variance of each regression coefficient is increased due to collinearity. What does collinearity do? it increases MSE(b) means b is far away in Euclidian distance from the true vector  which means we have garbage estimates. How do we pinpoint the culprit in collinearity? Now we turn to showing that c are indeed uncorrelated components of b. Recall that by definition,  =GT and c=GTb are OLS estimators of . What about variance covariance matrix of c? V(c) =2 (X*TX*)-1= 2 -1. Since the off diagonal elements (covariances) are all zero we call them uncorrelated components. Recall c=-0.5 H y Variance of c1 the first element of c is 2 /1 . If the eigenvalues are ordered from the largest to the smallest, variance of cp is the culprit in collinearity with the largest variance 2 /p . Note the sensible thing to do is to de-emphasize or down-weight or shrink the last uncorrelated component of c, namely cp the culprit. In Principal components regression, the last component is simply deleted (weight =0) in ridge regression the weighs are progressively reduced with reference to a biasing parameter k. In the absence of evidence to the contrary, coefficients are zero under the null hypothesis in the usual testing of significance. The shrinkage methods shrink toward zero as a conservative choice rather than make an explicit commitment to a value of the estimator as known a priori. Ridge estimator is defined as a family of estimators parameterized by the biasing parameter k >0. bk = (XX + kI )–1 Xy denoting transpose by prime when convenient. A large number of choices of k  [0,  ) is possible. Each choice gives a new ridge estimator. This is why we denote it with subscript k. If k=0 we have OLS if k= you are dividing by infinity, making ratio equal to zero or multiplying by zero. It is like deleting the coefficient. This is commonly done for collinearity. But here we are making all coefficients exactly zero if we choose k as infinitely large. The motivation behind adding a constant k is to improve the conditioning of the matrix being inverted. Verify that the condition number K#=sqrt( max(i)/ min(i) ) becomes Kridg#=sqrt( max(i+k)/ min(i+k) ) dramatically changes from a large number possibly exceeding 30 to a reasonable number. For example if max(i) =9 and min(i)/ =0.01, K#=sqrt(9/0.01)=30, this means collinearity exists. Now choosing a rather small biasing parameter k=0.1 the Kridg#=sqrt(9.1/(0.01+.1)=9.095453 What does ridge regression really do to the b, the OLS regression coefficients? To see this let us do eigenvalue eigenvector decomposition of XX=G G where G is a matrix of eigenvectors (G is orthogonal matrix GG=I=GG, that is its inverse equals its transpose) and  is a diagonal matrix of ordered eigenvalues diag(i) Also do svd on X as H0.5G. Now substituting these in bk formula bk = (XX + kI )–1 Xy , we have bk = (G G + kGG )–1 [H0.5G] y Since G is orthogonal, GG=I=GG holds so we replace kI by kGG bk= (G G + kGG )–1 G0.5 Hy bk= (G G + kGG )–1 G0.5 Hy. Recall  =diag(i) are eigenvalues of XX in decreasing order. In the above formula  is raised to positive 0.5 but in the formula for c it is raises to minus power 0.5 (Recall the warning above). Hence let us add 1 and subtract 0.05. Since ½ =[11/2] we can replace 0.5 by -0.5 Peel Off G on the left and G on the right inside the ( … ) above to write bk= G( + kI )–1 GG0.5 Hy =G( + kI )–10.5 Hy using GG=I Since ½ =[11/2] we can replace 0.5 by -0.5 bk= G( + kI )–1-0.5 Hy. Now define a diagonal matrix of shrinkage factors  =diag(i), such that i=i / [i+k] from the matrix multiplication ( + kI )–1  bk= G  -0.5 H y = G  c (Recalling the definition c=-0.5 H y) Now by definition,  =G and hence replacing  by OLS b we have OLS estimator c of  as simply Gb . Thus c= Gb always holds This shows that we are shrinking the ith uncorrelated component ci by i So far, we have bk= G  c . Now substitute c by Gb to yield the ridge family of estimators as bk= G  G b . Since i eigenvalues are declining, i=i / [i+k] are also declining (k>0 constant). This has profound implications. The weights i are declining. Smallest weight is on the last cp and highest weight is on the first c1, which is eminently sensible. The component with highest variance is given the smallest weight. This makes collinearity go away, removes wrong signs and makes the coefficients not so sensitive to minor perturbations in data sometimes beyond the rounding digits. See bottom of page 49 In general, a two dimensional space of the page is spanned by two basis vectors (x1,x2)= [1,0] for the horizontal basis vector and (x1,x2)= [0,1] for the vertical basis vector of Cartesian coordinate system. Note that the x1 coordinate is anything and x2 coordinate is zero for all points along the horizontal axis. That is exactly what horizontal axis means. Data on income and education will be two vectors in this. They too can span the same space. In fact any two vectors which are not parallel to each other can span the space of a page. See Fig on page 56 panel b. Eq (2.9) defines the column space spanned by the data on regressors, such as income, education and denoted by S(X). En is 3 dimensional when consumption, income and education are the n=3 dimensions of the Euclidian space. Higher the income /education higher the consumption? What is the orthogonal complement of S(X) the column space of regressors? S┴(X) First find what is remaining of the 3 dimensional space after the two dimensions for income and education are taken into account. Eq on page 50 defines the orthogonal (i.e. perpendicular). All vectors in that orthogonal space should be perpendicular to the two dimensional regression plane. Figures on page 56 clarify that if we drop a perpendicular from the y vector (data on consumption) on the regression plane (spanned by income and education) the residual vector u^ is orthogonal to the regression plane. Thus u^ lies in the orthogonal complement of S(X) denoted by S┴(X) Recall page 32 law of iterated exp. (2.15) is orthogonality condition in the sense that vector u^ is orthogonal to each regressor data xi. Hence it is orthogonal to every linear combination of regressors. Since X^ is a fitted linear combination, so it belongs to column space of X Data on y=consumption is a vector in n dimensional space, so are data on x1=own income and x2=parents’ income. x1 and x2 span the column space of regressors (it is a plane), Now y is rising up from that space. If u^ is to be orthogonal to the plane, it must be Vertical obtained by dropping a perpendicular from y on to the plane. The shortest distance to the plane is the perpendicular (shortest=least in least squares). See in panel b of Fig 2.11 that the fitted value vector is one of the vectors in the plane of regressors. Panel c gives the Pythagoras theorem in 2.17 Closed book Quiz problem Write (2.17) in sum of squares notation. Projection: maps a point in En into a point in its subspace. Invariant: leaves all points in that subspace unchanged Orthogonal Projection maps any point into the point that is closest to it. premultiplying by a matrix is a projection if a point is already in the invariant subspace, it is mapped into itself. Orthogonal Projections: formalizes the notion of dropping a perpendicular! Hat matrix =PX and MX=(IHat matrix) are two such projections. For any y, PX(y) lies in the column space of X because PXy=Xb, a linear combination of columns of X. Actually it is a vector of fitted values of y. The image of MX =(Ihatmatrix) is the orthogonal complement of the column space X Applying MX to y gives a vector of regression residual PX and MX annihilate each other. 2.25 Since the two complementary projections are also symmetric, the spaces they project onto are orthogonal p60L1-3 See page 50, just Pytho in 2.26 P (hat matrix) and M are convenient and important for theory, but they should not be used in computation of regression or residuals for obvious reasons. They are nn, too big. some questions to ponder and use in R software! 1)What happens when post-multiply X by a matrix A, where A is nonsingular? This is like a change of scale for regressors. 2) What is the hat matrix for XA? col space of X  col space of XA, hence hat matrix for X is same as hat matrix for XA 3) what is reversal rule of inverses? reversal rule of transposes? order reverses! Why I care about the X and XA business? In regression, y=X+u and y=XA+u should give exactly the same fitted values and residuals?, not same b, b simply becomes A-1b Change Units From F To Celsius P62 Fht =32 iota +(9/5) Cels Verify using R that residuals and fitted values do not change with this change of variables. (2.29) is relation between results in Cels scale  and coef in Fht scale are  4) What if we use log X instead of X? THIS IS NOT LIKE MULTIPLICATION BY NONSINGULAR MATRIX A FWL theorem says that 2.33 and 2.40 give identical coefficients and residuals 2.33 says y=X1 1 + X2  2 +u with two groups of regressors 2.40 says M1 y = M1 X2 2 + residuals where M1 = I  P1 where P1=hat mtx for X1 Hat matrix for Iota is just a 5 by 5 matrix all of whose elements are 1/5 or 0.2 (2.31) _ Projection (IHatFORiota) = Mi is also known and Mi y gives deviations of y from y P64 Fig shows that adding a constant does not change slope P65 (2.32) Two Groups of regressors first group has iota and second has z if we work with deviations from the mean for both y and z the coeff of slope does not change, residuals do not change, fitted values do not change.  =iota is one of the regressors, residuals are orthogonal to iota means that iota u^ = sum of residuals = zero Even if iota does not explicitly occur in the list of regressors, but the dummy variables sum to one for each data point, then also sum of residuals (u^ )equals zero. eq(2.33) has X1 and X2 as two groups of regressors X is partitioned. Instead of subscript X1 use just 1 as the subscript P1 and M1 projections also defined for X1 p.66 last 3 lines col space of X has all the columns of X1 also Projection PX applied to columns of X1 does not change it at all. Hence 2.35 is true, in general p.67 Similar to figure of page 64. what happens if we add X1A to X2? Projection M1 on X2 is similar to centering projection operation (38) and (39) must yield same 2? why? in (38) X1 is sort of like the iota in (32) and M1X2 is like z in (32). interesting that (33) with both X1 and X2 also has same 2, but not same residuals. Why? We need to change the left side of (39) from y to M1y as in (40) p68 last 2 lines M1 annihilates X1 and MX annihilates X2 p69 line before (45) p.70 Applications of FWL Thm Quarterly data are common, Use seasonal dummies (46) defines dummies, (47) shows why there is linear dependence which s u drop does not matter, but interpretation will be different (48) is drop iota case. p71 retain the constant and drop s1 s defined as differences from s4 always subtract s4 3 lines below (49) (51) is like FWL decomposition into 2 groups X1 and X2 p72 can even define residual creating projection MS for seasonal This projection is like seasonal adjustment. If the seasonal adjustment is made in a particular way, we get exactly the same results for important slope coefficients. (51) unadjusted (52) has seasonally adjusted The projection not only de-seasonalizes but also centers the data. Trick to avoid centering, USE only s1 to s3 TO seasonally adjust or NOT? FWL says go ahead, either way get same results! p73 De-trending as Second application of FWL thm Make time trend orthogonal to intercept iota. Then we have two possibilities depending on whether n=number of data points is even or odd! Time when odd data looks like eq on p. 73 middle FWL says that  estimates and residuals are exactly the same whether we work with detrended data or original data in certain cases consistent with the FWL specification. TO detrend or NOT? FWL says go ahead, either way get same results! goodness of fit p74 uncentered R2 is worse than centered R2 on page 75 Why? uncentered is sensitive to changes in units Just by adding a constant to y the R2 goes up. This is bad news for regressions where intercept is absent. See example in R p.75 last lines In general if Least sq is not used R2 is not reliable in some cases. p.76 Each b is a weighted linear combination of elements of y ci is i-th row of (XX)-1X It measures the effect of i-th observation on the regression (is it influential?) Figure shows how high leverage point pulls the regression line toward itself p77 x coordinate of outlier point decides if the point is high leverage y coordinate of outlier point decides if it is influential in changing things Real issue: Are outliers (influential points) correct data? If there is nothing wrong with the data we may have to accept the facts as they are and not grumble about them. LEVERAGE: what if I omit i-th observation? define et vector (unit basis vector, spans the Euclidian space, from Identity matrix) (56) studies what happens when we include et as a regressor. Seems silly to have one residual as a regressor, right? p.78 projection M based on the et vector amounts to deleting the t-th observation Jackknife is leave-one-out (Tukey term) (56) and eq before (61) apply FWL Thm, same  (60) measures effect of dropping t-th obs. on predicted y (left side) is known if we know  and hatmatrix of X and the et vector from identity matrix.( on RHS of (60)) Getting from (60) to ultimately (63) is our goal. (62) says what  should be, nicely known from hat matrix diagonals and t-th residual p.79 we are essentially pre-multiplying (60) by the generalized inverse of X to get rid of X on the left side of (60) (63) has lot of insight depends on both ht and u^t ht from the diag of hat matrix is a measure of LEVERAGE Every high leverage point is not INFLUENTIAL. HAT matrix DIAGONALS have special information in them. (65) says ht the diag elements of hat matrix lower bound is zero if intercept is present, lower bound is 1/n proj matrix for iota has 1/n last line of p. 79 iota is in column space of X  first line of p.80 We also know the average of ht values =k/n (68) shows that Trace of Hatmatrix is k if there are k columns of X When is it a balanced design? If ht values are close to their average k/n p81 Some observations far out away from center have more leverage than others. In the figure N(0,1) is used for x What was the advantage of geometric viewpoint? The fact that some matrices are idempotent becomes quite clear as soon as one understands that notion of orthogonal projections. Many exercises can be checked in R language. In R there is a package called perturb It gives lots of leverage and influence diagnostics.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Davidson-McKinnon book chapter 2 notes