Download Davidson-McKinnon book chapter 2 notes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
DmcKCh2
what is the difference between numerical and statistical properties of OLS estimator?
numerical means irrespective of how data are collected. Geometry of vector spaces is
useful for that.
scalar product of two vectors is xTy also denoted as <x,y>. It is scalar product because
the ANSWER is a scalar.
Euclidian length of a vector is the NORM of the vector
In R software length simply means number of elements (do not confuse the two)
page 45
vector x=(x1,x2) is denoted as hypotenuse with the coordinates.
show eq 2.07
x and  x are always parallel vectors
p.47 z vector is of Euclidian length 1 shown in fig on page 48
coordinates of z must be cos , sin and eq 2.06 holds
w is (1,0) vector
eq 2.08 C-S inequality comes from cos being always between -1 and +1 applied to 2.07
Linear dependence is important concept for checking the collinearity of economic data. If
one column of data are obtained as a weighted sum of other columns, the data are
collinear. e.g. (2.14). Try this in Excel and R
file named historyr.txt has this in machine readable form
> X=matrix(c(1, 0, 1, 1, 4, 0,1,0,1,1,4,0,1,0,1),byrow=T,5,3)
# need comma after each entry, if not for byrow=T, u will get by column, dim is 5by3
>X
[,1] [,2] [,3]
[1,] 1 0 1
[2,] 1 4 0
[3,] 1 0 1
[4,] 1 4 0
[5,] 1 0 1
>
> XtX=t(X)%*%X #this is matrix multiplication
> XtX
[,1] [,2] [,3]
[1,] 5 8 3
[2,] 8 32 0
[3,] 3 0 3
> xtxinv=solve(XtX)
Error in solve.default(XtX) : Lapack routine dgesv: system is exactly singular
> det(XtX)
[1] 0
since determinant is zero it is a singular matrix.
>
> eigen(XtX)
$values
[1] 3.421267e+01 5.787330e+00 1.475650e-15
$vectors
[,1]
[,2]
[,3]
[1,] -0.26648790 0.6664351 0.6963106
[2,] -0.96349787 -0.2033933 -0.1740777
[3,] -0.02561344 0.7172834 -0.6963106
#note the smallest eigenvalue is near zero
# eigen is a powerful command to compute the eigenvalues and eigenvectors
Singular value decomposition is a more fundamental way of studying the X data in a
regression model.
svdx=svd(X) # decomposes X into 3 matrices X is T by p matrix
X = H 0.5 GT.
H is T by p like X. It has sample principal coordinates of X standardized in the sense that
HTH=I
The middle matrix is p by diagonal, denoted by d in R software. It contains singular
values of X. The eigenvalue eigenvector decomposition is for XTX not for X
It is given by XTX=GGT. Thus the eigenvalues are squares of singular values.
The matrix G is p by p containing columns g1 to gp. The gi are direction cosine vectors
which orient the i-th principal axis of X with respect to the given original axes of the X
data.
If the condition number K# = max(singular value)/ min(singular value)
if K#>30 we have severe collinearity
If the units are such that XTX=is a correlation matrix then orthogonal regressors have
K#=1 and K# exceeding p is considered collinear data.
Sometimes the units of measurement of different regressors might be too distinct with
high standard deviations. It is advisable to have regressors with numerically similar
regressors in units. But this may not be always possible since some variables like interest
rates move in narrow bands while others like GDP are large. Still it is possible to
measure GDP in billions or tens of billions to make units closer to other variables in the
model for numerical reliability.
cor(X) command in R computes the matrix of correlation coefficients and one can
compute the svd of this matrix to assess collinearity. Now the K# is square root of ratio
of largest to smallest eigenvalue. The K# exceeding p the dimension of X is an
indication of collinearity. For perfect orthogonal regressors the correlation matrix is
identity matrix with K#=1. But this almost never happens in econometrics.
y= X +u is the original model then with svd we have
y= X +u = H 0.5 GT +u = H  + u
Note that here we define  =0.5 GT so that the regressors are perfectly orthogonal (no
problem of collinearity?) Obviously Not!
near-collinearity is a serious problem which does not disappear with mere reparameterization of the problem. The simplest solution is to omit variables. More
advanced solution is ridge regression.
Consider following parameterization (change of notation) with uncorrelated components
y= X +u = H 0.5 GT +u = X* + u, where X*=H 0.5 and where  =GT
y=X* +u is the model so OLS estimator is c=(X*TX*)-1X*Ty.
X*=H 0.5 its transpose is 0.5 H since transpose of diagonal is same as the diagonal
(X*TX*) is 0.5 H H 0.5 =  since HH=I cancels
inverse of (X*TX*) is simply the inverse of  =diag(i)
c=(X*TX*)-1X*Ty.
c simplifies to -10.5 H y
c= -0.5 H y. WARNING! power of  has NEGATIVE 0.5
What is the relation between c as estimates of  and original regression coefficients  or
OLS estimates b?
By definition,  =GT
Hence c=GTb Let us call c as the uncorrelated components of b. Why? We need some
discussion before we can say why.
How good is an estimator depends on how close it is to true values! Obvious, right?
Euclidian distance between estimator b and true value  is ||b|| or sum of squared
errors.
Sum of squared errors of all elements of vector b is ||b|| =(b)T(b), which is a
scalar 1 by 1 matrix.
Average of squared error when we allow for the fact that b is random variable is called
mean squared error or MSE.
Average is given by expected values. Hence mean or average of squared errors is given
by applying expectation operator. We have MSE= E(b)T(b),
It is desirable to minimize the MSE since that means we are close to the truth ( )
For the model y = X +u, Forecast of y vector is given by Xb
where b is the OLS vector of dimension p by 1
What is forecast error?
Xb X
What is the sum of squared forecast errors?
||XbX|| =(XbX)T(XbX) =(b)T(XTX) (b)
If we define weighted MSE = WMSE = (b)TW(b), where W is a positive
semidefinite matrix.
Then sum of squared forecast error is WMSE with W= (XTX) as weights.
It is desirable to minimize the MSE since that means our forecasts are close to the truth
(X)
MSE and WMSE are better considered in a multivariate matrix setting instead of as
scalars. So we define
MtxMSE (b) =E(b)(b)T, where the Transpose is on the second vector
The scalar MSE is a simply the trace(MtxMSE) or sum of diagonals of MtxMSE.
[recall trace is a matrix operation]
Let b* be some biased estimator of  so that E(b*) does not equal 
Now MtxMSE (b*) =E(b*)(b*)T,
we can add and subtract Eb* with no change in value
b* =b* +Eb* Eb* = (b*Eb* ) +(Eb* ) =P1+P2 as two parts.
Note the second part P2 is the Bias vector of dimension p by 1
The first part P1 has interesting the property that E(P1)= E (b*Eb* )= Eb*Eb* =0
(by definition of expectaion)
The expectation of the second part P2 is not a random variable at all
E(P2)= E (Eb* ) is still a bias vector and is nonzero for our biased estimator b*.
Now MtxMSE (b*) =E(b*)(b*)T=E(P1+P2)(P1+P2)T has 4 terms
=E(P1 P1T) + E (P1 P2T) + E(P2 P1T) + E (P2 P2T).
We study the 4 terms one by one.
Consider first of 4 terms
E(P1 P1T)= E (b*Eb* ) (b*Eb* )T= variance covariance matrix of b*
If x is a random variable vector with Ex=0 and c is a constant the Ecx=0. We apply this
to show that the second and third term expectations are zero. c is like the bias vector and
x is like b*Eb* here with zero expectation.
Now the last positive term of the four terms remains
So no generality is lost in considering MtxMSE, but we are not lumping all regression
coefficients in one but looking more closely at them individually.
The OLS estimator is unbiased. Hence Eb = 
Hence b = b  Eb
mean squared error matrix MtxMSE(b) is simply its variance covariance matrix
MtxMSE(b) = 2 (XTX)-1
scalar MSE(b)=2 trace[ (XTX)-1].= G-1GT.
trace ABC = trace CAB always holds
hence
scalar MSE(b)=2 trace[G-1GT]= 2 trace[GGT -1] =2 trace[-1] is proportional to
the sum of reciprocals of eigenvalues of XTX
When data are near-collinear, the smallest eigenvalue is close to zero so its reciprocal is
very large so sum of reciprocals of all eigenvalues of XTX including the last is also very
large. So the variance of each regression coefficient is increased due to collinearity.
What does collinearity do? it increases MSE(b) means b is far away in Euclidian distance
from the true vector  which means we have garbage estimates.
How do we pinpoint the culprit in collinearity?
Now we turn to showing that c are indeed uncorrelated components of b.
Recall that by definition,  =GT and c=GTb are OLS estimators of .
What about variance covariance matrix of c?
V(c) =2 (X*TX*)-1= 2 -1. Since the off diagonal elements (covariances) are all zero
we call them uncorrelated components. Recall c=-0.5 H y
Variance of c1 the first element of c is 2 /1 .
If the eigenvalues are ordered from the largest to the smallest, variance of cp is the culprit
in collinearity with the largest variance 2 /p .
Note the sensible thing to do is to de-emphasize or down-weight or shrink the last
uncorrelated component of c, namely cp the culprit. In Principal components regression,
the last component is simply deleted (weight =0) in ridge regression the weighs are
progressively reduced with reference to a biasing parameter k.
In the absence of evidence to the contrary, coefficients are zero under the null hypothesis
in the usual testing of significance. The shrinkage methods shrink toward zero as a
conservative choice rather than make an explicit commitment to a value of the estimator
as known a priori.
Ridge estimator is defined as a family of estimators parameterized by the biasing
parameter k >0. bk = (XX + kI )–1 Xy denoting transpose by prime when convenient.
A large number of choices of k  [0,  ) is possible. Each choice gives a new ridge
estimator. This is why we denote it with subscript k. If k=0 we have OLS if k= you are
dividing by infinity, making ratio equal to zero or multiplying by zero. It is like deleting
the coefficient. This is commonly done for collinearity. But here we are making all
coefficients exactly zero if we choose k as infinitely large.
The motivation behind adding a constant k is to improve the conditioning of the matrix
being inverted. Verify that the condition number
K#=sqrt( max(i)/ min(i) ) becomes Kridg#=sqrt( max(i+k)/ min(i+k) )
dramatically changes from a large number possibly exceeding 30 to a reasonable number.
For example if max(i) =9 and min(i)/ =0.01, K#=sqrt(9/0.01)=30, this means
collinearity exists. Now choosing a rather small biasing parameter k=0.1 the
Kridg#=sqrt(9.1/(0.01+.1)=9.095453
What does ridge regression really do to the b, the OLS regression coefficients?
To see this let us do eigenvalue eigenvector decomposition of XX=G G where G is a
matrix of eigenvectors (G is orthogonal matrix GG=I=GG, that is its inverse equals its
transpose) and  is a diagonal matrix of ordered eigenvalues diag(i)
Also do svd on X as H0.5G. Now substituting these in bk formula
bk = (XX + kI )–1 Xy , we have
bk = (G G + kGG )–1 [H0.5G] y
Since G is orthogonal, GG=I=GG holds so we replace kI by kGG
bk= (G G + kGG )–1 G0.5 Hy
bk= (G G + kGG )–1 G0.5 Hy.
Recall  =diag(i) are eigenvalues of XX in decreasing order.
In the above formula  is raised to positive 0.5 but in the formula for c it is raises to
minus power 0.5 (Recall the warning above). Hence let us add 1 and subtract 0.05.
Since ½ =[11/2] we can replace 0.5 by -0.5
Peel Off G on the left and G on the right inside the ( … ) above to write
bk= G( + kI )–1 GG0.5 Hy
=G( + kI )–10.5 Hy
using GG=I
Since ½ =[11/2] we can replace 0.5 by -0.5
bk= G( + kI )–1-0.5 Hy. Now define a diagonal matrix of shrinkage factors
 =diag(i), such that i=i / [i+k] from the matrix multiplication ( + kI )–1 
bk= G  -0.5 H y = G  c (Recalling the definition c=-0.5 H y)
Now by definition,  =G and hence replacing  by OLS b we have
OLS estimator c of  as simply Gb . Thus c= Gb always holds
This shows that we are shrinking the ith uncorrelated component ci by i
So far, we have bk= G  c . Now substitute c by Gb to yield the ridge family of
estimators as bk= G  G b .
Since i eigenvalues are declining, i=i / [i+k] are also declining (k>0 constant).
This has profound implications.
The weights i are declining. Smallest weight is on the last cp and highest weight is on
the first c1, which is eminently sensible. The component with highest variance is given
the smallest weight.
This makes collinearity go away, removes wrong signs and makes the coefficients not so
sensitive to minor perturbations in data sometimes beyond the rounding digits.
See bottom of page 49
In general, a two dimensional space of the page is spanned by two basis vectors (x1,x2)=
[1,0] for the horizontal basis vector and (x1,x2)= [0,1] for the vertical basis vector of
Cartesian coordinate system. Note that the x1 coordinate is anything and x2 coordinate is
zero for all points along the horizontal axis. That is exactly what horizontal axis means.
Data on income and education will be two vectors in this. They too can span the same
space. In fact any two vectors which are not parallel to each other can span the space of a
page. See Fig on page 56 panel b.
Eq (2.9) defines the column space spanned by the data on regressors, such as income,
education and denoted by S(X).
En is 3 dimensional when consumption, income and education are the n=3 dimensions of
the Euclidian space. Higher the income /education higher the consumption?
What is the orthogonal complement of S(X) the column space of regressors? S┴(X)
First find what is remaining of the 3 dimensional space after the two dimensions for
income and education are taken into account. Eq on page 50 defines the orthogonal (i.e.
perpendicular). All vectors in that orthogonal space should be perpendicular to the two
dimensional regression plane. Figures on page 56 clarify that if we drop a perpendicular
from the y vector (data on consumption) on the regression plane (spanned by income and
education) the residual vector u^ is orthogonal to the regression plane. Thus u^ lies in the
orthogonal complement of S(X) denoted by S┴(X)
Recall page 32 law of iterated exp. (2.15) is orthogonality condition in the sense that
vector u^ is orthogonal to each regressor data xi. Hence it is orthogonal to every linear
combination of regressors. Since X^ is a fitted linear combination, so it belongs to
column space of X
Data on y=consumption is a vector in n dimensional space, so are data on x1=own
income and x2=parents’ income. x1 and x2 span the column space of regressors (it is a
plane), Now y is rising up from that space. If u^ is to be orthogonal to the plane, it must
be Vertical obtained by dropping a perpendicular from y on to the plane. The shortest
distance to the plane is the perpendicular (shortest=least in least squares). See in panel b
of Fig 2.11 that the fitted value vector is one of the vectors in the plane of regressors.
Panel c gives the Pythagoras theorem in 2.17
Closed book Quiz problem Write (2.17) in sum of squares notation.
Projection: maps a point in En into a point in its subspace.
Invariant: leaves all points in that subspace unchanged
Orthogonal Projection maps any point into the point that is closest to it.
premultiplying by a matrix is a projection
if a point is already in the invariant subspace, it is mapped into itself.
Orthogonal Projections: formalizes the notion of dropping a perpendicular!
Hat matrix =PX and MX=(IHat matrix) are two such projections.
For any y, PX(y) lies in the column space of X because PXy=Xb, a linear combination of
columns of X. Actually it is a vector of fitted values of y.
The image of MX =(Ihatmatrix) is the orthogonal complement of the column space X
Applying MX to y gives a vector of regression residual
PX and MX annihilate each other. 2.25
Since the two complementary projections are also symmetric, the spaces they project
onto are orthogonal
p60L1-3 See page 50, just Pytho in 2.26
P (hat matrix) and M are convenient and important for theory, but they should not be
used in computation of regression or residuals for obvious reasons. They are nn, too
big.
some questions to ponder and use in R software!
1)What happens when post-multiply X by a matrix A, where A is nonsingular? This is
like a change of scale for regressors.
2) What is the hat matrix for XA?
col space of X  col space of XA, hence hat matrix for X is same as hat matrix for XA
3) what is reversal rule of inverses? reversal rule of transposes? order reverses!
Why I care about the X and XA business? In regression, y=X+u and y=XA+u should
give exactly the same fitted values and residuals?, not same b, b simply becomes A-1b
Change Units From F To Celsius
P62
Fht =32 iota +(9/5) Cels
Verify using R that residuals and fitted values do not change with this change of
variables.
(2.29) is relation between results in Cels scale  and coef in Fht scale are 
4) What if we use log X instead of X?
THIS IS NOT LIKE MULTIPLICATION BY NONSINGULAR MATRIX A
FWL theorem says that 2.33 and 2.40 give identical coefficients and residuals
2.33 says y=X1 1 + X2  2 +u with two groups of regressors
2.40 says M1 y = M1 X2 2 + residuals where M1 = I  P1 where P1=hat mtx for X1
Hat matrix for Iota is just a 5 by 5 matrix all of whose elements are 1/5 or 0.2
(2.31)
_
Projection (IHatFORiota) = Mi is also known and Mi y gives deviations of y from y
P64
Fig shows that adding a constant does not change slope
P65
(2.32) Two Groups of regressors first group has iota and second has z
if we work with deviations from the mean for both y and z the coeff of slope does not
change, residuals do not change, fitted values do not change.
 =iota is one of the regressors, residuals are orthogonal to iota means that iota u^ = sum
of residuals = zero
Even if iota does not explicitly occur in the list of regressors, but the dummy variables
sum to one for each data point, then also sum of residuals (u^ )equals zero.
eq(2.33) has X1 and X2 as two groups of regressors
X is partitioned. Instead of subscript X1 use just 1 as the subscript
P1 and M1 projections also defined for X1
p.66 last 3 lines
col space of X has all the columns of X1 also
Projection PX applied to columns of X1 does not change it at all.
Hence 2.35 is true, in general
p.67
Similar to figure of page 64. what happens if we add X1A to X2?
Projection M1 on X2 is similar to centering projection operation
(38) and (39) must yield same 2? why?
in (38) X1 is sort of like the iota in (32) and M1X2 is like z in (32).
interesting that (33) with both X1 and X2 also has same 2, but not same residuals.
Why?
We need to change the left side of (39) from y to M1y as in (40)
p68 last 2 lines M1 annihilates X1 and MX annihilates X2
p69 line before (45)
p.70 Applications of FWL Thm Quarterly data are common, Use seasonal dummies (46)
defines dummies, (47) shows why there is linear dependence
which s u drop does not matter, but interpretation will be different (48) is drop iota case.
p71 retain the constant and drop s1
s defined as differences from s4 always subtract s4
3 lines below (49)
(51) is like FWL decomposition into 2 groups X1 and X2
p72 can even define residual creating projection MS for seasonal This projection is like
seasonal adjustment. If the seasonal adjustment is made in a particular way, we get
exactly the same results for important slope coefficients. (51) unadjusted
(52) has seasonally adjusted
The projection not only de-seasonalizes but also centers the data.
Trick to avoid centering, USE only s1 to s3
TO seasonally adjust or NOT? FWL says go ahead, either way get same results!
p73 De-trending as Second application of FWL thm
Make time trend orthogonal to intercept iota. Then we have two possibilities depending
on whether n=number of data points is even or odd!
Time when odd data looks like eq on p. 73 middle
FWL says that  estimates and residuals are exactly the same whether we work with
detrended data or original data in certain cases consistent with the FWL specification.
TO detrend or NOT? FWL says go ahead, either way get same results!
goodness of fit
p74 uncentered R2 is worse than centered R2 on page 75 Why?
uncentered is sensitive to changes in units
Just by adding a constant to y the R2 goes up. This is bad news for regressions where
intercept is absent. See example in R
p.75 last lines
In general if Least sq is not used R2 is not reliable in some cases.
p.76
Each b is a weighted linear combination of elements of y
ci is i-th row of (XX)-1X
It measures the effect of i-th observation on the regression (is it influential?)
Figure shows how high leverage point pulls the regression line toward itself
p77
x coordinate of outlier point decides if the point is high leverage
y coordinate of outlier point decides if it is influential in changing things
Real issue: Are outliers (influential points) correct data? If there is nothing wrong with
the data we may have to accept the facts as they are and not grumble about them.
LEVERAGE: what if I omit i-th observation?
define et vector (unit basis vector, spans the Euclidian space, from Identity matrix)
(56) studies what happens when we include et as a regressor.
Seems silly to have one residual as a regressor, right?
p.78
projection M based on the et vector amounts to deleting the t-th observation
Jackknife is leave-one-out (Tukey term)
(56) and eq before (61) apply FWL Thm, same 
(60) measures effect of dropping t-th obs. on predicted y (left side) is known if we know
 and hatmatrix of X and the et vector from identity matrix.( on RHS of (60))
Getting from (60) to ultimately (63) is our goal.
(62) says what  should be, nicely known from hat matrix diagonals and t-th residual
p.79 we are essentially pre-multiplying (60) by the generalized inverse of X to get rid of
X on the left side of (60)
(63) has lot of insight
depends on both ht and u^t
ht from the diag of hat matrix is a measure of LEVERAGE
Every high leverage point is not INFLUENTIAL.
HAT matrix DIAGONALS have special information in them.
(65) says
ht the diag elements of hat matrix lower bound is zero
if intercept is present, lower bound is 1/n
proj matrix for iota has 1/n last line of p. 79
iota is in column space of X  first line of p.80
We also know the average of ht values =k/n
(68) shows that Trace of Hatmatrix is k if there are k columns of X
When is it a balanced design? If ht values are close to their average k/n
p81
Some observations far out away from center have more leverage than others. In the
figure N(0,1) is used for x
What was the advantage of geometric viewpoint?
The fact that some matrices are idempotent becomes quite clear as soon as one
understands that notion of orthogonal projections.
Many exercises can be checked in R language.
In R there is a package called perturb
It gives lots of leverage and influence diagnostics.