Download Representing the Simple Linear Regression Model as a Matrix

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exterior algebra wikipedia , lookup

Laplace–Runge–Lenz vector wikipedia , lookup

Rotation matrix wikipedia , lookup

Euclidean vector wikipedia , lookup

Vector space wikipedia , lookup

Determinant wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Jordan normal form wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Principal component analysis wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Orthogonal matrix wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Gaussian elimination wikipedia , lookup

System of linear equations wikipedia , lookup

Four-vector wikipedia , lookup

Matrix multiplication wikipedia , lookup

Matrix calculus wikipedia , lookup

Least squares wikipedia , lookup

Ordinary least squares wikipedia , lookup

Transcript
Simple Linear Regression - Part Deux
Just the Essentials of Linear Algebra
2
 5
Vector: A set of ordered numbers, written, by default, as a column. Ex., v   
1
 
3
Transpose of a Vector: The transpose of vector v , written vT , is defined as the corresponding ordered row of
numbers. Ex., the transpose of the vector v above is vT  2
5 1 3
2
 1
 5
3
Vector Addition: Addition is defined componentwise. Ex., for vectors v    and u    ,
1
 2 
 
 
3
4
 2   1  2  1   1 
 5  3   5  3  2
     uv
vu    
 1   2  1  2   1
    
  
 3   4   3 4   7 
Vector Multiplication: The inner product (or dot product) of vectors v and u is defined as
2
vT u 
5 1 3  1
3
    2  1   5 3  1 2    3 4   2  15  2  12  7
 2 
 
4
Note 1: The inner product of two vectors is a number!
Note 2: vT u  uT v
Orthogonal: Vectors v and u are orthogonal if vT u  0 . The geometrical interpretation is that the vectors are
perpendicular. Orthogonality of vectors plays a big role in linear models.
Length of a Vector: The length of vector v is given by v  vT v . This extends the Pythagorean theorem to
vectors with an arbitrary number of components. Ex., v  vT v  39
2
 1
Matrices: A is an m by n matrix if it is an array of numbers with m rows and n columns. Ex., A  
1

4
1
3
 is a 4x2
3

2
matrix.
Matrix Addition: Matrix addition is componentwise, as for vectors.
Note: A vector is a special case of a matrix with a single column (or single row in the case of its transpose).
Matrix Multiplication: If A is m x n and B is n x p, the product AB is defined as the array of all inner products of
row vectors of A with column vectors of B. Ex., If A4 x 2
1
 4
 AB 4 x 3  
 2

2
2
 1

1

4
1
1 7 3
3
 and B2 x 3  

 , then
3
 1 2 1

2
5
1 6
.
13 0 

32 10 
16
Transpose of a Matrix: The transpose of m x n matrix A is the n x m matrix AT obtained by swapping rows for
columns in A. Ex., If A4 x 2
2
 1

1

4
1
2 1 1 4 
3
 , the A2Tx 4  
1 3 3 2
3



2
Transpose of a Product: (AB)T = BTAT
Symmetric Matrix: The square matrix Anxn is symmetric if aij = aji for all 1  i  n, 1  j  n. Ex., In
 1 3 1 
A3 x 3   3 7 2  : a12 = a21 = -3; a13 = a31 = 1; a23 = a32 = -2.


 1 2 5 
(Note: If Anxn is symmetric, then AT  A )
Theorem: for any m x n matrix A, ATA is an n x n symmetric matrix, and AAT is an m x m symmetric matrix.
Linear Independence: The set of n vectors v1 ,v2 ,v3 ,
combination c1v1  c2v2  c3v3 
,vn form a linearly independent set if the linear
 cnvn  0 implies that all of the constant coefficients, ci , equal zero.
rank(A): The rank of matrix A, rank(A), is the number of linearly independent columns (or rows) in the matrix. A
matrix with all columns, or all rows, independent is said to be "full rank." Full rank matrices play an important role
in linear models.
Theorem about Rank: rank(ATA) = rank(AAT) = rank(A). This theorem will be used in regression.
Inverse of a Matrix: The inverse of the n x n square matrix A is the n x n square matrix A-1 such that AA-1 = A-1A = I,
where I is the n x n "Identity" matrix, I nxn
1
0



0
0
0
1
0
0
0 0
0 0



1 0
0 1  nxn
Theorem: The n x n square matrix A has an inverse if and only if it is full rank, i.e., rank(A) = n.
a b 
1  d b 
1
has
2
x
2
inverse
. Ex., if
A

2 x2
d 
ad  bc  c a 
 1 4 
2 4
1
1  1 4
1  1 4

 
, then A1 
. You should carry out
A





 2 1   4  3  3 2  10  3 2  10  3 2 
 3 1
Theorem: The invertible 2 x 2 matrix A2 x 2  
c
1 0
the products AA-1 and A-1A to confirm that they equal I 2 x 2  
.
0 1
Solving a System of n Linear Equations in n Unknowns
a11 x1  a12 x2 
 a1n xn  b1
a21 x1  a22 x2 
 a2 n xn  b2
The system of n Linear Equations in n Unknowns,
, can be represented by the
an1 x1  an2 x2 
matrix equation Ax = b , where Anxn
 a12 a12
a
a
  21 22


 an1 an2
 ann xn  bn
a1n 
 x1 
 b1 


b 

x
a2 n
 , x   2  , and b   2  . The System has a
 
 

 
 

ann  nxn
 xn 
 bn 
unique solution if and only if A is invertible, i.e., A-1 exists. In that case, the solution is given by x = A-1b .
Representing the Simple Linear Regression Model as a Matrix Equation
For a sample of n observations on the bivariate distribution of the variables X and Y, the simple linear regression
 Y1   0  1 X 1  ε1   0  1 X 1   ε1 
Y      X  ε     X   ε 
2
0
1 2
2  0
1 2  2


model Y  0  1 X  ε leads to the system of n equations    
,
  
 
  
  
 
  
Yn  0  1 X n  εn  0  1 X n   εn 
 Y1  1 X 1 
 ε1 
Y  1 X  
 
2 
2   0   ε2 



which can be written
. The shorthand for this is Y  Xβ  ε , where the n x 2
  
  1   
  

 
Yn  1 X n 
 εn 
matrix X is called the "design" matrix because the values of the variable X are often fixed by the researcher in the
design of the experiment used to investigate the relationship between the variables X and Y. Note: Since X refers
to the design matrix above, we will use x to refer to the vector of values for X, xT   X 1
X2
Xn .
Fitting the Best Least Squares Regression Line to the n Observations
Ideally, we would solve the matrix equation Y  Xβ  ε for the vector of regression coefficients β , i.e., the true
intercept  0 and true slope 1 in the model Y  0  1 X  ε . In practice, however, this is never possible because
the equation Y  Xβ  ε has no solution! (Why?) When faced with an equation that we cannot solve, we do what
mathematicians usually do: we find a related equation that we can solve. First, since we don't know the errors in
our observations, we forget about the vector of the errors ε . (Here, it is important that we not confuse the errors
εi , which we never know, with the residuals ei associated with our estimated regression line.)
Unable to determine the parameters  0 and 1 , we look to estimate them, i.e., solve for b0 and b1 in the matrix
equation Y  Xb , but the system involves n equations in the two unknowns b0 and b1 . If you remember your
algebra, overdetermined systems of linear equations rarely have solutions. What we need is a related system of
linear equations that we can solve. (Of course, we'll have to show that the solution to the new system has
relevance to our problem of estimating  0 and 1 .) Finally (drum roll please), the system we'll actually solve is,
X TXb = X TY
(1.1)
Next, we show what the system above looks like, and, as we go, we'll see why it's solvable and why it's relevant.
(1.2)
(1.3)
(1.4)
1
X TX  
 X1
1 X 1 
1  1 X 2   n


X n  1
  X i

 
1
X
n

1
X2
 n
X T Xb  
  X i
1
X TY  
 X1
 X i 
 X i2 
 X i  b0    nb0  b1 X i 
 
 X i2   b1  b0  X i  b1 X i2 
1
X2
 Y1 
1  Y2   Yi 
 

X n      X iYi 
 
Yn 
Now we're in a position to write out the equations for the system X TXb = X TY ,
nb0  b1 X i Yi
(1.5)
b0  X i  b1  X i2   X iYi
If these equations look familiar, they are the equations derived in the previous notes by applying the Least
Squares criterion for the best fitting regression line. Thus, the solution to equation to matrix equation (1.1) is the
least squares solution to the problem of fitting a line to data! (Although we could have established this directly
using theorems from linear algebra and vector spaces, the calculus argument made in Part I of the regression
notes is simpler and probably more convincing.) Finally, the system of two equations in the two
unknowns b0 and b1 has a solution!
To solve the system X TXb = X TY , we note that the 2 x 2 matrix X TX has an inverse. We know this since
rank( X TX ) = rank( X ) = 2 , and full rank square matrices have inverses. (Note: rank( X ) = 2 because the two
columns of the matrix X are linearly independent.) The solution has the form,

b  X TX
(1.6)

T
where X X

1

1
n X i2    X i 
2
  X i2

   X i

1
X TY
 X i 
1

n  n  X i  X 2
  X i2

  X i
 X i 
 , and, after
n 
a fair amount of algebra, equation (1.6) returns the estimated intercept b0 and slope b1 given in Part I of the
regression notes:
(1.7)
 X iYi  n   X i  Yi    X i  X Yi  Y 
b1 

2
2
2 1
 Xi  X 
 Xi  n  Xi 
(1.8)
b0  Y  b1 X
1
Example
Although you will have plenty of opportunities to use software to determine the best fitting regression line for a
sample of bivariate data, it might be useful to go through the process once by hand for the "toy" data set:
x
y
1
8
2
4
4
6
5
2
Rather than using the equations (1.7) and (1.8), start with X TXb = X TY and find (a) X TX , (b) X TY ,
  (using only the definition of the inverse of a 2 x 2 matrix given in the linear algebra review), and
1
(d)  X TX  X TY . Perform the analysis again using software or a calculator to confirm the answer.
(c) X T X
1
The (Least Squares) Solution to the Regression Model
 b0 
 is
 b1 
The equation Y  Xb  e is the estimate to the simple linear regression model Y  Xβ  ε , where b  
the least squares estimate of the intercept and slope of the true regression line Y  Xβ given by equations (1.8)
ˆ  e , where
and (1.7), respectively, and e is the n x 1 vector of residuals. Now, Y  Xb  e can be written Y  Y
ˆ  Xb is the n x 1 vector of predictions made for the observations used to construct the estimate b . The n
Y


points X i ,Ŷi will, of course, all lie on the final regression line.
The Hat Matrix

A useful matrix that shows up again and again in regression analysis is the n x n matrix H  X X TX

ˆ  Xb  X X TX
the "hat" matrix. To see how it gets its name, note that Y

-1

-1
X , called
XY  HY . Thus the
matrix H puts a "hat" on the vector Y .
The hat matrix H has many nice properties. These include:

H is symmetric, i.e., HT  H , as is easily proven.

H is idempotent, which is a fancy way of saying that HH  H . The shorthand for this is H2  H . This is
also easily proven.

The matrix  I - H  , where I is the n x n identity matrix, is both symmetric and idempotent.
The Error Sum of Squares, SSE, in Matrix Form
ˆ e ; e  YY
ˆ . Using the hat matrix, this becomes e  Y  HY   I  H Y , where
From the equation Y  Y
we have right-factored the vector Y . The error sum of squares, SSE, is just the squared length of the vector of
residuals, i.e., SSE 
 ei2  eTe . In terms of the hat matrix this becomes SSE  YT  I  H  Y , where we have
used the symmetry and idempotency of the matrix  I  H  to simplify the result.
The Geometry of Least Squares
ˆ e
YY
(1.9)
One of the attractions of linear models, and especially of the least squares solutions to them, is the wealth of
geometrical interpretations that spring from them. The n x 1 vectors Y , Ŷ , and e are vectors in the vector space
n
, (an n-dimensional space whose components are real numbers, as opposed to complex numbers). From
equation (1.9) above, we know that the vectors Ŷ and e sum to Y , but we can show a much more surprising
result that will eventually lead to powerful conclusions about the sums of squares SSE, SSR, and SST.
First, we have to know where each of the vectors Y , Ŷ , and e "live":

The vector of observations Y has no restrictions placed on it and therefore can lie anywhere in n .

 X1 
1
 
1
ˆ  Xb  b    b  X 2  is restricted to the two-dimensional subspace of
Y
0
1
 
 
 
 
1
 Xn 
n
spanned by the columns
of the design matrix X , called (appropriately) the column space of X .

We will shortly show that e lives in a subspace of n orthogonal to the column space of X .
Next, we derive the critical result that the vectors Ŷ and e are orthogonal (perpendicular) in n . (Remember:
ˆ Te   HY T  I  H  Y  Y TH T  I  H  Y 
vectors v and u are orthogonal if and only if v Tu  0 .) Y
Y T  H  H  Y  0 , where we made repeated use of the fact that H and I  H are symmetric and idempotent.
Therefore, e is restricted to an n - 2 dimensional subspace of n (because it must be orthogonal to every vector in
the column space of X ).
Combining the facts that the vectors Ŷ and e sum to Y and are orthogonal to each other, we conclude that
Ŷ and e form the legs of a right triangle (in
ˆ
triangles, Y
(1.10)
.
2
 e
2
 Y
2
n
) with hypotenuse Y . By the Pythagorean Theorem for right
or equivalently,
ˆ TY
ˆ  eTe  YTY
Y
ˆ  Y and e sum to Y  Y and are
A slight modification of the argument above shows that the vectors Y
ˆ  Y and e form the legs of a right triangle (in n ) with
orthogonal to each other, whence we conclude that Y
hypotenuse Y  Y . By the Pythagorean Theorem for right triangles,
 Yˆ  Y   Yˆ  Y   eTe   Y  Y T  Y  Y 
T
(1.11)
or equivalently, SSR + SSE = SST. This last equality is the famous one involving the three sums of squares in
regression.
We've actually done more than just derive the equation involving the three sums of squares. It turns out that the
dimensions of the subspaces the vectors live in also determines their "degrees of freedom," so we've also shown
that SSE has n - 2 degrees of freedom because the vector of residuals, e , is restricted to an n - 2 dimensional
subspace of n . (The term "degrees of freedom" is actually quite descriptive because the vector e is only "free"
to assume values in this n - 2 dimensional subspace.)
The Analysis of Variance (ANOVA) Table
The computer output of a regression analysis always contains a table containing the sums of squares SSR, SSE, and
SST. The table is called an analysis of variance, or ANOVA, table for reasons we will see later in the course. The
table has the form displayed below. (Note: The mean square of the sum of squared residuals is the "mean squared
error" MSE, the estimate of the variance 2 of the error variable  in the regression model.)
Source
Degrees of Freedom (df)
Sums of Squares (SS)
1
 Yˆi  Y 
Residual
n-2
 Yi  Ŷi 
Total
n-1
 Yi  Y 
Regression
2
2
2
Mean Square (MS) = SS/df
 Yˆi  Y 
 Yi  Yˆi 
n2
2
2