Download Textbook

Document related concepts

Rotation matrix wikipedia , lookup

Vector space wikipedia , lookup

Euclidean vector wikipedia , lookup

Determinant wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Matrix (mathematics) wikipedia , lookup

Jordan normal form wikipedia , lookup

Gaussian elimination wikipedia , lookup

Non-negative matrix factorization wikipedia , lookup

Cayley–Hamilton theorem wikipedia , lookup

Covariance and contravariance of vectors wikipedia , lookup

Orthogonal matrix wikipedia , lookup

System of linear equations wikipedia , lookup

Perron–Frobenius theorem wikipedia , lookup

Singular-value decomposition wikipedia , lookup

Eigenvalues and eigenvectors wikipedia , lookup

Matrix multiplication wikipedia , lookup

Coefficient of determination wikipedia , lookup

Matrix calculus wikipedia , lookup

Four-vector wikipedia , lookup

Ordinary least squares wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
!
MASTER OF SCIENCE IN ANALYTICS
2014 EMPLOYMENT REPORT
Results at graduation, May 2014
Number of graduates:
79
MSA 2015
Number of graduates seeking new employment:
75
Percent with one or more offers of employment by graduation:
100
Percent placed by graduation:
100
Number of employers interviewing:
138
Average number of initial job interviews per student:
13
Linear Algebra
Percent of all interviews arranged by Institute:
92
Percent of graduates with 2 or more job offers:
90
Percent of graduates with 3 or more job offers:
61
Percent of graduates with 4 or more job offers:
40
Average base salary offer ($):
96,600
Author:
Median base salary offer ($):
Shaina
Race
Average base salary offers – candidates with
job experience
($):
100,600
Range of base salary offers – candidates with job experience ($):
80,000-135,000
95,000
Percent of graduates with prior professional work experience:
50
Average base salary offers – candidates without experience ($):
89,000
Range of base salary offers – candidates without experience ($):
75,000-110,000
Percent of graduates receiving a signing bonus:
Average amount of signing bonus ($):
65
12,200
Percent remaining in NC:
59
Percent of graduates sharing salary data:
95
Number of reported job offers:
246
Percent of reported job offers based in U.S.:
100
North&Carolina&State&University&•&920&Main&Campus&Drive,&Suite&530&•&Raleigh,&NC&27606&•&http://analytics.ncsu.edu&
©
2014
1
CONTENTS
1
2
3
The Basics
1.1 Conventional Notation . . . . . . . . . . . .
1.1.1 Matrix Partitions . . . . . . . . . . .
1.1.2 Special Matrices and Vectors . . . .
1.1.3 n-space . . . . . . . . . . . . . . . . .
1.2 Vector Addition and Scalar Multiplication
1.3 Exercises . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
.
.
.
.
.
Norms, Inner Products and Orthogonality
2.1 Norms and Distances . . . . . . . . . .
2.2 Inner Products . . . . . . . . . . . . . .
2.2.1 Covariance . . . . . . . . . . . .
2.2.2 Mahalanobis Distance . . . . .
2.2.3 Angular Distance . . . . . . . .
2.2.4 Correlation . . . . . . . . . . .
2.3 Orthogonality . . . . . . . . . . . . . .
2.4 Outer Products . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
13
13
15
16
16
17
19
.
.
.
.
23
23
26
27
28
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Linear Combinations and Linear Independence
3.1 Linear Combinations . . . . . . . . . . . . . .
3.2 Linear Independence . . . . . . . . . . . . . .
3.2.1 Determining Linear Independence . .
3.3 Span of Vectors . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
3
4
4
7
4
Basis and Change of Basis
32
5
Least Squares
38
CONTENTS
2
6
Eigenvalues and Eigenvectors
6.1 Diagonalization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Geometric Interpretation of Eigenvalues and Eigenvectors . . .
43
47
49
7
Principal Components Analysis
7.1 Comparison with Least Squares . . .
7.2 Covariance or Correlation Matrix? . .
7.3 Applications of Principal Components
7.3.1 PCA for dimension reduction .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
57
57
58
58
Singular Value Decomposition (SVD)
8.1 Resolving a Matrix into Components
8.1.1 Data Compression . . . . . . .
8.1.2 Noise Reduction . . . . . . . .
8.1.3 Latent Semantic Indexing . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
62
63
64
64
65
Advanced Regression Techniques
9.1 Biased Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.1.1 Principal Components Regression (PCR) . . . . . . . . .
9.1.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . .
68
68
69
72
8
9
1
CHAPTER
1
THE BASICS
1.1
Conventional Notation
Linear Algebra has some conventional ways of representing certain types of
numerical objects. Throughout this course, we will stick to the following basic
conventions:
• Bold and uppercase letters like A, X, and U will be used to refer to
matrices.
• Occasionally, the size of the matrix will be specified by subscripts, like
Am×n , which means that A is a matrix with m rows and n columns.
• Bold and lowercase letters like x and y will be used to reference vectors.
Unless otherwise specified, these vectors will be thought of as columns,
with x T and yT referring to the row equivalent.
• The individual elements of a vector or matrix will often be referred to
with subscripts, so that Aij (or sometimes aij ) denotes the element in
the ith row and jth column of the matrix A. Similarly, xk denotes the kth
element of the vector x. These references to individual elements are not
generally bolded because they refer to scalar quantities.
• Scalar quantities are written as unbolded greek letters like α, δ, and λ.
• The trace of a square matrix An×n , denoted Tr (A) or Trace(A), is the
sum of the diagonal elements of A,
n
Tr (A) =
∑ Aii .
i =1
1.1. Conventional Notation
2
Beyond these basic conventions, there are other common notational tricks
that we will become familiar with. The first of these is writing a partitioned
matrix.
1.1.1
Matrix Partitions
We will often want to consider a matrix as a collection of either rows or columns
rather than individual elements. As we will see in the next chapter, when we
partition matrices in this form, we can view their multiplication in simplified
form. This often leads us to a new view of the data which can be helpful for
interpretation.
When we write A = (A1 |A2 | . . . |An ) we are viewing the matrix A as
collection of column vectors, Ai , in the following way:


↑
↑
↑ ... ↑
A = ( A1 | A2 | . . . | A n ) =  A1 A2 A3 . . . A p 
↓
↓
↓ ... ↓
Similarly, we can write A as a collection of row vectors:
  

A1
←− A1 −→
 A2  ←− A2 −→
  

A= . = .
..
.. 
 ..   ..
.
. 
←− Am
Am
−→
Sometimes, we will want to refer to both rows and columns in the same
context. The above notation is not sufficient for this as we have A j referring to
either a column or a row. In these situations, we may use A? j to reference the
jth column and Ai? to reference the ith row:
A ?1
a11
 ..
 .

 ai1

 .
 ..

A ?2
a12
..
.
...
...
...
...
...
..
.
...
aij
...
...
...
am1

A 1?
a11
..
 ..
 .
.

Ai ? 
 ai1
 .
..
 ..
.
Am?
am1
a12
..
.
...
...
...
..
.
...
aij
...
...
...
A?n

a1n
.. 
. 

ain 

.. 
. 
amn

a1n
.. 
. 

ain 

.. 
. 
amn
1.1. Conventional Notation
1.1.2
3
Special Matrices and Vectors
The bold capital letter I is used to denote the identity matrix. Sometimes this
matrix has a single subscript to specify the size of the matrix. More often, the
size of the identity is implied by the matrix equation in which it appears.


1 0 0 0
0 1 0 0

I4 = 
0 0 1 0
0 0 0 1
The bold lowercase e j is used to refer to the jth column of I. It is simply a
vector of zeros with a one in the jth position. We do not often specify the size
of the vector e j , the number of elements is generally assumed from the context
of the problem.
 
0
 .. 
.
 
0
 

e j = jth row → 
1
0
 
 .. 
.
0
The vector e with no subscript refers to a vector of all ones.
 
1
1
 
 
e = 1
.
 .. 
1
A diagonal matrix is a matrix for
are zero. For example:

σ1
0
D=
0
0
which off-diagonal elements, Aij , i 6= j
0
σ2
0
0
0
0
σ3
0

0
0

0
σ4
Since the off diagonal elements are 0, we need only define the diagonal elements
for such a matrix. Thus, we will frequently write
D = diag{σ1 , σ2 , σ3 , σ4 }
or simply
Dii = σi .
1.2. Vector Addition and Scalar Multiplication
1.1.3
4
n-space
You are already familiar with the concept of “ordered pairs" or coordinates
( x1 , x2 ) on the two-dimensional plane (in Linear Algebra, we call this plane
"2-space"). Fortunately, we do not live in a two-dimensional world! Our data
will more often consist of measurements on a number (lets call that number
n) of variables. Thus, our data points belong to what is known as n-space.
They are represented by n-tuples which are nothing more than ordered lists of
numbers:
( x1 , x2 , x3 , . . . , x n ).
An n-tuple defines a vector with the same n elements, and so these two concepts
should be thought of interchangeably. The only difference is that the vector
has a direction, away from the origin and toward the n-tuple.
You will recall that the symbol R is used to denote the set of real numbers.
R is simply 1-space. It is a set of vectors with a single element. In this sense
any real number, x, has a direction: if it is positive, it is to one side of the
origin, if it is negative it is to the opposite side. That number, x, also has a
magnitude: | x | is the distance between x and the origin, 0.
n-space (the set of real n-tuples) is denoted Rn . In set notation, the formal
mathematical definition is simply:
Rn = {( x1 , x2 , . . . , xn ) : xi ∈ R, i = 1, . . . , n} .
We will often use this notation to define the size of an arbitrary vector.
For example, x ∈ R p simply means that x is a vector with p entries: x =
( x1 , x2 , . . . , x p ).
Many (all, really) of the concepts we have previously considered in 2- or
3-space extend naturally to n-space and a few new concepts become useful as
well. One very important concept is that of a norm or distance metric, as we
will see in Chapter 2. Before discussing norms, let’s revisit the basics of vector
addition and scalar multiplication.
1.2
Vector Addition and Scalar Multiplication
You’ve already learned how vector addition works algebraically: it occurs
element-wise between two vectors of the same length:
   

b1
a1 + b1
a1
 a2   b2   a2 + b2 
    


    
a + b =  a3  +  b3  =  a3 + b3 
. .  . 
 ..   ..   .. 
an
bn
a n + bn

1.2. Vector Addition and Scalar Multiplication
5
Geometrically, vector addition is witnessed by placing the two vectors, a
and b, tail-to-head. The result, a + b, is the vector from the open tail to the open
head. This is called the parallelogram law and is demonstrated in Figure 1.1a.
a+b
a
a
b
b
a-b
(a) Addition of vectors
(b) Subtraction of Vectors
Figure 1.1: Vector Addition and Subtraction Geometrically: Tail-to-Head
When subtracting vectors as a − b we simply add −b to a. The vector −b
has the same length as b but points in the opposite direction. This vector has
the same length as the one which connects the two heads of a and b as shown
in Figure 1.1b.
Example 1.2.1: Vector Subtraction: Centering Data
One thing we will do frequently in this course is consider centered
and/or standardized data. To center a group of variables, we merely
subtract the mean of each variable from each observation. Geometrically,
this amounts to a translation (shift) of the data so that it’s center (or
mean) is at the origin. The following graphic illustrates this process
using 4 data points.
1.2. Vector Addition and Scalar Multiplication
6
x2
x2
x
x1
x1
-x
x2
x2
x1
x1
Scalar multiplication is another operation which acts element-wise:
 

a1
αa1
 a2   αa2 
  

  

αa = α  a3  =  αa3 
.  . 
 ..   .. 
an
αan

Scalar multiplication changes the length of a vector but not the overall
direction (although a negative scalar will scale the vector in the opposite
direction through the origin). We can see this geometric interpretation of scalar
multiplication in Figure 1.2.
1.3. Exercises
7
2a
a
-.5a
Figure 1.2: Geometric Effect of Scalar Multiplication
1.3
Exercises
1. For a general matrix Am×n describe what the following products will
provide. Also give the size of the result (i.e. "n × 1 vector" or "scalar").
a. Ae j
b. eiT A
c. eiT Ae j
d. Ae
e. eT A
f.
1 T
ne A
2. Let Dn×n be a diagonal matrix with diagonal elements Dii . What effect
does multiplying a matrix Am×n on the left by D have? What effect does
multiplying a matrix An×m on the right by D have? If you cannot see
this effect in a general sense, try writing out a simple 3 × 3 matrix as an
example first.
3. What is the inverse of a diagonal matrix, D = diag{d11 , d22 , . . . , dnn }?
4. Suppose you have a matrix of data, An× p , containing n observations
on p variables. Suppose the standard deviations of these variables are
σ1 , σ2 , . . . , σp . Give a formula for a matrix that contains the same data but
with each variable divided by its standard deviation. Hint: you should use
exercises 2 and 3.
5. Suppose we have a network/graph as shown in Figure 1.3. This particular
network has 6 numbered vertices (the circles) and edges which connect
the vertices. Each edge has a certain weight (perhaps reflecting some level
of association between the vertices) which is given as a number.
1.3. Exercises
8
3
1
3
5
12
2
5
4
10
2
9
6
Figure 1.3: An example of a graph or network
a. The adjacency matrix of a graph is defined to be the matrix A such
that element Aij reflects the the weight of the edge connecting vertex
i and vertex j. Write out the adjacency matrix for this graph.
b. The degree of a vertex is defined as the sum of the weights of the
edges connected to that vertex. Create a vector d such that di is the
degree of node i.
c. Write d as a matrix-vector product in two different ways using the
adjacency matrix, A, and e.
9
CHAPTER
2
NORMS, INNER PRODUCTS AND
ORTHOGONALITY
2.1
Norms and Distances
In applied mathematics, Norms are functions which measure the magnitude or
length of a vector. They are commonly used to determine similarities between
observations by measuring the distance between them. As we will see, there
are many ways to define distance between two points.
Definition 2.1.1: Vector Norms and Distance Metrics
A Norm, or distance metric, is a function that takes a vector as input
and returns a scalar quantity ( f : Rn → R). A vector norm is typically
denoted by two vertical bars surrounding the input vector, kxk, to
signify that it is not just any function, but one that satisfies the following
criteria:
1. If c is a scalar, then
kcxk = |c|k x k
2. The triangle inequality:
kx + yk ≤ kxk + kyk
3. kxk = 0 if and only if x = 0.
4. kxk ≥ 0 for any vector x
We will not spend any time on these axioms or on the theoretical aspects of
2.1. Norms and Distances
10
norms, but we will put a couple of these functions to good use in our studies,
the first of which is the Euclidean norm or 2-norm.
Definition 2.1.2: Euclidean Norm, k ? k2
The Euclidean Norm, also known as the 2-norm simply measures the
Euclidean length of a vector (i.e. a point’s distance from the origin). Let
x = ( x1 , x2 , . . . , xn ). Then,
q
kxk2 = x12 + x22 + · · · + xn2
If x is a column vector, then
k x k2 =
√
x T x.
Often we will simply write k ? k rather than k ? k2 to denote the 2-norm,
as it is by far the most commonly used norm.
This is merely the distance formula from undergraduate mathematics,
measuring the distance between the point x and the origin. To compute the
distance between two different points, say x and y, we’d calculate
q
k x − yk2 = ( x1 − y1 )2 + ( x2 − y2 )2 + · · · + ( x n − y n )2
Example 2.1.1: Euclidean Norm and Distance
Suppose I have two vectors in 3-space:
x = (1, 1, 1) and y = (1, 0, 0)
Then the magnitude of x (i.e. its length or distance from the origin) is
p
√
kxk2 = 12 + 12 + 12 = 3
and the magnitude of y is
k y k2 =
p
12 + 02 + 02 = 1
and the distance between point x and point y is
q
√
kx − yk2 = (1 − 1)2 + (1 − 0)2 + (1 − 0)2 = 2.
The Euclidean norm is crucial to many methods in data analysis as it
measures the closeness of two data points.
2.1. Norms and Distances
11
Thus, to turn any vector into a unit vector, a vector with a length of 1, we
need only to divide each of the entries in the vector by its Euclidean norm.
This is a simple form of standardization used in many areas of data analysis.
For a unit vector x, x T x = 1.
Perhaps without knowing it, we’ve already seen many formulas involving
the norm of a vector. Examples 2.1.2 and 2.1.3 show how some of the most
important concepts in statistics can be represented using vector norms.
Example 2.1.2: Standard Deviation and Variance
Suppose a group of individuals has the following heights, measured in
inches: (60, 70, 65, 50, 55). The mean height for this group is 60 inches.
The formula for the sample standard deviation is typically given as
q
∑in=1 ( xi − x̄ )2
√
s=
n−1
We want to subtract the mean from each observation, square
the num√
bers, sum the result, take the square root and divide by n − 1.
If we let x̄ = x̄e = (60, 60, 60, 60, 60) be a vector containing the mean,
and x = (60, 70, 65, 50, 55) be the vector of data then the standard
deviation in matrix notation is:
s= √
1
kx − x̄k2 = 7.9
n−1
The sample variance of this data is merely the square of the sample
standard deviation:
1
s2 =
kx − x̄k22
n−1
Example 2.1.3: Residual Sums of Squares
Another place we’ve seen a similar calculation is in linear regression.
You’ll recall the objective of our regression line is to minimize the sum
of squared residuals between the predicted value ŷ and the observed
value y:
n
∑ (ŷi − yi )2 .
i =1
In vector notation, we’d let y be a vector containing the observed data
and ŷ be a vector containing the corresponding predictions and write
this summation as
kŷ − yk22
2.1. Norms and Distances
12
In fact, any situation where the phrase "sum of squares" is encountered,
the 2-norm is generally implicated.
Example 2.1.4: Coefficient of Determination, R2
Since variance can be expressed using the Euclidean norm, so can the
coefficient of determination or R2 .
R2 =
SSreg
kŷ − ȳk2
∑n (ŷ − ȳ)2
= in=1 i
=
SStot
ky − ȳk2
∑i=1 (yi − ȳ)2
Other useful norms and distances
1-norm, k ? k1 .
If x = x1
x2
...
xn then the 1-norm of X is
n
k x k1 =
∑ | x i |.
i =1
This metric is often referred to as Manhattan distance, city block distance, or taxicab
distance because it measures the distance between points along a rectangular
grid (as a taxicab must travel on the streets of Manhattan, for example). When
x and y are binary vectors, the 1-norm is called the Hamming Distance, and
simply measures the number of elements that are different between the two
vectors.
Figure 2.1: The lengths of the red, yellow, and blue paths represent the 1norm distance between the two points. The green line shows the Euclidean
measurement (2-norm).
2.2. Inner Products
∞-norm, k ? k∞ .
tance, is:
13
The infinity norm, also called the Supremum, or Max dis-
kxk∞ = max{| x1 |, | x2 |, . . . , | x p |}
2.2
Inner Products
The inner product of vectors is a notion that you’ve already seen, it is what’s
called the dot product in most physics and calculus text books.
Definition 2.2.1: Vector Inner Product
The inner product of two n × 1 vectors x and y is written x T y (or
sometimes as hx, yi) and is the sum of the product of corresponding
elements.
 
y1
 y2 
n
 
x T y = x1 x2 . . . x n  .  = x1 y1 + x2 y2 + · · · + x n y n = ∑ x i y i .
 .. 
i =1
yn
When we take the inner product of a vector with itself, we get the square
of the 2-norm:
x T x = kxk22 .
Inner products are at the heart of every matrix product. When we multiply
two matrices, Xm×n and Yn× p , we can represent the individual elements of the
result as inner products of rows of X and columns of Y as follows:
X 1? Y ?1
 X 2? Y ?1

 X 3? Y ?1
Y? p = 

..

.




X 1?
 X 2? 


XY =  .  Y?1
 .. 
Xm?
2.2.1
Y ?2
...
X m ? Y ?1
X 1? Y ?2
X 2? Y ?2
X 3? Y ?2
..
.
...
...
...
..
.
...
..
.

X 1? Y ? p
X 2? Y ? p 

X 3? Y ? p 


..

.

Xm? Y? p
Covariance
Another important statistical measurement that is represented by an inner
product is covariance. Covariance is a measure of how much two random
variables change together. The statistical formula for covariance is given as
Covariance(x, y) = E[(x − E[x])(y − E[y])]
(2.1)
where E[?] is the expected value of the variable. If larger values of one variable
correspond to larger values of the other variable and at the same time smaller
2.2. Inner Products
14
values of one correspond to smaller values of the other, then the covariance
between the two variables is positive. In the opposite case, if larger values of
one variable correspond to smaller values of the other and vice versa, then the
covariance is negative. Thus, the sign of the covariance shows the tendency
of the linear relationship between variables, however the magnitude of the
covariance is not easy to interpret. Covariance is a population parameter - it is
a property of the joint distribution of the random variables x and y. Definition
2.2.2 provides the mathematical formulation for the sample covariance. This is
our best estimate for the population parameter when we have data sampled
from a population.
Definition 2.2.2: Sample Covariance
If x and y are n × 1 vectors containing n observations for two different
variables, then the sample covariance of x and y is given by
n
1
1
( x − x̄ )(yi − ȳ) =
(x − x̄)T (y − ȳ)
∑
n − 1 i =1 i
n−1
Where again x̄ and ȳ are vectors that contain x̄ and ȳ repeated n times.
It should be clear from this formulation that
cov(x, y) = cov(y, x).
When we have p vectors, v1 , v2 , . . . , v p , each containing n observations
for p different variables, the sample covariances are most commonly
given by the sample covariance matrix, Σ, where
Σij = cov(vi , v j ).
This matrix is symmetric, since Σij = Σ ji . If we create a matrix V whose
columns are the vectors v1 , v2 , . . . v p once the variables have been centered
to have mean 0, then the covariance matrix is given by:
cov(V) = Σ =
1
VT V.
n−1
The jth diagonal element of this matrix gives the variance v j since
Σ jj = cov(v j , v j )
=
=
=
1
(v − v̄ j )T (v j − v̄ j )
n−1 j
1
kv − v̄ j k22
n−1 j
var (v j )
(2.2)
(2.3)
(2.4)
When two variables are completely uncorrelated, their covariance is zero.
2.2. Inner Products
15
This lack of correlation would be seen in a covariance matrix with a diagonal
structure. That is, if v1 , v2 , . . . , v p are uncorrelated with individual variances
σ12 , σ22 , . . . , σp2 respectively then the corresponding covariance matrix is:
σ12
0


Σ=
0
.
 ..
0
σ22
0
0

0
0
..
.
..
.
0
0
..
.
...
...
..
.

0
0


0

.. 
.
..
.
0
σp2
Furthermore, for variables which are independent and identically distributed
(take for instance the error terms in a linear regression model, which are
assumed to independent and normally distributed with mean 0 and constant
variance σ), the covariance matrix is a multiple of the identity matrix:
σ2
0


Σ=
0
.
 ..
0
σ2
0
0

0
0
..
.
..
.
0
0
..
.
...
...
..
.
..
.
0

0
0


2
0
=σ I
.. 
.
σ2
Transforming our variables in a such a way that their covariance matrix
becomes diagonal will be our goal in Chapter 7.
Theorem 2.2.1: Properties of Covariance Matrices
The following mathematical properties stem from Equation 2.1. Let
Xn× p be a matrix of data containing n observations on p variables. If A
is a constant matrix (or vector, in the first case) then
cov(XA) = AT cov(X)A
2.2.2
and
cov(X + A) = cov(X)
Mahalanobis Distance
Mahalanobis Distance is similar to Euclidean distance, but takes into account
the correlation of the variables. This metric is relatively common in data
mining applications like classification. Suppose we have p variables which
have some covariance matrix, Σ. Then the Mahalanobis distance between two
T
T
observations, x = x1 x2 . . . x p and y = y1 y2 . . . y p is given
by
q
d(x, y) =
( x − y ) T Σ −1 ( x − y ).
2.2. Inner Products
16
If the covariance matrix is diagonal (meaning the variables are uncorrelated)
then the Mahalanobis distance reduces to Euclidean distance normalized by
the variance of each variable:
v
u p
u
( x − y )2
d(x, y) = t ∑ i 2 i = kΣ−1/2 (x − y)k2 .
si
i =1
2.2.3
Angular Distance
The inner product between two vectors can provide useful information about
their relative orientation in space and about their similarity. For example, to
find the cosine of the angle between two vectors in n-space, the inner product
of their corresponding unit vectors will provide the result. This cosine is often
used as a measure of similarity or correlation between two vectors.
Definition 2.2.3: Cosine of Angle between Vectors
The cosine of the angle between two vectors in n-space is given by
cos(θ ) =
xT y
k x k2 kyk2
y
θ
x
This angular distance is at the heart of Pearson’s correlation coefficient.
2.2.4
Correlation
Pearson’s correlation is a normalized version of the covariance, so that not
only the sign of the coefficient is meaningful, but its magnitude is meaningful in
measuring the strength of the linear association.
2.3. Orthogonality
17
Example 2.2.1: Pearson’s Correlation and Cosine Distance
You may recall the formula for Pearson’s correlation between variable x
and y with a sample size of n to be as follows:
r= q
∑in=1 ( xi − x̄ )(yi − ȳ)
q
∑in=1 ( xi − x̄ )2 ∑in=1 (yi − ȳ)2
If we let x̄ be a vector that contains x̄ repeated n times, like we did
in Example 2.1.2, and let ȳ be a vector that contains ȳ then Pearson’s
coefficient can be written as:
r=
(x − x̄)T (y − ȳ)
kx − x̄kky − ȳk
In other words, it is just the cosine of the angle between the two vectors
once they have been centered to have mean 0.
This makes sense: correlation is a measure of the extent to which the
two variables share a line in space. If the cosine of the angle is positive
or negative one, this means the angle between the two vectors is 0◦ or
180◦ , thus, the two vectors are perfectly correlated or collinear.
It is difficult to visualize the angle between two variable vectors because
they exist in n-space, where n is the number of observations in the dataset.
Unless we have fewer than 3 observations, we cannot draw these vectors or
even picture them in our minds. As it turns out, this angular measurement
does translate into something we can conceptualize: Pearson’s correlation
coefficient is the angle formed between the two possible regression lines using
the centered data: y regressed on x and x regressed on y. This is illustrated in
Figure 2.2.
To compute the matrix of pairwise correlations between variables x1 , x2 , x3 , . . . , x p
(columns containing n observations for each variable), we’d first center them
to have mean zero, then normalize them to have length kxi k = 1 and then
compose the matrix
X = [ x1 | x2 | x3 | . . . | x p ].
Using this centered and normalized data, the correlation matrix is simply
C = XT X.
2.3
Orthogonality
Orthogonal (or perpendicular) vectors have an angle between them of 90◦ ,
meaning that their cosine (and subsequently their inner product) is zero.
2.3. Orthogonality
18
y=f(x)
x=f(y)
θ
r=cos(θ)
Figure 2.2: Correlation Coefficient r and Angle between Regression Lines
Definition 2.3.1: Orthogonality
Two vectors, x and y, are orthogonal in n-space if their inner product is
zero:
xT y = 0
Combining the notion of orthogonality and unit vectors we can define an
orthonormal set of vectors, or an orthonormal matrix. Remember, for a unit
vector, x T x = 1.
Definition 2.3.2: Orthonormal Sets
The n × 1 vectors {x1 , x2 , x3 , . . . , x p } form an orthonormal set if and
only if
1. xiT x j = 0 when i 6= j and
2. xiT xi = 1 (equivalently kxi k = 1)
In other words, an orthonormal set is a collection of unit vectors which
are mutually orthogonal.
If we form a matrix, X = (x1 |x2 |x3 | . . . |x p ), having an orthonormal set of
vectors as columns, we will find that multiplying the matrix by its transpose
provides a nice result:
2.4. Outer Products

x1T
x2T 
 T
 
X T X =  x3  x1
 .. 
 . 
19
x1T x1
x2T x1
 T
 x x1
3
xp = 
 ..
 .



x2
...
x3
x Tp
x Tp x1
x1T x2
x2T x2
x3T x2
..
.
x1T x3
x2T x3
x3T x3
..
.
...
...
...
..
.
...
...
..
1
0


= 0
.
 ..
0
1
0
..
.
0
0
1
..
.
...
...
...
..
.

0
0

0
 = Ip
.. 
.
0
0
0
...
1

.

x1T x p
x2T x p 

x3T x p 

.. 
. 

x Tp x p
We will be particularly interested in these types of matrices when they are
square. If X is a square matrix with orthonormal columns, the arithmetic above
means that the inverse of X is XT (i.e. X also has orthonormal rows):
XT X = XXT = I.
Square matrices with orthonormal columns are called orthogonal matrices.
Definition 2.3.3: Orthogonal (or Orthonormal) Matrix
A square matrix, U with orthonormal columns also has orthonormal
rows and is called an orthogonal matrix. Such a matrix has an inverse
which is equal to it’s transpose,
UT U = UUT = I
2.4
Outer Products
The outer product of two vectors x ∈ Rm and y ∈ Rn , written xyT , is an m × n
matrix with rank 1. To see this basic fact, lets just look at an example.
2.4. Outer Products
20
Example 2.4.1: Outer Product
 
 
1
2
2
 and let y = 1. Then the outer product of x and y is:
Let x = 
3
3
4
 
1

2

xyT = 
3 2
4

1
2
4
3 =
6
8
1
2
3
4

3
6

9
12
which clearly has rank 1. It should be clear from this example that
computing an outer product will always result in a matrix whose rows
and columns are multiples of each other.
Example 2.4.2: Centering Data with an Outer Product
As we’ve seen in previous examples, many statistical formulas involve
the centered data, that is, data from which the mean has been subtracted
so that the new mean is zero. Suppose we have a matrix of data
containing observations of individuals’ heights (h) in inches, weights
(w), in pounds and wrist sizes (s), in inches:
h
person1 60
person2 
 72
A = person3 
 66
person4  69
person5 63

w
102
170
110
128
130
s

5.5
7.5 

6.0 

6.5 
7.0
The average values for height, weight, and wrist size are as follows:
= 66
w̄ = 128
s̄ = 6.5
h̄
(2.5)
(2.6)
(2.7)
To center all of the variables in this data set simultaneously, we could
compute an outer product using a vector containing the means and a
vector of all ones:
2.4. Outer Products

60
72

66

69
63

21
102
170
110
128
130
  
1
5.5
1
7.5
  
 
6.0
 − 1 66
1

6.5
7.0

1

66
60 102 5.5
72 170 7.5 66

 
 
=
66 110 6.0 − 66
69 128 6.5 66
66
63 130 7.0

−6.0000 −26.0000
 6.0000
42.0000


= 0
−18.0000
 3.0000
0
−3.0000
2.0000
128
6.5

6.5
6.5

6.5

6.5
6.5

−1.0000
1.0000 

−0.5000


0
0.5000
128
128
128
128
128
Exercises


1
 2 

1. Let u = 
−4 and v =
−2


1
 −1
 .
 1 
−1
a. Determine the Euclidean distance between u and v.
b. Find a vector of unit length in the direction of u.
c. Determine the cosine of the angle between u and v.
d. Find the 1- and ∞-norms of u and v.
c. Suppose these vectors are observations on four independent variables, which have the following covariance matrix:

2
0
Σ=
0
0
0
1
0
0
0
0
2
0

0
0

0
1
Determine the Mahalanobis distance between u and v.
2.4. Outer Products
2. Let
22
−1
1
2
U= 

0
3
−2

2 0
2 0
0 3
1 0

−2
1 

0 
2
a. Show that U is an orthogonal matrix.
 
1
1

b. Let b = 
1. Solve the equation Ux = b.
1
3. Write a matrix expression for the correlation matrix, C, for a matrix of
centered data, X, where Cij = rij is Pearson’s correlation measure between
variables xi and x j . To do this, we need more than an inner product, we
need to normalize the rows and columns by the norms kxi k. For a hint,
see Exercise 2 in Chapter 1.
4. Suppose you have a matrix of data, An× p , containing n observations on
p variables. Develop a matrix formula for the standardized data (where
the mean of each variable should be subtracted from the corresponding
column before dividing by the standard deviation). Hint: use Exercises 1(f)
and 4 from Chapter 1 along with Example 2.4.2.
5. Explain why, for any norm or distance metric,
k x − yk = ky − x k
 
1
6. Find two vectors which are orthogonal to x = 1
1
7. Pythagorean Theorem. Show that x and y are orthogonal if and only if
kx + yk22 = kxk22 + kyk22
(Hint: Recall that kxk22 = x T x)
23
CHAPTER
3
LINEAR COMBINATIONS AND LINEAR
INDEPENDENCE
One of the most central ideas in all of Linear Algebra is that of linear independence. For regression problems, it is repeatedly stressed that multicollinearity is
problematic. Multicollinearity is simply a statistical term for linear dependence.
It’s bad. We will see the reason for this shortly, but first we have to develop the
notion of a linear combination.
3.1
Linear Combinations
Definition 3.1.1: Linear Combination
A linear combination is constructed from a set of terms v1 , v2 , . . . , vn
by multiplying each term by a constant and adding the result:
n
c = α1 v1 + α2 v2 + · · · + αn vn =
∑ α i vn
i =1
The coefficients αi are scalar constants and the terms, {vi } can be scalars,
vectors, or matrices.
If we dissect our formula for a system of linear equations, Ax = b, we will
find that the right-hand side vector b can be expressed as a linear combination
of the columns in the coefficient matrix, A.
3.1. Linear Combinations
b
24
= Ax
(3.1)
b
=
 
x1
 x2 
 
( A1 | A2 | . . . | A n )  . 
 .. 
x3
(3.2)
b
= x 1 A1 + x 2 A2 + · · · + x n A n
(3.3)
A concrete example of this expression is given in Example 3.1.1.
Example 3.1.1: Systems of Equations as Linear Combinations
Consider the following system of equations:
3x1 + 2x2 + 9x3
4x1 + 2x2 + 3x3
2x1 + 7x2 + x3
= 1
= 5
= 0
(3.4)
(3.5)
(3.6)
We can write this as a matrix vector product Ax = b where

3
A = 4
2
2
2
7

 
 
9
x1
1
3 x =  x2  and b = 5
1
x3
0
We can also write b as a linear combination of columns of A:
 
 
   
3
2
9
1
x1 4 + x2 2 + x3 3 = 5
2
7
1
0
Similarly, if we have a matrix-matrix product, we can write each column
of the result as a linear combination of columns of the first matrix. Let Am×n ,
Xn× p , and Bm× p be matrices. If we have AX = B then

x11
 x21

( A1 | A2 | . . . | A n )  .
 ..
x12
x22
..
.
...
...
..
.

x1p
x2n 

..  = (B1 |B2 | . . . |Bn )
. 
xn1
xn2
...
xnp
and we can write
B j = AX j = x1j A1 + x2j A2 + x3j A3 + · · · + xnj An .
A concrete example of this expression is given in Example 3.1.2.
3.1. Linear Combinations
25
Example 3.1.2: Linear Combinations in Matrix-Matrix Products
Suppose we have the following matrix formula:

2
Where A = 1
3
AX = B


3
5 6
2, X = 9 5 Then
1
7 8

1
4
2

B
=
=


2 1 3
5 6
1 4 2 9 5
3 2 1
7 8


2(5) + 1(9) + 3(7) 2(6) + 1(5) + 3(8)
1(5) + 4(9) + 2(7) 1(6) + 4(5) + 2(8) 
3(5) + 2(9) + 1(7) 3(6) + 2(5) + 1(8)
(3.7)
(3.8)
and we can immediately notice that the columns of B are linear combinations of columns of A:
 
 
 
2
1
3
B1 = 5 1 + 9 4 + 7 2
3
2
1
 
 
 
3
2
1
B2 = 6 1 + 5 4 + 8 2
1
3
2
We may also notice that the rows of B can be expressed as a linear
combination of rows of X:
6 +1 9
= 1 5 6 +4 9
= 3 5 6 +2 9
B1? = 2 5
B2?
B3?
5 +3 7
5 +2 7
5 +1 7
8
8
8
Linear combinations are everywhere, and they can provide subtle but
important meaning in the sense that they can break data down into a
sum of parts.
You should convince yourself of one final view of matrix multiplication,
as the sum of outer products. In this case B is the sum of 3 outer products
(3 matrices of rank 1) involving the columns of A and corresponding
rows of X:
B = A?1 X1? + A?2 X2? + A?3 X3? .
Example 3.1.2 turns out to have important implications for our interpreta-
3.2. Linear Independence
26
tion of matrix factorizations. In this context we’d call AX a factorization of the
matrix B. We will see how to use these expressions to our advantage in later
chapters.
We don’t necessarily have to use vectors as the terms for a linear combination. Example 3.1.3 shows how we can write any m × n matrix as a linear
combination of nm matrices with rank 1.
Example 3.1.3: Linear Combination of Matrices
1 3
Write the matrix A =
as a linear combination of the following
4 2
matrices:
1 0
0 1
0 0
0 0
,
,
,
0 0
0 0
1 0
0 1
Solution:
A=
1
4
3
2
=1
1
0
0
0
+3
0
0
1
0
+4
0
1
0
0
+2
0
0
0
1
Now that we understand the concept of Linear Combination, we can develop the important concept of Linear Independence.
3.2
Linear Independence
Definition 3.2.1: Linear Dependence and Linear Independence
A set of vectors {v1 , v2 , . . . , vn } is linearly dependent if we can express
the zero vector, 0, as non-trivial linear combination of the vectors. In
other words there exist some constants α1 , α2 , . . . αn (non-trivial means
that these constants are not all zero) for which
α1 v1 + α2 v2 + · · · + αn vn = 0.
(3.9)
A set of terms is linearly independent if Equation 3.9 has only the
trivial solution (α1 = α2 = · · · = αn = 0).
Another way to express linear dependence is to say that we can write one
of the vectors as a linear combination of the others. If there exists a non-trivial
set of coefficients α1 , α2 , . . . , αn for which
α1 v1 + α2 v2 + · · · + αn vn = 0
3.2. Linear Independence
27
then for α j 6= 0 we could write
vj = −
1
αj
n
∑ α i vi
i =1
i6= j
Example 3.2.1: Linearly Dependent Vectors
 
 
 
1
1
3
The vectors v1 = 2 , v2 = 2 , and v3 = 6 are linearly depen2
3
7
dent because
v3 = 2v1 + v2
or, equivalently, because
2v1 + v2 − v3 = 0
3.2.1
Determining Linear Independence
You should realize that the linear combination expressed Definition 3.2.1 can
be written as a matrix vector product. Let Am×n = (A1 |A2 | . . . |An ) be a matrix.
Then by Definition 3.2.1, the columns of A are linearly independent if and only
if the equation
Ax = 0
(3.10)
has only the trivial solution, x = 0. Equation 3.10 is commonly known as the
homogeneous linear equation. For this equation to have only the trivial solution,
it must be the case that under Gauss-Jordan elimination, the augmented matrix
(A|0) reduces to (I|0). We have already seen this condition in our discussion
about matrix inverses - if a square matrix A reduces to the identity matrix under
Gauss-Jordan elimination then it is equivalently called full rank, nonsingular, or
invertible. Now we add an additional condition equivalent to the others - the
matrix A has linearly independent columns (and rows).
In Theorem 3.2.1 a important list of equivalent conditions regarding linear
independence and invertibility is given.
Theorem 3.2.1: Equivalent Conditions for Matrix Invertibility
Let A be an n × n matrix. The following statements are equivalent. (If
one these statement is true, then all of these statements are true)
• A is invertible (A−1 exists)
3.3. Span of Vectors
28
• A has full rank (rank (A) = n)
• The columns of A are linearly independent
• The rows of A are linearly independent
• The system Ax = b, b 6= 0 has a unique solution
• Ax = 0 =⇒ x = 0
• A is nonsingular
Gauss− Jordan
• A −−−−−−−−→ I
3.3
Span of Vectors
Definition 3.3.1: Vector Span
The span of a single vector v is the set of all scalar multiples of v:
span(v) = {αv for any constant α}
The span of a collection of vectors, V = {v1 , v2 , . . . , vn } is the set of all
linear combinations of these vectors:
span(V) = {α1 v1 + α2 v2 + · · · + αn vn for any constants α1 , . . . , αn }
Recall that addition of vectors can be done geometrically using the head-totail method shown in Figure 3.1.
Figure 3.1: Geometrical addition of vectors: Head-to-tail
If we have two linearly independent vectors on a coordinate plane, then any
3.3. Span of Vectors
29
third vector can be written as a linear combination of them. This is because
two vectors is sufficient to span the entire 2-dimensional plane. You should
take a moment to convince yourself of this geometrically.
In 3-space, two linearly independent vectors can still only span a plane.
Figure 3.2 depicts this situation. The set of all linearly combinations of the
two vectors a and b (i.e. the span(a, b)) carves out a plane. We call this a
two-dimensional collection of vectors a subspace of R3 . A subspace is formally
defined in Definition 3.3.2.
Figure 3.2: The span(a, b) in R3 creates a plane (a 2-dimensional subspace)
Definition 3.3.2: Subspace
A subspace, S of Rn is thought of as a “flat” (having no curvature)
surface within Rn . It is a collection of vectors which satisfies the
following conditions:
1. The origin (0 vector) is contained in S
2. If x and y are in S then the sum x + y is also in S
3. If x is in S and α is a constant then αx is also in S
The span of two vectors a and b is a subspace because it satisfies these three
conditions. (Can you prove it? See exercise 4).
3.3. Span of Vectors
30
Example 3.3.1: Span
 
 
1
3
Let a = 3 and b = 0. Explain why or why not each of the
4
1
following vectors is contained in the span(a, b)?
 
5
a. x = 6
9
• To determine if x is in the span(a, b) we need to find coefficients α1 , α2 such that
α1 a + α2 b = x.
Thus, we attempt to solve the system

1
3
4

 
3 5
α
0 1 = 6 .
α2
1
9
After Gaussian Elimination, we find that the system is consistent with the solution
α1
2
=
α2
1
and so x is in fact in the span(a, b).
 
2
b. y = 4
6
• We could follow the same procedure as we did in part (a)
to learn that the corresponding system is not consistent and
thus that y is not in the span(a, b).
Exercises
1. Six views of matrix multiplication: Let Am×k , Bk×n , and Cm×n be matrices such that
AB = C.
a. Express the first column of C as a linear combination of the columns
3.3. Span of Vectors
31
of A.
b. Express the first column of C as a matrix-vector product.
c. Express C as a sum of outer products.
d. Express the first row of C as a linear combination of the rows of B.
e. Express the first row of C as a matrix-vector product.
d. Express the element Cij as an inner product of row or column vectors
from A and B.
2. Determine whether or not the vectors
 
 
 
2
1
0
x1 = 3 , x2 = 1 , x3 = 1
1
0
1
are linearly independent.
 
 
1
3
3. Let a = 3 and b = 0.
4
1
 
0
a. Show that the zero vector, 0 is in the span(a, b).
0
 
1
b. Determine whether or not the vector 0 is in the span(a, b).
1
4. Prove that the span of vectors is a subspace by showing that it satisfies
the three conditions from Definition 3.3.1. You can simply show this fact
for the span of two vectors and notice how the concept will hold for more
than two vectors.
5. True/False Mark each statement as true or false. Justify your response.
• If Ax = b has a solution then b can be written as a linear combination of the columns of A.
• If Ax = b has a solution then b is in the span of the columns of A.
• If the vectors v1 , v2 , and , v3 form a linearly dependent set, then v1
is in the span(v2 , v3 ).
32
CHAPTER
4
BASIS AND CHANGE OF BASIS
When we think of coordinate pairs, or coordinate triplets, we tend to think
of them as points on a grid where each axis represents one of the coordinate
directions:
span(e2)
(2,3)
span(e1)
(-4,-2)
(5,2)
When we think of our data points this way, we are considering them as
linear combinations of elementary basis vectors
1
0
e1 =
and e2 =
.
0
1
For example, the point (2, 3) is written as
2
1
0
=2
+3
= 2e1 + 3e2 .
3
0
1
(4.1)
33
We consider the coefficients (the scalars 2 and 3) in this linear combination
as coordinates in the basis B1 = {e1 , e2 }. The coordinates, in essence, tell us
how much “information” from the vector/point (2, 3) lies along each basis
direction: to create this point, we must travel 2 units along the direction of e1
and then 3 units along the direction of e2 .
We can also view Equation 4.1 as a way to separate the vector (2, 3) into
orthogonal components. Each component is an orthogonal projection of
the vector onto the span of the corresponding basis vector. The orthogonal
projection of vector a onto the span another vector v is simply the closest point
to a contained on the span(v), found by “projecting” a toward v at a 90◦ angle.
Figure 4.1 shows this explicitly for a = (2, 3).
span(e2)
orthogonal
projection of
a onto e2
a
span(e1)
orthogonal
projection of
a onto e1
Figure 4.1: Orthogonal Projections onto basis vectors.
Definition 4.0.3: Elementary Basis
For any vector a = ( a1 , a2 , . . . , an ), the basis B = {e1 , e2 , . . . , en } (recall
ei is the ith column of the identity matrix In ) is the elementary basis
and a can be written in this basis using the coordinates a1 , a2 , . . . , an as
follows:
a = a1 e1 + a2 e2 + . . . an en .
The elementary basis B1 is convenient for many reasons, one being its
orthonormality:
e1T e1
e1T e2
= e2T e2 = 1
= e2T e1 = 0
However, there are many (infinitely many, in fact) ways to represent the
data points on different axes. If I wanted to view this data in a different
34
way, I could use a different basis. Let’s consider, for example, the following
orthonormal basis, drawn in green over the original grid in Figure 4.2:
√ √ 1
1
B2 = {v1 , v2 } = 22
, 22
1
−1
span(e2)
span(v1)
span(e1)
span(v2)
Figure 4.2: New basis vectors, v1 and v2 , shown on original plane
√
The scalar multipliers 22 are simply normalizing factors so that the basis
vectors have unit length. You can convince yourself that this is an orthonormal
basis by confirming that
v1T v1
v1T v2
= v2T v2 = 1
= v2T v1 = 0
If we want to change the basis from the elementary B1 to the new green basis
vectors in B2 , we need to determine a new set of coordinates that direct us to
the point using the green basis vectors as a frame of reference. In other words
we need to determine (α1 , α2 ) such that travelling α1 units along the direction
v1 and then α2 units along the direction v2 will lead us to the point in question.
For the point (2, 3) that means
√ !
√ !
2
2
2
2√
= α1 v1 + α2 v2 = α1 √22 + α2
.
3
− 2
2
This is merely a system of equations Va = b:
√ 1
α1
2
2 1
=
2
1 −1
α2
3
2
35
The 2 × 2 matrix V on the left-hand side has linearly independent columns
and thus has an inverse. In fact, V is an orthonormal matrix which means its
inverse is its transpose. Multiplying both sides of the equation by V−1 = VT
yields the solution
√ !
5 2
α1
T
2√
a=
=V b=
α2
− 22
This result tells us that in order to reach the √
red point (formerly known
as (2,3) in our previous basis), we should travel
√
5 2
2
units along the direction
2
2
of v1 and then −
units along the direction v2 (Note that v2 points toward
the southeast corner and we want to move northwest, hence the coordinate
is negative). Another way (a more mathematical way) to√say this is that the
length of the orthogonal projection of a onto the span of v1 is
√
5 2
2 ,
and the length of
2
2 .
the orthogonal projection of a onto the span of v2 is −
While it may seem that
these are difficult distances to plot, they work out quite well if √
we examine our
drawing in Figure 4.2, because the diagonal of each square is 2.
In the same fashion, we can re-write all 3 of the red points on our graph
in the new basis by solving the same system simultaneously for all the points.
Let B be a matrix containing the original coordinates of the points and let A be
a matrix containing the new coordinates:
−4 2 5
α11 α12 α13
B=
A=
−2 3 2
α21 α22 α23
Then the new data coordinates on the rotated plane can be found by solving:
VA = B
And thus
√ 2 −6 5 7
A=V B=
−2 −1 3
2
T
Using our new basis vectors, our alternative view of the data is that in
Figure 4.3.
In the above example, we changed our basis from the original elementary
basis to a new orthogonal basis which provides a different view of the data. All
of this amounts to a rotation of the data around the origin. No real information
has been lost - the points maintain their distances from each other in nearly
every distance metric. Our new variables, v1 and v2 are linear combinations
of our original variables e1 and e2 , thus we can transform the data back to its
original coordinate system by again solving a linear system (in this example,
we’d simply multiply the new coordinates again by V).
In general, we can change bases using the procedure outlined in Theorem
4.0.1.
36
+
span(v1)
+
span(v2)
Figure 4.3: Points plotted in the new basis, B
Theorem 4.0.1: Changing Bases
Given a matrix of coordinates (in columns), A, in some basis, B1 =
{x1 , x2 , . . . , xn }, we can change the basis to B2 = {v1 , v2 , . . . , vn } with
the new set of coordinates in a matrix B by solving the system
XA = VB
where X and V are matrices containing (as columns) the basis vectors
from B1 and B2 respectively.
Note that when our original basis is the elementary basis, X = I, our
system reduces to
A = VB.
When our new basis vectors are orthonormal, the solution to this system
is simply
B = VT A.
Definition 4.0.4: Basis Terminology
A basis for the vector space Rn can be any collection of n linearly
independent vectors in Rn ; n is said to be the dimension of the vector
space Rn . When the basis vectors are orthonormal (as they were in our
37
example), the collection is called an orthonormal basis.
The preceding discussion dealt entirely with bases for Rn (our example
was for points in R2 ). However, we will need to consider bases for subspaces of
Rn . Recall that the span of two linearly independent vectors in R3 is a plane.
This plane is a 2-dimensional subspace of R3 . Its dimension is 2 because 2
basis vectors are required to represent this space. However, not all points from
R3 can be written in this basis - only those points which exist on the plane.
In the next chapter, we will discuss how to proceed in a situation where the
point we’d like to represent does not actually belong to the subspace we are
interested in. This is the foundation for Least Squares.
Exercises
3
−2
and v2 =
are orthogonal. Create
1
6
an orthonormal basis for R2 using these two direction vectors.
1. Show that the vectors v1 =
2. Consider a1 = (1, 1) and a2 = (0, 1) as coordinates for points in the
elementary basis. Write the coordinates of a1 and a2 in the orthonormal
basis found in exercise 1. Draw a picture which reflects the old and new
basis vectors.
3. Write the orthonormal basis vectors from exercise 1 as linear combinations
of the original elementary basis vectors.
4. What is the length of the orthogonal projection of a1 onto v1 ?
38
CHAPTER
5
LEAST SQUARES
The least squares problem arises in almost all areas where mathematics is
applied. Statistically, the idea is to find an approximate mathematical relationship between predictor and target variables such that the sum of squared
errors between the true value and the approximation is minimized. In two
dimensions, the goal would be to develop a line as depicted in Figure 5.1 such
that the sum of squared vertical distances (the residuals, in green) between the
true data (in red) and the mathematical prediction (in blue) is minimized.
(x1,y1)
residual r1
{
^
(x1,y1)
Figure 5.1: Least Squares Illustrated in 2 dimensions
If we let r be a vector containing the residual values (r1 , r2 , . . . , rn ) then the
39
sum of squared residuals can be written in linear algebraic notation as
n
∑ ri2 = rT r = (y − ŷ)T (y − ŷ) = ky − ŷk2
i =1
Suppose we want to regress our target variable y on p predictor variables,
x1 , x2 , . . . , x p . If we have n observations, then the ideal situation would be to
find a vector of parameters β containing an intercept, β 0 along with p slope
parameters, β 1 , . . . , β p such that

obs1 1
obs2 
1
.
..
 ..
.
obsn 1
|
x1
x11
x21
..
.
xn1
x2
x12
x22
..
.
xn2
{z
X
...
...
...
..
.
...
xp
   
x1p
β0
y0
 β 1   y1 
x2p 
   
 . = . 
.. 
.   ..   .. 
βp
yn
xnp
} | {z } | {z }
β
(5.1)
y
With many more observations than variables, this system of equations will
not, in practice, have a solution. Thus, our goal becomes finding a vector of
parameters β̂ such that X β̂ = ŷ comes as close to y as possible. Using the
design matrix, X, the least squares solution β̂ is the one for which
ky − X β̂k2 = ky − ŷk2
is minimized. Theorem 5.0.2 characterizes the solution to the least squares
problem.
Theorem 5.0.2: Least Squares Problem and Solution
b − y. The least
For an n × m matrix X and n × 1 vector y, let r = X β
b that minimizes the quantity
squares problem is to find a vector β
n
∑ ri2 = ky − Xβb k2 .
i =1
b which provides a minimum value for this expression is
Any vector β
called a least-squares solution.
• The set of all least squares solutions is precisely the set of solutions
to the so-called normal equations,
b = XT y.
XT X β
40
• There is a unique least squares solution if and only if rank(X) = m
(i.e. linear independence of variables or no perfect multicollinearity!), in which case XT X is invertible and the solution is given
by
b = (X T X ) −1 X T y
β
Example 5.0.2: Solving a Least Squares Problem
In 2014, data was collected regarding the percentage of linear
algebra exercises done by students and the grade they received on
their examination. Based on this data, what is the expected effect of
completing an additional 10% of the exercises on a students exam grade?
ID
1
2
3
4
5
6
7
% of Exercises
20
100
90
70
50
10
30
Exam Grade
55
100
100
70
75
25
60
To find the least squares regression line, we want to solve the equation
Xβ = y:




1 20
55
1 100
100




1 90  100

 β0


1 70 



 β 1 =  70 
1 50 
 75 




1 10 
 25 
1
30
60
This system is obviously inconsistent. Thus, we want to find the least
squares solution β̂ by solving XT X β̂ = XT y:
7
370
β0
485
=
370 26900
β1
30800
Now, since multicollinearity was not a problem, we can simply find the
inverse of XT X and multiply it on both sides of the equation:
7
370
370
26900
−1
=
0.5233
−0.0072
−0.0072
0.0001
41
and so
β0
β1
=
0.5233
−0.0072
−0.0072
0.0001
485
30800
=
32.1109
0.7033
Thus, for each additional 10% of exercises completed, exam grades
are expected to increase by about 7 points. The data along with the
regression line
grade = 32.1109 + 0.7033percent_exercises
is shown below.
Why the normal equations? The solution of the normal equations has a nice
geometrical interpretation. It involves the idea of orthogonal projection, a
concept which will be useful for understanding future topics.
In order for a system of equations, Ax = b to have a solution, b must be
a linear combination of columns of A. That is simply the definition of matrix
multiplication and equality. If A is m × n then
Ax = b =⇒ b = x1 A1 + x2 A2 + · · · + xn An .
As discussed in Chapter 3, another way to say this is that b is in the span of
the columns of A. The span of the columns of A is called the column space of
A. In Least-Squares applications, the problem is that b is not in the column
space of A. In essence, we want to find the vector b̂ that is closest to b but
42
exists in the column space of A. Then we know that Ax̂ = b̂ does have a unique
solution, and that the right hand side of the equation comes as close to the
original data as possible. By multiplying both sides of the original equation
by AT what we are really doing is projecting b orthogonally onto the column
space of A. We should think of the column space as a flat surface (perhaps
a plane) in space, and b as a point that exists off of that flat surface. There
are many ways to draw a line from a point to plane, but the shortest distance
would always be travelled perpendicular (orthogonal) to the plane. You may
recall from undergraduate calculus or physics that a normal vector to a plane is
a vector that is orthogonal to that plane. The normal equations, AT Ax = AT b,
help us find the closest point to b that belongs to the column space of A by
means of an orthogonal projection. This geometrical development is depicted
in Figure 5.2.
b
^
^
b = Ax
=A(ATA)-1AT b
A2
span(A1,A2)
}
^ =r
||b-b||
A1
Figure 5.2: The normal equations yield the vector b̂ in the column space of A
which is closest to the original right hand side b vector.
43
CHAPTER
6
EIGENVALUES AND EIGENVECTORS
Definition 6.0.5: Eigenvalues and Eigenvectors
For a square matrix An×n , a scalar λ is called an eigenvalue of A if
there is a nonzero vector x such that
Ax = λx.
Such a vector, x is called an eigenvector of A corresponding to the
eigenvalue λ. We sometimes refer to the pair (λ, x) as an eigenpair.
Eigenvalues and eigenvectors have numerous applications throughout mathematics, statistics and other fields. First, we must get a handle on the definition
which we will do through some examples.
Example 6.0.3: Eigenvalues and Eigenvectors
1
3 1
Determine whether x =
is an eigenvector of A =
and if
1
1 3
so, find the corresponding eigenvalue.
To determine whether x is an eigenvector, we want to compute Ax and
observe whether the result is a multiple of x. If this is the case, then the
multiplication factor is the corresponding eigenvalue:
3 1
1
4
1
Ax =
=
=4
1 3
1
4
1
From this it follows that x is an eigenvector of A and the corresponding
44
eigenvalue is λ = 4.
2
Is the vector y =
an eigenvector?
2
Ay =
3
1
1
3
2
8
2
=
=4
= 4y
2
8
2
Yes, it is and it corresponds to the same eigenvalue, λ = 4
Example 6.0.3 shows a very important property of eigenvalue-eigenvector
pairs. If (λ, x) is an eigenpair then any scalar multiple of x is also an eigenvector
corresponding to λ. To see this, let (λ, x) be an eigenpair for a matrix A (which
means that Ax = λx) and let y = αx be any scalar multiple of x. Then we have,
Ay = A(αx) = α(Ax) = α(λx) = λ(αx) = λy
which shows that y (or any scalar multiple of x) is also an eigenvector associated
with the eigenvalue λ.
Thus, for each eigenvalue we have infinitely many eigenvectors. In the
preceding example,
the eigenvectors associated with λ = 4 will be scalar
1
multiples of x =
. You may recall from Chapter 3 that the set of all scalar
1
multiples of x is denoted span(x). The span(x) in this example represents the
eigenspace of λ. Note: when using software to compute eigenvectors, it is standard
practice for the software to provide the normalized/unit eigenvector.
In some situations, an eigenvalue can have multiple eigenvectors which are
linearly independent. The number of linearly independent eigenvectors associated with an eigenvalue is called the geometric multiplicity of the eigenvalue.
Example 6.0.4 clarifies this concept.
Example 6.0.4: Geometric Multiplicity
3 0
Consider the matrix A =
. It should be straightforward to see
0 3
1
0
that x1 =
and x2 =
are both eigenvectors corresponding to
0
1
the eigenvalue λ = 3. x1 and x2 are linearly independent, therefore the
geometric multiplicity of λ = 3 is 2.
What happens if we take a linear combination of x1 and x2 ? Is that also
45
an eigenvector? Consider y =
Ay =
3
0
0
3
2
= 2x1 + 3x2 . Then
3
2
6
2
=
=3
= 3y
3
9
3
shows that y is also an eigenvector associated with λ = 3.
The eigenspace corresponding to λ = 3 is the set of all linear combinations of x1 and x2 , i.e. the span(x1 , x2 ).
We can generalize the result that we saw in Example 6.0.4 for any square
matrix and any geometric multiplicity. Let An×n have an eigenvalue λ with
geometric multiplicity k. This means there are k linearly independent eigenvectors, x1 , x2 , . . . , xk such that Axi = λxi for each eigenvector xi . Now if we let y
be a vector in the span(x1 , x2 , . . . , xk ) then y is some linear combination of the
xi ’s:
y = α1 x2 + α2 x2 + · · · + α k x k
Observe what happens when we multiply y by A:
Ay
=
=
=
=
=
A( α1 x2 + α2 x2 + · · · + α k x k )
α1 (Ax1 ) + α2 (Ax2 ) + · · · + αk (Axk )
α1 (λx1 ) + α2 (λx2 ) + · · · + αk (λxk )
λ ( α1 x2 + α2 x2 + · · · + α k x k )
λy
which shows that y (or any vector in the span(x1 , x2 , . . . , xk )) is an eigenvector
of A corresponding to λ.
This proof allows us to formally define the concept of an eigenspace.
Definition 6.0.6: Eigenspace
Let A be a square matrix and let λ be an eigenvalue of A. The set of all
eigenvectors corresponding to λ, together with the zero vector, is called
the eigenspace of λ. The number of basis vectors required to form the
eigenspace is called the geometric multiplicity of λ.
Now, let’s attempt the eigenvalue problem from the other side. Given an
eigenvalue, we will find the corresponding eigenspace in Example 6.0.5.
46
Example 6.0.5: Eigenvalues and Eigenvectors
Show that λ = 5 is an eigenvalue of A =
1
4
2
3
and determine the
eigenspace of λ = 5.
Attempting the problem from this angle requires slightly more work.
We want to find a vector x such that Ax = 5x. Setting this up, we have:
Ax = 5x.
What we want to do is move both terms to one side and factor out the
vector x. In order to do this, we must use an identity matrix, otherwise
the equation wouldn’t make sense (we’d be subtracting a constant from
a matrix).
Ax − 5x
= 0
(A − 5I)x = 0
1 2
5 0
x1
0
−
=
4 3
0 5
x2
0
−4 2
x1
0
=
4 −2
x2
0
Clearly, the matrix A − λI is singular (i.e. does not have linearly independent rows/columns). This will always be the case by the definition
Ax = λx, and is often used as an alternative definition.
In order to solve this homogeneous system of equations, we use Gaussian elimination:
−4 2 0
1 − 12 0
−→
4 −2 0
0 0 0
This implies that any vector x for which x1 − 12 x2 = 0 satisfies the
eigenvector
equation. We can pick any such vector, for example x =
1
, and say that the eigenspace of λ = 5 is
2
1
span
2
If we didn’t know either an eigenvalue or eigenvector of A and instead
wanted to find both, we would first find eigenvalues by determining all possible
λ such that A − λI is singular and then find the associated eigenvectors. There
6.1. Diagonalization
47
are some tricks which allow us to do this by hand for 2 × 2 and 3 × 3 matrices,
but after that the computation time is unworthy of the effort. Now that we
have a good understanding of how to interpret eigenvalues and eigenvectors
algebraically, let’s take a look at some of the things that they can do, starting
with one important fact.
Theorem 6.0.3: Eigenvalues and the Trace of a Matrix
Let A be an n × n matrix with eigenvalues λ1 , λ2 , . . . , λn . Then the sum
of the eigenvalues is equal to the trace of the matrix (recall that the trace
of a matrix is the sum of its diagonal elements).
n
Trace(A) =
∑ λi .
i =1
Example 6.0.6: Trace of Covariance Matrix
Suppose that we had a collection of n observations on p variables,
x1 , x2 , . . . , x p . After centering the data to have zero mean, we can compute the sample variances as:
var (xi ) =
1
x T x = k x i k2
n−1 i i
These variances form the diagonal elements of the sample covariance
matrix,
1
Σ=
XT X
n−1
Thus, the total variance of this data is
n
n
1
2
k
x
k
=
Trace
(
Σ
)
=
i
∑ λi .
n − 1 i∑
=1
i =1
In other words, the sum of the eigenvalues of a covariance matrix
provides the total variance in the variables x1 , . . . , x p .
6.1
Diagonalization
Let’s take
showed
that λ1 = 5 and
another look at Example 6.0.5. We already
1
1 2
v1 =
is an eigenpair for the matrix A =
. You may verify that
2
4 3
1
λ2 = −1 and v2 =
is another eigenpair. Suppose we create a matrix of
−1
6.1. Diagonalization
48
eigenvectors:
V = (v1 , v2 ) =
1
2
1
−1
and a diagonal matrix containing the corresponding eigenvalues:
5 0
D=
0 −1
Then it is easy to verify that AV = VD:
1 2
1 1
AV =
4 3
2 −1
5 −1
=
10 1
1 1
5 0
=
2 −1
0 −1
= VD
If the columns of V are linearly independent, which they are in this case, we
can write:
V−1 AV = D
What we have just done is develop a way to transform a matrix A into a
diagonal matrix D. This is known as diagonalization.
Definition 6.1.1: Diagonalizable
An n × n matrix A is said to be diagonalizable if there exists an invertible matrix P and a diagonal matrix D such that
P−1 AP = D
This is possible if and only if the matrix A has n linearly independent
eigenvectors (known as a complete set of eigenvectors). The matrix
P is then the matrix of eigenvectors and the matrix D contains the
corresponding eigenvalues on the diagonal.
Determining whether or not a matrix An×n is diagonalizable is a little
tricky. Having rank(A) = n is not a sufficient condition for having n linearly
independent eigenvectors. The following matrix stands as a counter example:


−3 1 −3
A =  20
3
10 
2 −2 4
This matrix has full rank but only two linearly independent eigenvectors. Fortunately, for our primary application of diagonalization, we will be dealing
6.2. Geometric Interpretation of Eigenvalues and Eigenvectors
49
with a symmetric matrix, which can always be diagonalized. In fact, symmetric matrices have an additional property which makes this diagonalization
particularly nice, as we will see in Chapter 7.
6.2
Geometric Interpretation of
Eigenvalues and Eigenvectors
Since any scalar multiple of an eigenvector is still an eigenvector, let’s consider
for the present discussion unit eigenvectors x of a square matrix A - those with
length kxk = 1. By the definition, we know that
Ax = λx
We know that geometrically, if we multiply x by A, the resulting vector points
in the same direction as x. Geometrically, it turns out that multiplying the unit
circle or unit sphere by a matrix A carves out an ellipse, or an ellipsoid. We
can see eigenvectors visually by watching how multiplication by a matrix A
changes the unit vectors. Figure 6.1 illustrates this. The blue arrows represent
(a sampling of) the unit circle, all vectors x for which kxk = 1. The red
arrows represent the image of the blue arrows after multiplication by A, or
Ax for each vector x. We can see how almost every vector changes direction
when multiplied by A, except the eigenvector directions which are marked in
black. Such a picture provides a nice geometrical interpretation of eigenvectors
for a general matrix, but we will see in Chapter 7 just how powerful these
eigenvector directions are when we look at symmetric matrix.
4
3
2
1
0
−1
−2
−3
−4
−5
−4
−3
−2
−1
0
1
2
3
4
5
Figure 6.1: Visualizing eigenvectors (in black) using the image (in red) of the
unit sphere (in blue) after multiplication by A.
6.2. Geometric Interpretation of Eigenvalues and Eigenvectors
50
Exercises
1. Show that v is an eigenvector of A and find the corresponding eigenvalue:
1 2
3
a. A =
v=
2 1
−3
−1 1
1
b. A =
v=
6 0
−2
4 −2
4
c. A =
v=
5 −7
2
2. Show that λ is an eigenvalue of A and list two eigenvectors corresponding
to this eigenvalue:
0 4
a. A =
λ=4
−1 5
0 4
b. A =
λ=1
−1 5
3. Based on the eigenvectors you found in exercises 2, can the matrix A be
diagonalized? Why or why not? If diagonalization is possible, explain
how it would be done.
51
CHAPTER
7
PRINCIPAL COMPONENTS ANALYSIS
We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA
involves the analysis of eigenvalues and eigenvectors of the covariance or
correlation matrix. Its development relies on the following important facts:
Theorem 7.0.1: Diagonalization of Symmetric Matrices
All n × n real valued symmetric matrices (like the covariance and correlation matrix) have two very important properties:
1. They have a complete set of n linearly independent eigenvectors,
{v1 , . . . , vn }, corresponding to eigenvalues
λ1 ≥ λ2 ≥ · · · ≥ λ n .
2. Furthermore, these eigenvectors can be chosen to be orthonormal
so that if V = [v1 | . . . |vn ] then
VT V = I
or equivalently, V−1 = VT .
Letting D be a diagonal matrix with Dii = λi , by the definition of
eigenvalues and eigenvectors we have for any symmetric matrix S,
SV = VD
Thus, any symmetric matrix S can be diagonalized in the following
way:
VT SV = D
52
Covariance and Correlation matrices (when there is no perfect multicollinearity in variables) have the additional property that all of their
eigenvalues are positive (nonzero). They are positive definite matrices.
Now that we know we have a complete set of eigenvectors, it is common
to order them according to the magnitude of their corresponding eigenvalues.
From here on out, we will use (λ1 , v1 ) to represent the largest eigenvalue of a
matrix and its corresponding eigenvector. When working with a covariance or
correlation matrix, this eigenvector associated with the largest eigenvalue is
called the first principal component and points in the direction for which the
variance of the data is maximal. Example 7.0.1 illustrates this point.
Example 7.0.1: Eigenvectors of the Covariance Matrix
Suppose we have a matrix of data for 10 individuals on 2 variables, x1
and x2 . Plotted on a plane, the data appears as follows:
x2
x1
53
Our data matrix for these points is:

1
2

2

3


4
X=
5

6

6

7
8

1
1

4

1


4

2

4

6

6
8
the means of the variables in X are:
4.4
x̄ =
.
3.7
When thinking about variance directions, our first step should be to
center the data so that it has mean zero. Eigenvectors measure the
spread of data around the origin. Variance measures spread of data
around the mean. Thus, we need to equate the mean with the origin.
To center the data, we simply compute

 
 

−3.4 −2.7
4.4 3.7
1 1
2 1 4.4 3.7 −2.4 −2.7

 
 

2 4 4.4 3.7 −2.4 0.3 

 
 

3 1 4.4 3.7 −1.4 −2.7

 

 

 
 

−
0.4
0.3
4.4
3.7
4
4






Xc = X − ex̄ T = 
.
=
−
5 2 4.4 3.7  0.6 −1.7

 
 

6 4 4.4 3.7  1.6
0.3 

 
 

6 6 4.4 3.7  1.6
2.3 


 
 
7 6 4.4 3.7  2.6
2.3 
3.6
4.3
8 8
4.4 3.7
Examining the new centered data, we find that we’ve only translated
our data in the plane - we haven’t distorted it in any fashion.
54
x2
x1
Thus the covariance matrix is:
1
Σ = (XcT Xc ) =
9
5.6
4.8
4.8
6.0111
The eigenvalue and eigenvector pairs of Σ are (rounded to 2 decimal
places) as follows:
0.69
−0.72
(λ1 , v1 ) = 10.6100,
and (λ2 , v2 ) = 1.0012,
0.72
0.69
Let’s plot the eigenvector directions on the same graph:
x2
v1
v2
x1
55
The eigenvector v1 is called the first principal component. It is the direction along which the variance of the data is maximal. The eigenvector
v2 is the second principal component. In general, the second principal
component is the direction, orthogonal to the first, along which the
variance of the data is maximal (in two dimensions, there is only one
direction possible.)
Why is this important? Let’s consider what we’ve just done. We started
with two variables, x1 and x2 , which appeared to be correlated. We then
derived new variables, v1 and v2 , which are linear combinations of the original
variables:
v1
v2
= 0.69x1 + 0.72x2
= −0.72x1 + 0.69x2
(7.1)
(7.2)
These new variables are completely uncorrelated. To see this, let’s represent
our data according to the new variables - i.e. let’s change the basis from
B1 = [x1 , x2 ] to B2 = [v1 , v2 ].
Example 7.0.2: The Principal Component Basis
Let’s express our data in the basis defined by the principal components.
We want to find coordinates (in a 2 × 10 matrix A) such that our original
(centered) data can be expressed in terms of principal components. This
is done by solving for A in the following equation (see Chapter 4 and
note that the rows of X define the points rather than the columns):

−3.4
−2.4

−2.4

−1.4


−0.4

 0.6

 1.6

 1.6

 2.6
3.6
Xc

−2.7
−2.7

0.3 

−2.7


0.3 

−1.7

0.3 

2.3 

2.3 
4.3
= AVT

a11
a
 21
a
 31
a
 41

a
=  51
 a61

 a71

 a81

 a91
a10,1
(7.3)

a12
a22 

a32 

a42 
 
a52  v1T

a62  v2T

a72 

a82 

a92 
a10,2
(7.4)
Conveniently, our new basis is orthonormal meaning that V is an
orthogonal matrix, so
A = XV.
56
The new data coordinates reflect a simple rotation of the data around
the origin:
v2
v1
Visually, we can see that the new variables are uncorrelated. You may
wish to confirm this by calculating the covariance. In fact, we can do
this in a general sense. If A = Xc V is our new data, then the covariance
matrix is diagonal:
ΣA
=
=
=
=
1
AT A
n−1
1
(Xc V ) T (Xc V )
n−1
1
VT ((XcT Xc )V
n−1
1
VT ((n − 1)Σ X )V
n−1
V T ( Σ X )V
=
= VT (VDVT )V
= D
Where Σ X = VDVT comes from the diagonalization in Theorem 7.0.1.
By changing our variables to principal components, we have managed
to “hide” the correlation between x1 and x2 while keeping the spacial relationships between data points in tact. Transformation back to
variables x1 and x2 is easily done by using the linear relationships in
Equations 7.1 and 7.2.
7.1. Comparison with Least Squares
7.1
57
Comparison with Least Squares
In least squares regression, our objective is to maximize the amount of variance
explained in our target variable. It may look as though the first principal
component from Example 7.0.1 points in the direction of the regression line.
This is not the case however. The first principal component points in the
direction of a line which minimizes the sum of squared orthogonal distances
between the points and the line. Regressing x2 on x1 , on the other hand,
provides a line which minimizes the sum of squared vertical distances between
points and the line. This is illustrated in Figure 7.1.
x2
Principal
Component
Regression
Line
x1
Figure 7.1: Principal Components vs. Regression Lines
The first principal component about the mean of a set of points can be
represented by that line which most closely approaches the data points. In
contrast, linear least squares tries to minimize the distance in the y direction
only. Thus, although the two use a similar error metric, linear least squares
is a method that treats one dimension of the data preferentially, while PCA
treats all dimensions equally.
7.2
Covariance or Correlation Matrix?
Principal components analysis can involve eigenvectors of either the covariance
matrix or the correlation matrix. When we perform this analysis on the
covariance matrix, the geometric interpretation is simply centering the data
and then determining the direction of maximal variance. When we perform
7.3. Applications of Principal Components
58
this analysis on the correlation matrix, the interpretation is standardizing the
data and then determining the direction of maximal variance. The correlation
matrix is simply a scaled form of the covariance matrix. In general, these two
methods give different results, especially when the scales of the variables are
different.
The covariance matrix is the default for R. The correlation matrix is the
default in SAS. The covariance matrix method is invoked by the option:
proc princomp data=X cov;
var x1--x10;
run;
Choosing between the covariance and correlation matrix can sometimes
pose problems. The rule of thumb is that the correlation matrix should be used
when the scales of the variables vary greatly. In this case, the variables with the
highest variance will dominate the first principal component. The argument
against automatically using correlation matrices is that it is quite a brutal way
of standardizing your data.
7.3
Applications of Principal Components
Principal components have a number of applications across many areas of
statistics. In the next sections, we will explore their usefulness in the context of
dimension reduction. In Chapter 9 we will look at how PCA is used to solve
the issue of multicollinearity in biased regression.
7.3.1
PCA for dimension reduction
It is quite common for an analyst to have too many variables. There are two
different solutions to this problem:
1. Feature Selection: Choose a subset of existing variables to be used in a
model.
2. Feature Extraction: Create a new set of features which are combinations
of original variables.
Feature Selection
Let’s think for a minute about feature selection. What are we really doing when
we consider a subset of our existing variables? Take the two dimensional data
in Example 7.0.2 (while two-dimensions rarely necessitate dimension reduction,
the geometrical interpretation extends to higher dimensions as usual!). The
centered data appears as follows:
7.3. Applications of Principal Components
59
x2
x1
Now say we perform some kind of feature selection (there are a number of
ways to do this, chi-square tests for instances) and we determine that the
variable x2 is more important than x1 . So we throw out x2 and we’ve reduced
the dimensions from p = 2 to k = 1. Geometrically, what does our new data
look like? By dropping x1 we set all of those horizontal coordinates to zero. In
other words, we project the data orthogonally onto the x2 axis:
x2
x2
x1
(a) Projecting Data Orthogonally
x1
(b) New One-Dimensional Data
Figure 7.2: Geometrical Interpretation of Feature Selection
Now, how much information (variance) did we lose with this projection?
The total variance in the original data is
k x1 k2 + k x2 k2 .
The variance of our data reduction is
k x2 k2 .
7.3. Applications of Principal Components
60
Thus, the proportion of the total information (variance) we’ve kept is
k x2 k2
6.01
=
= 51.7%.
2
2
5.6 + 6.01
k x1 k + k x2 k
Our reduced dimensional data contains only 51.7% of the variance of the
original data. We’ve lost a lot of information!
The fact that feature selection omits variance in our predictor variables
does not make it a bad thing! Obviously, getting rid of variables which have
no relationship to a target variable (in the case of supervised modeling like
prediction and classification) is a good thing. But, in the case of unsupervised
learning techniques, where there is no target variable involved, we must be
extra careful when it comes to feature selection. In summary,
• Feature Selection is important. Examples include:
– Removing variables which have little to no impact on a target variable in supervised modeling (forward/backward/stepwise selection).
– Removing variables which have obvious strong correlation with
other predictors.
– Removing variables that are not interesting in unsupervised learning
(For example, you may not want to use the words “th” and “of”
when clustering text).
• Feature Selection is an orthogonal projection of the original data onto the
span of the variables you choose to keep.
• Feature selection should always be done with care and justification.
– In regression, could create problems of endogeneity (errors correlated with predictors - omitted variable bias).
– For unsupervised modelling, could lose important information.
Feature Extraction
PCA is the most common form of feature extraction. The rotation of the space
shown in Example 7.0.2 represents the creation of new features which are
linear combinations of the original features. If we have p potential variables
for a model and want to reduce that number to k, then the first k principal
components combine the individual variables in such a way that is guaranteed
to capture as much “information” (variance) as possible. Again, take our
two-dimensional data as an example. When we reduce our data down to onedimension using principal components, we essentially do the same orthogonal
projection that we did in Feature Selection, only in this case we conduct that
7.3. Applications of Principal Components
61
projection in the new basis of principal components. Recall that for this data,
our first principal component v1 was
0.69
v1 =
.
0.73
Projecting the data onto the first principal component is illustrated in Figure
7.3 How much variance do we keep with k principal components? The prox2
x2
v1
v1
x1
(a) Projecting Data Orthogonally
x1
(b) New One-Dimensional Data
Figure 7.3: Illustration of Feature Extraction via PCA
portion of variance explained by each principal component is the ratio of the
corresponding eigenvalue to the sum of the eigenvalues (which gives the total
amount of variance in the data).
Theorem 7.3.1: Proportion of Variance Explained
The proportion of variance explained by the projection of the data onto
principal component vi is
λi
.
p
∑ j =1 λ j
Similarly, the proportion of variance explained by the projection of the
data onto the first k principal components (k < j) is
∑ik=1 λi
p
∑ j =1 λ j
In our simple 2 dimensional example we were able to keep
λ1
10.61
=
= 91.38%
λ1 + λ2
10.61 + 1.00
of our variance in one dimension.
62
CHAPTER
8
SINGULAR VALUE DECOMPOSITION (SVD)
The Singular Value Decomposition (SVD) is one of the most important concepts
in applied mathematics. It is used for a number of application including
dimension reduction and data analysis. Principal Components Analysis (PCA)
is a special case of the SVD. Let’s start with the formal definition, and then see
how PCA relates to that definition.
Definition 8.0.1: Singular Value Decomposition
For any m × n matrix A with rank (A) = r, there are orthogonal matrices
Um×m and Vn×n and a diagonal matrix Dr×r = diag(σ1 , σ2 , . . . , σr ) such
that
D 0
A=U
VT with σ1 ≥ σ2 ≥ · · · ≥ σr ≥ 0
(8.1)
0 0
| {z }
m×n
The σi ’s are called the nonzero singular values of A. (When
r < p = min{m, n} (i.e. when A is not full-rank), A is said to have
an additional p − r zero singular values). This factorization is called
a singular value decomposition of A, and the columns of U and
V are called the left- and right-hand singular vectors for A, respectively.
Properties of the SVD
• The left-hand singular vectors are a set of orthonormal eigenvectors for AAT .
• The right-hand singular vectors are a set of orthonormal eigenvectors for AT A.
8.1. Resolving a Matrix into Components
63
• The singular values are the square roots of the eigenvalues for
AT A and AAT , as these matrices have the same eigenvalues.
When we studied PCA, one of the goals was to find the new coordinates, or
scores, of the data in the principal components basis. If our original (centered
or standardized) data was contained in the matrix X and the eigenvectors of
the covariance/correlation matrix (XT X) were columns of a matrix V, then to
find the scores (call these S) of the observations on the eigenvectors we used
the following equation:
X = SVT .
This equation mimics Equation 8.1 because the matrix VT in Equation 8.1 is
also a matrix of eigenvectors for AT A. This means that the principal component
scores S are a set of unit eigenvectors for AAT scaled by the singular values in
D:
D 0
S=U
.
0 0
8.1
Resolving a Matrix into Components
One of the primary goals of the singular value decomposition is to resolve
the data in A into r mutually orthogonal components by writing the matrix
factorization as a sum of outer products using the corresponding columns of
U and rows of VT :

D
A=U
0

0

.

.
um  .
0

.
 ..
0
VT = u1
0
u2
σ1
...
0
0
..
.
...
0
0
..
.
σr
0
..
.
0
0
0

0 0
..
  T
. 0
 v1
..  v2T 
 
0 .
 . 
 .. 
0 0

.. .. 
vnT
. .
0 0
= σ1 u1 v1T + σ2 u2 v2T + · · · + σr ur vrT .
σ1 ≥ σ2 ≥ . . . σr
For simplicity, let Zi = ui viT act as basis matrices for this expansion, so we have
r
A=
∑ σi Zi .
(8.2)
i =1
This representation can be regarded as a Fourier expansion. The coefficient
(singular value) σi can be interpreted as the proportion of A lying in the
8.1. Resolving a Matrix into Components
64
“direction" of Zi . When σi is small, omitting that term from the expansion will
cause only a small amount of the information in A to be lost. This fact has
important consequences for compression and noise reduction.
8.1.1
Data Compression
We’ve already seen how PCA can be used to reduce the dimensions of our
data while keeping the most amount of variance. The way this is done is by
simply ignoring those components for which the proportion of variance is
small. Supposing we keep k principal components, this amounts to truncating
the sum in Equation 8.2 after k terms:
k
A≈
∑ σi Zi .
(8.3)
i =1
As it turns out, this truncation has important consequences in many applications. One example is that of image compression. An image is simply an array
of pixels. Supposing the image size is m pixels tall by n pixels wide, we can
capture this information in an m × n matrix if the image is in grayscale, or an
m × 3n matrix for a [r,g,b] color image (we’d need 3 values for each pixel to
recreate the pixel’s color). These matrices can get very large (a 6 megapixel
photo is 6 million pixels).
Rather than store the entire matrix, we can store an approximation to the
matrix using only a few (well, more than a few) singular values and singular
vectors.
This is the basis of image compression. An approximated photo will not be
as crisp as the original - some information will be lost - but most of the time
we can store much less than the original matrix and still get a good depiction
of the image.
8.1.2
Noise Reduction
Many applications arise where the relevant information contained in a matrix
is contaminated by a certain level of noise. This is particularly common with
video and audio signals, but also arises in text data and other types of (usually
high dimensional) data. The truncated SVD (Equation 8.3) can actually reduce
the amount of noise in data and increase the overall signal-to-noise ratio under
certain conditions.
Let’s suppose, for instance, that our matrix Am×n contains data which is
contaminated by noise. If that noise is assumed to be random (or nondirectional) in the sense that the noise is distributed more or less uniformly across
the components Zi , then there is just as much noise “in the direction” of one Zi
as there is in the other. If the amount of noise along each direction is approximately the same, and the σi ’s tell us how much (relevant) information in A
8.1. Resolving a Matrix into Components
65
is directed along each component Zi , then it must be that the ratio of “signal”
(relevant information) to noise is decreasing across the ordered components,
since
σ1 ≥ σ2 ≥ · · · ≥ σr
implies that the signal is greater in earlier components. So letting SNR(σi Zi )
denote the signal-to-noise ratio of each component, we have
SNR(σ1 Z1 ) ≥ SNR(σ2 Z2 ) ≥ · · · ≥ SNR(σr Zr )
This explains why the truncated SVD,
k
A≈
∑ σi Zi
where
k<r
i =1
can, in many scenarios, filter out some of the noise without losing much of the
significant information in A.
8.1.3
Latent Semantic Indexing
Text mining is another area where the SVD is used heavily. In text mining, our
data structure is generally known as a Term-Document Matrix. The documents
are any individual pieces of text that we wish to analyze, cluster, summarize or
discover topics from. They could be sentences, abstracts, webpages, or social
media updates. The terms are the words contained in these documents. The
term-document matrix represents what’s called the “bag-of-words” approach the order of the words is removed and the data becomes unstructured in the
sense that each document is represented by the words it contains, not the order
or context in which they appear. The (i, j) entry in this matrix is the number of
times term j appears in document i.
Definition 8.1.1: Term-Document Matrix
Let m be the number of documents in a collection and n be the number
of terms appearing in that collection, then we create our term-document
8.1. Resolving a Matrix into Components
66
matrix A as follows:
Doc 1
Am × n =
Doc i
term 1








−
term j term n

|

|


|


− f ij


Doc m
where f ij is the frequency of term j in document i. A binary termdocument matrix will simply have Aij = 1 if term j is contained in
document i.
Term-document matrices tend to be large and sparse. Term-weighting
schemes are often used to downplay the effect of commonly used words and
bolster the effect of rare but semantically important words. The most popular
weighting method is known as “Term Frequency-Inverse Document Frequency”
(TF-IDF). For this method, the raw term-frequencies f ij in the matrix A are
multiplied by global weights (inverse document frequencies), w j , for each term.
These weights reflect the commonality of each term across the entire collection.
The inverse document frequency of term i is:
total # of documents
w j = log
# documents containing term j
To put this weight in perspective for a collection of n = 10, 000 documents
we have 0 ≤ w j ≤ 9.2, where w j = 0 means the word is contained in every
document (i.e. it’s not important semantically) and w j = 9.2 means the word
is contained in only 1 document (i.e. it’s quite important). The document
vectors are often normalized to have unit 2-norm, since their directions (not
their lengths) in the term-space is what characterizes them semantically.
The noise-reduction property of the SVD was extended to text processing in
1990 by Susan Dumais et al, who named the effect Latent Semantic Indexing (LSI).
LSI involves the singular value decomposition of the term-document matrix
defined in Definition 8.1.1. In other words, it is like a principal components
analysis using the unscaled, uncentered inner-product matrix AT A. If the
documents are normalized to have unit length, this is a matrix of cosine
similarities (see Chapter 2). In text-mining, the cosine similarity is the most
common measure of similarity between documents. If the term-document
matrix is binary, this is often called the co-occurrence matrix because each
entry gives the number of times two words occur in the same document.
It certainly seems logical to view text data in this context as it contains
both an informative signal and semantic noise. LSI quickly grew roots in the
8.1. Resolving a Matrix into Components
67
information retrieval community, where it is often used for query processing. The idea is to remove semantic noise, due to variation and ambiguity
in vocabulary and presentation style, without losing significant amounts of
information. For example, a human may not differentiate between the words
“car” and “automobile”, but indeed the words will become two separate entities
in the raw term-document matrix. The main idea in LSI is that the realignment
of the data into fewer directions should force related documents (like those
containing “car” and “automobile”) closer together in an angular sense, thus
revealing latent semantic connections.
Purveyors of LSI suggest that the use of the Singular Value Decomposition to
project the documents into a lower-dimensional space results in a representation
which reflects the major associative patterns of the data while ignoring less
important influences. This projection is done with the simple truncation of the
SVD shown in Equation 8.3.
As we have seen with other types of data, the very nature of dimension
reduction makes possible for two documents with similar semantic properties
to be mapped closer together. Unfortunately, the mixture of signs (positive
and negative) in the singular vectors (think principal components) makes
the decomposition difficult to interpret. While the major claims of LSI are
legitimate, this lack of interpretability is still conceptually problematic for
some folks. In order to make this point as clear as possible, consider the
original “term basis” representation for the data, where each document (from
a collection containing m total terms in the dictionary) could be written as:
m
Aj =
∑ fij ei
i =1
where f ij is the frequency of term i in the document, and ei is the ith column of
the m × m identity matrix. The truncated SVD gives us a new set of coordinates
(scores) and basis vectors (principal component features):
r
Aj ≈
∑ α i ui
i =1
but the features ui live in the term space, and thus ought to be interpretable
as a linear combination of the original “term basis.” However the linear
combination, having both positive and negative coefficients, is semantically
meaningless in context - These new features cannot, generally, be thought of as
meaningful topics.
68
CHAPTER
9
ADVANCED REGRESSION TECHNIQUES
9.1
Biased Regression
When severe multicollinearity occurs between our predictor variables, least
squares estimates are still unbiased, but their variances are large so they may
be far from the true value. Biased regression techniques intentionally bias
the estimation of the regression coefficients. By adding a degree of bias to
the estimates, we can reduce the standard errors (increase the precision). It is
hoped that the net effect will be more reliable parameter estimates.
The precision is generally measured by the mean-squared error of our
estimate,
MSE( β̂) = [ Bias( β̂)]2 + Var ( β).
Ordinary least squares regression assumes that the Bias is zero. In biased
regression techniques, we’ll allow for some bias in order to minimize the
9.1. Biased Regression
69
variance of our estimate.
Ideally, the criteria for deciding when biased regression techniques are
better than OLS depends on the true values of the parameters (i.e. we cannot
even estimate the bias in our parameter estimates). Since this is not possible,
there is no completely objective way to decide. Principal Components Regression (PCR) and Ridge Regression are two such techniques. Ridge regression
tends to be the more popular of the two methods, but PCR is a little more
straightforward.
9.1.1
Principal Components Regression (PCR)
As we saw in Chapter 7, every linear regression model can be restated in terms
of a new set of orthogonal predictor variables that are a linear combination of
the original variables - principal components. Let x1 , x2 , . . . x p be our predictor
variables. Then the principal components (PCs) are just linear combinations
of these predictor variables with coefficients from the rows of the eigenvector
matrix:
PCj = v1j x1 + v2j x2 + · · · + v pj x p
The variance-covariance matrix of the principal components is diagonal
(diag(λ1 , . . . , λ p )) because the principal components are orthogonal. If λ j = 0
then the corresponding PC has no variance (i.e. constant). This reveals linear
structure in variables. For example, suppose one of our principal components
is
PC2 = −0.5x1 + 2x2 with corresponding eigenvalue λ2 = 0
This means that when we subtract 2x2 − 0.5x1 in the original data, the result
has zero variability. It is constant for every observation. Thus, it must be that
for all observations x2 is completely determined by x1 and vice-versa. The two
variables are perfectly correlated. When λ j is nearly zero, we are very close to
the same situation, which violates the assumptions of our regression model.
Let’s look at an applied example.
Example: French Economy
We are going to examine data from the French economy reported by Malinvaud
(1968).
• The Variables:
1. Imports (Target)
2. Domestic Production
3. Stock Formation
4. Domestic Consumption
9.1. Biased Regression
70
• All measured in billions of French francs between 1949 and 1966.
Lets try to run a simple linear regression to predict Imports using the 3
predictor variables above. We are assuming there is some underlying insistence
to understand the relationship of all three variables on Imports - we do not
want to drop any variables from the analysis. When we run the regression, we
should always pay attention the the Variance Inflation Factors (VIFs) to see if
any multicollinearity is affecting the variability in our parameter estimates.
proc reg
data=advanced.french;
model Import = DoProd Stock Consum / vif;
run;
quit;
proc princomp data=advanced.french
var DoProd Stock Consum;
run;
out=frenchPC;
The VIF output from the regression model clearly indicates strong multicollinearity. The principal component output in Figure 9.1 makes it clear that
the difference between two of our variables is essentially constant or has no
variability, illuminating the exact source of that multicollinearity.
Figure 9.1: SAS Output
Domestic Consumption is essentially equal to Domestic Production. This
is something that matches with realistic expectations. Now there is a “new”
set of variables (the PCs) that are orthogonal to each other. Does this new
set of variables eliminate multicollinearity concerns? No! In the first model
listed in the next block of code (PC Model 1), we have not really changed
9.1. Biased Regression
71
anything! We’ve just rotated our data. Using all 3 principal components we are
not incorporating bias into the model or removing the multicollinearity - we
are just hiding it! It isn’t until we drop some of the PCs (the second model) that
we are able to introduce bias and eliminate the underlying multicollinearity.
/* First must standardize our dependent variable. */
/* be aware on covariance vs. correlation PCA!
*/
/*
what would be the difference?
*/
proc standard data=frenchPC mean=0 std=1
var Import;
run;
out=frenchPC2;
proc reg data=frenchPC2;
PC Model 1: model Import = Prin1 Prin2 Prin3 /vif;
PC Model 2: model Import = Prin1 Prin2 /vif;
run;
quit;
In order to compute meaningful coefficients we have to do some algebra
and take into account the standard errors of our variables (because both the
independent variables and the dependent variables were centered and scaled
when forming the principal components - there is a difference here if you use
the covariance matrix, so understand and be careful!):
Y
= α1 PC1 + α2 PC2 + · · · + α p PC p + e
= v1j x1 + v2j x2 + · · · + v pj x p
= β 0 + β 1 x1 + · · · + β p x p + e
βj
=
β0
= Ȳ − β 1 x̄1 − · · · − β p x̄ p
Ŷ
PCj
Where
sy
(v α + v j2 α2 + · · · + v jp α p )
s x j j1 1
SAS can actually do this in PROC PLS (Partial Least Squares) as demonstrated in the next block of code. The caveat is that this procedure can only
drop the later PCs keeping the first nfac=n components. Usually this is in
fact what you want to accomplish unless you have a principal component that
is being driven by some variable that is not significant in your model and
you wish to drop that component but keep others after it. In such cases, the
coefficients will have to be computed by hand.
proc pls
data=advanced.french method=pcr nfac=2;
model Import= DoProd Stock Consum /solution;
run;
quit;
9.1. Biased Regression
72
PCR - Cautions
PCR may not always work, in the sense that it may have trouble explaining
variability in the response variable. You should never blindly drop PCs - you
should always be using the justifications set forth above. Outliers / influential
observations can severely distort the principal components because they alter
the variance-covariance matrix - you should be aware of this fact and always
examine your principal components.
9.1.2
Ridge Regression
Ridge regression is a biased regression technique to use in the presence of
multicollinearity. Produces estimates that tends to have lower MSE (but higher
bias) than OLS estimates. Works with standardized values for each of the
variables in the model (similar to PCR):
Ỹ = θ1 x̃1 + θ2 x̃2 + · · · + θ p x̃ p
where Ỹ, x̃ represent the standardized values. Recall that solving for OLS
estimates involves the normal equations:
Z T Zθ̂ = Z T Y
Rearranging the normal equations leads to the following way to solve for the
OLS estimates.
θ1 + r12 θ2 + · · · + r1p θ p
=
r21 θ1 + θ2 + · · · + r2p θ p =
... =
r p1 θ1 + r p2 θ2 + · · · + θ p =
r1y
r2y
...
r py
where rij is the correlation between predictors i and j (so rij = r ji ) and r jy is
the correlation between the response and predictor j.
Ridge Adjustments Solving for ridge estimates involves changing the normal
equations to
Z T Zθ̂ = Z T Y −→ (Z T Z + kI)θ̂ R = Z T Y
Rearranging the changed normal equations leads to the following way to solve
for the ridge estimates:
(1 + k)θ1 + r12 θ2 + · · · + r1p θ p
r1y
r21 θ1 + (1 + k)θ2 + · · · + r2p θ p
r2y
=
=
... =
r p1 θ1 + r p2 θ2 + · · · + (1 + k)θ p =
...
r py
9.1. Biased Regression
73
The higher the value of k, the more bias is introduced in the estimates of the
model. The hardest part about ridge regression is choosing the appropriate
value of k because many different ways have been proposed over the years:
• Fixed Point (1975 by Hoerl, Kennard, Baldwin)
k=
pσ̂2
p
2
∑i=1 θ̂i,OLS
where σ̂2 is the MSE for the model. This is one of the most popular
estimates, sometimes referred to as HKB estimate.
• Iterative Method (1976 by Hoerl, Kennard)
k0 =
k1 =
kn =
pσ̂2
p
2
∑i=1 θ̂i,OLS
pσ̂2
p
2
∑i=1 θ̂i,k
0
pσ̂2
p
2
∑i=1 θ̂i,k
n −1
This is repeated until the change is negligible. In practice, we expect
taking very few iterations.
• Ridge Trace
– Plot of many different estimates of θ̂i across a series of k values.
– Use the plot to approximate when the estimates become stable.
Example: Fixed Point Method
The code below highlights the method for implementing the Fixed Point
method in SAS. Here we create macro variables to represent the MSE of the
model and the value for k. The MSE of the model will be used for implementing
the Iterative Method. The last PROC REG statement outputs the VIFs, the
standard errors for the betas (SEB) and the parameter estimates to the output
data set ’B’. The reason we must output these parameters to a dataset is that
the SAS output will not show the VIF values for the ridge regression, only
from the ordinary OLS model. The RIDGE option allows us to use our macro
variable for the parameter k.
proc standard data=advanced.french mean=0
var Import DoProd Stock Consum;
run;
std=1
out=frenchstd;
9.1. Biased Regression
74
proc reg data=frenchstd outest=B;
model Import = DoProd Stock Consum / vif;
run;
quit;
data _null_;
set B;
call symput(’MSE’ , RMSE**2);
call symput(’k’, 3*RMSE**2/(DoProd**2+Stock**2+Consum**2));
run;
proc reg
data=frenchstd outvif outseb outest=B
model Import = DoProd Stock Consum / vif;
run;
quit;
ridge=&k;
Example: Iterative Method
The code for the iterative method simply extends the code for the fixed point
method. We again create macro variables to represent the MSE of the model
and the resulting value for k.
proc reg
data=frenchstd outvif outseb outest=B
model Import = DoProd Stock Consum / vif;
run;
quit;
proc print
run;
ridge=&k;
data=B;
data _null_;
set B;
where _TYPE_=’RIDGE’;
call symput(’k’, 3*&MSE / (DoProd**2+Stock**2+Consum**2));
run;
proc reg
data=frenchstd outvif outseb outest=B
model Import = DoProd Stock Consum / vif;
run;
quit;
proc print data=B;
run;
ridge=&k;
9.1. Biased Regression
75
This code would be repeated using the latest ridge model. In practice, we
don’t often need to go beyond a few iterations before witnessing convergence.
In this example, the VIF values drop below 1 in the second iteration which is
something we have to be careful about - we probably do not want to use this
iteration.
Example: Ridge Trace Method
The Ridge Trace method is implemented by simply inputting the RIDGE
parameter as a sequence as shown below:
proc reg data=frenchstd outvif outest=B ridge=0 to 0.08 by 0.002;
model Import = DoProd Stock Consum / vif;
run;
quit;
proc reg data=frenchstd outvif outseb outest=B
model Import = DoProd Stock Consum / vif;
run;
quit;
ridge=0.04;
proc print data=B;
run;
The code above will output the VIF values and the standardized coefficients
for each variable in the model for a range of values of k. These values are given
in the output plots shown in Figure 9.2. The goal is to choose a value for k
where the lines on the graph become approximately horizontal.
Ridge Regression- Cautions
Due to the uncertainty of how to calculate k, there are some that dislike the
use of ridge regression (or any other bias regression technique). Both Principle
Components regression and Ridge regression should be used as a last case
scenario. Deleting or combining variables is preferred because it doesn’t
introduce bias. These methods, however, should not be shunned.
9.1. Biased Regression
Figure 9.2: SAS Output: Ridge Regression with Ridge Trace Method
76