Download 2006-01-20 princomp, ridge, PLS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component regression wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Target readings:
Hastie, Tibshirani, Friedman
Chapter 3:
Linear regression, principal components, ridge regression, partial least squares
No classes on Jan 25, 27..
Gauss-Markov Theorem:
The “best linear unbiased estimator” minimizes RSS.
To estimate a linear combination =aTβ with an unbiased
linear combination cTy, choose
ˆ  a T ˆ
But, by accepting some bias, you can do much better.
Best subset selection: Note that selecting the “best
subset” is NOT a linear estimator. Why not??
-1-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Ridge Regression See Fig 3.7.
The Principle: Add a “ridge” of size
XTX, to stabilize the matrix inverse.

to the diagonal of
ˆridge  ( X T X   I )1 X T Y
Another view: penalized likelihood
ˆridge : arg min  (Y  X  )T (Y  X  )   ||  ||2 
This can also be thought of as maximizing a Bayesian
posterior, where the prior is [ ] ~ N (0,(2 )1 I p ) .
This is also an example of data augmentation :
Let X aug
X


Y 

 , Y aug    . Then OLS will yield
0
 diag (  ) 
Another view:
Solve a constrained optimization problem
(restricting the model space):
ˆ : arg min  (Y  X  )T (Y  X  )
restricted to the set { : ||  ||2  K }
Note error in Fig 3.12:
if
X T X  I p , then as in Table 3.4, ˆ
ridge
So the ellipse’s major or minor axis must go through the origin.
-2-

1 ˆ ,
ls
1 
ˆridge .
2005-03-30
BIOINF 2054/BIOSTAT 2018
Supplemental notes,
Statistical Foundations for Bioinformatics Data Mining
The singular value decomposition of X (svd( ))
X  UDV T
is
where
U is N by p, U T U  I p
and UU T  X ( X T X )1 X T  H (“hat” matrix).
U transforms data points in “scatterplot space”
p
(rows of X in R ), creating a new dataset U T X  DV T
V is p by p, V TV  VV T  I p ,
V rotates data points in “variable space” (columns of X in R n ),
defining new variables XV  UD .
D is diagonal, d1  ...  d p  0 are the singular values.
[What are the eigenvalues of X T X ?]
Then
X ˆls  Yˆ  HY  UU T Y .
Ridge Regression: degrees of freedom (equation 3.47)
First, note that
UU 
  U i1 j U T 
p
T
i1i2
j 1
ji2
(definition of matrix multiplication)
p
  U i1 jU i2 j
j 1
   u j u Tj 
p
j 1
i1 i2
(outer product of column j with itself)
p
Therefore UU   u j u Tj .
T
j 1
-3-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Similarly,
U diag(a)U 
  U i1 j a j U T 
p
T
i1i2
j 1
ji2
p
  a jU i1 jU i2 j
j 1
  a j  u j u Tj 
p
j 1
i1 i2
p
Therefore U diag(a)U   a j u j u Tj (regarding a j as a scalar multiplier)
T
j 1
p
  u j a j u Tj
j 1
In 3.47,
(regarding a j as a 1  1 matrix)
diag(a)  D( D   I ) D,
2
1
aj 
d j2
d j2   .
We conclude:
p
X ˆridge  UD( D 2   I )1 DU T Y   u j
j 1
Recall that
d j2
uTj Y .
dj 
2
X ˆls  UU T Y .
For a linear smooth Yˆ  X ˆ  SY ,
the effective degrees of freedom are (is?)
df  tr ( S )  S11  ...  S pp  sum(diag( S ))
p
So for ridge regression df ( )   d
j 1
d j2
2
j
and for least squares, df  df (0)  p .
-4-
 ,
(see 5.4.1, 7.6).
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Lasso Regression
See Fig 3.9.
ˆ : arg min  (Y  X  )T (Y  X  )   ||  ||
Note the FIRST power of the length in the penalty function.
Another view:
Solve a constrained optimization problem
(restricting the model space):
ˆ : arg min  (Y  X  )T (Y  X  )
restricted to the set { : ||  || K }
Be prepared to compare ridge regression to lasso regression.
See Fig 3.12 and 3.13.
-5-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Principal Components
Recall:
See Fig 3.10.
X  UDV T .
The principal component weights
are the columns of V, v1...v p .
X T XV  VDU TUDV TV  VD 2 ,
so X T Xv j  d 2j v j (the v j are eigenvectors).
The principal components
are the linear combinations
z j  Xv j , j  1,..., p .
Note that Z
( z1...z p )  XV  UD.
This is a derived covariate technique. Z replaces X.
Algorithm for generating principal components:
The successive principal components solve
v j  arg max Var ( X  ) 
over all α of length 1 and orthogonal to v1 ,..., v j 1 .
(Important: note that Y does not enter into this.)
See Fig 3.8.
-6-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Principal components regression is the model
yˆ
M
pcr
 y  ˆ j z j , so
j 1
ˆ
M
pcr
 ˆ j v j ,
j 1
Note that M, the number of components to include, is a
model complexity tuning parameter.
-7-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Partial Least Squares
The successive PLS components solve
ˆ j  arg max Cov(Y , X  )
over all α of length 1 and orthogonal to 1 ,..., j 1 .
This is the same as
ˆ j  arg max corr 2 (Y , X  )Var ( X  )
Contrast this with principal components, where Y plays no role.
PLS regression is the model
yˆ
M
pls
 y  ˆ j z j , where z j  X ˆ j ,
so
j 1
M
ˆ pls  ˆ,jˆ j
j 1
(depends on M, a “smoothing” or “model complexity” parameter).
This is another derived covariates method.
-8-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Comparing the methods:
See Fig 3.6, 3.11, Table 3.3.
Summary: What do you need to remember about these methods? List here.
Exercises due Feb 1:
Go to http://www-stat.stanford.edu/~tibs/ElemStatLearn/ . Obtain the
prostate cancer data set. Load it into R. Carry out OLS regression, ridge
regression, principle components regression, and partial least squares
regression.
Also do exercises 3.1- 3.7 (skip 3.3b), 3.9, 3.11, 3.17.
As usual, bring to class at least one AHA and one Question about Chapter 3.
You will read Ch. 4 .1 – 4.3 for Friday Feb 3.
-9-