Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Target readings:
Hastie, Tibshirani, Friedman
Chapter 3:
Linear regression, principal components, ridge regression, partial least squares
No classes on Jan 25, 27..
Gauss-Markov Theorem:
The “best linear unbiased estimator” minimizes RSS.
To estimate a linear combination =aTβ with an unbiased
linear combination cTy, choose
ˆ a T ˆ
But, by accepting some bias, you can do much better.
Best subset selection: Note that selecting the “best
subset” is NOT a linear estimator. Why not??
-1-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Ridge Regression See Fig 3.7.
The Principle: Add a “ridge” of size
XTX, to stabilize the matrix inverse.
to the diagonal of
ˆridge ( X T X I )1 X T Y
Another view: penalized likelihood
ˆridge : arg min (Y X )T (Y X ) || ||2
This can also be thought of as maximizing a Bayesian
posterior, where the prior is [ ] ~ N (0,(2 )1 I p ) .
This is also an example of data augmentation :
Let X aug
X
Y
, Y aug . Then OLS will yield
0
diag ( )
Another view:
Solve a constrained optimization problem
(restricting the model space):
ˆ : arg min (Y X )T (Y X )
restricted to the set { : || ||2 K }
Note error in Fig 3.12:
if
X T X I p , then as in Table 3.4, ˆ
ridge
So the ellipse’s major or minor axis must go through the origin.
-2-
1 ˆ ,
ls
1
ˆridge .
2005-03-30
BIOINF 2054/BIOSTAT 2018
Supplemental notes,
Statistical Foundations for Bioinformatics Data Mining
The singular value decomposition of X (svd( ))
X UDV T
is
where
U is N by p, U T U I p
and UU T X ( X T X )1 X T H (“hat” matrix).
U transforms data points in “scatterplot space”
p
(rows of X in R ), creating a new dataset U T X DV T
V is p by p, V TV VV T I p ,
V rotates data points in “variable space” (columns of X in R n ),
defining new variables XV UD .
D is diagonal, d1 ... d p 0 are the singular values.
[What are the eigenvalues of X T X ?]
Then
X ˆls Yˆ HY UU T Y .
Ridge Regression: degrees of freedom (equation 3.47)
First, note that
UU
U i1 j U T
p
T
i1i2
j 1
ji2
(definition of matrix multiplication)
p
U i1 jU i2 j
j 1
u j u Tj
p
j 1
i1 i2
(outer product of column j with itself)
p
Therefore UU u j u Tj .
T
j 1
-3-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Similarly,
U diag(a)U
U i1 j a j U T
p
T
i1i2
j 1
ji2
p
a jU i1 jU i2 j
j 1
a j u j u Tj
p
j 1
i1 i2
p
Therefore U diag(a)U a j u j u Tj (regarding a j as a scalar multiplier)
T
j 1
p
u j a j u Tj
j 1
In 3.47,
(regarding a j as a 1 1 matrix)
diag(a) D( D I ) D,
2
1
aj
d j2
d j2 .
We conclude:
p
X ˆridge UD( D 2 I )1 DU T Y u j
j 1
Recall that
d j2
uTj Y .
dj
2
X ˆls UU T Y .
For a linear smooth Yˆ X ˆ SY ,
the effective degrees of freedom are (is?)
df tr ( S ) S11 ... S pp sum(diag( S ))
p
So for ridge regression df ( ) d
j 1
d j2
2
j
and for least squares, df df (0) p .
-4-
,
(see 5.4.1, 7.6).
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Lasso Regression
See Fig 3.9.
ˆ : arg min (Y X )T (Y X ) || ||
Note the FIRST power of the length in the penalty function.
Another view:
Solve a constrained optimization problem
(restricting the model space):
ˆ : arg min (Y X )T (Y X )
restricted to the set { : || || K }
Be prepared to compare ridge regression to lasso regression.
See Fig 3.12 and 3.13.
-5-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Principal Components
Recall:
See Fig 3.10.
X UDV T .
The principal component weights
are the columns of V, v1...v p .
X T XV VDU TUDV TV VD 2 ,
so X T Xv j d 2j v j (the v j are eigenvectors).
The principal components
are the linear combinations
z j Xv j , j 1,..., p .
Note that Z
( z1...z p ) XV UD.
This is a derived covariate technique. Z replaces X.
Algorithm for generating principal components:
The successive principal components solve
v j arg max Var ( X )
over all α of length 1 and orthogonal to v1 ,..., v j 1 .
(Important: note that Y does not enter into this.)
See Fig 3.8.
-6-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Principal components regression is the model
yˆ
M
pcr
y ˆ j z j , so
j 1
ˆ
M
pcr
ˆ j v j ,
j 1
Note that M, the number of components to include, is a
model complexity tuning parameter.
-7-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Partial Least Squares
The successive PLS components solve
ˆ j arg max Cov(Y , X )
over all α of length 1 and orthogonal to 1 ,..., j 1 .
This is the same as
ˆ j arg max corr 2 (Y , X )Var ( X )
Contrast this with principal components, where Y plays no role.
PLS regression is the model
yˆ
M
pls
y ˆ j z j , where z j X ˆ j ,
so
j 1
M
ˆ pls ˆ,jˆ j
j 1
(depends on M, a “smoothing” or “model complexity” parameter).
This is another derived covariates method.
-8-
2005-03-30
Supplemental notes,
BIOINF 2054/BIOSTAT 2018
Statistical Foundations for Bioinformatics Data Mining
Comparing the methods:
See Fig 3.6, 3.11, Table 3.3.
Summary: What do you need to remember about these methods? List here.
Exercises due Feb 1:
Go to http://www-stat.stanford.edu/~tibs/ElemStatLearn/ . Obtain the
prostate cancer data set. Load it into R. Carry out OLS regression, ridge
regression, principle components regression, and partial least squares
regression.
Also do exercises 3.1- 3.7 (skip 3.3b), 3.9, 3.11, 3.17.
As usual, bring to class at least one AHA and one Question about Chapter 3.
You will read Ch. 4 .1 – 4.3 for Friday Feb 3.
-9-