Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Target readings: Hastie, Tibshirani, Friedman Chapter 3: Linear regression, principal components, ridge regression, partial least squares No classes on Jan 25, 27.. Gauss-Markov Theorem: The “best linear unbiased estimator” minimizes RSS. To estimate a linear combination =aTβ with an unbiased linear combination cTy, choose ˆ a T ˆ But, by accepting some bias, you can do much better. Best subset selection: Note that selecting the “best subset” is NOT a linear estimator. Why not?? -1- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Ridge Regression See Fig 3.7. The Principle: Add a “ridge” of size XTX, to stabilize the matrix inverse. to the diagonal of ˆridge ( X T X I )1 X T Y Another view: penalized likelihood ˆridge : arg min (Y X )T (Y X ) || ||2 This can also be thought of as maximizing a Bayesian posterior, where the prior is [ ] ~ N (0,(2 )1 I p ) . This is also an example of data augmentation : Let X aug X Y , Y aug . Then OLS will yield 0 diag ( ) Another view: Solve a constrained optimization problem (restricting the model space): ˆ : arg min (Y X )T (Y X ) restricted to the set { : || ||2 K } Note error in Fig 3.12: if X T X I p , then as in Table 3.4, ˆ ridge So the ellipse’s major or minor axis must go through the origin. -2- 1 ˆ , ls 1 ˆridge . 2005-03-30 BIOINF 2054/BIOSTAT 2018 Supplemental notes, Statistical Foundations for Bioinformatics Data Mining The singular value decomposition of X (svd( )) X UDV T is where U is N by p, U T U I p and UU T X ( X T X )1 X T H (“hat” matrix). U transforms data points in “scatterplot space” p (rows of X in R ), creating a new dataset U T X DV T V is p by p, V TV VV T I p , V rotates data points in “variable space” (columns of X in R n ), defining new variables XV UD . D is diagonal, d1 ... d p 0 are the singular values. [What are the eigenvalues of X T X ?] Then X ˆls Yˆ HY UU T Y . Ridge Regression: degrees of freedom (equation 3.47) First, note that UU U i1 j U T p T i1i2 j 1 ji2 (definition of matrix multiplication) p U i1 jU i2 j j 1 u j u Tj p j 1 i1 i2 (outer product of column j with itself) p Therefore UU u j u Tj . T j 1 -3- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Similarly, U diag(a)U U i1 j a j U T p T i1i2 j 1 ji2 p a jU i1 jU i2 j j 1 a j u j u Tj p j 1 i1 i2 p Therefore U diag(a)U a j u j u Tj (regarding a j as a scalar multiplier) T j 1 p u j a j u Tj j 1 In 3.47, (regarding a j as a 1 1 matrix) diag(a) D( D I ) D, 2 1 aj d j2 d j2 . We conclude: p X ˆridge UD( D 2 I )1 DU T Y u j j 1 Recall that d j2 uTj Y . dj 2 X ˆls UU T Y . For a linear smooth Yˆ X ˆ SY , the effective degrees of freedom are (is?) df tr ( S ) S11 ... S pp sum(diag( S )) p So for ridge regression df ( ) d j 1 d j2 2 j and for least squares, df df (0) p . -4- , (see 5.4.1, 7.6). 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Lasso Regression See Fig 3.9. ˆ : arg min (Y X )T (Y X ) || || Note the FIRST power of the length in the penalty function. Another view: Solve a constrained optimization problem (restricting the model space): ˆ : arg min (Y X )T (Y X ) restricted to the set { : || || K } Be prepared to compare ridge regression to lasso regression. See Fig 3.12 and 3.13. -5- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Principal Components Recall: See Fig 3.10. X UDV T . The principal component weights are the columns of V, v1...v p . X T XV VDU TUDV TV VD 2 , so X T Xv j d 2j v j (the v j are eigenvectors). The principal components are the linear combinations z j Xv j , j 1,..., p . Note that Z ( z1...z p ) XV UD. This is a derived covariate technique. Z replaces X. Algorithm for generating principal components: The successive principal components solve v j arg max Var ( X ) over all α of length 1 and orthogonal to v1 ,..., v j 1 . (Important: note that Y does not enter into this.) See Fig 3.8. -6- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Principal components regression is the model yˆ M pcr y ˆ j z j , so j 1 ˆ M pcr ˆ j v j , j 1 Note that M, the number of components to include, is a model complexity tuning parameter. -7- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Partial Least Squares The successive PLS components solve ˆ j arg max Cov(Y , X ) over all α of length 1 and orthogonal to 1 ,..., j 1 . This is the same as ˆ j arg max corr 2 (Y , X )Var ( X ) Contrast this with principal components, where Y plays no role. PLS regression is the model yˆ M pls y ˆ j z j , where z j X ˆ j , so j 1 M ˆ pls ˆ,jˆ j j 1 (depends on M, a “smoothing” or “model complexity” parameter). This is another derived covariates method. -8- 2005-03-30 Supplemental notes, BIOINF 2054/BIOSTAT 2018 Statistical Foundations for Bioinformatics Data Mining Comparing the methods: See Fig 3.6, 3.11, Table 3.3. Summary: What do you need to remember about these methods? List here. Exercises due Feb 1: Go to http://www-stat.stanford.edu/~tibs/ElemStatLearn/ . Obtain the prostate cancer data set. Load it into R. Carry out OLS regression, ridge regression, principle components regression, and partial least squares regression. Also do exercises 3.1- 3.7 (skip 3.3b), 3.9, 3.11, 3.17. As usual, bring to class at least one AHA and one Question about Chapter 3. You will read Ch. 4 .1 – 4.3 for Friday Feb 3. -9-