Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining Target readings: Hastie, Tibshirani, Friedman, Chapter 10 Boosting The boosting idea: To predict a group membership Y {1,1} given X, combine a set of “weak” classifiers {x Gˆ m ( x) : m 1,..., M } by: M Gˆ ( x) sign mGˆ m ( x) . m 1 That is, the prediction is the winner of an “election” where each classifier gets vote of m . Algorithm: 1. 2. Initialize weights wi 1/ N for i 1,..., N . For m = 1,…,M, a. fit classifier Gˆ m to training data using wi , b. compute the weighted average error: N errm w I{ y i 1 i i Gˆ m ( xi )} N w i 1 i c. compute a vote size, m log 1 errm , errm d. re-compute the weights: wi wi wi exp i I { yi Gˆ m ( xi )} 1 errm wi errm 3. M ˆ Output G ( x) sign mGˆ m ( x) . m 1 (See Fig 10.1). -1- if yi Gˆ m ( xi ) if yi Gˆ m ( xi ) 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining Boosting seems to do remarkably well. In Fig 10.2 the training data are 2000 points with Y=1 outside a circle: X 1 ,..., X 10 ~ N (0,1) i.i.d., Yi | X 2 I X i 102 (0.5) 1 {1,1} . (Note use of the {0,1} {-1,1} transformation.) The method is boosting “stumps” (simple threshholding on one variable). Up to M=400 stumps, test data performance is still improving. Boosting seems to be surprisingly resistant to overfitting. It can even improve test error, well after the training error has gone to zero: C4.5 once From: Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods, Shapire, Freund, Bartlett, Lee, The Annals of Statistics, 26(5):1651-1686, 1998. See also Fig 10.3 HTF. -2- 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining Boosting fits an additive model M ˆ G ( x) sign mGˆ m ( x) is roughly a special case of a dictionary m 1 M method, E (Y | X x) f ( x) mb( x; m ) . m 1 Other examples: single-layer neural networks (Ch 11), wavelets in signal processing (Ch 5), MARS (multiple adaptive regression splines, Ch 9), and CART (Ch 9). Forward Stagewise (stepwise), in general Replace the global optimization M L y , b ( x ; ) i m i m 1 ... M , 1 ... M i 1 m 1 N ˆ1...ˆM , ˆ1...ˆM arg min by the simpler problem, building in a stagewise “greedy” method: N ˆm , ˆm arg min L yim , f m 1 ( xi ) mb( xi ; m ) { m , m } i 1 N arg min L rim , mb( xi ; m ) { m , m } i 1 m 1 where the residual is rim yi m '1 . b( x; m ' ) . m' The last equality assumes that the loss function depends on the difference between predicted and true. This includes squared error loss and misclassification error loss. AdaBoost = Forward stagewise with exponential loss See Fig 10.4. Recall loss function L y, f ( x) e yf ( x ) e 1 if y f ( x) 1 e if y f ( x) -3- 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining Let’s figure out what is “Stagewise”, when the basis {b} = {all members of a weak classifier family}, so b( x; m ) G( xi ) {1,1} , e.g. b( x; i ,i ) 2 I xi i 1. N ˆm , Gˆ m arg min exp yi fˆm 1 ( xi ) G ( xi ) ,G But N i 1 N exp yi fˆm1 ( xi ) G( xi ) wi( m) exp yiG( xi ) i 1 i 1 e N w i: yi G ( xi ) e ( m) where wi G Next, minimizing over gives N i: yi G ( xi ) N (m) i w i: yi G ( xi ) exp yi fˆm 1 ( xi ) . Gˆ m arg min e e (m) i e wi( m ) N w ( m) i i 1 So for any fixed 0 , N i : yi G ( xi ) wi( m ) . N ˆ e m i : yi G ( xi ) N i : yi G ( xi ) Finally, and wi( m ) , i.e. wi( m ) 1 2 m log 1 errm . errm fˆm ( x) fˆm 1 ( x) ˆmGˆ m ( x) , wi( m 1) exp yi fˆm ( xi ) wi( m ) exp m yiGˆ m ( xi ) w (m) i exp 2 m I yi Gˆ m ( xi ) exp m -4- . 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining (This uses the transformation [-1,1] = 2[0,1] – 1.) Note that 2 m is what we called m in the Boost algorithm, and exp m cancels out when the weights are normalized. -5- 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining See Table 12.1, p.381. The population minimizer of the binomial loglikelihood loss function is f ( X ) log Pr(Y 1| X ) . Pr(Y 1| X ) Similarly, the population minimizer of the exponential loss function is 1 Pr(Y 1| X ) f ( X ) log 2 Pr(Y 1| X ) because EY | X e Ya p1e a p1e a EY | X e Ya p1 (e a ) p1e a 0 a p1 e a a e2 a p1 e 1 p a log 1 2 p1 – Table 10.1 (Sec. 10. 7, p.313) shows characteristics of different learning methods. – Section 10.10 shows how to increase performance by using a loss function less sensitive to extremely wrong predictions. – Section 10.11-12 show how to increase performance by limiting M or by regularization. -6- 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining The great debate: why does boosting work?? Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods, Shapire, Freund, Bartlett, Lee, The Annals of Statistics, 26(5):1651-1686, 1998.. See also Schapire’s boosting page, http://www.cs.princeton.edu/~schapire/boost.html . o “achieving a large margin on the training set results in an improved bound on the generalization error”. Breiman’s 2002 Ann Stat article http://statwww.berkeley.edu/users/breiman/wald2002-1.pdf o Cites counter-example- “increasing the margin” is not the answer. o Shows if the weak learner set is all trees with T terminal nodes, and T > #covariates, then the class is complete: {all linear combinations of the trees} spans {all square-integrable functions} o Equality is not enough: e.g. f(x1,x2) = I{sign(x1)=sign(x2)} o Consistency for finite samples may not be achieved, unless regularization is used. o “A big win is possible with weak learners as long as their correlation and bias are low. Example in Bioinformatics: -7- 2006-04-12 BIOSTAT 2015 Statistical Foundations for Bioinformatics Data Mining Dettling’s http://stat.ethz.ch/~dettling/boosting.html for an R package “LogistBoost” and application to the Golub leukemia data. -8-