Download 2006-04-12 boosting

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
Target readings: Hastie, Tibshirani, Friedman, Chapter 10 Boosting
The boosting idea:
To predict a group membership Y  {1,1} given X,
combine a set of “weak” classifiers {x Gˆ m ( x) : m  1,..., M } by:
M


Gˆ ( x)  sign   mGˆ m ( x)  .
 m 1

That is, the prediction is the winner of an “election” where each classifier
gets vote of  m . Algorithm:
1.
2.
Initialize weights wi  1/ N for i  1,..., N .
For m = 1,…,M,
a. fit classifier Gˆ m to training data using wi ,
b. compute the weighted average error:
N
errm 
 w I{ y
i 1
i
i
 Gˆ m ( xi )}
N
w
i 1
i
c. compute a vote size,  m  log
1  errm
,
errm
d. re-compute the weights:
 wi

wi  wi exp  i I { yi  Gˆ m ( xi )}   1  errm
 wi
errm


3.

M

ˆ
Output G ( x)  sign   mGˆ m ( x)  .
 m 1

(See Fig 10.1).
-1-
if yi  Gˆ m ( xi )
if yi  Gˆ m ( xi )
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
Boosting seems to do remarkably well. In Fig 10.2 the training data are
2000 points with Y=1 outside a circle:


X 1 ,..., X 10 ~ N (0,1) i.i.d., Yi | X  2 I  X i  102 (0.5)   1 {1,1} .


(Note use of the {0,1}  {-1,1} transformation.)
The method is boosting “stumps” (simple threshholding on one variable).
Up to M=400 stumps, test data performance is still improving.
Boosting seems to be surprisingly resistant to overfitting. It can even
improve test error, well after the training error has gone to zero:
C4.5
once
From: Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods, Shapire,
Freund, Bartlett, Lee, The Annals of Statistics, 26(5):1651-1686, 1998.
See also Fig 10.3 HTF.
-2-
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
Boosting fits an additive model
M

ˆ
G ( x)  sign    mGˆ m ( x)  is roughly a special case of a dictionary
 m 1

M
method,
E (Y | X  x)  f ( x)    mb( x;  m ) .
m 1
Other examples: single-layer neural networks (Ch 11), wavelets in signal
processing (Ch 5), MARS (multiple adaptive regression splines, Ch 9), and
CART (Ch 9).
Forward Stagewise (stepwise), in general
Replace the global optimization
M


L
y
,

b
(
x
;

)
  i
m
i
m 
1 ...  M ,  1 ... M i 1 
m 1

N
ˆ1...ˆM , ˆ1...ˆM  arg min
by the simpler problem, building in a stagewise “greedy” method:
N
ˆm , ˆm  arg min  L  yim , f m 1 ( xi )   mb( xi ;  m ) 
{  m , m }
i 1
N
 arg min  L  rim ,  mb( xi ;  m ) 
{  m , m }
i 1
m 1
where the residual is rim  yi 

m '1
.
b( x;  m ' ) .
m'
The last equality assumes that the loss function depends on the difference
between predicted and true. This includes squared error loss and
misclassification error loss.
AdaBoost = Forward stagewise with exponential loss
See Fig 10.4.
Recall loss function
L  y, f ( x)   e
 yf ( x )
e 1 if y  f ( x)
  1
 e if y  f ( x)
-3-
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
Let’s figure out what is “Stagewise”, when the basis {b}
= {all members of a weak classifier family}, so
b( x;  m )  G( xi ) {1,1} , e.g. b( x; i ,i )  2 I  xi  i   1.
 
N
ˆm , Gˆ m  arg min  exp  yi fˆm 1 ( xi )   G ( xi )
 ,G
But
N
i 1
 


N
 exp  yi fˆm1 ( xi )   G( xi )   wi( m) exp     yiG( xi )  
i 1
i 1
e
N


w
i: yi G ( xi )
 e
( m)
where wi



G
Next, minimizing over  gives

N


i: yi  G ( xi )
 
N
(m)
i
w
i: yi  G ( xi )
 exp  yi fˆm 1 ( xi ) .
Gˆ m  arg min
e
e
(m)
i
e
wi( m )

N
w
( m)
i
i 1
So for any fixed   0 ,
N

i : yi  G ( xi )
wi( m ) .
N
ˆ
e m 

i : yi  G ( xi )
N

i : yi  G ( xi )
Finally,
and
wi( m )
,
i.e.
wi( m )
1
2
 m  log
1  errm
.
errm
fˆm ( x)  fˆm 1 ( x)  ˆmGˆ m ( x) ,
wi( m 1)  exp  yi fˆm ( xi )  wi( m ) exp   m yiGˆ m ( xi )

w
(m)
i






exp 2  m I yi  Gˆ m ( xi ) exp    m 
-4-
.
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
(This uses the transformation [-1,1] = 2[0,1] – 1.)
Note that 2 m is what we called  m in the Boost algorithm, and
exp    m  cancels out when the weights are normalized.
-5-
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
See Table 12.1, p.381. The population minimizer of the binomial loglikelihood loss function is f ( X )  log
Pr(Y  1| X )
.
Pr(Y  1| X )
Similarly, the population minimizer of the exponential loss function is
1
Pr(Y  1| X )
f ( X )  log
2
Pr(Y  1| X )
because
EY | X e Ya  p1e  a  p1e  a

EY | X e Ya  p1 (e  a )  p1e a  0
a
p1 e a

  a  e2 a
p1 e
1
p
 a  log 1
2
p1
– Table 10.1 (Sec. 10. 7, p.313) shows characteristics of different
learning methods.
– Section 10.10 shows how to increase performance by using a loss
function less sensitive to extremely wrong predictions.
– Section 10.11-12 show how to increase performance by limiting M or
by regularization.
-6-
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
The great debate: why does boosting work??
 Boosting the Margin: A New Explanation for the Effectiveness of
Voting Methods, Shapire, Freund, Bartlett, Lee, The Annals of Statistics,
26(5):1651-1686, 1998.. See also Schapire’s boosting page,
http://www.cs.princeton.edu/~schapire/boost.html .
o “achieving a large margin on the training set results in an
improved bound on the generalization error”.
 Breiman’s 2002 Ann Stat article http://statwww.berkeley.edu/users/breiman/wald2002-1.pdf
o
Cites counter-example- “increasing the margin” is not the
answer.
o
Shows if the weak learner set is all trees with T terminal
nodes, and T > #covariates, then the class is complete:
{all linear combinations of the trees} spans
{all square-integrable functions}
o
Equality is not enough:
e.g. f(x1,x2) = I{sign(x1)=sign(x2)}
o
Consistency for finite samples may not be achieved, unless
regularization is used.
o
“A big win is possible with weak learners as long as their
correlation and bias are low.
Example in Bioinformatics:
-7-
2006-04-12
BIOSTAT 2015
Statistical Foundations for Bioinformatics Data Mining
 Dettling’s http://stat.ethz.ch/~dettling/boosting.html for an R
package “LogistBoost” and application to the Golub leukemia
data.
-8-