Download Boosting to predict unidentified account status

Document related concepts

Mixture model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Boosting for prediction and
inference
By Marc Sobel
Sometimes you just need a boost!!!
Learning
 Consider a setting in which we have
training (or labeled) data and test (or
unlabeled) data. (Example: (i)
learning who are the consumers of a
particular product using a training
data set in which the consumers are
identified.) We would like to ‘learn’
the unlabelled data using the labeled
data.
Some methods used in the past for
learning
Greedy Algorithms
 Greedy algorithms for learning
optimize some loss function at each
stage.
 A) CART (classification and
regression trees)
 B) MARS (multivariate adaptive
regression splines)
 C) stepwise regression.
Greed!!!
Stepwise regression (from
wikipedia)


In this example from engineering, necessity and sufficiency
are usually determined by F-tests. For additional
consideration, when planning an experiment, computer
simulation, or scientific survey to collect data for this
model, one must keep in mind the number of parameters,
P, to estimate and adjust the sample size accordingly. For K
variables, P = 1(Start)+ K(Stage I)+ (K2-K)/2(Stage II)+
3K(Stage III)= .5K2+ 3.5K + 1. For K<17, an efficient
design of experiments exists for this type of model, a BoxBehnken design,[4] augmented with positive and negative
axial points of length min(2,sqrt(int(1.5+K/4))), plus
point(s) at the origin. There are more efficient designs,
requiring fewer runs, even for K>16.
MARS (from wikipedia) [use
Bayesian Mars – see A.F.M. Smith]




Multivariate adaptive regression splines is a non-parametric
technique that builds flexible models by fitting piecewise linear
regressions.
An important concept associated with regression splines is that of
a knot. Knot is where one local regression model gives way to
another and thus is the point of intersection between two splines.
In multivariate and adaptive regression splines, basis functions are
the tool used for generalizing the search for knots. Basis functions
are a set of functions used to represent the information contained
in one or more variables. Multivariate and Adaptive Regression
Splines model almost always creates the basis functions in pairs.
Multivariate and adaptive regression spline approach deliberately
overfits the model and then prunes to get to the optimal model.
The algorithm is computationally very intensive and in practice we
are required to specify an upper limit on the number of basis
functions.
CART (according to wikipedia) [use
Bayesian Cart – see Edward George]






Decision tree learning is a common method used in data mining. Each
interior node corresponds to a variable; an arc to a child represents a
possible value of that variable. A leaf represents a possible value of target
variable given the values of the variables represented by the path from the
root.
A tree can be "learned" by splitting the source set into subsets based on an
attribute value test. This process is repeated on each derived subset in a
recursive manner. The recursion is completed when splitting is either nonfeasible, or a singular classification can be applied to each element of the
derived subset. A random forest classifier uses a number of decision trees,
in order to improve the classification rate.
In data mining, trees can be described also as the combination of
mathematical and computing techniques to aid the description,
categorisation and generalisation of a given set of data.
Data comes in records of the form:
(x, y) = (x1, x2, x3..., xk, y)
The dependent variable, Y, is the variable that we are trying to understand,
classify or generalise. The other variables, x1, x2, x3 etc., are the variables
that will help with that task.
CART (classification and regression
tree’s) – Assume 2 labels
 At each stage, we pick a cut point c(iˆ)
for a predictor random variable Xi
which optimally divides the responses
into two groups so that the resulting
entropy for the two children reduces
the entropy of the adult. (see next for
the formula)
CART (classification and regression
tree’s) – Assume 2 labels
 At each stage, we pick a cut point c(iˆ) for the
predictor variable Xi which optimally
increases the information:
 (Ij is the information from node j: n0,j is the
number of responses in node j for which y=0)
 (IG is information gain against the split nodes
j0,j1)
n  n
n 
n
 0, j  1, j
 1, j 
log 

log



nj
n
n
n
j


 j 

 j 

 nj 
 n j0 
1
IG ( j; j0 , j1 )  I j  
I 
I
 n j  j1  n j  j0




Ij 
0, j
Why are greedy algorithms ‘weak’?

The weakness of Greedy (i.e., weak
learner) algorithms
1) Greedy algorithms depend on the
‘order’ in which the algorithm is
preformed.
2) Greedy algorithms typically choose
locally optimum (rather than globally
optimal) results.
Making weak learners strong:
Generate many weak learner
algorithms by biulding them from
bootstrap samples (taken from the
training set).
A) Vote on labels for the test data.
The majority wins. This is Bagging.
B) Biuld better algorithms by
‘correcting’ the weights used by the
bootstrap samples. This is boosting.
AdaBoost
 Start with a weak learner based on a
bootstrap sample: (Xi*,Yi*) (i=1,…,n).
(Taken from the training set). We use the
notation WLt(Xi*) for the label predicted by
the weak learner WLt (at time t) . We
employ the terminology Wi,t (i=1,…,n) for
the weights at time t. We refer to each
member of the training set as an ‘instance’.
Calculate the overall weight error:
 (i)
error[t ] 

WLt ( X i* ) Yi*
Wi,t
AdaBoost (continued)
 (ii) Update the weights as follows: (errors are
typically controlled to be below (1/2))
error[t ]
t 
;
1  error[t ]
 t if WL t (X*i )  Yi*
Wi,t 1  Wi,t 
if not
1
 (iii) Select the next bootstrap sample using the new
weights. Return to step 1.
 (iv) Choose the label which minimizes the errors
over time for each given item.
 In other words, when our weak learner is correct, we
decrease the weight whereas when our weak learner
is incorrect we increase the weight. The next
bootstrap sample is selected with the new weights.
Boosting (continued again)
 (iv) Choose the label which minimizes the
errors over time for each given item. This
means that:
 t 
  t 
1  T
WL
(
x
)


log
  log 

 t



2 t 1
t 1
1   t 
1   t 
T
Where does boosting come from?
Again assume 2 labels, +1 and -1. Suppose we form a
label estimator based on indicator functions,
φ1,…,φN (with values -1,1) taking the form,
k
f ( x ,  )    j j ( x )
j 1
Assume a risk function of the form,
n
Risk (  )   exp  yi f ( xi ,  )
i 1
Note that when f(xi) agrees with yi then the
corresponding term in the sum is small whereas
when it disagrees the term gets large.
More on where boosting comes
from
 An iterative algorithm for biulding a good
classifier involves choosing the correct β’s.
We can do this step by step as follows:
 (1) Suppose β1,…,βl are known; write the
classifier involving these as:
l
f1:l ( x,  )    ji ( x)
j 1
(2) We can write the risk function for
estimating βl+1 as Risk ( )  n exp  y f ( x )  y 
l 1
i 1
i 1:l
i
i l 1l 1 ( xi )
More on where boosting comes
from
 Now we differentiate Risk ( l 1)
with respect to
βl+1 and set the result equal to 0:
 We get: (see derivation in appendix)


exp

y
f
(
x
)



i 1:l i 

1
 l 1 ( xi )yi 1

 l 1  log 

2

exp  yi f1:l ( xi ) 

l 1 ( xi )yi 1

 But this is just minus (1/2) times the log of the
ratio of the error weight to its complement.

Explanation
 Examples: The phi’s could be
decision trees based on different
bootstrap samples. The weights then
represent the amount of conditional
information which is supplied by
tree’s given their successors.
Bayesian Viewpoint (see
BART=Bayesian additive regression
Trees)
 Let P(Yi≠WL(Xi))=εt; Let Y be the
‘true’ value of the response; prior
distribution πi=.5. We then accept
Y=1 if,
P Y  1| WL1 ( x),...,WLT ( x)  (1/ 2)

t

(1   t )
WLt ( x )  0
WLt ( x )  0

WLt ( x ) 1
(1   t ) 

WLt ( x ) 1
t
Bayesian Interpretation
 Taking logs, we get,
T
T
t 1
T
t 1
T
t 1
t 1
 (1  WLt ( x)) log  t   WLt ( x) log(1   t ) 
 WLt ( x) log  t   (1  WLt ( x)) log(1   t )
 Or
T
 t  T
 t 
 (1  WLt ( x)) log 
   h( X i ) log 
  0;
1   t  t 1
1   t 
t 1
 t 
 t 
1 T
  WLt ( x) log 
      log 

 2  t 1 1   t 
1   t 
t 1
T
Bounding the error made by
boosting
 Theorem: Put
We have that
  Pw ( y  WL( x));
 t  Pwt ( yt  WLt ( xi ));
T
   4 t (1   t )
i 1
Proof: We have, for the weights associated
with incorrect classification that:
 wi,t 1
1 WLt ( X i ) Yi
=  wi,t t

( r  1  (1   )r )

<  w i,t 1  1  t  1  WLt ( X i )  Yi 
Boosting error bound (continued)
 We have that:


 w i,t 1  1  t  1  WL( X i )  Yi 


   wi,t  1  1  t 1   t 


 Putting this together over all the iterations:
T
 w i,T+1   1  1  t 1   t 
i 1
Lower bounding the error
 We can lower bound the weights by:
 The final hypothesis makes a mistake on
the predicting yi if (see Bayes)
1

 
T  WL ( x )  y
T
t i
i
  t  2 
 t
t 1
i 1
 The final weight on an instance
T
1 yi WLt ( xi )
wi,T 1  wi,0  t
t 1
Lower Bound on weights
 Putting together the last slide:
N
 wi,T 1 
i 1

WLT 1 ( xi )  yi

wi,T 1


w


i,0   t 
WLT+1 ( xi )  yi
t 1 
T


=   t 
t 1 
T
(1/ 2)
(1/ 2)
Conclusion of Proof:
 Putting the former result together with
T
 w i,T+1   1  1  t 1   t 
i 1
 We get the conclusion.
Overview
1. Introduction to the Problem;
2. Overview of Arcing (Boosting) for
purposes of Prediction.
3. Solution to the Problem
4. Results
The Problem
Over time, a utility company tends to
accumulate financial accounts which
are unidentified with regards to
whether they have already been paid
or not. These accounts can be e.g.,
duplicates of money already paid to
customers or money still owed, or etc..
Each account includes additional
(covariate) information about payment
amounts, dates, source, etc..
The Problem (continued)

A Small set of 380 training (labeled) accounts (totaling
$179,380) where the duplication status is known are
observed. (Of these training accounts, 159 were
duplicated and 221 were not). This is distinguished
from the 4970 test (unlabeled) accounts (totaling
$586,504) whose duplication status is unknown.
Covariate information is observed for both the training
and test accounts (e.g., amounts, dates, source). Note
that the training accounts correspond to much more
money per account than the test accounts (roughly $900
more per account). This is due to selection bias.

The company would like to predict, for the test accounts,
whether it owes money or not, and, more generally, the
total or proportional amount of money owed for all the
accounts.
A Solution: Boosting Cart Models
Many statistical solutions to this problem exist but
most (e.g., logistic regression) suffer from a
failure to capture the complex relationship
between covariates and response. We employ
techniques which boost CART (tree) models for
this purpose. Other solutions which do not preform
as well include bagging CART (tree) models.
References for such techniques include:
Breiman (et al) [1984,1996]
Hasti, Tibshirani, et al [2001]
Efron and Tibshirani [1993]
Freund and Schapire (adaboost) [1996, etc..]
Boosting CART Models
1. Rescale the training accounts to
adjust for the training selection bias.
2. Use re-sampling to construct CART
classifiers for the training data.
3. Boost the classifiers by adjusting the
resampling weights in conformity with
the duplication status of the training
data.
4. Calculate appropriate scaling factors
designed to make the training and test
data comparable (i.e., adjust for the
selection bias)
Boosting (continued)
5. Use the boosted classifiers to predict the
duplication status of the test data. (Below, we
prefer ‘accuracy’ to ‘error’ measures).
The training accuracy is the proportion of accounts
used for training (i.e., labeled accounts) which
were predicted correctly; the test accuracy is the
proportion of accounts not used for training (i.e.,
unlabeled accounts) which were predicted
correctly.
6. Calculate the training and test (or generalization)
accuracies.
7. Calculate specific and total amounts owed for the
unidentified accounts (i.e., for each account, the
boosted probability of being duplicated times the
amount)
Conclusion:
 Boosting and related (arcing) procedures have been
shown to be useful in predicting the labels of
unlabeled data. They are useful because:
1. The training accuracy (under suitable resampling)
is very high.
2. The test (generalization) accuracy (under suitable
resampling) is high – approaching (the best
possible) Bayes accuracy.
3. confidence intervals for estimates of parameters,
like the amounts owed, can be accurately
predicted.
Boosting induced changes in the
training sample weights
The values of accounts versus the
estimated amount owed by them
The Percentage of untrained
accounts predicted correctly
The Percentage of accounts (used
for training) which are predicted
correctly
Generalized Boosting
 Consider the problem of analyzing surveys.
A large variety of people are surveyed to
determine how likely they are to vote for
conviction on juries. It is advantageous to
design surveys which link their gender,
political affiliations, etc.. to conviction. It is
also advantageous to ordinally divide
conviction into 5 categories which
correspond to how strongly people feel
about conviction.
Generalized Boosting example
 For the response variable, we have 5
separate values; the higher the response
the greater the tendency to convict. We
would like to predict how likely participants
are to go for conviction based on their sex,
participation in sports, etc… Logistic
discrimination does not work in this
example because it does not capture the
complicated relationships between the
predictor and response variables.
Generalized boosting for the
conviction problem
 We assign a score h(x,y)=ηφ{|y-ycorrect|/σ}
which increases in proportion to how close
it is to the correct response.
 We put weights on all possible responses
(xi,y) for y=yi and also y≠yi. We update not
only the former, but also the latter weights
in a two stage procedure. First, we update
weights for each case. Second, we update
weights within each single case. We
update weights via,
Generalized boosting
 We update weights via:
qi,t ( y ) 
wi,t ( y )
wi,t
; (y  yi ); (second stage)
wi,t   qi ,t ( y )
(first stage)
y  yi


1
 t   wi,t 1  l ( xi , yi )   qi ,t ( y )h( xi , y )  ;


2 i
y

y
i


1
1 h( xi , yi )  h( xi , y ) 


wi(,ty1)  wi(,ty) t 2 
t
t 
1  t
Generalized boosting (explained)
 The error incorporates all possible
mistaken possibilities rather than just
a single one. The algorithm differs
from 2-valued boosting in that it
updates a matrix rather than a vector
of weights. This algorithm works
much better than the comparable one
which gives weight 1 to mistakes and
weight 0 to correct responses.
Why does generalized boosting
work?
 The pseudo-loss of WL on training
data (xi,yi), defined by


ploss=(1/ 2) 1  WL( xi , yi )   qi, yWL( xi , y ) 


y


 Defines the error made by the weak
learner in case i.
 The goal is to minimize the
(weighted) average of the pseudolosses.
Stochastic Gradient Boosting
 Assume training data; (Xi,Yi);
(i=1,…,n) with responses taking e.g.,
real values. We want to estimate the
relationship between the X’s and Y’s.
 Assume a model of the form,
Yi   i gi ( X i , i )
Stochastic Gradient Boosting
(continued)
 We use a two stage procedure:
l
*
*
*
 Define Resl X i , Yi  Yi    j X i*


j 1
 First, given β1,…,βl+1 and Θ1,…,Θl we
minimize

n
l 1  argmin l+1   Resl ( X i* , Yi* )  l 1gi ( X i* | l 1 )
i=1
 
2
Stochastic Gradient Boosting
 We fit the beta’s via (using the
bootstrap)

n
 l 1  argmin l+1   Resl ( X i* , Yi* )  l 1gi ( X i* | l 1)
i=1
n


 Resl X i* , Yi* gl 1( X i* | l 1)
 l+1  i 1
n
*
2
g
(
X
|

)
 l 1 i l 1
i 1
 
2
Stochastic Gradient Boosting
 Note that the new weights (i.e., the
beta’s) are proportional to the
residual error. Bootstrapping
estimates for the new parameters has
the effect of making them robust.
Proof of log ratio result
 Recall that we had the risk function
n
Risk ( l 1 )   exp  yi f1:l ( xi )  yi l 1l 1 ( xi )
i 1
 Which we would like to minimize in βl+1.
 First divide up the sum into two parts; the
first is where φl+1 correctly predicts y; the
second where it does not:
Risk ( l 1 ) 

yil 1 ( xi ) 1
+

exp  yi f1:l ( xi )  l 1
yil 1 ( xi ) 1
exp  yi f1:l ( xi )  l 1
conclusion
 We have that:
Risk ( l 1 )  exp  l 1

yil 1 ( xi ) 1
+ exp l 1

exp  yi f1:l ( xi )
yil 1 ( xi ) 1
exp  yi f1:l ( xi )
Bounding the error made by
boosting
 Theorem: Put
We have that
  Pw ( y  h( x));
 t  Pwt ( yt  h( xi ));
T
   4 t (1   t )
i 1
Proof: We have, for the weights associated
with incorrect classification that:
Pwt 1 (Yi  h( X i ))   wi ,t 1
1 h( X i ) Yi
=  wi,t t

( r  1  (1   )r )

<  w i,t 1  1  t  1  h( X i )  Yi 
Boosting error bound (continued)
 We have that:


 w i,t 1  1  t  1  h( X i )  Yi 


   wi,t  1  1  t 1   t 


 Putting this together over all the iterations:
T
 w i,T+1   1  1  t 1   t 
i 1
Lower bounding the error
 We can lower bound the weights by:
 The final hypothesis makes a mistake on
the predicting yi if (see Bayes)
1

 
T
T
 ht ( xi )  yi
  t  2 
 t
t 1
i 1
 The final weight on an instance
T
1 yi  ht ( xi )
wi,T 1  wi,0  t
t 1
Lower Bound on weights
 Putting together the last slide:
N
 wi ,T 1 
i 1

hT 1 ( xi )  yi

wi ,T 1


w


i,0  
t
h T+1 ( xi )  yi
t 1

T


=    t 
t 1

T
 (1/ 2)
 (1/ 2)
Conclusion of Proof:
 Putting the former result together with
T
 w i,T+1   1  1  t 1   t 
i 1
 We get the conclusion.