Download slides: 50min - Microsoft Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Regression trees and regression graphs:
Efficient estimators for
Generalized Additive Models
Adam Tauman Kalai
TTI-Chicago
Outline
• Generalized Additive Models (GAM)
• Computationally efficient regression
– Model
[Valiant]
[Kearns&Schapire]
Thm: Regression graph algorithm
efficiently learns GAMs
• Regression tree algorithm
• Regression graph algorithm [Mansour&McAllester]
New Correlation boosting
New
Generalized Additive Models
[Hastie & Tibshirani]
Dist.  over X £ Y = Rd £ R
f(x) = E[y|x] = u(f1(x(1))+f2(x(2))+…+fd(x(d)))
monotonic u: R!R, arbitrary fi: R!R
• e.g., Generalized linear models
– u( w¢x ), monotonic u
– linear/logistic models
• e.g., f(x) =
2
–||x||
e
=
2–x(2)2…–x(d)2
–x(1)
e
Non-Hodgkin’s Lymphoma International Prognostics Index
[NEJM ‘93]
# Risk
factors
Relapse
< 5 years
Relapse
< 2 years
Death
< 5 years
Death
< 2 years
0,1
30%
21%
30%
16%
2
50%
34%
50%
34%
3
51%
41%
51%
46%
4,5
60%
42%
60%
66%
Risk Factors
age>60, # sites>1, perf. status>1, LDH>normal, stage>2
Setup

X£Y
1
0 1
.3
0
1
1
1 1 .2
0 0 1
0
.3
1
1
0
1 0 1
1
regression
.4 1
1
0 .7
algorithm
0 .3 0 0 1 101 0 1
1
1
1
1
1
0
0
0 10
0 0
0 1
“training
error”
0
1 1 .4
0
0
0 0
1
2
0
0
1
0
(h,train)
=

(h(x
)-y)
i .3
0
n1 0 i 1
0
X = Rd Y = [0,1]
training sample:
(x1,y1),…,(xn,yn)
1 0 1
.4
0
0 0 1 1 .1
01 1
0
1
0
1
1
11
0
1
.8
1
0
.4 1
0 .51
0
0
0 0 1 101 0 1
1
11 0
1
1
0
0
0
0 .2
0 1
0
“true
error”
1
.5
0
0
1
0 0
0
0
0
.2 21]
(h)
= E1[(h(x)-y)
0 1
.1 .4
.3
.02
.2
h: X ! [0,1]
Computationally-efficient regression
[Kearns&Schapire]
Definition: A efficiently learns F:
8

X £ [0,1]
n examples
Learning
Algorithm
A
Family of
target functions
f(x) = E[y|x] 2 F,
with probability 1-,
E[(h(x)-y)2] · E[(f(x)-y)2]+(term)/nc
true error (h)
poly(|f|,1/)
h: X ! [0,1]
A’s runtime must be poly(n,|f|)
Outline
• Generalized Additive Models (GAM)
• Computationally efficient regression
– Model
[Valiant]
[Kearns&Schapire]
Thm: Regression graph algorithm
efficiently learns GAMs
• Regression tree algorithm
• Regression graph algorithm [Mansour&McAllester]
New Correlation boosting
New
New
Results for GAM’s
10 1
0 0 0
0 .7
1
1
0
0
0 1 .2 .4 1
11
1
1
0
1 1 0 1
n samples 2 X £ [0,1]
X µ Rd
Regression
Graph
Learner
0
.1 .6
.8
h: Rd ! [0,1]
Thm: reg. graph learner efficiently learns GAMs
• 8dist  over X£Y with E[y|x] = f(x) 2 GAM
8  with probability 1-,
– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))
– runtime = poly(n,d)
n1/7
New
Results for GAM’s
• f(x) = u(i fi(x(i)))
– u: R!R, monotonic, L-Lipschitz (L=max |u’(z)|)
– fi: R!R, bounded total variation
V = i s |fi’(z)|dz
Thm: reg. graph learner efficiently learns GAMs
• 8dist  over X£Y with E[y|x]=f(x) 2 GAM
– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))
– runtime = poly(n,d)
n1/7
New
Results for GAM’s
1
0
0 0 0 1 0 .7
1
0
0 1
0 1 .2 .4 1
11
1
1
0
1 1 0 1
Regression
0
Tree
.1 .6
.8
Learner
h: Rd ! [0,1]
n samples 2 X £ [0,1]
X µ Rd
Thm: reg. tree learner inefficiently learns GAMs
• 8dist  over X£Y with E[y|x]=f(x) 2 GAM
–
E[(h(x)-y)2]
·
E[(f(x)-y)2]
– runtime = poly(n,d)
+ O(LV) (
log(d) 1/4
log(n)
)
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]
• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
(x1,y1),
(x2,y2),
…
avg(y1,y2,…,yn)
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]
• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸  ?
(xi,yi): x(j) < 
(xi,yi): x(j) ¸ 
avg(yi: xi(j)<)
avg(yi: xi(j)¸)
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]
• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸  ?
(xi,yi): x(j) < 
x(j’) ¸ ’ ?
avg(yi: xi(j)<) (xi,yi): x(j) ¸ 
and x(j’) < ’
avg(yi: x(j)¸Æx(j’)<’)
(xi,yi): x(j) ¸ 
and x(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)¸’)
Regression Tree Algorithm
• n = amount of training data
• Put all data into one leaf
• Repeat until size(RT)=n/log2(n):
Equivalent
to “Gini”
– Greedily choose leaf and split x(j) ·  to
minimize (RT,train) =  (RT(xi)-yi)2/n
– Divide data in split node into two new leaves
Regression Graph Algorithm
[Mansour&McAllester]
• Regression graph RG: Rd ! [0,1]
• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸  ?
x(j’’) ¸ ’’ ?
(xi,yi): x(j) < 
and x(j’’) < ’’
(xi,yi): x(j) < 
and x(j’’) ¸ ’’
x(j’) ¸ ’ ?
(xi,yi): x(j) ¸ 
and x(j’) < ’
avg(yi: x(j)<Æx(j’’)<’’)
avg(yi: x(j)¸Æx(j’)<’)
avg(yi: x(j)<Æx(j’’)¸’’)
(xi,yi): x(j) ¸ 
and x(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)¸’)
Regression Graph Algorithm
[Mansour&McAllester]
• Regression graph RG: Rd ! [0,1]
• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸  ?
x(j’’) ¸ ’’ ?
(xi,yi): x(j) < 
and x(j’’) < ’’
avg(yi: x(j)<Æx(j’)<’)
x(j’) ¸ ’ ?
(xi,yi): x(j) <  and x(j’’) ¸ ’’
or x(j) ¸  and x(j’) < ’
(xi,yi): x(j) ¸ 
and x(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)¸’)
avg(yi: (x(j)<Æx(j’’)¸’’)Ç(x(j)¸Æx(j’)<’))
Regression Graph Algorithm
[Mansour&McAllester]
• Put all n training data into one leaf
• Repeat until size(RG)=n3/7:
– Split: greedily choose leaf and split x(j) ·  to
minimize (RG,train) =  (RG(xi)-yi)2/n
• Divide data in split node into two new leaves
• Let  be the decrease in (RG,train) from this split
– Merge(s):
• Greedily choose two leaves whose merger
increases (RG,train) as little as possible
• Repeat merging while total increase in (RG,train)
from merges is · /2
Two useful lemmas
• Uniform generalization bound for any n:
regression graph R
probability over training sets
(x1,y1),…,(xn,yn)
• Existence of a correlated split:
There always exists a split I(x(i) · ) s.t.,
Motivating natural example
• X = {0,1}d, f(x) = (x(1)+x(2)+…+x(d))/d, uniform 
• Size(RT) ¼ exp(Size(RG)c), e.g. d=4:
x(1)>½
x(1)>½
x(2)>½ x(2)>½
x(2)>½
x(2)>½
x(3)>½ x(3)>½ x(3)>½
x(4)>½ x(4)>½ x(4)>½ x(4)>½
x(3)>½
x(3)>½
x(3)>½
x(3)>½
0
.25
x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½
x(4)>½ x(4)>½
0
.25
.5 .5 .75 .25
.25.25 .5
.5 .5
.75 .5
.75 .75
1
.5
.75
1
Regression boosting
• Incremental learning
– Suppose you find something of positive
correlation with y, then reg. graphs make
progress
– “Weak regression” implies strong regression,
i.e. small correlations can efficiently be
combined to get correlation near 1
(error near 0)
– Generalizes binary classification boosting
[Kearns&Valiant, Schapire, Mansour&McAllester,…]
Conclusions
• Generalized additive models are very general
• Regression graphs, i.e., regression trees with
merging, provably estimate GAMs using
polynomial data and runtime
• Regression boosting generalizes binary
classification boosting
• Future work
– Improve algorithm/analysis
– Room for interesting work in
statistics Å computational learning theory
Related documents