Download slides: 50min - Microsoft Research

Regression trees and regression graphs: Efficient estimators for Generalized Additive Models Adam Tauman Kalai TTI-Chicago Outline • Generalized Additive Models (GAM) • Computationally efficient regression – Model [Valiant] [Kearns&Schapire] Thm: Regression graph algorithm efficiently learns GAMs • Regression tree algorithm • Regression graph algorithm [Mansour&McAllester] New Correlation boosting New Generalized Additive Models [Hastie & Tibshirani] Dist.  over X £ Y = Rd £ R f(x) = E[y|x] = u(f1(x(1))+f2(x(2))+…+fd(x(d))) monotonic u: R!R, arbitrary fi: R!R • e.g., Generalized linear models – u( w¢x ), monotonic u – linear/logistic models • e.g., f(x) = 2 –||x|| e = 2–x(2)2…–x(d)2 –x(1) e Non-Hodgkin’s Lymphoma International Prognostics Index [NEJM ‘93] # Risk factors Relapse < 5 years Relapse < 2 years Death < 5 years Death < 2 years 0,1 30% 21% 30% 16% 2 50% 34% 50% 34% 3 51% 41% 51% 46% 4,5 60% 42% 60% 66% Risk Factors age>60, # sites>1, perf. status>1, LDH>normal, stage>2 Setup  X£Y 1 0 1 .3 0 1 1 1 1 .2 0 0 1 0 .3 1 1 0 1 0 1 1 regression .4 1 1 0 .7 algorithm 0 .3 0 0 1 101 0 1 1 1 1 1 1 0 0 0 10 0 0 0 1 “training error” 0 1 1 .4 0 0 0 0 1 2 0 0 1 0 (h,train) =  (h(x )-y) i .3 0 n1 0 i 1 0 X = Rd Y = [0,1] training sample: (x1,y1),…,(xn,yn) 1 0 1 .4 0 0 0 1 1 .1 01 1 0 1 0 1 1 11 0 1 .8 1 0 .4 1 0 .51 0 0 0 0 1 101 0 1 1 11 0 1 1 0 0 0 0 .2 0 1 0 “true error” 1 .5 0 0 1 0 0 0 0 0 .2 21] (h) = E1[(h(x)-y) 0 1 .1 .4 .3 .02 .2 h: X ! [0,1] Computationally-efficient regression [Kearns&Schapire] Definition: A efficiently learns F: 8  X £ [0,1] n examples Learning Algorithm A Family of target functions f(x) = E[y|x] 2 F, with probability 1-, E[(h(x)-y)2] · E[(f(x)-y)2]+(term)/nc true error (h) poly(|f|,1/) h: X ! [0,1] A’s runtime must be poly(n,|f|) Outline • Generalized Additive Models (GAM) • Computationally efficient regression – Model [Valiant] [Kearns&Schapire] Thm: Regression graph algorithm efficiently learns GAMs • Regression tree algorithm • Regression graph algorithm [Mansour&McAllester] New Correlation boosting New New Results for GAM’s 10 1 0 0 0 0 .7 1 1 0 0 0 1 .2 .4 1 11 1 1 0 1 1 0 1 n samples 2 X £ [0,1] X µ Rd Regression Graph Learner 0 .1 .6 .8 h: Rd ! [0,1] Thm: reg. graph learner efficiently learns GAMs • 8dist  over X£Y with E[y|x] = f(x) 2 GAM 8  with probability 1-, – E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/)) – runtime = poly(n,d) n1/7 New Results for GAM’s • f(x) = u(i fi(x(i))) – u: R!R, monotonic, L-Lipschitz (L=max |u’(z)|) – fi: R!R, bounded total variation V = i s |fi’(z)|dz Thm: reg. graph learner efficiently learns GAMs • 8dist  over X£Y with E[y|x]=f(x) 2 GAM – E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/)) – runtime = poly(n,d) n1/7 New Results for GAM’s 1 0 0 0 0 1 0 .7 1 0 0 1 0 1 .2 .4 1 11 1 1 0 1 1 0 1 Regression 0 Tree .1 .6 .8 Learner h: Rd ! [0,1] n samples 2 X £ [0,1] X µ Rd Thm: reg. tree learner inefficiently learns GAMs • 8dist  over X£Y with E[y|x]=f(x) 2 GAM – E[(h(x)-y)2] · E[(f(x)-y)2] – runtime = poly(n,d) + O(LV) ( log(d) 1/4 log(n) ) Regression Tree Algorithm • Regression tree RT: Rd ! [0,1] • Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] (x1,y1), (x2,y2), … avg(y1,y2,…,yn) Regression Tree Algorithm • Regression tree RT: Rd ! [0,1] • Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] x(j) ¸  ? (xi,yi): x(j) <  (xi,yi): x(j) ¸  avg(yi: xi(j)<) avg(yi: xi(j)¸) Regression Tree Algorithm • Regression tree RT: Rd ! [0,1] • Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] x(j) ¸  ? (xi,yi): x(j) <  x(j’) ¸ ’ ? avg(yi: xi(j)<) (xi,yi): x(j) ¸  and x(j’) < ’ avg(yi: x(j)¸Æx(j’)<’) (xi,yi): x(j) ¸  and x(j’) ¸ ’ avg(yi: x(j)¸Æx(j’)¸’) Regression Tree Algorithm • n = amount of training data • Put all data into one leaf • Repeat until size(RT)=n/log2(n): Equivalent to “Gini” – Greedily choose leaf and split x(j) ·  to minimize (RT,train) =  (RT(xi)-yi)2/n – Divide data in split node into two new leaves Regression Graph Algorithm [Mansour&McAllester] • Regression graph RG: Rd ! [0,1] • Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] x(j) ¸  ? x(j’’) ¸ ’’ ? (xi,yi): x(j) <  and x(j’’) < ’’ (xi,yi): x(j) <  and x(j’’) ¸ ’’ x(j’) ¸ ’ ? (xi,yi): x(j) ¸  and x(j’) < ’ avg(yi: x(j)<Æx(j’’)<’’) avg(yi: x(j)¸Æx(j’)<’) avg(yi: x(j)<Æx(j’’)¸’’) (xi,yi): x(j) ¸  and x(j’) ¸ ’ avg(yi: x(j)¸Æx(j’)¸’) Regression Graph Algorithm [Mansour&McAllester] • Regression graph RG: Rd ! [0,1] • Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] x(j) ¸  ? x(j’’) ¸ ’’ ? (xi,yi): x(j) <  and x(j’’) < ’’ avg(yi: x(j)<Æx(j’)<’) x(j’) ¸ ’ ? (xi,yi): x(j) <  and x(j’’) ¸ ’’ or x(j) ¸  and x(j’) < ’ (xi,yi): x(j) ¸  and x(j’) ¸ ’ avg(yi: x(j)¸Æx(j’)¸’) avg(yi: (x(j)<Æx(j’’)¸’’)Ç(x(j)¸Æx(j’)<’)) Regression Graph Algorithm [Mansour&McAllester] • Put all n training data into one leaf • Repeat until size(RG)=n3/7: – Split: greedily choose leaf and split x(j) ·  to minimize (RG,train) =  (RG(xi)-yi)2/n • Divide data in split node into two new leaves • Let  be the decrease in (RG,train) from this split – Merge(s): • Greedily choose two leaves whose merger increases (RG,train) as little as possible • Repeat merging while total increase in (RG,train) from merges is · /2 Two useful lemmas • Uniform generalization bound for any n: regression graph R probability over training sets (x1,y1),…,(xn,yn) • Existence of a correlated split: There always exists a split I(x(i) · ) s.t., Motivating natural example • X = {0,1}d, f(x) = (x(1)+x(2)+…+x(d))/d, uniform  • Size(RT) ¼ exp(Size(RG)c), e.g. d=4: x(1)>½ x(1)>½ x(2)>½ x(2)>½ x(2)>½ x(2)>½ x(3)>½ x(3)>½ x(3)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(3)>½ x(3)>½ x(3)>½ x(3)>½ 0 .25 x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ x(4)>½ 0 .25 .5 .5 .75 .25 .25.25 .5 .5 .5 .75 .5 .75 .75 1 .5 .75 1 Regression boosting • Incremental learning – Suppose you find something of positive correlation with y, then reg. graphs make progress – “Weak regression” implies strong regression, i.e. small correlations can efficiently be combined to get correlation near 1 (error near 0) – Generalizes binary classification boosting [Kearns&Valiant, Schapire, Mansour&McAllester,…] Conclusions • Generalized additive models are very general • Regression graphs, i.e., regression trees with merging, provably estimate GAMs using polynomial data and runtime • Regression boosting generalizes binary classification boosting • Future work – Improve algorithm/analysis – Room for interesting work in statistics Å computational learning theory

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides: 50min - Microsoft Research