Download +1 - Microsoft

Learning and smoothed analysis Adam Kalai Microsoft Research Cambridge, MA Alex Samorodnitsky* Hebrew University Jerusalem Shang-Hua Teng* University of Southern California *while visiting Microsoft In this talk… • Revisit classic learning problems – e.g. learn DNFs from random examples (drawn from product distributions) • Barrier = worst case complexity • Solve in a new model! • Smoothed analysis sheds light on hard problem instance structure • Also show: DNF can be recovered from heavy “Fourier coefficients” [Valiant84] P.A.C. learning AND’s!? Noiseless X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7 Input: training data (xj from D, f(xj))j≤m x1 x2 x3 x4 x5 x6 x7 x8 f(x) x1 1 1 1 0 0 1 0 1 +1 x2 1 0 0 1 1 1 1 1 –1 x3 0 1 1 1 1 1 1 1 –1 x4 1 1 1 0 0 0 0 1 +1 x5 1 1 1 1 1 0 1 1 +1 P.A.C. learning AND’s!? [Valiant84] Noiseless X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7 Input: training data (xj from D, f(xj))j≤m NIGERIA BANK VIAGRA ADAM LASER SALE FREE IN f(x) x1 YES YES YES NO NO YES NO YES SPAM x2 YES NO NO YES YES YES YES YES LEGIT x3 NO YES YES YES YES YES YES YES LEGIT x4 YES YES YES NO NO NO NO YES SPAM x5 YES YES YES YES YES NO YES YES SPAM P.A.C. learning AND’s!? [Valiant84] Noiseless X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7 Input: training data (xj from D, f(xj))j≤m x1 x2 x3 x4 x5 x6 x7 x8 f(x) x1 1 1 1 0 0 1 0 1 +1 x2 1 0 0 1 1 1 1 0 –1 x3 0 1 1 1 1 1 1 1 +1 x4 0 1 1 0 0 0 0 1 –1 x5 1 1 1 1 1 0 1 1 +1 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm Output: h: X → {–1,+1} with err(h)=Prx←D[h(x)≠f(x)] ≤ ε *OPTIONAL* “Proper” learning: h is an AND [Kearns Schapire Sellie92] Agnostic P.A.C. learning AND’s!? X = {0,1}ⁿ f: X → {–1,+1} PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7 Input: training data (xj from D, f(xj))j≤m x1 x2 x3 x4 x5 x6 x7 x8 f(x) x1 1 1 1 0 0 1 0 1 +1 x2 1 0 0 1 1 1 1 0 –1 x3 0 1 1 1 1 1 1 1 +1 x4 0 1 1 0 0 0 0 1 –1 x5 1 1 1 1 1 0 1 1 +1 1. Succeed with prob. ≥ 0.99 2. m = # examples = poly(n/ε) 3. Polytime learning algorithm opt Output: h: X → {–1,+1} with err(h)=Prx←D[h(x)≠f(x)] ≤ ε + minAND g err(g) Some related work AND, e.g., x2˄x4˄x7˄x9 PAC Agnostic EASY ?  Decision trees, e.g., ?Uniform D +Mem queries ?Uniform D +Mem queries [Kushilevitz-Mansour’91; Goldreich-Levin’89] x1 0 1 x2 x7 0 1 – + [Gopalan-K-Klivans’08] Mem queries [Bshouty’94] 0 1 x2 x9  0 1 0 1 – + – + DNF, e.g., (x1˄x4)˅(x2˄x4˄x7˄x9) ?Uniform D +Mem queries ? [Jackson’94] Some related work AND, e.g., x2˄x4˄x7˄x9 PAC Agnostic EASY ?  Decision trees, e.g., ?Product D +Mem queries ?Product D +Mem queries [Kushilevitz-Mansour’91; Goldreich-Levin’89] x1 0 1 x2 x7 0 1 – + [Gopalan-K-Klivans’08] Product D [KST’09] (smoothed analysis) Mem queries [Bshouty’94] 0 1 x2 x9  0 1 0 1 – + – + DNF, e.g., (x1˄x4)˅(x2˄x4˄x7˄x9) Product D [KST’09] (smoothed analysis) ?Product D +Mem queries ? [Jackson’94] Product D [KST’09] (smoothed analysis) Outline 1. PAC learn decision trees over smoothed (constant-bounded) product distributions • • • Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2. PAC learn DNFs over smoothed (constant-bounded) product distribution • Why DNF can be recovered from heavy coefficients (information-theoretically) 3. Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm Feature Construction “Heuristic” ≈ [SuttonMatheus91] Approach: Greedily learn sparse polynomial, bottom-up, using least-squares regression 1. Normalize input (x1,y1),(x2,y2),…,(xm,ym) so that each attribute xi has mean 0 & variance 1 2. F := {1,x1,x2,…,xn} 3. Repeat m¼ times: F := F{ t·xi } for t ϵ F of min regression error, e.g., for F  {1, x1 , x2 , x3 , x12 , x1 x3}: m  min w  w0  w x  w x  w x  w4 ( x )  w x x  y j 1 j 1 1 j 2 2 j 3 3 j 2 1 j j 5 1 3 j  2 Guarantee for that Heuristic For μ ϵ [0,1]ⁿ, let πμ be the product distribution where Ex←πμ[xᵢ] = μᵢ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1}, with probability ≥ 0.99 over uniformly random μ ϵ [0.49,0.51]ⁿ and m=poly(ns/ε) training examples (xj,f(xj))j≤m with xj iid from πμ, the heuristic outputs h with Prx←πμ[sgn(h(x))≠f(x)] ≤ ε. Guarantee for that Heuristic For μ ϵ [0,1]ⁿ, let πμ be the product distribution where Ex←πμ[xᵢ] = μᵢ. Theorem 1. For any size s decision tree f: {0,1}ⁿ → {–1,+1} and any ν ϵ [.02,.98]ⁿ, with probability ≥ 0.99 over uniformly random μ ϵ ν+[–.01,.01]ⁿ and m=poly(ns/ε) training examples (xj,f(xj))j≤m with xj iid from πμ, the heuristic outputs h with Prx←πμ[sgn(h(x))≠f(x)] ≤ ε. *same statement for DNF alg. Smoothed analysis ass. x1 x2 –1 x7 +1 x2 –1 cube ν+[-.01,.01]ⁿ x9 +1 +1 –1 f:{0,1}n{-1,1} (x(1), f(x(1))),…,(x(m),f(x(m))) LEARNING ALG. iid … prod dist πμ Pr[ h(x)  f(x) ] ≤  h “Hard” instance picture x1 x2 –1 red = heuristic fails x7 +1 x2 –1 x9 +1 +1 –1 can’t be this μϵ[0,1]n, μi=Pr[xi=1] n f:{0,1} {-1,1} prod dist πμ “Hard” instance picture x1 x2 –1 red = heuristic fails x7 +1 x2 –1 x9 +1 +1 –1 μϵ[0,1]n, μi=Pr[xi=1] n f:{0,1} {-1,1} prod dist πμ Theorem 1 “Hard” instances are few and far between for any tree Fourier over product distributions • x ϵ {0,1}ⁿ, μ ϵ [0,1]ⁿ, E [ xi ]  i • Coordinates xi normalized to mean 0, var. 1  xi  i xi  i (1  i ) xS   xi for any S  [n] (also called  S ( x,  )) iS f ( x)   fˆ ( S ) xS , where fˆ ( S )  E  [ f ( x) xS ] S fˆ 2   S fˆ 2 ( S ) 1 (Parseval) 1   S fˆ ( S ) 2 fˆ Heuristic over product distributions xi  xi  i i (1  i ) m 1 fˆ ( S )   xSj y j m j 1 (μᵢ can easily be estimated from data) (easy to appx any individual coefficient) 1) F  {1, x1 , x2 , , xn } 2) Repeat m¼ times: F  F \ \xS  where S is chosen to maximize |fˆ (S ) | such that S  T {i} and xT F Example x2 x4 x4 x9 x9 x9 x9 • f(x) = x2x4x9 • For uniform μ = (.5,.5,.5), xi ϵ {–1,+1} f(x) = x2x4x9 • For μ = (.4,.6,.55),† f(x)=.9x2x4x9+.1x2x4+.3x4x9+.2x2x9+.2x2–.2x4+.1x9 + – – + –+ †figures + + – not to scale Fourier structure over random product distributions Lemma For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1,  2d ˆ ˆ   Pr n S  T s.t. f ( S )    f (T )    T  d  200  [.49,.51]  5 Fourier structure over random product distributions Lemma For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1,  2d ˆ ˆ   Pr n S  T s.t. f ( S )    f (T )    T  d  200  [.49,.51]  5 Lemma Let p:Rⁿ→R be a degree-d multilinear polynomial with leading coefficient of 1. Then, for any ε>0, Pr n  p( x)  т  2d т x[ 1,1] e.g., p(x)=x1x2x9+.3x7–0.2 An older perspective • [Kushilevitz-Mansour’91] and [Goldreich-Levin’89] find heavy Fourier coefficients fт ( x)   fˆ (S ) xS S : fˆ ( S ) т • Really use the fact that fˆт  fˆ   т • Every decision tree is well approximated by it’s heavy coefficients because fˆ 1  #leaves In smoothed product distribution setting, Heuristic finds heavy (log-degree) coefficients Outline 1. PAC learn decision trees over smoothed (constant-bounded) product distributions • • • Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2. PAC learn DNFs over smoothed (constant-bounded) product distribution • Why DNF can be recovered from heavy coefficients (information-theoretically) 3. Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm Learning DNF • Adversary picks DNF f(x)=C1(x)˅C2(x)˅…˅Cs(x) (and ν ϵ [.02,.98]ⁿ) • Step 1: find f≥ε • [BFJKMR’94, Jackson’95]: “KM gives weak learner” combined with careful boosting. • Cannot use boosting in smoothed setting  • Solution: learn DNF from f≥ε alone! – Design a robust membership query DNF learning algorithm, and give it query access to f≥ε DNF learning algorithm f(x)=C1(x)˅C2(x)˅…˅Cs(x), e.g., (x1˄x4)˅(x2˄x4˄x7˄x9) Ci is “linear threshold function,” e.g. sgn(x1+x4-1.5) [KKanadeMansour’09] approach + other stuff I’m a burier (of details) burier noun, pl. –s, One that buries. DNF recoverable from heavy coef’s Information-theoretic lemma (uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Pr[ f ( x)  g ( x)]   s  12  fˆ  gˆ  Thanks, Madhu! Maybe similar to Bazzi/Braverman/Razborov? DNF recoverable from heavy coef’s Information-theoretic lemma (uniform distribution) For any s-term DNF f and any g: {0,1}ⁿ→{–1,1}, Pr[ f ( x)  g ( x)]   s  12  fˆ  gˆ  Proof f(x)=C1(x)˅…˅Cs(x), where f(x)ϵ{–1,1} but Cᵢ(x)ϵ{0,1}.   Pr[ f  g ]  E  g ( x)  f ( x)  12  i Ci ( x)     ( gˆ  fˆ )· 1̂  Cˆ    2  gˆ  fˆ 1̂ 2  1̂ 2 i i   i Cˆ i   i Cˆ i  12   i Cˆ i  12  s 1 1 1 e.g., x1  x2     1 x1 2 1 x2 2 Outline 1. PAC learn decision trees over smoothed (constant-bounded) product distributions • • • Describe practical heuristic Define smoothed product distribution setting Structure of Fourier coeff’s over random prod. dist. 2. PAC learn DNFs over smoothed (constant-bounded) product distribution • Why heavy coefficients characterize a DNF 3. Agnostically learn decision trees over smoothed (constant-bounded) product distributions • Rough idea of algorithm Agnostically learning decision trees • Adversary picks arbitrary f:{0,1}ⁿ→{–1,+1} and ν ϵ [.02,.98]ⁿ • Nature picks μ ϵ ν + [–.01,.01]ⁿ • These determine best size-s decision tree f* • Guarantee: get err(h) ≤ opt + ε opt = err(f*) Agnostically learning decision trees Design robust membership query learning algorithm that works as long as queries are to g where fˆ  gˆ   т .  • Solve: min h: hˆ s E  т1 ( x)  f ( x)h( x) 2 1 (Appx) solved using [GKK’08] approach • Robustness:   E  g ( x)h( x)  f ( x) h( x)  gˆ  fˆ ·hˆ  gˆ  fˆ · hˆ  тs  1 The gradient-project descent alg. • Find f≥ε:{0,1}ⁿ→R using heuristic. • h¹ =0 • For t=1,…,T: – ht+1 = projs( KM( h ( x)    f ( x)    h ( x)   ) ) t т 1 т2  t • Output h(x) = sgn(ht(x)-θ) for t≤T, θϵ[–1,1] that minimize error on held-out data set Closely following [GopalanKKlivans’08] projection • projs(h) = arg min hˆ: gˆ 1 s gˆ  hˆ 2 From [GopalanKKlivans’08] projection • projs(h) = arg min hˆ: gˆ 1 s gˆ  hˆ 2 From [GopalanKKlivans’08] projection • projs(h) = arg min hˆ: gˆ 1 s gˆ  hˆ 2 From [GopalanKKlivans’08] Conclusions • Smoothed complexity [SpielmanTeng01] – Compromise between worst-case/average-case – Novel application to learning over product dist’s • Assumption: not completely adversarial relationship between target f and dist. D • Weaker than “margin” assumptions • Future work – Non-product distributions – Other smoothed anal. app. Thanks! Sorry! Average-case complexity [JacksonServedio05] • [JS05] give a polytime algorithm that learns most DTs under uniform distribution on {0,1}n • Random DTs sometimes easier than real ones “Random is not typical” courtesy of Dan Spielman

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download +1 - Microsoft