Download +1 - Microsoft

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning and smoothed analysis
Adam Kalai
Microsoft Research
Cambridge, MA
Alex Samorodnitsky*
Hebrew University
Jerusalem
Shang-Hua Teng*
University of Southern
California
*while visiting Microsoft
In this talk…
• Revisit classic learning problems
– e.g. learn DNFs from random examples
(drawn from product distributions)
• Barrier = worst case complexity
• Solve in a new model!
• Smoothed analysis sheds light on hard problem
instance structure
• Also show: DNF can be recovered from heavy
“Fourier coefficients”
[Valiant84]
P.A.C. learning AND’s!?
Noiseless
X = {0,1}ⁿ f: X → {–1,+1}
PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7
Input: training data (xj from D, f(xj))j≤m
x1
x2
x3
x4
x5
x6
x7
x8
f(x)
x1
1
1
1
0
0
1
0
1
+1
x2
1
0
0
1
1
1
1
1
–1
x3
0
1
1
1
1
1
1
1
–1
x4
1
1
1
0
0
0
0
1
+1
x5
1
1
1
1
1
0
1
1
+1
P.A.C. learning AND’s!?
[Valiant84]
Noiseless
X = {0,1}ⁿ f: X → {–1,+1}
PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7
Input: training data (xj from D, f(xj))j≤m
NIGERIA
BANK
VIAGRA
ADAM
LASER
SALE
FREE
IN
f(x)
x1
YES
YES
YES
NO
NO
YES
NO
YES
SPAM
x2
YES
NO
NO
YES
YES
YES
YES
YES
LEGIT
x3
NO
YES
YES
YES
YES
YES
YES
YES
LEGIT
x4
YES
YES
YES
NO
NO
NO
NO
YES
SPAM
x5
YES
YES
YES
YES
YES
NO
YES
YES
SPAM
P.A.C. learning AND’s!?
[Valiant84]
Noiseless
X = {0,1}ⁿ f: X → {–1,+1}
PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7
Input: training data (xj from D, f(xj))j≤m
x1
x2
x3
x4
x5
x6
x7
x8
f(x)
x1
1
1
1
0
0
1
0
1
+1
x2
1
0
0
1
1
1
1
0
–1
x3
0
1
1
1
1
1
1
1
+1
x4
0
1
1
0
0
0
0
1
–1
x5
1
1
1
1
1
0
1
1
+1
1. Succeed with prob. ≥ 0.99
2. m = # examples = poly(n/ε)
3. Polytime learning algorithm
Output: h: X → {–1,+1} with
err(h)=Prx←D[h(x)≠f(x)] ≤ ε
*OPTIONAL*
“Proper” learning:
h is an AND
[Kearns
Schapire
Sellie92]
Agnostic
P.A.C. learning AND’s!?
X = {0,1}ⁿ f: X → {–1,+1}
PAC ass.: target is an AND, e.g. f(x)= x2˄x4˄x7
Input: training data (xj from D, f(xj))j≤m
x1
x2
x3
x4
x5
x6
x7
x8
f(x)
x1
1
1
1
0
0
1
0
1
+1
x2
1
0
0
1
1
1
1
0
–1
x3
0
1
1
1
1
1
1
1
+1
x4
0
1
1
0
0
0
0
1
–1
x5
1
1
1
1
1
0
1
1
+1
1. Succeed with prob. ≥ 0.99
2. m = # examples = poly(n/ε)
3. Polytime learning algorithm
opt
Output: h: X → {–1,+1} with
err(h)=Prx←D[h(x)≠f(x)] ≤ ε + minAND g err(g)
Some related work
AND, e.g., x2˄x4˄x7˄x9
PAC
Agnostic
EASY
?

Decision trees, e.g.,
?Uniform D +Mem queries ?Uniform D +Mem queries
[Kushilevitz-Mansour’91;
Goldreich-Levin’89]
x1
0
1
x2
x7
0
1
–
+
[Gopalan-K-Klivans’08]
Mem queries [Bshouty’94]
0
1
x2
x9

0
1
0
1
–
+
–
+
DNF, e.g.,
(x1˄x4)˅(x2˄x4˄x7˄x9)
?Uniform D +Mem queries ?
[Jackson’94]
Some related work
AND, e.g., x2˄x4˄x7˄x9
PAC
Agnostic
EASY
?

Decision trees, e.g.,
?Product D +Mem queries ?Product D +Mem queries
[Kushilevitz-Mansour’91;
Goldreich-Levin’89]
x1
0
1
x2
x7
0
1
–
+
[Gopalan-K-Klivans’08]
Product D [KST’09]
(smoothed analysis)
Mem queries [Bshouty’94]
0
1
x2
x9

0
1
0
1
–
+
–
+
DNF, e.g.,
(x1˄x4)˅(x2˄x4˄x7˄x9)
Product D [KST’09]
(smoothed analysis)
?Product D +Mem queries ?
[Jackson’94]
Product D [KST’09]
(smoothed analysis)
Outline
1. PAC learn decision trees over smoothed
(constant-bounded) product distributions
•
•
•
Describe practical heuristic
Define smoothed product distribution setting
Structure of Fourier coeff’s over random prod. dist.
2. PAC learn DNFs over smoothed
(constant-bounded) product distribution
•
Why DNF can be recovered from heavy coefficients (information-theoretically)
3. Agnostically learn decision trees over smoothed
(constant-bounded) product distributions
•
Rough idea of algorithm
Feature Construction “Heuristic”
≈ [SuttonMatheus91]
Approach: Greedily learn sparse polynomial,
bottom-up, using least-squares regression
1. Normalize input (x1,y1),(x2,y2),…,(xm,ym) so that
each attribute xi has mean 0 & variance 1
2. F := {1,x1,x2,…,xn}
3. Repeat m¼ times: F := F{ t·xi } for t ϵ F of
min regression error, e.g., for F  {1, x1 , x2 , x3 , x12 , x1 x3}:
m

min w  w0  w x  w x  w x  w4 ( x )  w x x  y
j 1
j
1 1
j
2 2
j
3 3
j 2
1
j j
5 1 3
j

2
Guarantee for that Heuristic
For μ ϵ [0,1]ⁿ, let πμ be the product distribution where Ex←πμ[xᵢ] = μᵢ.
Theorem 1.
For any size s decision tree f: {0,1}ⁿ → {–1,+1}, with
probability ≥ 0.99 over uniformly random μ ϵ [0.49,0.51]ⁿ
and m=poly(ns/ε) training examples (xj,f(xj))j≤m with xj iid
from πμ, the heuristic outputs h with
Prx←πμ[sgn(h(x))≠f(x)] ≤ ε.
Guarantee for that Heuristic
For μ ϵ [0,1]ⁿ, let πμ be the product distribution where Ex←πμ[xᵢ] = μᵢ.
Theorem 1.
For any size s decision tree f: {0,1}ⁿ → {–1,+1} and any
ν ϵ [.02,.98]ⁿ, with probability ≥ 0.99 over uniformly
random μ ϵ ν+[–.01,.01]ⁿ and m=poly(ns/ε) training
examples (xj,f(xj))j≤m with xj iid from πμ, the heuristic
outputs h with
Prx←πμ[sgn(h(x))≠f(x)] ≤ ε.
*same statement for DNF alg.
Smoothed analysis ass.
x1
x2
–1
x7
+1
x2
–1
cube ν+[-.01,.01]ⁿ
x9
+1
+1
–1
f:{0,1}n{-1,1}
(x(1), f(x(1))),…,(x(m),f(x(m)))
LEARNING
ALG.
iid
…
prod dist πμ
Pr[ h(x)  f(x) ] ≤ 
h
“Hard” instance picture
x1
x2
–1
red =
heuristic fails
x7
+1
x2
–1
x9
+1
+1
–1
can’t be this
μϵ[0,1]n, μi=Pr[xi=1]
n
f:{0,1} {-1,1}
prod dist πμ
“Hard” instance picture
x1
x2
–1
red =
heuristic fails
x7
+1
x2
–1
x9
+1
+1
–1
μϵ[0,1]n, μi=Pr[xi=1]
n
f:{0,1} {-1,1}
prod dist πμ
Theorem 1
“Hard” instances are few and far between for any tree
Fourier over product distributions
• x ϵ {0,1}ⁿ, μ ϵ [0,1]ⁿ, E [ xi ]  i
• Coordinates xi normalized to mean 0, var. 1

xi  i
xi 
i (1  i )
xS   xi for any S  [n] (also called  S ( x,  ))
iS
f ( x)   fˆ ( S ) xS , where fˆ ( S )  E  [ f ( x) xS ]
S
fˆ
2
  S fˆ 2 ( S ) 1 (Parseval)
1
  S fˆ ( S )
2
fˆ
Heuristic over product distributions
xi 
xi  i
i (1  i )
m
1
fˆ ( S )   xSj y j
m j 1
(μᵢ can easily be estimated from data)
(easy to appx any individual coefficient)
1) F  {1, x1 , x2 , , xn }
2) Repeat m¼ times: F
 F \ \xS  where S is chosen to
maximize |fˆ (S ) | such that S  T {i} and xT F
Example
x2
x4
x4
x9
x9
x9
x9
• f(x) = x2x4x9
• For uniform μ = (.5,.5,.5), xi ϵ {–1,+1}
f(x) = x2x4x9
• For μ = (.4,.6,.55),†
f(x)=.9x2x4x9+.1x2x4+.3x4x9+.2x2x9+.2x2–.2x4+.1x9
+
–
–
+
–+
†figures
+
+
–
not to scale
Fourier structure over random
product distributions
Lemma
For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1,

2d
ˆ
ˆ


Pr n S  T s.t. f ( S )    f (T )    T  d  200

[.49,.51] 
5
Fourier structure over random
product distributions
Lemma
For any f:{0,1}ⁿ→{–1,1}, α,β > 0, and d ≥ 1,

2d
ˆ
ˆ


Pr n S  T s.t. f ( S )    f (T )    T  d  200

[.49,.51] 
5
Lemma
Let p:Rⁿ→R be a degree-d multilinear polynomial
with leading coefficient of 1. Then, for any ε>0,
Pr n  p( x)  т  2d т
x[ 1,1]
e.g., p(x)=x1x2x9+.3x7–0.2
An older perspective
• [Kushilevitz-Mansour’91] and [Goldreich-Levin’89]
find heavy Fourier coefficients fт ( x)   fˆ (S ) xS
S : fˆ ( S ) т
• Really use the fact that fˆт  fˆ   т
• Every decision tree is well approximated by it’s
heavy coefficients because fˆ 1  #leaves
In smoothed product distribution setting,
Heuristic finds heavy (log-degree) coefficients
Outline
1. PAC learn decision trees over smoothed
(constant-bounded) product distributions
•
•
•
Describe practical heuristic
Define smoothed product distribution setting
Structure of Fourier coeff’s over random prod. dist.
2. PAC learn DNFs over smoothed
(constant-bounded) product distribution
•
Why DNF can be recovered from heavy coefficients (information-theoretically)
3. Agnostically learn decision trees over smoothed
(constant-bounded) product distributions
•
Rough idea of algorithm
Learning DNF
• Adversary picks DNF f(x)=C1(x)˅C2(x)˅…˅Cs(x)
(and ν ϵ [.02,.98]ⁿ)
• Step 1: find f≥ε
• [BFJKMR’94, Jackson’95]: “KM gives weak learner”
combined with careful boosting.
• Cannot use boosting in smoothed setting 
• Solution: learn DNF from f≥ε alone!
– Design a robust membership query DNF learning
algorithm, and give it query access to f≥ε
DNF learning algorithm
f(x)=C1(x)˅C2(x)˅…˅Cs(x), e.g., (x1˄x4)˅(x2˄x4˄x7˄x9)
Ci is “linear threshold function,” e.g. sgn(x1+x4-1.5)
[KKanadeMansour’09] approach + other stuff
I’m a burier (of details)
burier noun, pl. –s, One that buries.
DNF recoverable from heavy coef’s
Information-theoretic lemma (uniform distribution)
For any s-term DNF f and any g: {0,1}ⁿ→{–1,1},
Pr[ f ( x)  g ( x)]   s  12  fˆ  gˆ

Thanks, Madhu! Maybe similar to Bazzi/Braverman/Razborov?
DNF recoverable from heavy coef’s
Information-theoretic lemma (uniform distribution)
For any s-term DNF f and any g: {0,1}ⁿ→{–1,1},
Pr[ f ( x)  g ( x)]   s  12  fˆ  gˆ

Proof
f(x)=C1(x)˅…˅Cs(x), where f(x)ϵ{–1,1} but Cᵢ(x)ϵ{0,1}.


Pr[ f  g ]  E  g ( x)  f ( x)  12  i Ci ( x) 


 ( gˆ  fˆ )· 1̂ 
Cˆ
  
2
 gˆ  fˆ
1̂
2

1̂
2
i
i
  i Cˆ i
  i Cˆ i  12   i Cˆ i  12  s
1
1
1
e.g., x1  x2 
  
1 x1
2
1 x2
2
Outline
1. PAC learn decision trees over smoothed
(constant-bounded) product distributions
•
•
•
Describe practical heuristic
Define smoothed product distribution setting
Structure of Fourier coeff’s over random prod. dist.
2. PAC learn DNFs over smoothed
(constant-bounded) product distribution
•
Why heavy coefficients characterize a DNF
3. Agnostically learn decision trees over smoothed
(constant-bounded) product distributions
•
Rough idea of algorithm
Agnostically learning decision trees
• Adversary picks arbitrary f:{0,1}ⁿ→{–1,+1} and
ν ϵ [.02,.98]ⁿ
• Nature picks μ ϵ ν + [–.01,.01]ⁿ
• These determine best size-s decision tree f*
• Guarantee: get err(h) ≤ opt + ε
opt = err(f*)
Agnostically learning decision trees
Design robust membership query learning
algorithm that works as long as queries are to g
where fˆ  gˆ   т .

• Solve: min h: hˆ s E  т1 ( x)  f ( x)h( x)
2
1
(Appx) solved using [GKK’08] approach
• Robustness:


E  g ( x)h( x)  f ( x) h( x)  gˆ  fˆ ·hˆ  gˆ  fˆ · hˆ  тs

1
The gradient-project descent alg.
• Find f≥ε:{0,1}ⁿ→R using heuristic.
• h¹ =0
• For t=1,…,T:
– ht+1 = projs( KM( h ( x)    f ( x)    h ( x)   ) )
t
т
1
т2

t
• Output h(x) = sgn(ht(x)-θ) for t≤T, θϵ[–1,1] that
minimize error on held-out data set
Closely following [GopalanKKlivans’08]
projection
• projs(h) = arg min hˆ: gˆ
1
s
gˆ  hˆ
2
From [GopalanKKlivans’08]
projection
• projs(h) = arg min hˆ: gˆ
1
s
gˆ  hˆ
2
From [GopalanKKlivans’08]
projection
• projs(h) = arg min hˆ: gˆ
1
s
gˆ  hˆ
2
From [GopalanKKlivans’08]
Conclusions
• Smoothed complexity [SpielmanTeng01]
– Compromise between worst-case/average-case
– Novel application to learning over product dist’s
• Assumption: not completely adversarial
relationship between target f and dist. D
• Weaker than “margin” assumptions
• Future work
– Non-product distributions
– Other smoothed anal. app.
Thanks!
Sorry!
Average-case complexity [JacksonServedio05]
• [JS05] give a polytime algorithm that learns
most DTs under uniform distribution on {0,1}n
• Random DTs sometimes easier than real ones
“Random is not typical” courtesy of Dan Spielman
Related documents