Download Rademacher complexity properties 1: Lipschitz losses, finite class

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birkhoff's representation theorem wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Eisenstein's criterion wikipedia , lookup

Basis (linear algebra) wikipedia , lookup

Congruence lattice problem wikipedia , lookup

Fundamental theorem of algebra wikipedia , lookup

Transcript
Rademacher complexity properties 1: Lipschitz losses, finite class
lemma
Administrative/Meta:
• Homework 2 is out.
• Project presentations are on December 8 at noon (location TBA); details forthcoming on webpage.
Overview
The culmination of Rademacher/symmetrization was the following bound.
Theorem. Let functions F be given with |f (z)| ≤ c almost surely for every f ∈ F. With probability
≥ 1 − de]ta over an i.i.d. draw S := (Z1 , . . . , Zn ), every f ∈ F satisfies
r
2
Ef ≤ Êf + 2Rad(F|S ) + 3c
ln(2/δ).
n
where
Ef := E(f (Z)),
n
1X
f (Zi ),
Êf :=
n i=1
Rad(U ) := Eσ sup hσ, vin ,
v∈V
n
1X
ha, bin :=
ai bi ,
n i=1
F|S := (f (Z1 ), . . . , f (Zn )) : f ∈ F .
Goals for today.
1. Bounds on R` (f ) := E(`(−Y f (X))) when ` is Lipschitz.
2. A Rademacher bound for finite classes. We’ll use this in the next class to discuss Shatter coefficients
and VC dimension.
R` for lipschitz `
Before getting the tools to work with Lipschitz losses, let’s see how easily we can control l2 bounded linear
functions with Rademacher complexity; this will allow us to get bounds for logistic regression soon after.
Lemma. Set X := supx∈S kxk2 . Then
Rad({x 7→ hw, xi}|S : kwk2 ≤ W }) ≤ W
sX
i
1
√
kxi k22 /n ≤ W X/ n.
Proof. For any fixed σ ∈ {−1, +1}n , setting xσ :=
sup
kwk2 ≤W
1
sumni=1 σi hw, xi i sup hw, xσ i =
n
kwk2 ≤W
Pn
σi xi /n, the equality case of Cauchy-Schwarz grats
)
if xσ = 0:
0
= W kxσ k2 .
otherwise:
W xσ /kxσ k2 , xσ
i=1
(
Now to handle the expectation, we invoke the only inequality in the whole proof: by Jensen’s inequality,
q
q
Eσ kxσ k2 = Eσ kxσ k22 ≤ Eσ kxσ k22 .
The rest is again equalities: since Eσ (σi σj ) = 1[i = j],
Ekxσ k22 =
d
d
n
n
n
X
X
X
X
1 X
1 X
2
2
2
E
(
E
(
x
σ
)
x
=
σ
+
x
x
σ
σ
)
=
kxi k22 /n2
σ
σ
i,j i
i,j l,j i l
i,j i
2
n2 j=1
n
j=1
i=1
i=1
i=1
i6=l
Remark. This proof is particularly clean for k · k2 , but in the homework we’ll see a clean trick to handle
other norms.
Lemma. Let functions ~` = (`i )ni=1 be given with each `i : R → R L-lipschitz, and for any vector v ∈ Rn
define the coordinate-wise composition ~` ◦ v := (`i (vi ))ni=1 , and similarly ~` ◦ U := {~` ◦ v : v ∈ U }. Then
Rad(~` ◦ U ) ≤ LRad(U ).
Remark. The proof seems straightforward but it is a little magical. It uses a step akin to symmetrization,
but quite different. The difficulty arises since |`i (a) − `i (b)| ≤ L|a − b| by the definition of Lipschitz, but we
need to erase that absolute value.
Proof. We will show that we can replace `1 with L, and the proof is complete by recursing on {2, . . . , n}.
The idea is that we need to get two terms that depend on `i in order to invoke Lipschitz. Proceeding from
the definition of Rad,
1X
Rad(~` ◦ U ) = Eσ sup
σi `i (vi )
v∈U n i


X
X
1
σi `i (vi ) + sup −`1 (w1 ) +
σi `i (wi )
=
Eσ  sup `1 (v1 ) +
2n 2:n v∈U
w∈U
i≥2
i≥2


X
1


≤
Eσ  sup L|u1 − w1 | +
σi (`i (vi ) + `i (wi ))
2n 2:n v∈U
i≥2
w∈U

=



X
1


Eσ2:n  sup L|v1 − w1 | +
σi (`i (vi ) + `i (wi ))
2n
 v∈U

i≥2
w∈U
v1 ≥w1

=



X
1


Eσ2:n  sup L(v1 − w1 ) +
σi (`i (vi ) + `i (wi ))
2n
 v∈U

i≥2
w∈U
v1 ≥w1

=
1
Eσ sup Lv1 +
n v∈U

X
σi `i (vi )
i≥2
The same technique is now applied for i ∈ {2, . . . , n}.
2
This gives the following useful bound.
Theorem. Let functions F and loss ` be given. Suppose ` is L-Lipschitz and |`(−yf (x))| ≤ c and |y| ≤ 1
almost surely. Then with probability at least 1 − δ over S, every f ∈ F satisfies
s
1
2
b
R` (f ) ≤ R` (f ) + 2LRad(F|S ) + 3c
ln
.
n
δ
Proof. The proof follows from the previous Rademacher rules by noting `i (z) := `(−yi z) is L-Lipschitz, just
like `.
Remark (logistic regression). Combining these pieces, with kxk2 ≤ X and functions F := {x 7→ hw, xi :
kwk2 ≤ W } and nondecreasing L-lipschitz loss (e.g., logistic loss z 7→ ln(1 + exp(z)) is 1-lipschitz) we get,
with probability ≥ 1 − δ, that each f ∈ F satisfies
s
√
2
1
b
R` (f ) ≤ R` (f ) + 2LW X/ n + 3(LW X + `(0))
ln
.
n
δ
We had roughly this bound with SGD, but via a very different analysis!
Finite classes
The main Rademacher tool here is as follows. [ In class we also discussed Shatter coefficients and VC
dimension, but these will be in the next lecture notes. ]
Theorem (Massart finite lemma).
maxv∈U kvk2
Rad(U ) ≤
n
p
2 ln(|U |)
.
While this has a fancy name, it’s a consequence of the following lemma from homework.
Lemma. If (X1 , . . . , Xn ) are c2 -subgaussian (but not necessarily independent or identical), then
p
E max Xi ≤ c 2 ln(n).
i
P
Proof. As in the homework, this follows by noting maxi Xi ≤ inf t>0 t−1 ln i exp(tXi ), using the definition
of c2 -subgaussian, and optimizing t. (The homework problem had 2n not n, but it controlled for maxi |Xi |.)
.
Next we need to see how subgaussianity transfer from a random variables to sums of them.
P
P
Lemma. If (Z1 , . . . , Zn ) are c2i -subgaussian and independent, then i Zi /n is i c2i /n2 -subgaussian.
P
Proof. Set Z̄ := i Zi /n. For any t ∈ R,
Y
Y
X
E(exp(tZi )) =
E(exp(tZi /n)) ≤
exp(c2 t2 /(2n2 )) = exp(
c2i t2 /2n2 ).
i
i
i
These lemmas suffice to prove the Rademacher bound.
Proof (of Massart finite lemma). For each v ∈ V , define random variable Xv := hσ, vin ; crucially, the
distribution of Xv is determined by the distribution of σ. Moreover, σi vi is vi2 -subgaussian by the Hoeffding
lemma, meaning Xv is kvk22 /n2 -subgaussian by the preceding lemma, which together with the lemma on
maxima of subgaussian distributions gives the bound.
References
3