Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Rademacher complexity properties 1: Lipschitz losses, finite class lemma Administrative/Meta: • Homework 2 is out. • Project presentations are on December 8 at noon (location TBA); details forthcoming on webpage. Overview The culmination of Rademacher/symmetrization was the following bound. Theorem. Let functions F be given with |f (z)| ≤ c almost surely for every f ∈ F. With probability ≥ 1 − de]ta over an i.i.d. draw S := (Z1 , . . . , Zn ), every f ∈ F satisfies r 2 Ef ≤ Êf + 2Rad(F|S ) + 3c ln(2/δ). n where Ef := E(f (Z)), n 1X f (Zi ), Êf := n i=1 Rad(U ) := Eσ sup hσ, vin , v∈V n 1X ha, bin := ai bi , n i=1 F|S := (f (Z1 ), . . . , f (Zn )) : f ∈ F . Goals for today. 1. Bounds on R` (f ) := E(`(−Y f (X))) when ` is Lipschitz. 2. A Rademacher bound for finite classes. We’ll use this in the next class to discuss Shatter coefficients and VC dimension. R` for lipschitz ` Before getting the tools to work with Lipschitz losses, let’s see how easily we can control l2 bounded linear functions with Rademacher complexity; this will allow us to get bounds for logistic regression soon after. Lemma. Set X := supx∈S kxk2 . Then Rad({x 7→ hw, xi}|S : kwk2 ≤ W }) ≤ W sX i 1 √ kxi k22 /n ≤ W X/ n. Proof. For any fixed σ ∈ {−1, +1}n , setting xσ := sup kwk2 ≤W 1 sumni=1 σi hw, xi i sup hw, xσ i = n kwk2 ≤W Pn σi xi /n, the equality case of Cauchy-Schwarz grats ) if xσ = 0: 0 = W kxσ k2 . otherwise: W xσ /kxσ k2 , xσ i=1 ( Now to handle the expectation, we invoke the only inequality in the whole proof: by Jensen’s inequality, q q Eσ kxσ k2 = Eσ kxσ k22 ≤ Eσ kxσ k22 . The rest is again equalities: since Eσ (σi σj ) = 1[i = j], Ekxσ k22 = d d n n n X X X X 1 X 1 X 2 2 2 E ( E ( x σ ) x = σ + x x σ σ ) = kxi k22 /n2 σ σ i,j i i,j l,j i l i,j i 2 n2 j=1 n j=1 i=1 i=1 i=1 i6=l Remark. This proof is particularly clean for k · k2 , but in the homework we’ll see a clean trick to handle other norms. Lemma. Let functions ~` = (`i )ni=1 be given with each `i : R → R L-lipschitz, and for any vector v ∈ Rn define the coordinate-wise composition ~` ◦ v := (`i (vi ))ni=1 , and similarly ~` ◦ U := {~` ◦ v : v ∈ U }. Then Rad(~` ◦ U ) ≤ LRad(U ). Remark. The proof seems straightforward but it is a little magical. It uses a step akin to symmetrization, but quite different. The difficulty arises since |`i (a) − `i (b)| ≤ L|a − b| by the definition of Lipschitz, but we need to erase that absolute value. Proof. We will show that we can replace `1 with L, and the proof is complete by recursing on {2, . . . , n}. The idea is that we need to get two terms that depend on `i in order to invoke Lipschitz. Proceeding from the definition of Rad, 1X Rad(~` ◦ U ) = Eσ sup σi `i (vi ) v∈U n i X X 1 σi `i (vi ) + sup −`1 (w1 ) + σi `i (wi ) = Eσ sup `1 (v1 ) + 2n 2:n v∈U w∈U i≥2 i≥2 X 1 ≤ Eσ sup L|u1 − w1 | + σi (`i (vi ) + `i (wi )) 2n 2:n v∈U i≥2 w∈U = X 1 Eσ2:n sup L|v1 − w1 | + σi (`i (vi ) + `i (wi )) 2n v∈U i≥2 w∈U v1 ≥w1 = X 1 Eσ2:n sup L(v1 − w1 ) + σi (`i (vi ) + `i (wi )) 2n v∈U i≥2 w∈U v1 ≥w1 = 1 Eσ sup Lv1 + n v∈U X σi `i (vi ) i≥2 The same technique is now applied for i ∈ {2, . . . , n}. 2 This gives the following useful bound. Theorem. Let functions F and loss ` be given. Suppose ` is L-Lipschitz and |`(−yf (x))| ≤ c and |y| ≤ 1 almost surely. Then with probability at least 1 − δ over S, every f ∈ F satisfies s 1 2 b R` (f ) ≤ R` (f ) + 2LRad(F|S ) + 3c ln . n δ Proof. The proof follows from the previous Rademacher rules by noting `i (z) := `(−yi z) is L-Lipschitz, just like `. Remark (logistic regression). Combining these pieces, with kxk2 ≤ X and functions F := {x 7→ hw, xi : kwk2 ≤ W } and nondecreasing L-lipschitz loss (e.g., logistic loss z 7→ ln(1 + exp(z)) is 1-lipschitz) we get, with probability ≥ 1 − δ, that each f ∈ F satisfies s √ 2 1 b R` (f ) ≤ R` (f ) + 2LW X/ n + 3(LW X + `(0)) ln . n δ We had roughly this bound with SGD, but via a very different analysis! Finite classes The main Rademacher tool here is as follows. [ In class we also discussed Shatter coefficients and VC dimension, but these will be in the next lecture notes. ] Theorem (Massart finite lemma). maxv∈U kvk2 Rad(U ) ≤ n p 2 ln(|U |) . While this has a fancy name, it’s a consequence of the following lemma from homework. Lemma. If (X1 , . . . , Xn ) are c2 -subgaussian (but not necessarily independent or identical), then p E max Xi ≤ c 2 ln(n). i P Proof. As in the homework, this follows by noting maxi Xi ≤ inf t>0 t−1 ln i exp(tXi ), using the definition of c2 -subgaussian, and optimizing t. (The homework problem had 2n not n, but it controlled for maxi |Xi |.) . Next we need to see how subgaussianity transfer from a random variables to sums of them. P P Lemma. If (Z1 , . . . , Zn ) are c2i -subgaussian and independent, then i Zi /n is i c2i /n2 -subgaussian. P Proof. Set Z̄ := i Zi /n. For any t ∈ R, Y Y X E(exp(tZi )) = E(exp(tZi /n)) ≤ exp(c2 t2 /(2n2 )) = exp( c2i t2 /2n2 ). i i i These lemmas suffice to prove the Rademacher bound. Proof (of Massart finite lemma). For each v ∈ V , define random variable Xv := hσ, vin ; crucially, the distribution of Xv is determined by the distribution of σ. Moreover, σi vi is vi2 -subgaussian by the Hoeffding lemma, meaning Xv is kvk22 /n2 -subgaussian by the preceding lemma, which together with the lemma on maxima of subgaussian distributions gives the bound. References 3