Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
How to estimate the mean of a random variable? Gábor Lugosi ICREA and Pompeu Fabra University, Barcleona based on joint work with Christian Brownlees (Pompeu Fabra, Barcelona) Luc Devroye (McGill, Montreal) Émilien Joly (Nanterre) Matthieu Lerasle (CNRS, Nice) Roberto Imbuzeiro Oliveira (IMPA, Rio) estimating the mean Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 . estimating the mean Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 . “Obvious” choice: empirical mean µn = n 1X n i =1 Xi estimating the mean Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 . “Obvious” choice: empirical mean µn = n 1X n Xi i =1 By the central limit theorem, if X has a finite variance σ 2 , n√ o p lim P n |µn − µ| > σ 2 log(2/δ) ≤ δ . n→∞ We would like non-asymptotic inequalities of a similar form. estimating the mean Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 . “Obvious” choice: empirical mean µn = n 1X n Xi i =1 By the central limit theorem, if X has a finite variance σ 2 , n√ o p lim P n |µn − µ| > σ 2 log(2/δ) ≤ δ . n→∞ We would like non-asymptotic inequalities of a similar form. If the distribution is sub-Gaussian, E exp(λX ) ≤ exp(σ 2 λ2 /2), then with probability at least 1 − δ, s 2 log(2/δ) |µn − µ| ≤ σ . n empirical mean–heavy tails The empirical mean is computationally attractive. Requires no a priori knowledge and automatically scales with σ. If the distribution is not sub-Gaussian, we still have Chebyshev’s inequality: w.p. ≥ 1 − δ, r 2 . |µn − µ| ≤ σ nδ Exponentially weaker bound. Especially hurts when many means are estimated simultaneously. This is the best one can say. Catoni (2012) shows that for each δ there exists a distribution with finite variance such that r c P |µn − µ| ≥ σ ≥δ. nδ sub-Gaussian estimators Our question: Under what conditions do there exist sub-Gaussian mean estimators? sub-Gaussian estimators Our question: Under what conditions do there exist sub-Gaussian mean estimators? More formally: let P be a class of probability distributions on R b n such with finite variance. Does there exist a mean estimator µ that for all distributions in P, with probability at least 1 − δ, s log(2/δ) b n − µ| ≤ Lσ |µ n for some constant L = L(P)? median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002). N kN X X 1 1 def b MM = median µ Xt , . . . , Xt N t=1 N t=(k−1)N+1 median of means A simple estimator is median-of-means. Goes back to Nemirovsky, Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias, and Szegedy (2002). N kN X X 1 1 def b MM = median µ Xt , . . . , Xt N t=1 N t=(k−1)N+1 Lemma Let δ ∈ (0, 1), k = 8 log δ −1 and N = 8 lognδ−1 . Then with probability at least 1 − δ, s log(2/δ) b MM − µ| ≤ 8σ |µ n proof By Chebyshev, each mean is within distance σ probability 3/4. p 8/N of µ with p The probability that the median is not within distance σ 8/N of µ is at most P{Bin(k, 1/4) > k/2} which is exponentially small in k. median of means • Sub-Gaussian deviations. • Scales automatically with σ. • Parameters depend on required confidence level δ. • See Lerasle and Oliveira (2012), Hsu and Sabato (2013), Minsker (2014) for generalizations. • Also works when the variance is infinite. If 1+α E |X − EX | = M for some α ≤ 1, then, with probability at least 1 − δ, b MM − µ| ≤ |µ 8 (12M)1/α ln(1/δ) n !α/(1+α) catoni’s estimator Catoni (2012) introduced a novel M-estimator. Let φ(x) = (1{x≥0} − 1{x<0} ) log 1 + |x| + x 2 /2 . Catoni’s estimator (with constant α > 0) µ̂C is defined as the solution of n X φ(α(Xt − µ)) = 0. t=1 catoni’s estimator Let δ ∈ (0, q1) and−1v an upper bound on the variance of X . Then δ , with probability ≥ 1 − δ, with α = 2nvlog +o(n) s |µ̂C − EX | ≤ With α = 2v log(1/δ) n − 2 log(1/δ) . p 2/(nv ), r |µ̂C − EX | ≤ (1 + log(2/δ)) v n . catoni’s estimator • Sub-Gaussian deviations with optimal constants. • Scales with (an upper bound on) Var (X ). • Requires knowledge of (an upper bound on) Var (X ). • Works (with the optimal constants) only for finite variance. • Estimator depends on δ for sub-Gaussian performance. A confidence-independent choice yields sub-exponential bounds. catoni’s proof P Define rn (m) = (1/nα) nt=1 φ(α(Xt − m)), a decreasing function. Then for each fixed m, ! nα2 2 nαrn (m) Ee ≤ exp nα(µ − m) + v + (m − µ) 2 and by Markov’s inequality, log(1/δ) α P rn (m) < (µ − m) + v + (m − µ)2 + >1−δ . 2 nα why sub-Gaussian? Sub-Gaussian bounds are the best one can hope for when the variance is finite. −n/4 , and mean In fact, for any M > 0, α ∈ (0, 1], δ > 2e b n , there exists a distribution E |X − EX |1+α = M estimator µ such that !α/(1+α) M 1/α ln(1/δ) b n − µ| ≥ . |µ n Proof: The distributions P+ (0) = 1 − p, P+ (c) = p and P− (0) = 1 − p, P− (−c) = p are indistinguishable if all n samples are equal to 0. why sub-Gaussian? This shows optimality of the median-of-means estimator for all α. It also shows that finite variance is necessary even for rate n −1/2 . One cannot hope to get anything better than sub-Gaussian tails. Catoni proved that sample mean is optimal for the class of Gaussian distributions. multiple-δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple-δ -sub-Gaussian for a class of distributions P and δmin if for all δ ∈ [δmin , 1), and all distributions in P, s log(2/δ) b n − µ| ≤ Lσ . |µ n multiple-δ estimators Do there exist estimators that are sub-Gaussian simultaneously for all confidence levels? An estimator is multiple-δ -sub-Gaussian for a class of distributions P and δmin if for all δ ∈ [δmin , 1), and all distributions in P, s log(2/δ) b n − µ| ≤ Lσ . |µ n The picture is more complex than before. known variance Given 0 < σ1 ≤ σ2 < ∞, define the class [σ12 ,σ22 ] P2 Let R = σ2 /σ1 . = {P : σ12 ≤ σP2 ≤ σ22 .} known variance Given 0 < σ1 ≤ σ2 < ∞, define the class [σ12 ,σ22 ] P2 = {P : σ12 ≤ σP2 ≤ σ22 .} Let R = σ2 /σ1 . • If R is bounded then there exists a multiple-δ -sub-Gaussian estimator with δmin = 4e 1−n/2 ; • If R is unbounded then there is no multiple-δ -sub-Gaussian estimate for any L and δmin → 0. A sharp distinction. The exponentially small value of δmin is best possible. construction of multiple-δ estimator Reminiscent to Lepski’s method of adaptive estimation. For k = 1, . . . , K = log2 (1/δmin ), use the median-of-means estimator to construct confidence intervals Ik such that P{µ ∈ / Ik } ≤ 2−k . (This is where knowledge of σ2 and boundedness of R is used.) Define K \ b k = min k : Ij 6= ∅ . j =k Finally, let b n = mid point of µ K \ j =kb Ij proof For any k = 1, . . . , K , b n − µ| > |Ik |} ≤ P{∃j ≥ k : µ ∈ P{|µ / Ij } K because if µ ∈ ∩K j =k Ij , then ∩j =k Ij is non-empty and therefore b n ∈ ∩K µ j =k Ij . But K X P{∃j ≥ k : µ ∈ / Ij } ≤ P{µ ∈ / Ij } ≤ 21−k j =k higher moments For η ≥ 1 and α ∈ (2, 3], define Pα,η = {P : E|X − µ|α ≤ (η σ)α } . Then for some C = C (α, η) there exists a multiple-δ estimator with a constant L and δmin = e −n/C for all sufficiently large n. k-regular distributions This follows from a more general result: Define j j X X p− (j ) = P Xi ≤ j µ and p+ (j ) = P Xi ≥ j µ . i =1 i =1 A distribution is k-regular if ∀j ≥ k, min(p+ (j ), p− (j )) ≥ 1/3. For this class there exists a multiple-δ estimator with a constant L and δmin = e −n/k for all n. multivariate distributions Let X be a random vector taking values in Rd with mean µ = EX and covariance matrix Σ = (X − µ)(X − µ)T . Given an i.i.d. sample X1 , . . . , Xn , we want to estimate µ that has sub-Gaussian performance. multivariate distributions Let X be a random vector taking values in Rd with mean µ = EX and covariance matrix Σ = (X − µ)(X − µ)T . Given an i.i.d. sample X1 , . . . , Xn , we want to estimate µ that has sub-Gaussian performance. What is sub-Gaussian? If X has a multivariate Gaussian distribution, the sample mean P µn = (1/n) ni=1 X1 satisfies, with probability at least 1 − δ, s s Tr(Σ) 2λmax log(1/δ) kµn − µk ≤ + , n n Can one construct mean estimators with similar performance for a large class of distributions? coordinate-wise median of means Coordinate-wise median of means yields the bound: s Tr(Σ) log(d /δ) b MM − µk ≤ K . kµ n coordinate-wise median of means Coordinate-wise median of means yields the bound: s Tr(Σ) log(d /δ) b MM − µk ≤ K . kµ n We can do better. multivariate median of means Hsu and Sabato (2013), Minsker (2015) extended the median-of-means estimate. Minsker proposes an analogous estimate that uses the multivariate median Med(x1 , . . . , xN ) = argmin y ∈Rd N X ky − xi k . i =1 For this estimate, with probability at least 1 − δ, s Tr(Σ) log(1/δ) b MM − µk ≤ K . kµ n multivariate median of means Hsu and Sabato (2013), Minsker (2015) extended the median-of-means estimate. Minsker proposes an analogous estimate that uses the multivariate median Med(x1 , . . . , xN ) = argmin y ∈Rd N X ky − xi k . i =1 For this estimate, with probability at least 1 − δ, s Tr(Σ) log(1/δ) b MM − µk ≤ K . kµ n No further assumption or knowledge of the distribution is required. Computationally feasible. Almost sub-Gaussian but not quite. Dimension free. sub-Gaussian estimate We construct a sub-Gaussian estimator in various steps. We start with an estimator that works for “spherical” distributions. (δ) For w ∈ S d −1 , define mn (w ) as the median-of-means estimate of w T µ = Ew T X . Let N1/2 be a minimal 1/2-cover of S d −1 . Note that |N1/2 | ≤ 8d . Since Var w T X ≤ λmax , with probability at least 1 − δ, s d sup mn(δ/8 ) (w ) − w T µ ≤ 2e w ∈N1/2 2λmax ln(e8d /δ) n . sub-Gaussian estimate Let Pδ,λ be the polytope of points x ∈ Rd s ln(e8d /δ) d . sup mn(δ/8 ) (w ) − w T x ≤ 2e 2λ n w ∈N1/2 Then with probability at least 1 − δ, µ ∈ Pδ,λmax . In particular, on this event, Pδ,λmax is nonempty. Suppose that an upper bound λ ≥ λmax is available. Then we may define any element y ∈ Pδ,λ if Pδ,λ 6= ∅ b n,λ = µ 0 otherwise spherical distributions For this estimate, we have the following: For any distribution with mean µ and covariance matrix Σ such that λmax ≤ λ, with probability at least 1 − δ, s d ln 8 + ln(e/δ) b n,λ − µk ≤ 8e 2λ . kµ n spherical distributions For this estimate, we have the following: For any distribution with mean µ and covariance matrix Σ such that λmax ≤ λ, with probability at least 1 − δ, s d ln 8 + ln(e/δ) b n,λ − µk ≤ 8e 2λ . kµ n In particular, if the distribution is c-spherical (i.e., d λmax ≤ cTr(Σ)) and λ ≤ 2λmax , then s cTr(Σ) ln 8 + λmax ln(e/δ) b n,λ − µk ≤ 16e kµ , n a sub-Gaussian performance. spherical distributions For this estimate, we have the following: For any distribution with mean µ and covariance matrix Σ such that λmax ≤ λ, with probability at least 1 − δ, s d ln 8 + ln(e/δ) b n,λ − µk ≤ 8e 2λ . kµ n In particular, if the distribution is c-spherical (i.e., d λmax ≤ cTr(Σ)) and λ ≤ 2λmax , then s cTr(Σ) ln 8 + λmax ln(e/δ) b n,λ − µk ≤ 16e kµ , n a sub-Gaussian performance. Challenge: make the estimate computationally feasible. extension to non-spherical distributions We decompose Rd into orthogonal subspaces such that the projection of X to each subspace is nearly spherical. extension to non-spherical distributions We decompose Rd into orthogonal subspaces such that the projection of X to each subspace is nearly spherical. We need an extra assumption: 4 ≤ K (v T Σv )2 . E (X − µ)T v For each u in a γ-cover of S d −1 , calculate the median-of-means estimate Vn (u) of u T Σu = E(u T (X − µ))2 = (1/2)E(u T (X − X 0 ))2 based on the i.i.d. sample (1/2)(u T (X1 − Xn/2+1 ))2 , . . . , (1/2)(u T (Xn/2 − Xn ))2 . b From this, we construct an estimated covariance matrix Σ. empirical eigendecomposition Find the spectral decomposition bn = Σ d X b i vbi vbT λ i i =1 b1 ≥ · · · ≥ λ bd . with λ Let V1 be the subspace spanned by eigenvectors corresponding to b 1 /2, λ b 1 ]. eigenvalues falling in [λ One can show that the projection of X to V1 has a 4-spherical distribution. In the orthogonal complement of V1 we repeat the procedure—using fresh independent data. Repeat for s = O(log d ) steps, obtaining Rd = V1 ⊕ · · · ⊕ Vs ⊕ Vs+1 . final estimate V1 , . . . , Vs are all 4-spherical. Estimate the mean of each projection by our previous estimate. The projection to Vs+1 has a covariance matrix with norm at most λmax /d . We may use Minsker’s estimate here. We obtain an estimate such that, with probability at least 1 − δ, s s Tr(Σ) λmax (log(1/δ) + log log d ) , b n − µk ≤ C + kµ n n for all n ≥ C (d + log(1/δ)) log d . final estimate V1 , . . . , Vs are all 4-spherical. Estimate the mean of each projection by our previous estimate. The projection to Vs+1 has a covariance matrix with norm at most λmax /d . We may use Minsker’s estimate here. We obtain an estimate such that, with probability at least 1 − δ, s s Tr(Σ) λmax (log(1/δ) + log log d ) , b n − µk ≤ C + kµ n n for all n ≥ C (d + log(1/δ)) log d . Sub-Gaussian behavior but not quite dimension free. empirical risk minimization Let X take values in a space X and let F be a set of non-negative functions defined on X . Define the risk of f by mf = Ef (X ). m ∗ = inf f ∈F mf denotes the optimal risk. empirical risk minimization Let X take values in a space X and let F be a set of non-negative functions defined on X . Define the risk of f by mf = Ef (X ). m ∗ = inf f ∈F mf denotes the optimal risk. Given X1 , . . . , Xn i.i.d., one aims at finding a function with small risk. The empirical risk minimizer is fERM = argmin f ∈F n 1X n i =1 f (Xi ) empirical risk minimization The performance is measured by the excess risk mERM − m ∗ , where mERM = E [fERM (X )|X1 , . . . , Xn ] . empirical risk minimization The performance is measured by the excess risk mERM − m ∗ , where mERM = E [fERM (X )|X1 , . . . , Xn ] . Small uniform deviations guarantee a small excess risk: n 1 X ∗ mERM − m ≤ 2 sup f (Xi ) − mf n f ∈F i =1 empirical risk minimization The performance is measured by the excess risk mERM − m ∗ , where mERM = E [fERM (X )|X1 , . . . , Xn ] . Small uniform deviations guarantee a small excess risk: n 1 X ∗ mERM − m ≤ 2 sup f (Xi ) − mf n f ∈F i =1 Symmetrization and Dudley’s classical chaining bound gives Z ∞q c ∗ EmERM − m ≤ √ E log Nn,2 (F , )d , n 0 where Nn,2 (F , ) is the -covering number of F under the distance v u n u1 X t d2,n (f , g ) = (f (Xi ) − g (Xi ))2 n i =1 chaining The key property needed for chaining is that for all f , g ∈ F and > 0, ! −n2 ef − µ e g > } ≤ exp P {µ 2D(f , g )2 P e f = n1 ni=1 f (Xi ). for µ empirical risk minimization–heavy tails Chaining bounds are impossible to obtain under heavy tails. empirical risk minimization–heavy tails Chaining bounds are impossible to obtain under heavy tails. It is natural to select a function by minimizing a robust estimate. empirical risk minimization–heavy tails Chaining bounds are impossible to obtain under heavy tails. It is natural to select a function by minimizing a robust estimate. Chaining seems to be difficult to adapt for median-of-means. empirical risk minimization–heavy tails Chaining bounds are impossible to obtain under heavy tails. It is natural to select a function by minimizing a robust estimate. Chaining seems to be difficult to adapt for median-of-means. We work with Catoni’s estimate. empirical risk minimization–heavy tails Chaining bounds are impossible to obtain under heavy tails. It is natural to select a function by minimizing a robust estimate. Chaining seems to be difficult to adapt for median-of-means. We work with Catoni’s estimate. Recall that φ(x) = −1{x<0} log 1 − x + x2 2 + 1{x>0} log 1 + x + b f for Catoni’s estimator of mf = Ef (X ) is the unique value µ b f ) = 0 where which rbf (µ rbf (µ) = We choose α = n 1 X nα p 2/(nv ). i =1 φ(α(f (Xi ) − µ)) x2 2 . minimization of Catoni’s estimate Empirical risk minimization becomes bf fb = argmin µ f ∈F minimization of Catoni’s estimate Empirical risk minimization becomes bf fb = argmin µ f ∈F What can we say about the excess risk mfb − m ∗ ? minimization of Catoni’s estimate Empirical risk minimization becomes bf fb = argmin µ f ∈F What can we say about the excess risk mfb − m ∗ ? If F is finite and small, the union bound may be used: with probability ≥ 1 − δ, s log(|F |/δ) mfb − m ∗ ≤ c n minimization of Catoni’s estimate Empirical risk minimization becomes bf fb = argmin µ f ∈F What can we say about the excess risk mfb − m ∗ ? If F is finite and small, the union bound may be used: with probability ≥ 1 − δ, s log(|F |/δ) mfb − m ∗ ≤ c n However, we do not have sub-Gaussian increments to apply chaining directly. performance bound Assume that supf ∈F Var (f (X )) ≤ v . Then with probability at least 1 − δ mfb − mf ∗ ≤ C ln(δ −1 r ) v n Z ∞p 1 log N2 (F , )d √ n 0 Z 1 ∞ + log N∞ (F , )d . n 0 + where N2 (F , ) and N∞ (F , ) are covering numbers with respect to the distances h 2 i1/2 d (f , f 0 ) = E f (X ) − f 0 (X ) and D(f , f 0 ) = sup f (x) − f 0 (x) . x∈X performance bound The main term in the bound is r Z ∞p v 1 +√ log N2 (F , )d n n 0 performance bound The main term in the bound is r Z ∞p v 1 +√ log N2 (F , )d n n 0 However, the presence of Z 1 n ∞ log N∞ (F , )d 0 requires F to be totally bounded in the sup norm. alternative bound With probability 1 − δ, r mfb − mf ∗ ≤ 6 K Γδ q (1 + ln(δ −1 )) + √ ln(δ −1 ) n n 2v Where Γδ is the 1 − δ/2 quantile of Z ∞q log Nn,2 (F , )d 0 Covering number is under the random distance v u n u1 X d2,n (f , g ) = t (f (Xi ) − g (Xi ))2 n i =1 application to L1 regression Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R. application to L1 regression Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R. G is a class of functions Z → R such that def ∆ = sup sup |g (z) − g 0 (z)| < ∞ g ,g 0 ∈G z∈Z application to L1 regression Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R. G is a class of functions Z → R such that def ∆ = sup sup |g (z) − g 0 (z)| < ∞ g ,g 0 ∈G z∈Z The risk of g ∈ G is R(g ) = E|g (Z ) − Y |. Let g ∗ = argmin g ∈G R(g ). application to L1 regression Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R. G is a class of functions Z → R such that def ∆ = sup sup |g (z) − g 0 (z)| < ∞ g ,g 0 ∈G z∈Z The risk of g ∈ G is R(g ) = E|g (Z ) − Y |. Let g ∗ = argmin g ∈G R(g ). Statistical learning problem: find a function gb ∈ G such that R(b g ) − R(g ∗ ) is small. application to L1 regression Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R. G is a class of functions Z → R such that def ∆ = sup sup |g (z) − g 0 (z)| < ∞ g ,g 0 ∈G z∈Z The risk of g ∈ G is R(g ) = E|g (Z ) − Y |. Let g ∗ = argmin g ∈G R(g ). Statistical learning problem: find a function gb ∈ G such that R(b g ) − R(g ∗ ) is small. Standard procedure minimizes (1/n) g ∈ G. P i =1 |g (Zi ) − Yi | over L1 regression If Y has a heavy tail, standard empirical risk minimization may fail. L1 regression If Y has a heavy tail, standard empirical risk minimization may fail. We choose gb by minimizing Catoni’s estimate. We only assume EY 2 ≤ σ 2 for some σ > 0. Then def supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v is a known and finite constant. L1 regression If Y has a heavy tail, standard empirical risk minimization may fail. We choose gb by minimizing Catoni’s estimate. We only assume EY 2 ≤ σ 2 for some σ > 0. Then def supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v is a known and finite constant. For g ∈ G and µ ∈ R, define rbg (µ) = n 1 X nα φ(α(|g (Xi ) − Yi | − µ)) i =1 b ) is the unique value for which rbf (R(g b )) = 0. R(g L1 regression If Y has a heavy tail, standard empirical risk minimization may fail. We choose gb by minimizing Catoni’s estimate. We only assume EY 2 ≤ σ 2 for some σ > 0. Then def supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v is a known and finite constant. For g ∈ G and µ ∈ R, define rbg (µ) = n 1 X nα φ(α(|g (Xi ) − Yi | − µ)) i =1 b ) is the unique value for which rbf (R(g b )) = 0. R(g b ). gb = argmin R(g g ∈G L1 regression With probability ≥ 1 − δ, r Z ∞p v 1 1 ∗ +√ log N∞ (G, )d . R(b g ) − R(g ) ≤ C log δ n n 0 L1 regression–experiments Suppose Zi ∈ R4 such that each component of Zi is uniform on [−1, 1] and Yi = ZiT θ + i , where θ = (0.25, −0.25, 0.50, 0.70) and i has Student’s t distribution with d degrees of freedom. b g )| for n = 50. Plot of E|R(b g ) − R(b k-means clustering Let X take values in Rd with distribution P. A clustering scheme is given by a set of k ≥ 2 cluster centers C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C . k-means clustering Let X take values in Rd with distribution P. A clustering scheme is given by a set of k ≥ 2 cluster centers C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C . Given a distortion measure ` : Rd × Rd → [0, ∞), define the expected distortion Dk (P, q) = E`(X , q(X )) k-means clustering Let X take values in Rd with distribution P. A clustering scheme is given by a set of k ≥ 2 cluster centers C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C . Given a distortion measure ` : Rd × Rd → [0, ∞), define the expected distortion Dk (P, q) = E`(X , q(X )) Typical choices are `(x, y ) = kx − y kα with α = 1 or 2. We consider `(x, y ) = kx − y k k-means clustering Let X take values in Rd with distribution P. A clustering scheme is given by a set of k ≥ 2 cluster centers C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C . Given a distortion measure ` : Rd × Rd → [0, ∞), define the expected distortion Dk (P, q) = E`(X , q(X )) Typical choices are `(x, y ) = kx − y kα with α = 1 or 2. We consider `(x, y ) = kx − y k if EkX k < ∞, then there exists an optimal quantizer q ∗ : def Dk (P, q ∗ ) = Dk∗ (P) . q ∗ is a nearest neighbor quantizer, that is, kx − q ∗ (x)k = min kx − yi k . yi ∈C empirical quantizer design Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such that Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ] is close to Dk∗ (P). empirical quantizer design Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such that Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ] is close to Dk∗ (P). One usually minimizes Dk (Pn , q) = n 1X n i =1 kXi − q(Xi )k . empirical quantizer design Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such that Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ] is close to Dk∗ (P). One usually minimizes Dk (Pn , q) = n 1X n kXi − q(Xi )k . i =1 Pollard (1981) showed that if EkX k < ∞, then lim Dk (P, qn ) = Dk∗ (P) a.s. n→∞ empirical quantizer design When kX k is bounded, one has EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 . empirical quantizer design When kX k is bounded, one has EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 . For heavy-tailed X such bounds are not possible. Telgarsky and Dasgupta (2013) prove bounds of the order n −1/2+2/p when the p-th moment is finite. empirical quantizer design When kX k is bounded, one has EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 . For heavy-tailed X such bounds are not possible. Telgarsky and Dasgupta (2013) prove bounds of the order n −1/2+2/p when the p-th moment is finite. Minimizing Catoni’s estimate of the distortion we can get O(n −1/2 ) when the second moment is finite. empirical quantizer design Some care is needed to make sure the class F is totally bounded in the sup norm. empirical quantizer design Some care is needed to make sure the class F is totally bounded in the sup norm. One can show that it suffices to minimize over quantizers with uniformly bounded centers. empirical quantizer design Some care is needed to make sure the class F is totally bounded in the sup norm. One can show that it suffices to minimize over quantizers with uniformly bounded centers. Let Cρ be the set of all collections C = {y1 , . . . , yk } ∈ (Rd )k with kyj k ≤ ρ. For C ∈ Cρ , we estimate D(P, qC ) = EkX − qC (X )k by the unique µ ∈ R such that n 1 X nα i =1 φ α min kXi − yj k − µ =0 j =1,...,k Let qbn be any quantizer minimizing the estimated distortion. empirical quantizer design Assume V := Var (kX k) < ∞ and ρ is known. Then with probability ≥ 1 − δ, s s Vk d 1 + D(P, qbn ) − D(P, q ∗ ) ≤ C (ρ) log δ n n empirical quantizer design Assume V := Var (kX k) < ∞ and ρ is known. Then with probability ≥ 1 − δ, s s Vk d 1 + D(P, qbn ) − D(P, q ∗ ) ≤ C (ρ) log δ n n √ In the bounded case a bound of the order k/ n is possible. references L. Devroye, M. Lerasle, G. Lugosi, and R. Imbuzeiro Oliveira. Sub-Gaussian mean estimators. Annals of Statistics, to appear, 2016. E. Joly, and G. Lugosi. Robust estimation of U-statistics. Stochastic Processes and their Applications, to appear, 2015. C. Brownlees, E. Joly, and G. Lugosi. Empirical risk minimization for heavy-tailed losses. Annals of Statistics, 43:2507–2536, 2015. S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59:7711-7717, 2013. E. Joly, and G. Lugosi, and R. Imbuzeiro Oliveira. On the estimation of the mean of a random vector. (in preparation)