Download How to estimate the mean of a random variable?

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
How to estimate the mean of a random variable?
Gábor Lugosi
ICREA and Pompeu Fabra University, Barcleona
based on joint work with
Christian Brownlees (Pompeu Fabra, Barcelona)
Luc Devroye (McGill, Montreal)
Émilien Joly (Nanterre)
Matthieu Lerasle (CNRS, Nice)
Roberto Imbuzeiro Oliveira (IMPA, Rio)
estimating the mean
Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 .
estimating the mean
Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 .
“Obvious” choice: empirical mean
µn =
n
1X
n
i =1
Xi
estimating the mean
Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 .
“Obvious” choice: empirical mean
µn =
n
1X
n
Xi
i =1
By the central limit theorem, if X has a finite variance σ 2 ,
n√
o
p
lim P
n |µn − µ| > σ 2 log(2/δ) ≤ δ .
n→∞
We would like non-asymptotic inequalities of a similar form.
estimating the mean
Given X1 , . . . , Xn , a real i.i.d. sequence, estimate µ = EX1 .
“Obvious” choice: empirical mean
µn =
n
1X
n
Xi
i =1
By the central limit theorem, if X has a finite variance σ 2 ,
n√
o
p
lim P
n |µn − µ| > σ 2 log(2/δ) ≤ δ .
n→∞
We would like non-asymptotic inequalities of a similar form.
If the distribution is sub-Gaussian, E exp(λX ) ≤ exp(σ 2 λ2 /2),
then with probability at least 1 − δ,
s
2 log(2/δ)
|µn − µ| ≤ σ
.
n
empirical mean–heavy tails
The empirical mean is computationally attractive.
Requires no a priori knowledge and automatically scales with σ.
If the distribution is not sub-Gaussian, we still have Chebyshev’s
inequality: w.p. ≥ 1 − δ,
r
2
.
|µn − µ| ≤ σ
nδ
Exponentially weaker bound. Especially hurts when many means
are estimated simultaneously.
This is the best one can say. Catoni (2012) shows that for each δ
there exists a distribution with finite variance such that
r c
P |µn − µ| ≥ σ
≥δ.
nδ
sub-Gaussian estimators
Our question: Under what conditions do there exist sub-Gaussian
mean estimators?
sub-Gaussian estimators
Our question: Under what conditions do there exist sub-Gaussian
mean estimators?
More formally: let P be a class of probability distributions on R
b n such
with finite variance. Does there exist a mean estimator µ
that for all distributions in P, with probability at least 1 − δ,
s
log(2/δ)
b n − µ| ≤ Lσ
|µ
n
for some constant L = L(P)?
median of means
A simple estimator is median-of-means. Goes back to Nemirovsky,
Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias,
and Szegedy (2002).


N
kN
X
X
1
1
def
b MM = median 
µ
Xt , . . . ,
Xt 
N t=1
N
t=(k−1)N+1
median of means
A simple estimator is median-of-means. Goes back to Nemirovsky,
Yudin (1983), Jerrum, Valiant, and Vazirani (1986), Alon, Matias,
and Szegedy (2002).


N
kN
X
X
1
1
def
b MM = median 
µ
Xt , . . . ,
Xt 
N t=1
N
t=(k−1)N+1
Lemma
Let δ ∈ (0, 1), k = 8 log δ −1 and N = 8 lognδ−1 . Then with
probability at least 1 − δ,
s
log(2/δ)
b MM − µ| ≤ 8σ
|µ
n
proof
By Chebyshev, each mean is within distance σ
probability 3/4.
p
8/N of µ with
p
The probability that the median is not within distance σ 8/N of
µ is at most P{Bin(k, 1/4) > k/2} which is exponentially small
in k.
median of means
• Sub-Gaussian deviations.
• Scales automatically with σ.
• Parameters depend on required confidence level δ.
• See Lerasle and Oliveira (2012), Hsu and Sabato (2013),
Minsker (2014) for generalizations.
• Also works when the variance is infinite. If
1+α
E |X − EX |
= M for some α ≤ 1, then, with
probability at least 1 − δ,
b MM − µ| ≤
|µ
8
(12M)1/α ln(1/δ)
n
!α/(1+α)
catoni’s estimator
Catoni (2012) introduced a novel M-estimator.
Let φ(x) = (1{x≥0} − 1{x<0} ) log 1 + |x| + x 2 /2 .
Catoni’s estimator (with constant α > 0) µ̂C is defined as the
solution of
n
X
φ(α(Xt − µ)) = 0.
t=1
catoni’s estimator
Let δ ∈ (0,
q1) and−1v an upper bound on the variance of X . Then
δ
, with probability ≥ 1 − δ,
with α = 2nvlog
+o(n)
s
|µ̂C − EX | ≤
With α =
2v log(1/δ)
n − 2 log(1/δ)
.
p
2/(nv ),
r
|µ̂C − EX | ≤ (1 + log(2/δ))
v
n
.
catoni’s estimator
• Sub-Gaussian deviations with optimal constants.
• Scales with (an upper bound on) Var (X ).
• Requires knowledge of (an upper bound on) Var (X ).
• Works (with the optimal constants) only for finite variance.
• Estimator depends on δ for sub-Gaussian performance. A
confidence-independent choice yields sub-exponential bounds.
catoni’s proof
P
Define rn (m) = (1/nα) nt=1 φ(α(Xt − m)), a decreasing
function. Then for each fixed m,
!
nα2 2
nαrn (m)
Ee
≤ exp nα(µ − m) +
v + (m − µ)
2
and by Markov’s inequality,
log(1/δ) α
P rn (m) < (µ − m) +
v + (m − µ)2 +
>1−δ .
2
nα
why sub-Gaussian?
Sub-Gaussian bounds are the best one can hope for when the
variance is finite.
−n/4 , and mean
In fact, for any M > 0, α ∈ (0, 1], δ > 2e
b n , there exists a distribution E |X − EX |1+α = M
estimator µ
such that
!α/(1+α)
M 1/α ln(1/δ)
b n − µ| ≥
.
|µ
n
Proof: The distributions P+ (0) = 1 − p, P+ (c) = p and
P− (0) = 1 − p, P− (−c) = p are indistinguishable if all n
samples are equal to 0.
why sub-Gaussian?
This shows optimality of the median-of-means estimator for all α.
It also shows that finite variance is necessary even for rate n −1/2 .
One cannot hope to get anything better than sub-Gaussian tails.
Catoni proved that sample mean is optimal for the class of
Gaussian distributions.
multiple-δ estimators
Do there exist estimators that are sub-Gaussian simultaneously for
all confidence levels?
An estimator is multiple-δ -sub-Gaussian for a class of distributions
P and δmin if for all δ ∈ [δmin , 1), and all distributions in P,
s
log(2/δ)
b n − µ| ≤ Lσ
.
|µ
n
multiple-δ estimators
Do there exist estimators that are sub-Gaussian simultaneously for
all confidence levels?
An estimator is multiple-δ -sub-Gaussian for a class of distributions
P and δmin if for all δ ∈ [δmin , 1), and all distributions in P,
s
log(2/δ)
b n − µ| ≤ Lσ
.
|µ
n
The picture is more complex than before.
known variance
Given 0 < σ1 ≤ σ2 < ∞, define the class
[σ12 ,σ22 ]
P2
Let R = σ2 /σ1 .
= {P : σ12 ≤ σP2 ≤ σ22 .}
known variance
Given 0 < σ1 ≤ σ2 < ∞, define the class
[σ12 ,σ22 ]
P2
= {P : σ12 ≤ σP2 ≤ σ22 .}
Let R = σ2 /σ1 .
• If R is bounded then there exists a multiple-δ -sub-Gaussian
estimator with δmin = 4e 1−n/2 ;
• If R is unbounded then there is no multiple-δ -sub-Gaussian
estimate for any L and δmin → 0.
A sharp distinction.
The exponentially small value of δmin is best possible.
construction of multiple-δ estimator
Reminiscent to Lepski’s method of adaptive estimation.
For k = 1, . . . , K = log2 (1/δmin ), use the median-of-means
estimator to construct confidence intervals Ik such that
P{µ ∈
/ Ik } ≤ 2−k .
(This is where knowledge of σ2 and boundedness of R is used.)
Define


K


\
b
k = min k :
Ij 6= ∅ .


j =k
Finally, let
b n = mid point of
µ
K
\
j =kb
Ij
proof
For any k = 1, . . . , K ,
b n − µ| > |Ik |} ≤ P{∃j ≥ k : µ ∈
P{|µ
/ Ij }
K
because if µ ∈ ∩K
j =k Ij , then ∩j =k Ij is non-empty and therefore
b n ∈ ∩K
µ
j =k Ij .
But
K
X
P{∃j ≥ k : µ ∈
/ Ij } ≤
P{µ ∈
/ Ij } ≤ 21−k
j =k
higher moments
For η ≥ 1 and α ∈ (2, 3], define
Pα,η = {P : E|X − µ|α ≤ (η σ)α } .
Then for some C = C (α, η) there exists a multiple-δ estimator
with a constant L and δmin = e −n/C for all sufficiently large n.
k-regular distributions
This follows from a more general result:
Define




j
j
X

X

p− (j ) = P
Xi ≤ j µ
and p+ (j ) = P
Xi ≥ j µ .




i =1
i =1
A distribution is k-regular if
∀j ≥ k, min(p+ (j ), p− (j )) ≥ 1/3.
For this class there exists a multiple-δ estimator with a constant L
and δmin = e −n/k for all n.
multivariate distributions
Let X be a random vector taking values in Rd with mean µ = EX
and covariance matrix Σ = (X − µ)(X − µ)T .
Given an i.i.d. sample X1 , . . . , Xn , we want to estimate µ that has
sub-Gaussian performance.
multivariate distributions
Let X be a random vector taking values in Rd with mean µ = EX
and covariance matrix Σ = (X − µ)(X − µ)T .
Given an i.i.d. sample X1 , . . . , Xn , we want to estimate µ that has
sub-Gaussian performance.
What is sub-Gaussian?
If X has a multivariate
Gaussian distribution, the sample mean
P
µn = (1/n) ni=1 X1 satisfies, with probability at least 1 − δ,
s
s
Tr(Σ)
2λmax log(1/δ)
kµn − µk ≤
+
,
n
n
Can one construct mean estimators with similar performance for a
large class of distributions?
coordinate-wise median of means
Coordinate-wise median of means yields the bound:
s
Tr(Σ) log(d /δ)
b MM − µk ≤ K
.
kµ
n
coordinate-wise median of means
Coordinate-wise median of means yields the bound:
s
Tr(Σ) log(d /δ)
b MM − µk ≤ K
.
kµ
n
We can do better.
multivariate median of means
Hsu and Sabato (2013), Minsker (2015) extended the
median-of-means estimate.
Minsker proposes an analogous estimate that uses the multivariate
median
Med(x1 , . . . , xN ) = argmin
y ∈Rd
N
X
ky − xi k .
i =1
For this estimate, with probability at least 1 − δ,
s
Tr(Σ) log(1/δ)
b MM − µk ≤ K
.
kµ
n
multivariate median of means
Hsu and Sabato (2013), Minsker (2015) extended the
median-of-means estimate.
Minsker proposes an analogous estimate that uses the multivariate
median
Med(x1 , . . . , xN ) = argmin
y ∈Rd
N
X
ky − xi k .
i =1
For this estimate, with probability at least 1 − δ,
s
Tr(Σ) log(1/δ)
b MM − µk ≤ K
.
kµ
n
No further assumption or knowledge of the distribution is required.
Computationally feasible.
Almost sub-Gaussian but not quite.
Dimension free.
sub-Gaussian estimate
We construct a sub-Gaussian estimator in various steps.
We start with an estimator that works for “spherical” distributions.
(δ)
For w ∈ S d −1 , define mn (w ) as the median-of-means estimate
of w T µ = Ew T X .
Let N1/2 be a minimal 1/2-cover of S d −1 . Note that
|N1/2 | ≤ 8d .
Since Var w T X ≤ λmax , with probability at least 1 − δ,
s
d
sup mn(δ/8 ) (w ) − w T µ ≤ 2e
w ∈N1/2
2λmax
ln(e8d /δ)
n
.
sub-Gaussian estimate
Let Pδ,λ be the polytope of points x ∈ Rd
s
ln(e8d /δ)
d
.
sup mn(δ/8 ) (w ) − w T x ≤ 2e 2λ
n
w ∈N1/2
Then with probability at least 1 − δ, µ ∈ Pδ,λmax .
In particular, on this event, Pδ,λmax is nonempty.
Suppose that an upper bound λ ≥ λmax is available. Then we
may define
any element y ∈ Pδ,λ if Pδ,λ 6= ∅
b n,λ =
µ
0
otherwise
spherical distributions
For this estimate, we have the following:
For any distribution with mean µ and covariance matrix Σ such
that λmax ≤ λ, with probability at least 1 − δ,
s
d ln 8 + ln(e/δ)
b n,λ − µk ≤ 8e 2λ
.
kµ
n
spherical distributions
For this estimate, we have the following:
For any distribution with mean µ and covariance matrix Σ such
that λmax ≤ λ, with probability at least 1 − δ,
s
d ln 8 + ln(e/δ)
b n,λ − µk ≤ 8e 2λ
.
kµ
n
In particular, if the distribution is c-spherical (i.e.,
d λmax ≤ cTr(Σ)) and λ ≤ 2λmax , then
s
cTr(Σ) ln 8 + λmax ln(e/δ)
b n,λ − µk ≤ 16e
kµ
,
n
a sub-Gaussian performance.
spherical distributions
For this estimate, we have the following:
For any distribution with mean µ and covariance matrix Σ such
that λmax ≤ λ, with probability at least 1 − δ,
s
d ln 8 + ln(e/δ)
b n,λ − µk ≤ 8e 2λ
.
kµ
n
In particular, if the distribution is c-spherical (i.e.,
d λmax ≤ cTr(Σ)) and λ ≤ 2λmax , then
s
cTr(Σ) ln 8 + λmax ln(e/δ)
b n,λ − µk ≤ 16e
kµ
,
n
a sub-Gaussian performance.
Challenge: make the estimate computationally feasible.
extension to non-spherical distributions
We decompose Rd into orthogonal subspaces such that the
projection of X to each subspace is nearly spherical.
extension to non-spherical distributions
We decompose Rd into orthogonal subspaces such that the
projection of X to each subspace is nearly spherical.
We need an extra assumption:
4 ≤ K (v T Σv )2 .
E (X − µ)T v
For each u in a γ-cover of S d −1 , calculate the median-of-means
estimate Vn (u) of
u T Σu = E(u T (X − µ))2 = (1/2)E(u T (X − X 0 ))2
based on the i.i.d. sample
(1/2)(u T (X1 − Xn/2+1 ))2 , . . . , (1/2)(u T (Xn/2 − Xn ))2 .
b
From this, we construct an estimated covariance matrix Σ.
empirical eigendecomposition
Find the spectral decomposition
bn =
Σ
d
X
b i vbi vbT
λ
i
i =1
b1 ≥ · · · ≥ λ
bd .
with λ
Let V1 be the subspace spanned by eigenvectors corresponding to
b 1 /2, λ
b 1 ].
eigenvalues falling in [λ
One can show that the projection of X to V1 has a 4-spherical
distribution.
In the orthogonal complement of V1 we repeat the
procedure—using fresh independent data.
Repeat for s = O(log d ) steps, obtaining
Rd = V1 ⊕ · · · ⊕ Vs ⊕ Vs+1 .
final estimate
V1 , . . . , Vs are all 4-spherical. Estimate the mean of each
projection by our previous estimate.
The projection to Vs+1 has a covariance matrix with norm at most
λmax /d . We may use Minsker’s estimate here.
We obtain an estimate such that, with probability at least 1 − δ,

s
s
Tr(Σ)
λmax (log(1/δ) + log log d )
 ,
b n − µk ≤ C 
+
kµ
n
n
for all n ≥ C (d + log(1/δ)) log d .
final estimate
V1 , . . . , Vs are all 4-spherical. Estimate the mean of each
projection by our previous estimate.
The projection to Vs+1 has a covariance matrix with norm at most
λmax /d . We may use Minsker’s estimate here.
We obtain an estimate such that, with probability at least 1 − δ,

s
s
Tr(Σ)
λmax (log(1/δ) + log log d )
 ,
b n − µk ≤ C 
+
kµ
n
n
for all n ≥ C (d + log(1/δ)) log d .
Sub-Gaussian behavior but not quite dimension free.
empirical risk minimization
Let X take values in a space X and let F be a set of non-negative
functions defined on X .
Define the risk of f by mf = Ef (X ).
m ∗ = inf f ∈F mf denotes the optimal risk.
empirical risk minimization
Let X take values in a space X and let F be a set of non-negative
functions defined on X .
Define the risk of f by mf = Ef (X ).
m ∗ = inf f ∈F mf denotes the optimal risk.
Given X1 , . . . , Xn i.i.d., one aims at finding a function with small
risk.
The empirical risk minimizer is
fERM = argmin
f ∈F
n
1X
n
i =1
f (Xi )
empirical risk minimization
The performance is measured by the excess risk mERM − m ∗ ,
where
mERM = E [fERM (X )|X1 , . . . , Xn ] .
empirical risk minimization
The performance is measured by the excess risk mERM − m ∗ ,
where
mERM = E [fERM (X )|X1 , . . . , Xn ] .
Small uniform deviations guarantee a small excess risk:
n
1 X
∗
mERM − m ≤ 2 sup f (Xi ) − mf n
f ∈F
i =1
empirical risk minimization
The performance is measured by the excess risk mERM − m ∗ ,
where
mERM = E [fERM (X )|X1 , . . . , Xn ] .
Small uniform deviations guarantee a small excess risk:
n
1 X
∗
mERM − m ≤ 2 sup f (Xi ) − mf n
f ∈F
i =1
Symmetrization and Dudley’s classical chaining bound gives
Z ∞q
c
∗
EmERM − m ≤ √ E
log Nn,2 (F , )d ,
n
0
where Nn,2 (F , ) is the -covering number of F under the
distance
v
u
n
u1 X
t
d2,n (f , g ) =
(f (Xi ) − g (Xi ))2
n i =1
chaining
The key property needed for chaining is that for all f , g ∈ F and
> 0,
!
−n2
ef − µ
e g > } ≤ exp
P {µ
2D(f , g )2
P
e f = n1 ni=1 f (Xi ).
for µ
empirical risk minimization–heavy tails
Chaining bounds are impossible to obtain under heavy tails.
empirical risk minimization–heavy tails
Chaining bounds are impossible to obtain under heavy tails.
It is natural to select a function by minimizing a robust estimate.
empirical risk minimization–heavy tails
Chaining bounds are impossible to obtain under heavy tails.
It is natural to select a function by minimizing a robust estimate.
Chaining seems to be difficult to adapt for median-of-means.
empirical risk minimization–heavy tails
Chaining bounds are impossible to obtain under heavy tails.
It is natural to select a function by minimizing a robust estimate.
Chaining seems to be difficult to adapt for median-of-means.
We work with Catoni’s estimate.
empirical risk minimization–heavy tails
Chaining bounds are impossible to obtain under heavy tails.
It is natural to select a function by minimizing a robust estimate.
Chaining seems to be difficult to adapt for median-of-means.
We work with Catoni’s estimate.
Recall that
φ(x) = −1{x<0} log 1 − x +
x2
2
+ 1{x>0} log 1 + x +
b f for
Catoni’s estimator of mf = Ef (X ) is the unique value µ
b f ) = 0 where
which rbf (µ
rbf (µ) =
We choose α =
n
1 X
nα
p
2/(nv ).
i =1
φ(α(f (Xi ) − µ))
x2
2
.
minimization of Catoni’s estimate
Empirical risk minimization becomes
bf
fb = argmin µ
f ∈F
minimization of Catoni’s estimate
Empirical risk minimization becomes
bf
fb = argmin µ
f ∈F
What can we say about the excess risk mfb − m ∗ ?
minimization of Catoni’s estimate
Empirical risk minimization becomes
bf
fb = argmin µ
f ∈F
What can we say about the excess risk mfb − m ∗ ?
If F is finite and small, the union bound may be used: with
probability ≥ 1 − δ,
s
log(|F |/δ)
mfb − m ∗ ≤ c
n
minimization of Catoni’s estimate
Empirical risk minimization becomes
bf
fb = argmin µ
f ∈F
What can we say about the excess risk mfb − m ∗ ?
If F is finite and small, the union bound may be used: with
probability ≥ 1 − δ,
s
log(|F |/δ)
mfb − m ∗ ≤ c
n
However, we do not have sub-Gaussian increments to apply
chaining directly.
performance bound
Assume that supf ∈F Var (f (X )) ≤ v . Then with probability at
least 1 − δ
mfb − mf ∗
≤ C ln(δ
−1
r
)
v
n
Z ∞p
1
log N2 (F , )d √
n 0
Z
1 ∞
+
log N∞ (F , )d .
n 0
+
where N2 (F , ) and N∞ (F , ) are covering numbers with respect
to the distances
h
2 i1/2
d (f , f 0 ) = E f (X ) − f 0 (X )
and
D(f , f 0 ) = sup f (x) − f 0 (x) .
x∈X
performance bound
The main term in the bound is
r
Z ∞p
v
1
+√
log N2 (F , )d n
n 0
performance bound
The main term in the bound is
r
Z ∞p
v
1
+√
log N2 (F , )d n
n 0
However, the presence of
Z
1
n
∞
log N∞ (F , )d 0
requires F to be totally bounded in the sup norm.
alternative bound
With probability 1 − δ,
r
mfb − mf ∗ ≤ 6
K Γδ q
(1 + ln(δ −1 )) + √
ln(δ −1 )
n
n
2v
Where Γδ is the 1 − δ/2 quantile of
Z ∞q
log Nn,2 (F , )d 0
Covering number is under the random distance
v
u
n
u1 X
d2,n (f , g ) = t
(f (Xi ) − g (Xi ))2
n i =1
application to L1 regression
Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R.
application to L1 regression
Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R.
G is a class of functions Z → R such that
def
∆ =
sup sup |g (z) − g 0 (z)| < ∞
g ,g 0 ∈G z∈Z
application to L1 regression
Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R.
G is a class of functions Z → R such that
def
∆ =
sup sup |g (z) − g 0 (z)| < ∞
g ,g 0 ∈G z∈Z
The risk of g ∈ G is R(g ) = E|g (Z ) − Y |.
Let g ∗ = argmin g ∈G R(g ).
application to L1 regression
Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R.
G is a class of functions Z → R such that
def
∆ =
sup sup |g (z) − g 0 (z)| < ∞
g ,g 0 ∈G z∈Z
The risk of g ∈ G is R(g ) = E|g (Z ) − Y |.
Let g ∗ = argmin g ∈G R(g ).
Statistical learning problem: find a function gb ∈ G such that
R(b
g ) − R(g ∗ )
is small.
application to L1 regression
Let (Z1 , Y1 ), . . . , (Zn , Yn ) be i.i.d. taking values in Z × R.
G is a class of functions Z → R such that
def
∆ =
sup sup |g (z) − g 0 (z)| < ∞
g ,g 0 ∈G z∈Z
The risk of g ∈ G is R(g ) = E|g (Z ) − Y |.
Let g ∗ = argmin g ∈G R(g ).
Statistical learning problem: find a function gb ∈ G such that
R(b
g ) − R(g ∗ )
is small.
Standard procedure minimizes (1/n)
g ∈ G.
P
i =1
|g (Zi ) − Yi | over
L1 regression
If Y has a heavy tail, standard empirical risk minimization may fail.
L1 regression
If Y has a heavy tail, standard empirical risk minimization may fail.
We choose gb by minimizing Catoni’s estimate.
We only assume EY 2 ≤ σ 2 for some σ > 0.
Then
def
supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v
is a known and finite constant.
L1 regression
If Y has a heavy tail, standard empirical risk minimization may fail.
We choose gb by minimizing Catoni’s estimate.
We only assume EY 2 ≤ σ 2 for some σ > 0.
Then
def
supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v
is a known and finite constant.
For g ∈ G and µ ∈ R, define
rbg (µ) =
n
1 X
nα
φ(α(|g (Xi ) − Yi | − µ))
i =1
b ) is the unique value for which rbf (R(g
b )) = 0.
R(g
L1 regression
If Y has a heavy tail, standard empirical risk minimization may fail.
We choose gb by minimizing Catoni’s estimate.
We only assume EY 2 ≤ σ 2 for some σ > 0.
Then
def
supg ∈G Var (|g (Z ) − Y |) ≤ σ 2 + supg ∈G supz∈Z |g (z)|2 = v
is a known and finite constant.
For g ∈ G and µ ∈ R, define
rbg (µ) =
n
1 X
nα
φ(α(|g (Xi ) − Yi | − µ))
i =1
b ) is the unique value for which rbf (R(g
b )) = 0.
R(g
b ).
gb = argmin R(g
g ∈G
L1 regression
With probability ≥ 1 − δ,
r
Z ∞p
v
1
1
∗
+√
log N∞ (G, )d .
R(b
g ) − R(g ) ≤ C log
δ
n
n 0
L1 regression–experiments
Suppose Zi ∈ R4 such that each component of Zi is uniform on
[−1, 1] and
Yi = ZiT θ + i ,
where θ = (0.25, −0.25, 0.50, 0.70) and i has Student’s t
distribution with d degrees of freedom.
b g )| for n = 50.
Plot of E|R(b
g ) − R(b
k-means clustering
Let X take values in Rd with distribution P.
A clustering scheme is given by a set of k ≥ 2 cluster centers
C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C .
k-means clustering
Let X take values in Rd with distribution P.
A clustering scheme is given by a set of k ≥ 2 cluster centers
C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C .
Given a distortion measure ` : Rd × Rd → [0, ∞), define the
expected distortion
Dk (P, q) = E`(X , q(X ))
k-means clustering
Let X take values in Rd with distribution P.
A clustering scheme is given by a set of k ≥ 2 cluster centers
C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C .
Given a distortion measure ` : Rd × Rd → [0, ∞), define the
expected distortion
Dk (P, q) = E`(X , q(X ))
Typical choices are `(x, y ) = kx − y kα with α = 1 or 2.
We consider `(x, y ) = kx − y k
k-means clustering
Let X take values in Rd with distribution P.
A clustering scheme is given by a set of k ≥ 2 cluster centers
C = {y1 , . . . , yk } ⊂ Rd and a quantizer q : Rd → C .
Given a distortion measure ` : Rd × Rd → [0, ∞), define the
expected distortion
Dk (P, q) = E`(X , q(X ))
Typical choices are `(x, y ) = kx − y kα with α = 1 or 2.
We consider `(x, y ) = kx − y k
if EkX k < ∞, then there exists an optimal quantizer q ∗ :
def
Dk (P, q ∗ ) = Dk∗ (P) .
q ∗ is a nearest neighbor quantizer, that is,
kx − q ∗ (x)k = min kx − yi k .
yi ∈C
empirical quantizer design
Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such
that
Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ]
is close to Dk∗ (P).
empirical quantizer design
Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such
that
Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ]
is close to Dk∗ (P).
One usually minimizes
Dk (Pn , q) =
n
1X
n
i =1
kXi − q(Xi )k .
empirical quantizer design
Given an i.i.d. sample X1 , . . . , Xn from P, we ant to find qn such
that
Dk (P, qn ) = E [kX − qn (X )k|X1 , . . . , Xn ]
is close to Dk∗ (P).
One usually minimizes
Dk (Pn , q) =
n
1X
n
kXi − q(Xi )k .
i =1
Pollard (1981) showed that if EkX k < ∞, then
lim Dk (P, qn ) = Dk∗ (P) a.s.
n→∞
empirical quantizer design
When kX k is bounded, one has
EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 .
empirical quantizer design
When kX k is bounded, one has
EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 .
For heavy-tailed X such bounds are not possible.
Telgarsky and Dasgupta (2013) prove bounds of the order
n −1/2+2/p when the p-th moment is finite.
empirical quantizer design
When kX k is bounded, one has
EDk (P, qn ) − Dk∗ (P) ≤ C (P, k, d )n −1/2 .
For heavy-tailed X such bounds are not possible.
Telgarsky and Dasgupta (2013) prove bounds of the order
n −1/2+2/p when the p-th moment is finite.
Minimizing Catoni’s estimate of the distortion we can get
O(n −1/2 ) when the second moment is finite.
empirical quantizer design
Some care is needed to make sure the class F is totally bounded in
the sup norm.
empirical quantizer design
Some care is needed to make sure the class F is totally bounded in
the sup norm.
One can show that it suffices to minimize over quantizers with
uniformly bounded centers.
empirical quantizer design
Some care is needed to make sure the class F is totally bounded in
the sup norm.
One can show that it suffices to minimize over quantizers with
uniformly bounded centers.
Let Cρ be the set of all collections C = {y1 , . . . , yk } ∈ (Rd )k
with kyj k ≤ ρ.
For C ∈ Cρ , we estimate D(P, qC ) = EkX − qC (X )k by the
unique µ ∈ R such that
n
1 X
nα
i =1
φ α
min kXi − yj k − µ
=0
j =1,...,k
Let qbn be any quantizer minimizing the estimated distortion.
empirical quantizer design
Assume V := Var (kX k) < ∞ and ρ is known. Then with
probability ≥ 1 − δ,
s
s 
Vk
d
1


+
D(P, qbn ) − D(P, q ∗ ) ≤ C (ρ) log
δ
n
n
empirical quantizer design
Assume V := Var (kX k) < ∞ and ρ is known. Then with
probability ≥ 1 − δ,
s
s 
Vk
d
1


+
D(P, qbn ) − D(P, q ∗ ) ≤ C (ρ) log
δ
n
n
√
In the bounded case a bound of the order k/ n is possible.
references
L. Devroye, M. Lerasle, G. Lugosi, and R. Imbuzeiro Oliveira.
Sub-Gaussian mean estimators.
Annals of Statistics, to appear, 2016.
E. Joly, and G. Lugosi.
Robust estimation of U-statistics.
Stochastic Processes and their Applications, to appear, 2015.
C. Brownlees, E. Joly, and G. Lugosi.
Empirical risk minimization for heavy-tailed losses.
Annals of Statistics, 43:2507–2536, 2015.
S. Bubeck, N. Cesa-Bianchi, and G. Lugosi.
Bandits with heavy tail.
IEEE Transactions on Information Theory, 59:7711-7717, 2013.
E. Joly, and G. Lugosi, and R. Imbuzeiro Oliveira.
On the estimation of the mean of a random vector.
(in preparation)