Download arXiv:1205.1005v1 [math.ST] 4 May 2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
Some Refinements of Large Deviation Tail Probabilities
László Györfia , Peter Harremoësb,∗, Gábor Tusnádyc
arXiv:1205.1005v1 [math.ST] 4 May 2012
a
Budapest University of Technology and Economics, Budapest, Hungary
b Copenhagen Business College, Copenhagen, Denmark
c Rényi Institute of Mathematics, Budapest, Hungary
Abstract
We study tail probabilities via some Gaussian approximations. Our results make refinements to large deviation theory. The proof builds on classical results by Bahadur and
Rao. Binomial distributions and their tail probabilities are discussed in more detail.
Keywords: Binomial distribution, Gaussian distribution, large deviations, tail
probability.
2000 MSC: primary 60F10, secondary 60E15
1. Introduction
Let X1 , . . . , Xn be i.i.d. random variables such that the moment generating function
E [exp (βX1 )] is finite in a neighborhood of the origin. For fixed µ > E [X1 ], the aim of
this paper is to approximate the tail distribution:
( n
)
1X
Pn,µ := P
Xi ≥ µ .
n i=1
If µ is close to the mean of X1 one would usually approximate Pn,µ by a tail probability
of a Gaussian random variable. If µ is far from the mean of X1 the tail probability can be
estimated using large deviation theory. According to the Sanov theorem the probability
that the deviation from the mean is as large as µ is of the order exp (−nD) where D is a
constant. Bahadur and Rao [2] improved the estimate of this large deviation probability,
and the goal of this paper is to extend the Gaussian tail approximations into situations
where one normally uses large deviation techniques.
Let φ and Φ be the density function and the distribution function of the standard
Gaussian, respectively. Let P0 denote a probability measure describing the distribution
of a random variable X. Consider the 1-dimensional exponential family (Pβ ) based on P0
and given by
dPβ
exp (β · x)
(x) =
dP0
Z (β)
∗ Corresponding
author
Email addresses: [email protected] (László Györfi), [email protected] (Peter Harremoës),
[email protected] (Gábor Tusnády)
Preprint submitted to Statistics and Probability Letters
May 7, 2012
where the denominator is the moment generating function (partition function) given by
Z
Z (β) = exp (β · x) dP0 x = E eβX .
The mean value of Pβ is
Z ′ (β)
Z (β)
(1)
and the range of this function will be denoted M and will be called the mean value range
of the exponential family.
For µ in interior of M the maximum likelihood estimate β̂ (µ) equals the β such that
the mean value of Pβ equals µ, which in this case is the average of the i.i.d. samples.
Put P µ = Pβ̂(µ) . An equivalent definition of β̂ (µ) can be as the solution of the equation
Z β̂(µ)
eβ̂(µ)µ
=
i
h
E eβ̂(µ)X
eβ̂(µ)µ
E eβX
Z(β)
= min
= min βµ .
βµ
β>0
β>0 e
e
Let V (µ) denote the variance of P µ .
Information divergence is given by
D (P µ kP0 ) =
Z
ln
dP µ
(x)
dP0
dP µ x.
We see that
D (P µ kP0 ) = − ln
h
i
E eβ̂(µ)X
eβ̂(µ)µ
= β̂ (µ) µ − ln Z β̂ (µ) .
(2)
2. Approximation of tail distributions for non-lattice valued variables
Introduce the notation
µ∗ := sup{µ > µ0 ; D (P µ kP0 ) < ∞} = sup M.
Bahadur and Rao [2] proved a refined version of the large deviation bound, but some
aspects of their result dates back to Cramér [4] and part of it was proved by a different
method by Blackwell and Hodges [3]. For µ∗ > µ > µ0 , the Sanov theorem implies that
Pn
ln P n1 i=1 Xi ≥ µ
−
→ D (P µ kP0 ) for n → ∞.
n
Bahadur and Rao [2] verified the following improvement of the Sanov theorem
( n
)
1X
exp (−nD (P µ kP0 ))
1
√
for n → ∞
P
Xi ≥ µ =
1
+
O
1/2
n i=1
n
(2πnV (µ)) β̂ (µ)
for non lattice random variables.
We will write D (µ) as short for D ( P µ k P0 ) .
2
(3)
Theorem 1. For µ∗ > µ > µ0 , one has that
( n
)
1X
cµ 1/2
1
P
Xi ≥ µ = Φ −n1/2 2D µ −
1+O √
for n → ∞, (4)
n i=1
n
n
where
1/2
cµ =
ln V(2D(µ))
(µ)1/2 β̂(µ)
β̂ (µ)
.
(5)
Proof. The cµ defined by (5) satisfies the equation
2D(µ)
V (µ)
1/2
β̂ (µ) ecµ β̂(µ)
= 1.
(6)
The tail probabilities of the standard Gaussian satisfy
1
φ (z)
φ (z)
1 − 2 ≤ Φ(−z) ≤
z
z
z
for z > 0, (cf. Feller [5, p. 179]), which implies that
c exp −nD µ − nµ
1
cµ 1/2
1/2
1+O
,
2D µ −
= Φ −n
cµ 1/2
1/2
n
n
(2πn)
2D µ − n
and so
cµ n
1/2
(2πn) (2D(µ))1/2
exp −nD µ −
cµ 1/2
1
1+O
.
= Φ −n1/2 2D µ −
n
n
(7)
Because of (1) and (2), the derivative can be calculated as
d
D (µ) = β̂ (µ) ,
dµ
leading to the following Taylor expansion
cµ 1
cµ
D µ−
.
+O
= D (µ) − β̂ (µ) ·
n
n
n2
Thus,
c exp −nD µ − nµ
(2πn)1/2 (2D(µ))1/2
=
=
=
exp −n D (µ) − β̂ (µ) ·
cµ
n
+O
(2πn)1/2 (2D(µ))1/2
exp −nD(µ) + β̂(µ)cµ + O
1
n
1
n2
(2πn)1/2 (2D(µ))1/2
exp (−nD(µ)) ecµ β̂(µ)
1
1
+
O
1/2
n
(2πn) (2D(µ))1/2
3
(8)
According to (3) we also have
( n
)
1X
exp (−nD (µ))
1
P
for n → ∞,
Xi ≥ µ =
1+O √
1/2
n i=1
n
(2πnV (µ)) β̂ (µ)
therefore applying (6), (7), (8) and (9) the proof of Theorem 1 is complete.
(9)
Remark 1. If in the approximation cµ is replaced by any other constant c then the ratio
of the two approximations tends to a number, which is not equal to 1:
c exp −nD µ − nµ
c c = exp −nD µ − µ + nD µ −
c
n
n
exp −nD µ − n
1
= exp β̂ (µ) · (cµ − c) + O
n
≈ exp β̂ (µ) · (cµ − c)
6=
1.
Remark 2. If X1 has a density with respect to the Lebesgue measure then Bahadur and
Rao [2] proved the stronger result that
)
( n
1
exp (−nD (P µ kP0 ))
1X
1+O
Xi ≥ µ =
.
P
1/2
n i=1
n
(2πnV (µ)) β̂ (µ)
Using this result we get the following theorem: If X1 has a density with respect to the
Lebesgue measure then
( n
)
1X
cµ 1/2
1
1/2
P
2D µ −
Xi ≥ µ = Φ −n
1+O
for n → ∞,
n i=1
n
n
for any µ∗ > µ > µ0 .
3. Results for lattice valued variables
Now assume that X1 , X2 , . . . is a sequence of i.i.d. random variables with values in
a lattice of the type {kd + δ | k ∈ Z} . For such a sequence Bahadur and Rao [2] proved
that
)
( n
exp (−nD ( P µ k P0 ))
1
1X
Xi ≥ µ =
(10)
1+O
P
n i=1
n
1/2 1−exp(−dβ̂(µ))
(2πnV (µ))
d
Pn
for any n such that P n1 i=1 Xi = µ > 0. We note that the result (3) for non-lattice
variables can be considered as a limiting version of (10) for small d > 0 because
1 − exp (−dβ)
→ β for d → 0.
d
4
∗
Theorem 2. Assume that X1 has values
1 Pnin the lattice
{kd + δ | k ∈ Z} and that µ >
µ > µ0 . Then for any n such that P n i=1 Xi = µ > 0 one has
)
( n
cµ 1/2
1
1X
Xi ≥ µ = Φ −n1/2 2D µ −
1+O
for n → ∞,
P
n i=1
n
n
where
ln
cµ =
(2D(µ))1/2
V
1−exp(−dβ̂(µ))
(µ)1/2
d
β̂ (µ)
.
Proof. If X1 is lattice valued then the proof of Theorem 1 can be modified by replacing
1−exp(−dβ̂(µ))
at the appropriate places throughout the proof. There is no
β̂ (µ) by
d
modification in the use of a Taylor expansion.
We now turn to the special case, where X1 , . . . , Xn are i.i.d. Bernoulli random variables with
1 with probability p,
Xi =
0 with probability 1 − p.
Pn
In this case d = 1, and
i=1 Xi is a binomial (n, p) random variable. For various
refinements of (10), see Bahadur [1], Littlewood [8] and McKay [9].
Corollary 1. Put
µn := ⌈nµ⌉/n.
Then for 1 > µ > p one has that
)
( n
cµn 1/2
1
1X
1/2
2D µn −
Xi ≥ µ = Φ −n
1+O
for n → ∞,
P
n i=1
n
n
where
D(µ) = D(µkp) = µ ln
and
cµ =
1
+
2
ln
µ
1−µ
+ (1 − µ) ln
p
1−p
2D(µkp)
p (1 −
(µ−p)2
µ(1−p)
2 ln p(1−µ)
p)
.
Proof. Because of the definition of µn ,
( n
)
)
( n
1X
1X
P
Xi ≥ µ = P
X i ≥ µn ,
n i=1
n i=1
Pn
and the condition P n1 i=1 Xi = µn > 0 is satisfied, and so Theorem 2 implies that
( n
)
1X
cµn 1/2
1
1/2
P
2D µn −
Xi ≥ µ = Φ −n
1+O
for n → ∞.
n i=1
n
n
5
We have to evaluate cµ . The distribution Pβ has
Pβ (Xi = 1) =
peβ
1 − p + peβ
which is also the mean of Pβ . The equation
µ=
peβ
1 − p + peβ
is equivalent to
eβ =
implying that
µ (1 − p)
p (1 − µ)
p (1 − µ)
µ−p
1 − e−dβ
= 1 − e−β = 1 −
=
.
d
µ (1 − p)
µ (1 − p)
The variance function is
V (µ) = µ (1 − µ) .
Thus, we have
ln
cµ =
ln
=
=
1
+
2
2D(µkp)
V (µ)
1/2
β̂ (µ)
1/2
2D(µkp)
µ(1−µ)
ln
1
1−e−β̂(µ)
µ(1−p)
µ−p
µ(1−p)
ln p(1−µ)
2D(µkp)
p (1 −
(µ−p)2
µ(1−p)
2 ln p(1−µ)
p)
.
Remark 3. For p = 1/2, 0.5 < cµ < 0.534 and Table 1 shows some numerical values
for cµ ≈ 0.5 + (µ − 0.5)/12.
µ
cµ
0.6
0.508
0.65
0.512
0.7
0.516
0.75
0.520
0.8
0.524
0.85
0.528
0.9
0.532
Table 1: Numerical values
4. Discussion
As discussed by Reiczigel, Rejtő and Tusnády [10] and by Harremoës and Tusnády
[6] there are some strong indications that these asymptotic results can be strengthened
6
to sharp inequalities. Such sharp inequalities would imply the present asymptotic results
as corollaries. We hope that the asymptotics presented here can help in proving the
conjectured sharp inequalities. Related sharp inequalities have been discussed by Leon
and Perron [7] and Talagrand [11]. Numerical experiments have also shown that our tail
estimates are useful even for small values of n.
References
[1] Bahadur, R. R.: 1960, Some approximations to the binomial distribution function. Annals of Mathematical Statistics, 31:43–54.
[2] Bahadur, R. R. and Rao, R. R.: 1960, On deviation of the sample mean. Annals of Mathematical
Statistics, 31, 1015–1027.
[3] Blackwell, D. and Hodges, J. L.: 1959, The probability in the extreme tail of a convolution, Annals
of Mathematical Statistics, 30, 1113–1120.
[4] Cramér, H.: 1938, Sur un nouveau théoréme-limite de la théorie des probabilités, Actualités Scientifiques et Industrielles (Number 736, Hermann Cie, Paris).
[5] Feller, W.: 1957, An Introduction to Probability and its Applications. Vol. I, Wiley, New York.
[6] Harremoës, P. and Tusnády, G.: 2012, Information divergence is more χ2 - distributed than the χ2 statistic, 2012 IEEE International Symposium on Information Theory (ISIT 2012), Cambridge,
Massachusetts, USA. Accepted. URL: http://arxiv.org/abs/1202.1125
[7] Leon, C. A. and Perron, F.: 2003, Extremal properties of sums of Bernoulli random variables,
Statistics and Probability Letters, 62(4), 345–354.
[8] Littlewood, J. E.: 1969, On the probability in the tail of binomial distribution. Advanced Applied
Probability, 1:43–72.
[9] McKay, B. D.: 1989, On Littlewood’s estimate for the binomial distribution. Advanced Applied
Probability, 21:475–478.
[10] Reiczigel, J., Rejtő, L. and Tusnády, G.: 2011, A sharpning of Tusnády’s inequality. ArXiv
1110.3627v2.
[11] Talagrand, M.: 1995, The missing factor in Hoeffding’s inequality, Ann. Inst. Henri Poincare, 31,
689–702.
7