Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Refinements of Large Deviation Tail Probabilities László Györfia , Peter Harremoësb,∗, Gábor Tusnádyc arXiv:1205.1005v1 [math.ST] 4 May 2012 a Budapest University of Technology and Economics, Budapest, Hungary b Copenhagen Business College, Copenhagen, Denmark c Rényi Institute of Mathematics, Budapest, Hungary Abstract We study tail probabilities via some Gaussian approximations. Our results make refinements to large deviation theory. The proof builds on classical results by Bahadur and Rao. Binomial distributions and their tail probabilities are discussed in more detail. Keywords: Binomial distribution, Gaussian distribution, large deviations, tail probability. 2000 MSC: primary 60F10, secondary 60E15 1. Introduction Let X1 , . . . , Xn be i.i.d. random variables such that the moment generating function E [exp (βX1 )] is finite in a neighborhood of the origin. For fixed µ > E [X1 ], the aim of this paper is to approximate the tail distribution: ( n ) 1X Pn,µ := P Xi ≥ µ . n i=1 If µ is close to the mean of X1 one would usually approximate Pn,µ by a tail probability of a Gaussian random variable. If µ is far from the mean of X1 the tail probability can be estimated using large deviation theory. According to the Sanov theorem the probability that the deviation from the mean is as large as µ is of the order exp (−nD) where D is a constant. Bahadur and Rao [2] improved the estimate of this large deviation probability, and the goal of this paper is to extend the Gaussian tail approximations into situations where one normally uses large deviation techniques. Let φ and Φ be the density function and the distribution function of the standard Gaussian, respectively. Let P0 denote a probability measure describing the distribution of a random variable X. Consider the 1-dimensional exponential family (Pβ ) based on P0 and given by dPβ exp (β · x) (x) = dP0 Z (β) ∗ Corresponding author Email addresses: [email protected] (László Györfi), [email protected] (Peter Harremoës), [email protected] (Gábor Tusnády) Preprint submitted to Statistics and Probability Letters May 7, 2012 where the denominator is the moment generating function (partition function) given by Z Z (β) = exp (β · x) dP0 x = E eβX . The mean value of Pβ is Z ′ (β) Z (β) (1) and the range of this function will be denoted M and will be called the mean value range of the exponential family. For µ in interior of M the maximum likelihood estimate β̂ (µ) equals the β such that the mean value of Pβ equals µ, which in this case is the average of the i.i.d. samples. Put P µ = Pβ̂(µ) . An equivalent definition of β̂ (µ) can be as the solution of the equation Z β̂(µ) eβ̂(µ)µ = i h E eβ̂(µ)X eβ̂(µ)µ E eβX Z(β) = min = min βµ . βµ β>0 β>0 e e Let V (µ) denote the variance of P µ . Information divergence is given by D (P µ kP0 ) = Z ln dP µ (x) dP0 dP µ x. We see that D (P µ kP0 ) = − ln h i E eβ̂(µ)X eβ̂(µ)µ = β̂ (µ) µ − ln Z β̂ (µ) . (2) 2. Approximation of tail distributions for non-lattice valued variables Introduce the notation µ∗ := sup{µ > µ0 ; D (P µ kP0 ) < ∞} = sup M. Bahadur and Rao [2] proved a refined version of the large deviation bound, but some aspects of their result dates back to Cramér [4] and part of it was proved by a different method by Blackwell and Hodges [3]. For µ∗ > µ > µ0 , the Sanov theorem implies that Pn ln P n1 i=1 Xi ≥ µ − → D (P µ kP0 ) for n → ∞. n Bahadur and Rao [2] verified the following improvement of the Sanov theorem ( n ) 1X exp (−nD (P µ kP0 )) 1 √ for n → ∞ P Xi ≥ µ = 1 + O 1/2 n i=1 n (2πnV (µ)) β̂ (µ) for non lattice random variables. We will write D (µ) as short for D ( P µ k P0 ) . 2 (3) Theorem 1. For µ∗ > µ > µ0 , one has that ( n ) 1X cµ 1/2 1 P Xi ≥ µ = Φ −n1/2 2D µ − 1+O √ for n → ∞, (4) n i=1 n n where 1/2 cµ = ln V(2D(µ)) (µ)1/2 β̂(µ) β̂ (µ) . (5) Proof. The cµ defined by (5) satisfies the equation 2D(µ) V (µ) 1/2 β̂ (µ) ecµ β̂(µ) = 1. (6) The tail probabilities of the standard Gaussian satisfy 1 φ (z) φ (z) 1 − 2 ≤ Φ(−z) ≤ z z z for z > 0, (cf. Feller [5, p. 179]), which implies that c exp −nD µ − nµ 1 cµ 1/2 1/2 1+O , 2D µ − = Φ −n cµ 1/2 1/2 n n (2πn) 2D µ − n and so cµ n 1/2 (2πn) (2D(µ))1/2 exp −nD µ − cµ 1/2 1 1+O . = Φ −n1/2 2D µ − n n (7) Because of (1) and (2), the derivative can be calculated as d D (µ) = β̂ (µ) , dµ leading to the following Taylor expansion cµ 1 cµ D µ− . +O = D (µ) − β̂ (µ) · n n n2 Thus, c exp −nD µ − nµ (2πn)1/2 (2D(µ))1/2 = = = exp −n D (µ) − β̂ (µ) · cµ n +O (2πn)1/2 (2D(µ))1/2 exp −nD(µ) + β̂(µ)cµ + O 1 n 1 n2 (2πn)1/2 (2D(µ))1/2 exp (−nD(µ)) ecµ β̂(µ) 1 1 + O 1/2 n (2πn) (2D(µ))1/2 3 (8) According to (3) we also have ( n ) 1X exp (−nD (µ)) 1 P for n → ∞, Xi ≥ µ = 1+O √ 1/2 n i=1 n (2πnV (µ)) β̂ (µ) therefore applying (6), (7), (8) and (9) the proof of Theorem 1 is complete. (9) Remark 1. If in the approximation cµ is replaced by any other constant c then the ratio of the two approximations tends to a number, which is not equal to 1: c exp −nD µ − nµ c c = exp −nD µ − µ + nD µ − c n n exp −nD µ − n 1 = exp β̂ (µ) · (cµ − c) + O n ≈ exp β̂ (µ) · (cµ − c) 6= 1. Remark 2. If X1 has a density with respect to the Lebesgue measure then Bahadur and Rao [2] proved the stronger result that ) ( n 1 exp (−nD (P µ kP0 )) 1X 1+O Xi ≥ µ = . P 1/2 n i=1 n (2πnV (µ)) β̂ (µ) Using this result we get the following theorem: If X1 has a density with respect to the Lebesgue measure then ( n ) 1X cµ 1/2 1 1/2 P 2D µ − Xi ≥ µ = Φ −n 1+O for n → ∞, n i=1 n n for any µ∗ > µ > µ0 . 3. Results for lattice valued variables Now assume that X1 , X2 , . . . is a sequence of i.i.d. random variables with values in a lattice of the type {kd + δ | k ∈ Z} . For such a sequence Bahadur and Rao [2] proved that ) ( n exp (−nD ( P µ k P0 )) 1 1X Xi ≥ µ = (10) 1+O P n i=1 n 1/2 1−exp(−dβ̂(µ)) (2πnV (µ)) d Pn for any n such that P n1 i=1 Xi = µ > 0. We note that the result (3) for non-lattice variables can be considered as a limiting version of (10) for small d > 0 because 1 − exp (−dβ) → β for d → 0. d 4 ∗ Theorem 2. Assume that X1 has values 1 Pnin the lattice {kd + δ | k ∈ Z} and that µ > µ > µ0 . Then for any n such that P n i=1 Xi = µ > 0 one has ) ( n cµ 1/2 1 1X Xi ≥ µ = Φ −n1/2 2D µ − 1+O for n → ∞, P n i=1 n n where ln cµ = (2D(µ))1/2 V 1−exp(−dβ̂(µ)) (µ)1/2 d β̂ (µ) . Proof. If X1 is lattice valued then the proof of Theorem 1 can be modified by replacing 1−exp(−dβ̂(µ)) at the appropriate places throughout the proof. There is no β̂ (µ) by d modification in the use of a Taylor expansion. We now turn to the special case, where X1 , . . . , Xn are i.i.d. Bernoulli random variables with 1 with probability p, Xi = 0 with probability 1 − p. Pn In this case d = 1, and i=1 Xi is a binomial (n, p) random variable. For various refinements of (10), see Bahadur [1], Littlewood [8] and McKay [9]. Corollary 1. Put µn := ⌈nµ⌉/n. Then for 1 > µ > p one has that ) ( n cµn 1/2 1 1X 1/2 2D µn − Xi ≥ µ = Φ −n 1+O for n → ∞, P n i=1 n n where D(µ) = D(µkp) = µ ln and cµ = 1 + 2 ln µ 1−µ + (1 − µ) ln p 1−p 2D(µkp) p (1 − (µ−p)2 µ(1−p) 2 ln p(1−µ) p) . Proof. Because of the definition of µn , ( n ) ) ( n 1X 1X P Xi ≥ µ = P X i ≥ µn , n i=1 n i=1 Pn and the condition P n1 i=1 Xi = µn > 0 is satisfied, and so Theorem 2 implies that ( n ) 1X cµn 1/2 1 1/2 P 2D µn − Xi ≥ µ = Φ −n 1+O for n → ∞. n i=1 n n 5 We have to evaluate cµ . The distribution Pβ has Pβ (Xi = 1) = peβ 1 − p + peβ which is also the mean of Pβ . The equation µ= peβ 1 − p + peβ is equivalent to eβ = implying that µ (1 − p) p (1 − µ) p (1 − µ) µ−p 1 − e−dβ = 1 − e−β = 1 − = . d µ (1 − p) µ (1 − p) The variance function is V (µ) = µ (1 − µ) . Thus, we have ln cµ = ln = = 1 + 2 2D(µkp) V (µ) 1/2 β̂ (µ) 1/2 2D(µkp) µ(1−µ) ln 1 1−e−β̂(µ) µ(1−p) µ−p µ(1−p) ln p(1−µ) 2D(µkp) p (1 − (µ−p)2 µ(1−p) 2 ln p(1−µ) p) . Remark 3. For p = 1/2, 0.5 < cµ < 0.534 and Table 1 shows some numerical values for cµ ≈ 0.5 + (µ − 0.5)/12. µ cµ 0.6 0.508 0.65 0.512 0.7 0.516 0.75 0.520 0.8 0.524 0.85 0.528 0.9 0.532 Table 1: Numerical values 4. Discussion As discussed by Reiczigel, Rejtő and Tusnády [10] and by Harremoës and Tusnády [6] there are some strong indications that these asymptotic results can be strengthened 6 to sharp inequalities. Such sharp inequalities would imply the present asymptotic results as corollaries. We hope that the asymptotics presented here can help in proving the conjectured sharp inequalities. Related sharp inequalities have been discussed by Leon and Perron [7] and Talagrand [11]. Numerical experiments have also shown that our tail estimates are useful even for small values of n. References [1] Bahadur, R. R.: 1960, Some approximations to the binomial distribution function. Annals of Mathematical Statistics, 31:43–54. [2] Bahadur, R. R. and Rao, R. R.: 1960, On deviation of the sample mean. Annals of Mathematical Statistics, 31, 1015–1027. [3] Blackwell, D. and Hodges, J. L.: 1959, The probability in the extreme tail of a convolution, Annals of Mathematical Statistics, 30, 1113–1120. [4] Cramér, H.: 1938, Sur un nouveau théoréme-limite de la théorie des probabilités, Actualités Scientifiques et Industrielles (Number 736, Hermann Cie, Paris). [5] Feller, W.: 1957, An Introduction to Probability and its Applications. Vol. I, Wiley, New York. [6] Harremoës, P. and Tusnády, G.: 2012, Information divergence is more χ2 - distributed than the χ2 statistic, 2012 IEEE International Symposium on Information Theory (ISIT 2012), Cambridge, Massachusetts, USA. Accepted. URL: http://arxiv.org/abs/1202.1125 [7] Leon, C. A. and Perron, F.: 2003, Extremal properties of sums of Bernoulli random variables, Statistics and Probability Letters, 62(4), 345–354. [8] Littlewood, J. E.: 1969, On the probability in the tail of binomial distribution. Advanced Applied Probability, 1:43–72. [9] McKay, B. D.: 1989, On Littlewood’s estimate for the binomial distribution. Advanced Applied Probability, 21:475–478. [10] Reiczigel, J., Rejtő, L. and Tusnády, G.: 2011, A sharpning of Tusnády’s inequality. ArXiv 1110.3627v2. [11] Talagrand, M.: 1995, The missing factor in Hoeffding’s inequality, Ann. Inst. Henri Poincare, 31, 689–702. 7