Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Estimating the Cumulative Distribution Function and Statistical Functionals John Hughes September 8, 2014 Estimating the Cumulative Distribution Function n Yn = ∑ I ( Xi ≤ x ) i =1 is a sum of independent B er { F ( x )} random variables. This means that Yn is a B in{n, F ( x )} random variable, which implies that F̂n ( x ) ∼ 1 B in{n, F ( x )}. n These facts provide some insight regarding the first part of the following theorem. Theorem 1 1. For any x, E F̂n ( x ) = F ( x ) V F̂n ( x ) = F ( x ){1 − F ( x )} . n 0.8 0.6 0.4 0.2 0.0 Notice that I ( Xi ≤ x ) = 1 with probability P( Xi ≤ x ) = F ( x ), and I ( Xi ≤ x ) = 0 with probability P( Xi > x ) = 1 − F ( x ). This implies that I ( Xi ≤ x ) is a Bernoulli random variable with "success" probability F ( x ). Since the Xi are independent, so are the I ( Xi ≤ x ). Thus ^ F n where I (·) is the indicator function: 0 if A is false I ( A) = 1 if A is true. 1.0 Suppose we have an iid sample X1 , . . . , Xn ∼ F, where F ( x ) = P( X ≤ x ) is a distribution function defined on the real numbers. If P is the relevant probability measure, we can also define F in terms of P as F ( x ) = P{(−∞, x ]}, where P( A) denotes the probability of the set A. We estimate F using the empirical distribution function F̂n . This is the distribution function that puts mass 1/n at each data point. In other words, F̂n is a step function such that the steps all have the same height (1/n) but generally do not have the same depth. Formally, 1 n F̂n ( x ) = ∑ I ( Xi ≤ x ), n i =1 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 x Figure 1: This figure shows the empirical cdf for a sample of size three from the standard normal distribution. The sampled values were -0.67, -0.02, and 0.90. Since the sample size is three, the height of each step is 1/3. estimating the cumulative distribution function and statistical functionals 2 2. (Glivenko-Cantelli theorem) a.s. sup | F̂n ( x ) − F ( x )| −→ 0. x 3. (Dvoretzky-Kiefer-Wolfowitz inequality) For any e > 0, 2 P sup | F̂n ( x ) − F ( x )| > e ≤ 2e−2ne . x P Part 1 of the theorem implies that F̂n ( x ) −→ F ( x ). This follows immediately from the weak law of large numbers, or it can easily be proved directly using Chebyshev’s inequality. And Part 3 of the theorem can be used to construct a nonparametric (1 − α)100% confidence band for F ( x ). Theorem 2 Let L( x ) = max{ F̂n ( x ) − en , 0} H ( x ) = min{ F̂n ( x ) + en , 1}, where en = {(2n)−1 log 2α−1 }1/2 . Then, for all F and n, P{ L( x ) ≤ F ( x ) ≤ H ( x ) for all x } ≥ 1 − α. The homework assignment will ask you to compute a bootstrap confidence band for a real dataset and compare the result with the band provided by Theorem 2. The Kolmogorov-Smirnov and Cramér-von Mises Tests The K-S and C-vM tests are nonparametric tests of the null hypothesis that a sample came from a certain distribution, or that two samples came from the same distribution. More specifically, in the onesample case, the K-S and C-vM tests allow us to test the null hypothesis H0 : F = F0 versus H1 : F 6= F0 for a fixed F0 and assuming that X1 , . . . , Xn is an iid sample from a continuous cdf F. The K-S statistic is Dn ( F0 ) = sup | F̂n ( x ) − F0 ( x )|. x a.s. From Part 2 of Theorem 1 we know that Dn ( F0 ) −→ 0 iff H0 is true. Thus a K-S test rejects H0 when Dn ( F0 ) > c for some constant c that of course depends on the null distribution of the statistic. We may want to test a more specific hypothesis, such as H0 : F = F0 versus H1 : F ≤ F0 , F 6= F0 By Chebyshev’s inequality, P(| F̂n ( x ) − F ( x )| ≥ e) ≤ F ( x ){1 − F ( x )} → 0. ne2 estimating the cumulative distribution function and statistical functionals 3 or H0 : F = F0 versus H1 : F ≥ F0 , F 6= F0 . The corresponding K-S statistics are Dn− ( F0 ) = sup{ F0 ( x ) − F̂n ( x )} x and Dn+ ( F0 ) = sup{ F̂n ( x ) − F0 ( x )}. x Let X(1) , . . . , X(n) be the order statistics, and define X(0) = −∞ and X(n) = ∞. Since F̂n ( x ) = i/n when X(i) ≤ x < X(i+1) (i = 0, 1, . . . , n), we can write Dn+ ( F0 ) = max sup 0≤ i ≤ n X ≤ x < X (i ) ( i +1) {i/n − F0 ( x )} ( ) i/n − = max 0≤ i ≤ n inf X ( i ) ≤ x < X ( i +1) F0 ( x ) = max {i/n − F0 ( X(i) )}. 0≤ i ≤ n And Dn− ( F0 ) can be written in an analogous fashion. When F0 is continuous, F0 ( X(i) ) is the ith order statistic of a sample of size n from the standard uniform distribution, and so the distribution of Dn+ ( F0 ) does not depend on F0 . Due to symmetry, Dn− ( F0 ) has the same distribution as Dn+ ( F0 ). Since Dn ( F0 ) = max{ Dn− ( F0 ), Dn+ ( F0 )}, the distribution of Dn ( F0 ) does not depend on F0 , which means the distributions of K-S statistics are known under H0 . Note that this is not true if some or all parameters of F0 have been estimated from data. In this case a Lilliefors-type test can be used [Lilliefors, 1967]. The two-sample K-S statistic is Dn1 ,n2 = sup | F̂1n1 ( x ) − F̂2n2 ( x )|, x where n1 , n2 are the sample sizes and F̂1n1 , F̂2n2 are the empirical cdfs. A test of limiting size α can be obtained using lim n1 ,n2 →∞ P[{n1 n2 /(n1 + n2 )}1/2 Dn1 ,n2 ≤ t] = ∞ ∑ j=−∞ The one-sample C-vM statistic is Cn ( F0 ) = Z 2 2 (−1) j−1 e−2j t , t > 0. { F̂n ( x ) − F0 ( x )}2 dF0 ( x ). The distribution of Cn ( F0 ) also does not depend on F0 . Suppose that X ∼ F, where F is continuous. Then F ( X ) is a standard uniform random variable. This is called the probability integral transform. estimating the cumulative distribution function and statistical functionals An interesting paper regarding the K-S and C-vM tests is Fan [1996]. Fan takes a deeper look at these statistics by first formulating an equivalent version of H0 : F = F0 versus H1 : F 6= F0 . If the null hypothesis is true, i.e., if the unknown cdf F is in fact equal to F0 , the sample U1 = F0 ( X1 ), . . . , Un = F0 ( Xn ) is an independent sample of size n from the standard uniform distribution. Thus the problem of testing H0 : F = F0 versus H1 : F 6= F0 for the original sample is equivalent to testing H0 : F = the uniform cdf versus H1 : F 6= the uniform cdf (1) for the transformed sample U1 , . . . , Un . Fan’s next step is to cleverly reformulate (1). Consider the Fourier transform θ2j−1 = θ2j = Z 1 0 Z 1 0 cos(2πjx ) f ( x )dx sin(2πjx ) f ( x )dx, ( j = 1, 2, . . . ) where f is the density for F. If the null hypothesis is true, i.e., if F is the standard uniform cdf, the density f is of course equal to I ( x ∈ [0, 1]). This implies that θ2j−1 and θ2j are equal to 0. Thus (1) is equivalent to H0 : θk = 0 (k = 1, 2, . . . ) (2) versus H1 : at least one of θk 6= 0. Fan then rewrites the C-vM statistic as Cn = (2π 2 )−1 ∞ ∑ j−2 (θ̂2j2 −1 + θ̂2j2 ), j =1 where θ̂2j−1 and θ̂2j are estimates of θ2j−1 and θ2j . This form of Cn is revealing, for it shows that Cn down weights high-frequency components. More specifically, suppose that F has high energy at a large frequency, i.e., large θ2j−1 and/or θ2j for a large value of j. Then θ̂2j−1 and/or θ̂2j will be large, but their contribution might easily be overwhelmed when divided by j2 . This means that the C-vM test lacks power when the true cdf F deviates from the uniform cdf in certain ways. This is true not just for the C-vM test but for any distributionbased test statistic. Fan goes on to formulate more powerful tests that he calls smoothing-based tests. Estimating Statistical Functionals R R Let F be a cdf. Define g( x )dF ( x ) to be g( x ) f ( x )dx if F is continuous, ∑i g( xi ) f ( xi ) if F is discrete, where f is the density or mass 4 estimating the cumulative distribution function and statistical functionals function, respectively, for F. This is handy because it allows us to use one notation to handle both situations. If you have studied measure theory, you know that the two situations are in fact equivalent in the sense that ∑i g( xi ) f ( xi ) can be written as an integral with respect to counting measure. A statistical functional is a function T ( F ) of the cdf F. Common R R examples are the mean µ = xdF ( x ) and variance σ2 = ( x − µ)2 dF ( x ). These are linear functionals, i.e., functionals of the form R g( x )dF ( x ). Another example is F −1 ( p), the pth quantile, which is a nonlinear functional. The so-called plug-in estimator of a functional θ = T ( F ) is θ̂n = T ( F̂n ). In the case of a linear functional, since F̂n is discrete, we can write θ̂n as Z 1 n T ( F̂n ) = g( x )d F̂n ( x ) = ∑ g( Xi ). n i =1 We can often apply this definition recursively to find plug-in estimators of linear functionals that contain unknown parameters. For example, consider the skewness R ( x − µ)3 dF ( x ) E( X − µ )3 = R . γ1 = 3 σ { ( x − µ)2 dF ( x )}3/2 We do not know µ, but we know that its plug-in estimator is µ̂ = X̄n . Thus γ̂1 = = = 1 n ∑i ( Xi − µ̂)3 { n1 ∑i ( Xi − µ̂)2 }3/2 1 n ∑i ( Xi − µ̂)3 {σ̂2 }3/2 1 n ∑i ( Xi − µ̂)3 . σ̂3 As for the nonlinear functional F −1 ( p) mentioned above, the estimator is F̂n−1 ( p). Since F̂n does not have a true inverse, we define F̂n−1 ( p) = inf{ x : F̂n ( x ) ≥ p}. We use this definition not just for F̂n but for all cdfs that have one or more "jumps." Influence Functions The influence function is useful for approximating the standard error of a plug-in estimator. The influence function is defined in terms of the Gâteaux derivative of T at F in the direction G: L F ( G ) = lim e →0 T {(1 − e) F + eG } − T ( F ) . e 5 estimating the cumulative distribution function and statistical functionals If G = δx is a point mass (also called a Dirac mass) at x, we write L F ( x ) ≡ L F (δx ) and call L F ( x ) the influence function. This implies that T {(1 − e) F + eδx } − T ( F ) L F ( x ) = lim . e e →0 Informally, the influence function describes the effect of an infinitesimal contamination at x on the functional of interest. The empirical influence function is the plug-in estimator of L F : L̂( x ) = lim e →0 Theorem 3 Let T ( F ) = R T {(1 − e) F̂n + eδx } − T ( F̂n ) . e g( x )dF ( x ) be a linear functional. Then: 1. L F ( x ) = g( x ) − T ( F ) and L̂( x ) = g( x ) − T ( F̂n ). 2. For any G, T (G) = T ( F) + 3. Z L F ( x )dG ( x ). R L F ( x )dF ( x ) = 0. R R 4. Let τ 2 = L2F ( x )dF ( x ). Then τ 2 = { g( x ) − T ( F )}2 dF ( x ), and if τ 2 < ∞, √ n{ T ( F ) − T ( F̂n )} ⇒ N (0, τ 2 ). 5. Let τ̂ 2 = 1 n 1 ∑ L̂2 (Xi ) = n ∑{ g(Xi ) − T ( F̂n )}2 . i i √ P P 2 2 ˆ ˆ = τ̂/ n and se = −→ 1, where se Then q τ̂ −→ τ and se/se VT ( F̂n ). 6. √ n{ T ( F ) − T ( F̂n )} ⇒ N (0, 1). τ̂ This theorem shows that the influence function behaves like the score function in parametric settings. More specifically, if ˙ θ | x ) = d log f ( x | θ ) sθ ( x ) = `( dθ is the score function for parametric model f ( x | θ ), we have (assuming certain regularity conditions hold) Esθ ( x ) = 0 Vsθ ( x ) = Es2θ ( x ). 6 estimating the cumulative distribution function and statistical functionals In words, the variance of the score is the second raw moment of the score because the first raw moment is equal to 0. The same is true for the influence function of a linear functional, and so we have EL F ( x ) = 0 VT ( F̂n ) ≈ 1 1 VL F ( x ) = EL2F ( x ). n n If the functional is not linear, Part 2 of Theorem 3 will not hold, but it may be approximately true. Theorem 4 If T is Hadamard differentiable wrt d( F, G ) = supx | F ( x ) − G ( x )|, √ n{ T ( F̂n ) − T ( F )} ⇒ N (0, τ 2 ), where τ 2 = EL2F ( x ), and T ( F̂n ) − T ( F ) ⇒ N (0, 1), ˆ se √ ˆ = τ̂/ n and where se τ̂ 2 = 1 n ∑ L̂2 (Xi ). i · ˆ ∼ N (0, 1) is called the nonThe approximation { T ( F̂n ) − T ( F )}/se parametric delta method. The case T ( F ) = F −1 ( p), where F has positive density f , is an instructive example. The influence function is p −1 , x ≤ θ L( x ) = f (pθ ) , x > θ. f (θ ) Thus the asymptotic variance of T ( F̂n ) is τ2 = Z L2 ( x )dF ( x ) = p (1 − p ) , f 2 (θ ) which implies that we must estimate the density f in order to estimate τ 2 . References Jianqing Fan. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association, 91 (434):674–688, 1996. Hubert Lilliefors. On the kolmogorov-smirnov test for normality with mean and variance unknown. Journal of the American Statistical Association, 62(318):399–402, Jun 1967. 7