Download Estimating the Cumulative Distribution Function and Statistical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Estimating the Cumulative Distribution Function
and Statistical Functionals
John Hughes
September 8, 2014
Estimating the Cumulative Distribution Function
n
Yn =
∑ I ( Xi ≤ x )
i =1
is a sum of independent B er { F ( x )} random variables. This means
that Yn is a B in{n, F ( x )} random variable, which implies that
F̂n ( x ) ∼
1
B in{n, F ( x )}.
n
These facts provide some insight regarding the first part of the following theorem.
Theorem 1 1. For any x,
E F̂n ( x ) = F ( x )
V F̂n ( x ) =
F ( x ){1 − F ( x )}
.
n
0.8
0.6
0.4
0.2
0.0
Notice that I ( Xi ≤ x ) = 1 with probability P( Xi ≤ x ) = F ( x ),
and I ( Xi ≤ x ) = 0 with probability P( Xi > x ) = 1 − F ( x ). This
implies that I ( Xi ≤ x ) is a Bernoulli random variable with "success"
probability F ( x ). Since the Xi are independent, so are the I ( Xi ≤ x ).
Thus
^
F
n
where I (·) is the indicator function:

0 if A is false
I ( A) =
1 if A is true.
1.0
Suppose we have an iid sample X1 , . . . , Xn ∼ F, where F ( x ) = P( X ≤
x ) is a distribution function defined on the real numbers. If P is the
relevant probability measure, we can also define F in terms of P as
F ( x ) = P{(−∞, x ]}, where P( A) denotes the probability of the set A.
We estimate F using the empirical distribution function F̂n . This
is the distribution function that puts mass 1/n at each data point.
In other words, F̂n is a step function such that the steps all have the
same height (1/n) but generally do not have the same depth. Formally,
1 n
F̂n ( x ) = ∑ I ( Xi ≤ x ),
n i =1
-1.5 -1.0 -0.5
0.0
0.5
1.0
1.5
x
Figure 1: This figure shows the empirical cdf for a sample of size three from
the standard normal distribution. The
sampled values were -0.67, -0.02, and
0.90. Since the sample size is three, the
height of each step is 1/3.
estimating the cumulative distribution function and statistical functionals
2
2. (Glivenko-Cantelli theorem)
a.s.
sup | F̂n ( x ) − F ( x )| −→ 0.
x
3. (Dvoretzky-Kiefer-Wolfowitz inequality) For any e > 0,
2
P sup | F̂n ( x ) − F ( x )| > e ≤ 2e−2ne .
x
P
Part 1 of the theorem implies that F̂n ( x ) −→ F ( x ). This follows
immediately from the weak law of large numbers, or it can easily be
proved directly using Chebyshev’s inequality.
And Part 3 of the theorem can be used to construct a nonparametric (1 − α)100% confidence band for F ( x ).
Theorem 2 Let
L( x ) = max{ F̂n ( x ) − en , 0}
H ( x ) = min{ F̂n ( x ) + en , 1},
where
en = {(2n)−1 log 2α−1 }1/2 .
Then, for all F and n,
P{ L( x ) ≤ F ( x ) ≤ H ( x ) for all x } ≥ 1 − α.
The homework assignment will ask you to compute a bootstrap
confidence band for a real dataset and compare the result with the
band provided by Theorem 2.
The Kolmogorov-Smirnov and Cramér-von Mises Tests
The K-S and C-vM tests are nonparametric tests of the null hypothesis that a sample came from a certain distribution, or that two samples came from the same distribution. More specifically, in the onesample case, the K-S and C-vM tests allow us to test the null hypothesis H0 : F = F0 versus H1 : F 6= F0 for a fixed F0 and assuming that
X1 , . . . , Xn is an iid sample from a continuous cdf F.
The K-S statistic is
Dn ( F0 ) = sup | F̂n ( x ) − F0 ( x )|.
x
a.s.
From Part 2 of Theorem 1 we know that Dn ( F0 ) −→ 0 iff H0 is true.
Thus a K-S test rejects H0 when Dn ( F0 ) > c for some constant c that
of course depends on the null distribution of the statistic.
We may want to test a more specific hypothesis, such as
H0 : F = F0 versus H1 : F ≤ F0 , F 6= F0
By Chebyshev’s inequality,
P(| F̂n ( x ) − F ( x )| ≥ e) ≤
F ( x ){1 − F ( x )}
→ 0.
ne2
estimating the cumulative distribution function and statistical functionals
3
or
H0 : F = F0 versus H1 : F ≥ F0 , F 6= F0 .
The corresponding K-S statistics are
Dn− ( F0 ) = sup{ F0 ( x ) − F̂n ( x )}
x
and
Dn+ ( F0 ) = sup{ F̂n ( x ) − F0 ( x )}.
x
Let X(1) , . . . , X(n) be the order statistics, and define X(0) = −∞ and
X(n) = ∞. Since F̂n ( x ) = i/n when X(i) ≤ x < X(i+1) (i = 0, 1, . . . , n),
we can write
Dn+ ( F0 ) = max
sup
0≤ i ≤ n X ≤ x < X
(i )
( i +1)
{i/n − F0 ( x )}
(
)
i/n −
= max
0≤ i ≤ n
inf
X ( i ) ≤ x < X ( i +1)
F0 ( x )
= max {i/n − F0 ( X(i) )}.
0≤ i ≤ n
And Dn− ( F0 ) can be written in an analogous fashion.
When F0 is continuous, F0 ( X(i) ) is the ith order statistic of a sample of size n from the standard uniform distribution, and so the
distribution of Dn+ ( F0 ) does not depend on F0 . Due to symmetry,
Dn− ( F0 ) has the same distribution as Dn+ ( F0 ). Since
Dn ( F0 ) = max{ Dn− ( F0 ), Dn+ ( F0 )},
the distribution of Dn ( F0 ) does not depend on F0 , which means the
distributions of K-S statistics are known under H0 . Note that this is
not true if some or all parameters of F0 have been estimated from
data. In this case a Lilliefors-type test can be used [Lilliefors, 1967].
The two-sample K-S statistic is
Dn1 ,n2 = sup | F̂1n1 ( x ) − F̂2n2 ( x )|,
x
where n1 , n2 are the sample sizes and F̂1n1 , F̂2n2 are the empirical cdfs.
A test of limiting size α can be obtained using
lim
n1 ,n2 →∞
P[{n1 n2 /(n1 + n2 )}1/2 Dn1 ,n2 ≤ t] =
∞
∑
j=−∞
The one-sample C-vM statistic is
Cn ( F0 ) =
Z
2 2
(−1) j−1 e−2j t , t > 0.
{ F̂n ( x ) − F0 ( x )}2 dF0 ( x ).
The distribution of Cn ( F0 ) also does not depend on F0 .
Suppose that X ∼ F, where F is
continuous. Then F ( X ) is a standard
uniform random variable. This is called
the probability integral transform.
estimating the cumulative distribution function and statistical functionals
An interesting paper regarding the K-S and C-vM tests is Fan
[1996]. Fan takes a deeper look at these statistics by first formulating
an equivalent version of H0 : F = F0 versus H1 : F 6= F0 . If the null
hypothesis is true, i.e., if the unknown cdf F is in fact equal to F0 , the
sample U1 = F0 ( X1 ), . . . , Un = F0 ( Xn ) is an independent sample of
size n from the standard uniform distribution. Thus the problem of
testing H0 : F = F0 versus H1 : F 6= F0 for the original sample is
equivalent to testing
H0 : F = the uniform cdf versus H1 : F 6= the uniform cdf
(1)
for the transformed sample U1 , . . . , Un .
Fan’s next step is to cleverly reformulate (1). Consider the Fourier
transform
θ2j−1 =
θ2j =
Z 1
0
Z 1
0
cos(2πjx ) f ( x )dx
sin(2πjx ) f ( x )dx,
( j = 1, 2, . . . )
where f is the density for F. If the null hypothesis is true, i.e., if
F is the standard uniform cdf, the density f is of course equal to
I ( x ∈ [0, 1]). This implies that θ2j−1 and θ2j are equal to 0. Thus (1) is
equivalent to
H0 : θk = 0 (k = 1, 2, . . . )
(2)
versus
H1 : at least one of θk 6= 0.
Fan then rewrites the C-vM statistic as
Cn = (2π 2 )−1
∞
∑ j−2 (θ̂2j2 −1 + θ̂2j2 ),
j =1
where θ̂2j−1 and θ̂2j are estimates of θ2j−1 and θ2j . This form of Cn is
revealing, for it shows that Cn down weights high-frequency components. More specifically, suppose that F has high energy at a large
frequency, i.e., large θ2j−1 and/or θ2j for a large value of j. Then θ̂2j−1
and/or θ̂2j will be large, but their contribution might easily be overwhelmed when divided by j2 . This means that the C-vM test lacks
power when the true cdf F deviates from the uniform cdf in certain
ways. This is true not just for the C-vM test but for any distributionbased test statistic. Fan goes on to formulate more powerful tests that
he calls smoothing-based tests.
Estimating Statistical Functionals
R
R
Let F be a cdf. Define g( x )dF ( x ) to be g( x ) f ( x )dx if F is continuous, ∑i g( xi ) f ( xi ) if F is discrete, where f is the density or mass
4
estimating the cumulative distribution function and statistical functionals
function, respectively, for F. This is handy because it allows us to use
one notation to handle both situations. If you have studied measure
theory, you know that the two situations are in fact equivalent in the
sense that ∑i g( xi ) f ( xi ) can be written as an integral with respect to
counting measure.
A statistical functional is a function T ( F ) of the cdf F. Common
R
R
examples are the mean µ = xdF ( x ) and variance σ2 = ( x −
µ)2 dF ( x ). These are linear functionals, i.e., functionals of the form
R
g( x )dF ( x ). Another example is F −1 ( p), the pth quantile, which is a
nonlinear functional.
The so-called plug-in estimator of a functional θ = T ( F ) is θ̂n =
T ( F̂n ). In the case of a linear functional, since F̂n is discrete, we can
write θ̂n as
Z
1 n
T ( F̂n ) = g( x )d F̂n ( x ) = ∑ g( Xi ).
n i =1
We can often apply this definition recursively to find plug-in estimators of linear functionals that contain unknown parameters. For
example, consider the skewness
R
( x − µ)3 dF ( x )
E( X − µ )3
= R
.
γ1 =
3
σ
{ ( x − µ)2 dF ( x )}3/2
We do not know µ, but we know that its plug-in estimator is µ̂ = X̄n .
Thus
γ̂1 =
=
=
1
n
∑i ( Xi − µ̂)3
{ n1 ∑i ( Xi − µ̂)2 }3/2
1
n
∑i ( Xi − µ̂)3
{σ̂2 }3/2
1
n
∑i ( Xi − µ̂)3
.
σ̂3
As for the nonlinear functional F −1 ( p) mentioned above, the estimator is F̂n−1 ( p). Since F̂n does not have a true inverse, we define
F̂n−1 ( p) = inf{ x : F̂n ( x ) ≥ p}. We use this definition not just for F̂n but
for all cdfs that have one or more "jumps."
Influence Functions
The influence function is useful for approximating the standard error
of a plug-in estimator. The influence function is defined in terms of
the Gâteaux derivative of T at F in the direction G:
L F ( G ) = lim
e →0
T {(1 − e) F + eG } − T ( F )
.
e
5
estimating the cumulative distribution function and statistical functionals
If G = δx is a point mass (also called a Dirac mass) at x, we write
L F ( x ) ≡ L F (δx ) and call L F ( x ) the influence function. This implies
that
T {(1 − e) F + eδx } − T ( F )
L F ( x ) = lim
.
e
e →0
Informally, the influence function describes the effect of an infinitesimal contamination at x on the functional of interest. The empirical
influence function is the plug-in estimator of L F :
L̂( x ) = lim
e →0
Theorem 3 Let T ( F ) =
R
T {(1 − e) F̂n + eδx } − T ( F̂n )
.
e
g( x )dF ( x ) be a linear functional. Then:
1. L F ( x ) = g( x ) − T ( F ) and L̂( x ) = g( x ) − T ( F̂n ).
2. For any G,
T (G) = T ( F) +
3.
Z
L F ( x )dG ( x ).
R
L F ( x )dF ( x ) = 0.
R
R
4. Let τ 2 = L2F ( x )dF ( x ). Then τ 2 = { g( x ) − T ( F )}2 dF ( x ), and if
τ 2 < ∞,
√
n{ T ( F ) − T ( F̂n )} ⇒ N (0, τ 2 ).
5. Let
τ̂ 2 =
1
n
1
∑ L̂2 (Xi ) = n ∑{ g(Xi ) − T ( F̂n )}2 .
i
i
√
P
P
2
2
ˆ
ˆ = τ̂/ n and se =
−→ 1, where se
Then
q τ̂ −→ τ and se/se
VT ( F̂n ).
6.
√
n{ T ( F ) − T ( F̂n )}
⇒ N (0, 1).
τ̂
This theorem shows that the influence function behaves like the score
function in parametric settings. More specifically, if
˙ θ | x ) = d log f ( x | θ )
sθ ( x ) = `(
dθ
is the score function for parametric model f ( x | θ ), we have (assuming certain regularity conditions hold)
Esθ ( x ) = 0
Vsθ ( x ) = Es2θ ( x ).
6
estimating the cumulative distribution function and statistical functionals
In words, the variance of the score is the second raw moment of the
score because the first raw moment is equal to 0. The same is true for
the influence function of a linear functional, and so we have
EL F ( x ) = 0
VT ( F̂n ) ≈
1
1
VL F ( x ) = EL2F ( x ).
n
n
If the functional is not linear, Part 2 of Theorem 3 will not hold,
but it may be approximately true.
Theorem 4 If T is Hadamard differentiable wrt d( F, G ) = supx | F ( x ) −
G ( x )|,
√
n{ T ( F̂n ) − T ( F )} ⇒ N (0, τ 2 ),
where τ 2 = EL2F ( x ), and
T ( F̂n ) − T ( F )
⇒ N (0, 1),
ˆ
se
√
ˆ = τ̂/ n and
where se
τ̂ 2 =
1
n
∑ L̂2 (Xi ).
i
·
ˆ ∼ N (0, 1) is called the nonThe approximation { T ( F̂n ) − T ( F )}/se
parametric delta method. The case T ( F ) = F −1 ( p), where F has
positive density f , is an instructive example. The influence function is

 p −1 , x ≤ θ
L( x ) = f (pθ )

, x > θ.
f (θ )
Thus the asymptotic variance of T ( F̂n ) is
τ2 =
Z
L2 ( x )dF ( x ) =
p (1 − p )
,
f 2 (θ )
which implies that we must estimate the density f in order to estimate τ 2 .
References
Jianqing Fan. Test of significance based on wavelet thresholding and
neyman’s truncation. Journal of the American Statistical Association, 91
(434):674–688, 1996.
Hubert Lilliefors. On the kolmogorov-smirnov test for normality
with mean and variance unknown. Journal of the American Statistical
Association, 62(318):399–402, Jun 1967.
7