Download ON THE EXACT AND THE APPROXIMATE MEAN INTEGRATED

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Density of states wikipedia , lookup

History of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014
ON THE EXACT AND THE APPROXIMATE
MEAN INTEGRATED SQUARE ERROR FOR
THE KERNEL DISTRIBUTION FUNCTION
ESTIMATOR
Abdel-Razzaq Mugdadi, and Rawan Bani-Melhem
Abstract—The asymptotic mean integrated square error
(AMISE) is used as an approximate measure of error for the
mean integrated square error (MISE). The exact MISE for kernel
density estimator is discussed by Marron and Wand [1]. In this
investigation we discuss the exact MISE and the AMISE for
the cumulative distribution function estimate. Also, we compare
between the optimal bandwidth that minimize the AMISE and
that minimize the MISE. In addition, through simulations these
optimal bandwidths are compared with the bandwidth selectors
using the least square cross - validation (LSCV), biased cross validation (BCV), and direct plug -in (DPI) techniques.
Fb(x, h) = (n)−1
i=1
L
ET X1 , ... ,Xn be a random sample from a continuous
probability density function density f . The kernel density estimator of f was proposed in 1956 by Roosenblat and
it is given by:
i=1
k(
x − Xi
),
h
where K(x) = −∞ k(t)dt. In this investigation we will
use the Epanechnikov kernel k(x) = 43 (1 − x2 )I[|x| ≤ 1]
to estimate F (x). The choice of the bandwidth is a cruel
part in the kernel distribution estimator. The bandwidth h
still controls the smoothness of F. Unfortunately, the value
of h that optimize global measures of the accuracy of Fb(x, h)
are different from those of fb(x, h)estimation are not directly
applicable here. The typical measure of error to study the
performance of Fb(x, h) is called the mean integrated square
error (M ISE ) and is defined as,
I. I NTRODUCTION
n
X
K(
Rx
Index Terms—cumulative distribution function, density estimation, bandwidth, kernel method, mean square error
fb(x, h) = (nh)−1
n
X
Z
M ISE(Fb(x, h), w) = E[
∞
{Fb(x, h) − F (x)}2 w(x)dx],
−∞
where w(x) is a weighted function. Swanepoel [4] derives an
expression for M ISE(Fb(x, h)) when w(x) = f (x), Altman
and Leger [5] derived the asymptotic mean integrated square
error (AM ISE) for Fb(x, h). In this investigation we will
denote The optimal bandwidth that obtained by minimizing
the AM ISE{Fb(x, h)}, as hAM ISE , while the bandwidth that
obtained by minimizing the M ISE is denoted by hM ISE .
The M ISE optimal bandwidth can only be calculated if the
random sample comes from a known density, which is often
not the case. Therefore, methods are needed to select the
bandwidth when the underlying density is unknown. There
are several techniques to select the bandwidth for Fb(x, h).
We will derive an expression for the M ISE(Fb(x, h)) when
the probability density function is assumed to be exponential
distribution. Also we will compare between the M ISE and
the AM ISE, between hAM ISE and hM ISE and study how
close they are to each other. In addition, we will compare
between the optimal bandwidths and the data based bandwidth
selectors.
x − Xi
),
h
where, Rk is called the kernel function, it is a known function
such that k(x)dx = 1, and it is a symmetric function around
zero; and h is a positive real number, called the bandwidth. By
using the rescaling notation kh (u) = h−1 k( uh ), the estimator
Pn
can be written as fb(x, h) = (n)−1 i=1 kh (x − Xi ).
To study the performance of the kernel density estimator,
one must have suitable error criteria. The most common
criterion is the mean integrated squared error (M ISE ).
Marron and Wand [1] derived the exact M ISE expression for
the kernel density estimator. Mugdadi and Jeter [2] performed
a simulation study for the exact and the approximate M ISE
for the kernel density estimator.
The oldest known technique to estimate the cumulative distribution function, F (x) is the empirical estimation. Nadarya
[3] proposed the kernel technique to estimate F (x) and it is
given by
II.
Abdel-Razzaq Mugdadi Department of Mathematics and Statistics, Jordan University of Science and Technology, Irbid, Jordan, email: [email protected]
Rawan Bani-Melhem Department of Mathematics and Statistics, Jordan
University of Science and Technology, Irbid, Jordan email: [email protected]
MISE AND AMISE O PTIMAL
BANDWIDTHS
Before a kernel distribution function estimator can be created from a random sample, a value for a bandwidth must be
chosen. The obvious choice for the bandwidth is the value of
© 2014 GSTF
DOI: 10.5176/2251-3388_2.2.46
13
GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014
h, which minimizes the M ISE which is denoted by hM ISE ,
also, it is called the M ISE optimal bandwidth . To find
hM ISE the probability density function f (x) or the cumulative
distribution function F (x) must be known. Thus, to study
this technique we will simulate from a known probability
density function f (x) and substitute F (x) in the M ISE. In
this section, numerous values of hM ISE are calculated using
Epanechnikov kernel, while this may seem like a simple task,
to do so by hand would be impractical, if not impossible.
Therefore, these values for hM ISE are calculated numerically
through the implementation of a minimization program written
in the Mathematica programming. Under some conditions
Swanepoel [4] obtain the exact M ISE expression when
w(x) = f (x) by the following formula :
∞
Z
ψ[h, λ]
=
−∞
M ISE{Fb(x, h)}
f (x)dx
1
e−3λh (288(10 + 10λh + 3λ2 h2 )
=
256h6 λ6
−96e2λh (30 − 30λh + 9λ2 h2 + 4λ3 h3 )
+e3λh (45 − 36λ2 h2 + 18λ3 h3 − 72λ5 h5
+16λ6 h6 ) − eλh (45 + 90λh + 54λ2 h2 − 24λ3 h3
−48λ4 h4 + 64λ6 h6 )),
and
Z
ω[h, λ]
∞
∞
=
−∞
Z ∞
(6n)−1 + (1 − 1/n)
−∞
Z ∞
[
k(t){F (x − ht) − F (x)}dt]2
+e4λh (745 − 234λ2 h2 + 72λ3 h3 )).
∞
Z
M ISE{Fb(x, h)}
∞
k(t)K(t)
−∞
The exact M ISE expression for Fb(x, h) when the random
variable X has an exponential distribution with mean λ1 and
the Epanechnikov kernel is derived. Let
Z
∞
[
−∞
k(t){F (x − ht) − F (x)}dt ]2 .
−∞
f (x)dx
1
e−5λh (−11664(1 + λh)2
=
15552h6 λ6
+729e3λh (25 + 50λh + 22h2 λ2
−24h3 λ3 + 16h4 λ4 )
+864eλh (−1 − 4λh − 3h2 λ2
0.25V2 1
] 3.
nB3
However, hAM ISE is still disappointing since it depends on
the unknown underlying density f .
In this section, the exponential density function is used to
find hM ISE is also used to find hAM ISE . Before hAM ISE can
be calculated for this density V2 and B3 must be calculated.
After simplifications we have:
Z ∞
λ2
D2 (F ) =
λ3 e−3λx dx =
,
3
0
hAM ISE = [
+18h3 λ3 + 18h4 λ4 )
−108e4λh (1207 + 883λh + 198h2 λ2
−54h3 λ3 + 72h4 λ4 )
+e2λh (6481 + 1947λh − 4590h2 λ2
−144h3 λ3 + 432h4 λ4 − 1296h6 λ6 )
+e5λh (98735 − 26820h2 λ2
+8460h3 λ3 + 5022h4 λ4 − 4536h5 λ5
+1296h6 λ6 )),
(6n)−1 + (1 − 1/n)η[h, λ]
We evaluated h that minimizes M ISE{Fb(x, h)} for λ = 1,
then we compare the M ISE and the AM ISE, Other cases
are available from the authors.
As seen above, it is quite tedious to determine the exact
expressions for the M ISE that are needed to determine
hM ISE . This leads to another benefit of the asymptotic mean
integrated squared error (AM ISE), which provides asymptotic approximation to the M ISE for large sample sizes with
the benefit of depending upon h is a simple way. Altman and
Leger (1995) derived an expression for the ( AM ISE ) under
some regularity assumptions:
AM ISE{Fb(x, h)} = V1 n−1 − V2 hn−1 + B3 h4
where,
V1 = D1 (F ),
V2 = 2A1 (k)D2 (F ) and
2
B3 = 0.25[A
R 2 (k)] D3 (F ),
A1 (k) = R xK(x)k(x)dx
A2 (k) = R x2 k(x)dx
D1 (F ) = R F (x)[1 − F (x)]f (x)w(x)dx,
D2 (F ) = R [f (x)]2 w(x)dx, and
D3 (F ) = [f 0 (x)]2 f (x)w(x)dx. Therefore,
F (x)f (x)dx.
∞
=
+2/nψ[h, λ] − 2/nω[h, λ]
−∞
[F (x − ht) − F (x)dt]f (x)dx
Z ∞
−2/n
Z ∞ −∞
[
k(t){F (x − ht) − F (x)}dt ]
Z
k(t){F (x − ht) − F (x)}dt ]
−∞
Thus,
−∞
=
∞
[
+e8λh (79 + 75λh + 36λ3 h3 )
−∞
f (x)dx
Z
+2/n
η[h, λ]
Z
F (x)f (x)dx
1
=
e−4λh (−432(1 + λh)
864λ3 h3
−216e3λh (7 + λh) − 27e2λh (−21 + 16λ3 h3 )
{Fbn,h (x) − F (x)}2
w(x)dx
=
k(t)K(t){F (x − ht) − F (x)}dt ].
−∞
−∞
Z
= E [
∞
Z
[
(1)
© 2014 GSTF
14
GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014
and
Z
D3 (F )
∞
(λ4 e−2λx ) ∗ λe−λx ∗ λe−λx dx
=
0
5
=
λ
.
4
Thus,
hAM ISE = {2.04653(
1 1
)3 }
nλ3
M ISE optimal bandwidth (bottom curve) versus AM ISE
optimal bandwidth (top curve) when λ = 1
Fig. 1.
Table 1 : M ISE Optimal Bandwidth for the Exponential
Distribution Function when λ = 1
n
5
10
15
20
25
30
35
40
hM ISE
0.861139
0.482836
0.318424
0.23139
0.179662
0.146119
0.122479
0.105662
M ISE
0.0306692
0.016192
0.010966
0.0082737
0.00663713
0.00553896
0.00475171
0.00415997
AM ISE
0.00327906
0.00433061
0.00376169
0.00318216
0.00271995
0.00236143
0.00208157
0.00185723
III.
The practical implementation of the kernel distribution
estimator requires the specification of the bandwidth h. This
choice of h is very important as many authors noted (e.g Wand
and Jones [6]). The obvious choice of h is that minimizes the
M ISE, which we discussed earlier. However, the true density
of the random sample must be known to calculate hM ISE .
There are many different bandwidth selectors. In this section, we will investigate through simulations three techniques
to select the bandwidth to estimate F (x), they are: least
squares cross - validation (LSCV ), biased cross - validation
(BCV ) and direct plug - in (DP I ), the first two are inspired
of by the minimization of M ISE( Fb(x, h)),while the last
one ia based on minimizing the AM ISE ( Fb(x, h)). For
each one of the bandwidth technique to select the bandwidth
a simulation was conducted to create bandwidth estimates,
the estimate was obtained by taking various value of λ and
samples of sizes 10 to 40.
The Least Squares Cross-Validation: The LSCV technique to estimate f (x), which proposed by Rudemo [7] and
Bowman [8], it is inspired by expanding M ISE {fb(x, h)}to
Table 2 provide us with the values of hAM ISE obtained
for the values of λ = 1. The table also gives the M ISE and
the AM ISE when each corresponding value of h is used to
create a kernel distribution estimate for F .
Table 2 : AM ISE Optimal Bandwidth for the Exponential
Distribution when λ = 1
n
5
10
15
20
25
30
35
40
hAM ISE
1.19682
0.949914
0.829827
0.753947
0.699903
0.658634
0.625646
0.598408
M ISE
0.0313453
0.0173189
0.0121889
0.00947741
0.0077856
0.000662358
0.00577351
0.00512324
BANDWIDTH S ELECTORS
AM ISE
0.00127902
0.00222674
0.00199916
0.00174326
0.00153358
0.00136642
0.0012318
0.0011216
Z
= E[ fb(x, h)2 dx
Z
Z
b
−2 f (x, h)f (x)dx] + f 2 (x)dx.
M ISE{fb(x, h)}
R 2
Since
f (x)dx does not depend on h, minimizing
M ISE{fb(x, h)} is equivalent to minimizing
Many insights can be gained from the analysis of Tables
1 and 2. (Also, several similar tables for other cases are
available from the authors). First, in each table the AM ISE is
approaching to the M ISE as the sample size gets large. Since
the AM ISE is a large sample approximation to the M ISE.
In addition, the M ISE and the AM ISE become smaller as
the sample size increases, which means the kernel distribution
estimator is becoming more accurate. Next, focus on the values
of h in each table. As the sample size increases, the bandwidth
decreases. In other words, less smoothing is needed with larger
sample sizes. Insights can also be gained by comparing the two
tables from each value of λ for the exponential distribution,
as n becomes large, hM ISE and hAM ISE become closer to
0, this shown in the figure 1.
Z
M ISE{fb(x, h)} −
2
f (x)dx
Z
= E[ fb(x, h)2 dx
Z
−2 fb(x, h)f (x)dx].
It can be shown that an unbiased estimator of the right - hand
side of the above equation is
Z
LSCV (h) =
fb(x, h)2 dx − 2n−1
n
X
fb−i (Xi ; h),
i=1
Pn
where fb−i (x, h) = (n − 1)−1 i6=j kh (x − Xj ).
© 2014 GSTF
15
GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014
Sarda [9] considered such an estimator, but argued that
the resulting score function will produce a small bandwidth.
Instead, he introduced a so-called cross-validation criterion .
CV (h) = n−1
gAM SE
n
X
[ Fbn,−i (xi ) − Fn (xi )]2 w(xi ),
where Fbn,−i is the kernel estimator computed by leaving out
xi , and
Pn
X −X
Fbn,−i (x) = (n − 1)−1 j6=i K( i h j ), Fn (xi ) is the
empirical function, and w(xi ) is the weighted function. The
Bandwidth estimate is chosen by minimizing CV (h), this
value of h is denoted by b
hLSCV .
The biased cross-validation: The BCV technique proposed by Scott and Terrell [10]. It is based on the formula
for the asymptotic M ISE. The BCV objective function is
obtained by replacing the unknown R(f 00 ) in AM ISE by the
estimator,
"
hDP I
= R(fc00 (x, h)) − (nh5 )−1 R(k 00 )
n
X
= n−2
(kh00 ∗ kh00 )(xi − xj ),
Comparison of Bandwidth Selectors and Optimal Bandwidths
To compare among the techniques, we did simulated from
exponential distribution with mean 2 for different sample size.
Table 5 showed that the LSCV estimates are less biased than
the BCV estimates, also, as the sample size increases both
estimates become more accurate. The DP I estimates smaller
than AM ISE optimal bandwidth, again as the sample size
increases, the DP I estimates larger than the M ISE optimal
bandwidths as shown in Figure 4.
The bandwidth estimate is chosen by minimizing BCV (h).
This value of h is denoted by b
hBCV . Using the exponential
2
λ
2
density we obtain V1 (F ) = 12
, V2 (F ) = 3λ
35 , and [A2 (k)]
1
= 25 , so
Table 5 : Bandwidth Estimates using the Exponential Distribution when λ = 0.5
λ
3λ2 h 0.25h4 b
−
+
D3 (F )
12n
35n
25
Direct Plug-In: The practical rules were proposed by Scott,
Tapia, and Thompson [12] and Sheather [13], focuses on plugging in estimates of unknown quantities intoR the formula for
hAM ISE . This method depends upon ψr = f (r) (x)f (x)dx
functionals and the choice of a pilot bandwidth g. A kernel
estimator of ψr is
BCV (h) =
i=1
n X
n
X
.
35
b 3 (F )}h4 .
BCV (h) = V1 n−1 − V2 hn−1 + {(0.25)[A2 (k)]2 D
fb(r) (Xi ; g) = n−2
# 15
b3
nB
R
where
∗
− xj ) = kh00 (xi − xj − t) kh00 (t)dt
Now for the distribution function case, we will use the
same technique as that of the kernel density estimator.
The AM ISE of Fb is, AM ISE(Fb(x, h)) = V1 n−1 −
V2 hn−1 + B3 h4 , where V1 ,V2 and B3 are defined earlier.
The BCV objective function is obtained by replacing D3 (F )
with the estimator which introduced by Hall and Marron
b 3 (F ), which is given by D
b 3 (F ) =
[11], P
denoted
by P
D
n Pn
n
1
0 Xi −Xj
0 Xi −Xk
k
(
)k
(
)w(x
i ), where
i=1
j=1
k=1
n3 h4
h
h
k 0 is the derivative of a kernel k. Thus,
n
X
R(k)
=
nµ2 (k)2 ψb4 (g)
ponential case we can find b
hDP I as follow: Let Φ(h) =
b2 1
0.25V
3
{ nBb } and h = Φ(h) so we find h such that h−Φ(h) = 0,
3
b 2 (F ) , B
b3 = 0.01 D
b 3 (F )
where, Vb2 = 9 D
kh00 )(xi
ψbr (g) = n−1
.
Now in the distribution function, an alternative approach to
select the bandwidth is to use an estimator of the asymptotically optimal bandwidth b
hAM ISE , which was proposed by
b 2 (F ), D
b 2 (F ) =
Altman and
Leger,
where
Vb2 = 2A1 (k)D
Pn
Xi −Xj
1
−1
∗
α
k{
}w(x
),
and
α
is
the
bandi
v
i6=j v
n(n−1)
αv
2b
b
b
width,Pand P
B3 =P
0.25 [A2 (k)] D3 (F ), where, D3 (F ) =
n
n
n
1
0 Xi −Xk
0 Xi −Xj
)w(xi ). Finally
4
i=1
j=1
k=1 k ( αb )k ( αb
αb n3
1
b
0.25
V
2
the selected bandwidth is b
hDP I = {
} 3 . For the ex-
i6=j
(kh00
1
(r+3)
A discussion of these two quantities can be found in Wand
and Jones (1995). Using r = 2 and by replacing ψ4 by
the kernel estimator ψb4 (g), the DPI bandwidth selectors is
obtained.
i=1
00 )
^
R(f
k!L(r) (0)
=
−nµk (L)ψr+2
n
hM ISE
hAM ISE b
hLSCV b
hBCV
10 0.965671 1.89983
1.43
1.475
20 0.462784 1.50789 0.4999 0.761
30 0.292054 1.31727
0.351
0.66
40 0.211148 1.19682
0.238
0.482
Also, from Figures 2,3, and 4, we conclude that
bandwidth is close to the MISE bandwidth.
b
hDP I
1.159
1.135
1.046
0.874
the LSCV
L(r)
g (Xi− Xj )
i=1 j=1
where g is a bandwidth, possibly different from h, and
L is a kernel, possibly different from k. Obviously, ψbr (g)
depends upon the choice of a bandwidth g, which is called the
pilot bandwidth. The asymptotic mean square error (AM SE)
optimal bandwidth is used for g, is given by
© 2014 GSTF
16
GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014
MISE optimal bandwidth (bottom curve) versus LSCV
bandwidth estimate (top curve)
MISE optimal bandwidth (bottom curve) versus DPI bandwidth estimate (top curve)
Fig. 2.
Fig. 4.
AUTHORS' PROFILE
AUTHORS' FPROFILE
Abdel-Razzaq Mugdadi is an Associate Professor of Statistics at the
Department of Mathematics and Statistics at Jordan University of Science
and Technology. He received his Ph.D. in Statistics from Northern Illinois
University in 1999. His research interest is in the areas of density estimation, Reliability, and life testing. His publications appeared in the Journal
of Statistical Planning and Inference, Journal of Nonparametric Statistics,
Computational Statistics and Data Analysis, IEEE transaction on Reliability
and others.
Rawan Bani-Melhem is part time lecturer at Jordan University of Science
and Technology. Her research interest is in the area of density estimation. The
current paper is based on her master thesis at Jordan University of Science
and Technology under the supervision of Dr. Mugdadi.
MISE optimal bandwidth (bottom curve) versus BCV
bandwidth estimate (top curve)
Fig. 3.
R EFERENCES
[1] Marron, J.S and Wand, M . P. (1992). Exact mean integrated squared
error. Ann. statist. vol (20), 2 : 712-736.
[2] Mugdadi, A. R. Jeter, J. (2010). A simulation study for the bandwidth
selection in the kernel density estimation based on the exact and the
asymptotic MISE. Pakistan Journal of Statistics. 26 (1), 239-265.
[3] Nadaraya , E.A .(1964). Some new estimates for distribution functions,
theory of probability and it is applications, 9 : 497-500.
[4] Swanepoel.W.H.(1988). Mean integrated square error properties and
optimal kernel when estimating a distribution function,17:3785-3799.
[5] Altman, N. and Leger,C.(1995). Bandwidth selection for kernel distribution function estimation, J . statist.Planning and Inference, 46:195-214.
[6] Wand. M.P. and Jones.M.C. (1995). Kernel Smoothing, First Edition.
[7] Rudemo, M. (1982). Empirical choice of histograms and kernel density
estimators. Scand. J . Statist, 9, 65-78.
[8] Bowman, A.W.(1984). An alternative method of cross-Validation for the
smoothing of density estimates. Biometrika,71, 353-360.
[9] Sarda, P. (1993). Smoothing parameter selection for smooth distribution
functions. J. Statist. Plann. Inference 35, 65-75.
[10] Scott, D.W. and Terrell, G.R. (1987). Biased and unbiased crossvalidation in density estimation. J. Amer. Statist. Assoc., 82, 1131-1146.
[11] Hall, P. and J.S. Marron (1987). Estimation of integrated squared density
derivatives, Statist. Probab. Lett. 6,109-115.
[12] Scott, D.W. Tapia, R.A. and Thompson, J.R. (1977). Kernel density
estimation revisited. Nonlinear Anal. Theory Meth. Applic., 1, 339-372.
[13] Sheather, S.J. (1983). A data-based algorithm for choosing the window
width when estimating the density at a point. Comp. Statist. Data Anal.
,1, 229-238.
© 2014 GSTF
17