Download ON THE EXACT AND THE APPROXIMATE MEAN INTEGRATED

GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014 ON THE EXACT AND THE APPROXIMATE MEAN INTEGRATED SQUARE ERROR FOR THE KERNEL DISTRIBUTION FUNCTION ESTIMATOR Abdel-Razzaq Mugdadi, and Rawan Bani-Melhem Abstract—The asymptotic mean integrated square error (AMISE) is used as an approximate measure of error for the mean integrated square error (MISE). The exact MISE for kernel density estimator is discussed by Marron and Wand [1]. In this investigation we discuss the exact MISE and the AMISE for the cumulative distribution function estimate. Also, we compare between the optimal bandwidth that minimize the AMISE and that minimize the MISE. In addition, through simulations these optimal bandwidths are compared with the bandwidth selectors using the least square cross - validation (LSCV), biased cross validation (BCV), and direct plug -in (DPI) techniques. Fb(x, h) = (n)−1 i=1 L ET X1 , ... ,Xn be a random sample from a continuous probability density function density f . The kernel density estimator of f was proposed in 1956 by Roosenblat and it is given by: i=1 k( x − Xi ), h where K(x) = −∞ k(t)dt. In this investigation we will use the Epanechnikov kernel k(x) = 43 (1 − x2 )I[|x| ≤ 1] to estimate F (x). The choice of the bandwidth is a cruel part in the kernel distribution estimator. The bandwidth h still controls the smoothness of F. Unfortunately, the value of h that optimize global measures of the accuracy of Fb(x, h) are different from those of fb(x, h)estimation are not directly applicable here. The typical measure of error to study the performance of Fb(x, h) is called the mean integrated square error (M ISE ) and is defined as, I. I NTRODUCTION n X K( Rx Index Terms—cumulative distribution function, density estimation, bandwidth, kernel method, mean square error fb(x, h) = (nh)−1 n X Z M ISE(Fb(x, h), w) = E[ ∞ {Fb(x, h) − F (x)}2 w(x)dx], −∞ where w(x) is a weighted function. Swanepoel [4] derives an expression for M ISE(Fb(x, h)) when w(x) = f (x), Altman and Leger [5] derived the asymptotic mean integrated square error (AM ISE) for Fb(x, h). In this investigation we will denote The optimal bandwidth that obtained by minimizing the AM ISE{Fb(x, h)}, as hAM ISE , while the bandwidth that obtained by minimizing the M ISE is denoted by hM ISE . The M ISE optimal bandwidth can only be calculated if the random sample comes from a known density, which is often not the case. Therefore, methods are needed to select the bandwidth when the underlying density is unknown. There are several techniques to select the bandwidth for Fb(x, h). We will derive an expression for the M ISE(Fb(x, h)) when the probability density function is assumed to be exponential distribution. Also we will compare between the M ISE and the AM ISE, between hAM ISE and hM ISE and study how close they are to each other. In addition, we will compare between the optimal bandwidths and the data based bandwidth selectors. x − Xi ), h where, Rk is called the kernel function, it is a known function such that k(x)dx = 1, and it is a symmetric function around zero; and h is a positive real number, called the bandwidth. By using the rescaling notation kh (u) = h−1 k( uh ), the estimator Pn can be written as fb(x, h) = (n)−1 i=1 kh (x − Xi ). To study the performance of the kernel density estimator, one must have suitable error criteria. The most common criterion is the mean integrated squared error (M ISE ). Marron and Wand [1] derived the exact M ISE expression for the kernel density estimator. Mugdadi and Jeter [2] performed a simulation study for the exact and the approximate M ISE for the kernel density estimator. The oldest known technique to estimate the cumulative distribution function, F (x) is the empirical estimation. Nadarya [3] proposed the kernel technique to estimate F (x) and it is given by II. Abdel-Razzaq Mugdadi Department of Mathematics and Statistics, Jordan University of Science and Technology, Irbid, Jordan, email: [email protected] Rawan Bani-Melhem Department of Mathematics and Statistics, Jordan University of Science and Technology, Irbid, Jordan email: [email protected] MISE AND AMISE O PTIMAL BANDWIDTHS Before a kernel distribution function estimator can be created from a random sample, a value for a bandwidth must be chosen. The obvious choice for the bandwidth is the value of © 2014 GSTF DOI: 10.5176/2251-3388_2.2.46 13 GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014 h, which minimizes the M ISE which is denoted by hM ISE , also, it is called the M ISE optimal bandwidth . To find hM ISE the probability density function f (x) or the cumulative distribution function F (x) must be known. Thus, to study this technique we will simulate from a known probability density function f (x) and substitute F (x) in the M ISE. In this section, numerous values of hM ISE are calculated using Epanechnikov kernel, while this may seem like a simple task, to do so by hand would be impractical, if not impossible. Therefore, these values for hM ISE are calculated numerically through the implementation of a minimization program written in the Mathematica programming. Under some conditions Swanepoel [4] obtain the exact M ISE expression when w(x) = f (x) by the following formula : ∞ Z ψ[h, λ] = −∞ M ISE{Fb(x, h)} f (x)dx 1 e−3λh (288(10 + 10λh + 3λ2 h2 ) = 256h6 λ6 −96e2λh (30 − 30λh + 9λ2 h2 + 4λ3 h3 ) +e3λh (45 − 36λ2 h2 + 18λ3 h3 − 72λ5 h5 +16λ6 h6 ) − eλh (45 + 90λh + 54λ2 h2 − 24λ3 h3 −48λ4 h4 + 64λ6 h6 )), and Z ω[h, λ] ∞ ∞ = −∞ Z ∞ (6n)−1 + (1 − 1/n) −∞ Z ∞ [ k(t){F (x − ht) − F (x)}dt]2 +e4λh (745 − 234λ2 h2 + 72λ3 h3 )). ∞ Z M ISE{Fb(x, h)} ∞ k(t)K(t) −∞ The exact M ISE expression for Fb(x, h) when the random variable X has an exponential distribution with mean λ1 and the Epanechnikov kernel is derived. Let Z ∞ [ −∞ k(t){F (x − ht) − F (x)}dt ]2 . −∞ f (x)dx 1 e−5λh (−11664(1 + λh)2 = 15552h6 λ6 +729e3λh (25 + 50λh + 22h2 λ2 −24h3 λ3 + 16h4 λ4 ) +864eλh (−1 − 4λh − 3h2 λ2 0.25V2 1 ] 3. nB3 However, hAM ISE is still disappointing since it depends on the unknown underlying density f . In this section, the exponential density function is used to find hM ISE is also used to find hAM ISE . Before hAM ISE can be calculated for this density V2 and B3 must be calculated. After simplifications we have: Z ∞ λ2 D2 (F ) = λ3 e−3λx dx = , 3 0 hAM ISE = [ +18h3 λ3 + 18h4 λ4 ) −108e4λh (1207 + 883λh + 198h2 λ2 −54h3 λ3 + 72h4 λ4 ) +e2λh (6481 + 1947λh − 4590h2 λ2 −144h3 λ3 + 432h4 λ4 − 1296h6 λ6 ) +e5λh (98735 − 26820h2 λ2 +8460h3 λ3 + 5022h4 λ4 − 4536h5 λ5 +1296h6 λ6 )), (6n)−1 + (1 − 1/n)η[h, λ] We evaluated h that minimizes M ISE{Fb(x, h)} for λ = 1, then we compare the M ISE and the AM ISE, Other cases are available from the authors. As seen above, it is quite tedious to determine the exact expressions for the M ISE that are needed to determine hM ISE . This leads to another benefit of the asymptotic mean integrated squared error (AM ISE), which provides asymptotic approximation to the M ISE for large sample sizes with the benefit of depending upon h is a simple way. Altman and Leger (1995) derived an expression for the ( AM ISE ) under some regularity assumptions: AM ISE{Fb(x, h)} = V1 n−1 − V2 hn−1 + B3 h4 where, V1 = D1 (F ), V2 = 2A1 (k)D2 (F ) and 2 B3 = 0.25[A R 2 (k)] D3 (F ), A1 (k) = R xK(x)k(x)dx A2 (k) = R x2 k(x)dx D1 (F ) = R F (x)[1 − F (x)]f (x)w(x)dx, D2 (F ) = R [f (x)]2 w(x)dx, and D3 (F ) = [f 0 (x)]2 f (x)w(x)dx. Therefore, F (x)f (x)dx. ∞ = +2/nψ[h, λ] − 2/nω[h, λ] −∞ [F (x − ht) − F (x)dt]f (x)dx Z ∞ −2/n Z ∞ −∞ [ k(t){F (x − ht) − F (x)}dt ] Z k(t){F (x − ht) − F (x)}dt ] −∞ Thus, −∞ = ∞ [ +e8λh (79 + 75λh + 36λ3 h3 ) −∞ f (x)dx Z +2/n η[h, λ] Z F (x)f (x)dx 1 = e−4λh (−432(1 + λh) 864λ3 h3 −216e3λh (7 + λh) − 27e2λh (−21 + 16λ3 h3 ) {Fbn,h (x) − F (x)}2 w(x)dx = k(t)K(t){F (x − ht) − F (x)}dt ]. −∞ −∞ Z = E [ ∞ Z [ (1) © 2014 GSTF 14 GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014 and Z D3 (F ) ∞ (λ4 e−2λx ) ∗ λe−λx ∗ λe−λx dx = 0 5 = λ . 4 Thus, hAM ISE = {2.04653( 1 1 )3 } nλ3 M ISE optimal bandwidth (bottom curve) versus AM ISE optimal bandwidth (top curve) when λ = 1 Fig. 1. Table 1 : M ISE Optimal Bandwidth for the Exponential Distribution Function when λ = 1 n 5 10 15 20 25 30 35 40 hM ISE 0.861139 0.482836 0.318424 0.23139 0.179662 0.146119 0.122479 0.105662 M ISE 0.0306692 0.016192 0.010966 0.0082737 0.00663713 0.00553896 0.00475171 0.00415997 AM ISE 0.00327906 0.00433061 0.00376169 0.00318216 0.00271995 0.00236143 0.00208157 0.00185723 III. The practical implementation of the kernel distribution estimator requires the specification of the bandwidth h. This choice of h is very important as many authors noted (e.g Wand and Jones [6]). The obvious choice of h is that minimizes the M ISE, which we discussed earlier. However, the true density of the random sample must be known to calculate hM ISE . There are many different bandwidth selectors. In this section, we will investigate through simulations three techniques to select the bandwidth to estimate F (x), they are: least squares cross - validation (LSCV ), biased cross - validation (BCV ) and direct plug - in (DP I ), the first two are inspired of by the minimization of M ISE( Fb(x, h)),while the last one ia based on minimizing the AM ISE ( Fb(x, h)). For each one of the bandwidth technique to select the bandwidth a simulation was conducted to create bandwidth estimates, the estimate was obtained by taking various value of λ and samples of sizes 10 to 40. The Least Squares Cross-Validation: The LSCV technique to estimate f (x), which proposed by Rudemo [7] and Bowman [8], it is inspired by expanding M ISE {fb(x, h)}to Table 2 provide us with the values of hAM ISE obtained for the values of λ = 1. The table also gives the M ISE and the AM ISE when each corresponding value of h is used to create a kernel distribution estimate for F . Table 2 : AM ISE Optimal Bandwidth for the Exponential Distribution when λ = 1 n 5 10 15 20 25 30 35 40 hAM ISE 1.19682 0.949914 0.829827 0.753947 0.699903 0.658634 0.625646 0.598408 M ISE 0.0313453 0.0173189 0.0121889 0.00947741 0.0077856 0.000662358 0.00577351 0.00512324 BANDWIDTH S ELECTORS AM ISE 0.00127902 0.00222674 0.00199916 0.00174326 0.00153358 0.00136642 0.0012318 0.0011216 Z = E[ fb(x, h)2 dx Z Z b −2 f (x, h)f (x)dx] + f 2 (x)dx. M ISE{fb(x, h)} R 2 Since f (x)dx does not depend on h, minimizing M ISE{fb(x, h)} is equivalent to minimizing Many insights can be gained from the analysis of Tables 1 and 2. (Also, several similar tables for other cases are available from the authors). First, in each table the AM ISE is approaching to the M ISE as the sample size gets large. Since the AM ISE is a large sample approximation to the M ISE. In addition, the M ISE and the AM ISE become smaller as the sample size increases, which means the kernel distribution estimator is becoming more accurate. Next, focus on the values of h in each table. As the sample size increases, the bandwidth decreases. In other words, less smoothing is needed with larger sample sizes. Insights can also be gained by comparing the two tables from each value of λ for the exponential distribution, as n becomes large, hM ISE and hAM ISE become closer to 0, this shown in the figure 1. Z M ISE{fb(x, h)} − 2 f (x)dx Z = E[ fb(x, h)2 dx Z −2 fb(x, h)f (x)dx]. It can be shown that an unbiased estimator of the right - hand side of the above equation is Z LSCV (h) = fb(x, h)2 dx − 2n−1 n X fb−i (Xi ; h), i=1 Pn where fb−i (x, h) = (n − 1)−1 i6=j kh (x − Xj ). © 2014 GSTF 15 GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014 Sarda [9] considered such an estimator, but argued that the resulting score function will produce a small bandwidth. Instead, he introduced a so-called cross-validation criterion . CV (h) = n−1 gAM SE n X [ Fbn,−i (xi ) − Fn (xi )]2 w(xi ), where Fbn,−i is the kernel estimator computed by leaving out xi , and Pn X −X Fbn,−i (x) = (n − 1)−1 j6=i K( i h j ), Fn (xi ) is the empirical function, and w(xi ) is the weighted function. The Bandwidth estimate is chosen by minimizing CV (h), this value of h is denoted by b hLSCV . The biased cross-validation: The BCV technique proposed by Scott and Terrell [10]. It is based on the formula for the asymptotic M ISE. The BCV objective function is obtained by replacing the unknown R(f 00 ) in AM ISE by the estimator, " hDP I = R(fc00 (x, h)) − (nh5 )−1 R(k 00 ) n X = n−2 (kh00 ∗ kh00 )(xi − xj ), Comparison of Bandwidth Selectors and Optimal Bandwidths To compare among the techniques, we did simulated from exponential distribution with mean 2 for different sample size. Table 5 showed that the LSCV estimates are less biased than the BCV estimates, also, as the sample size increases both estimates become more accurate. The DP I estimates smaller than AM ISE optimal bandwidth, again as the sample size increases, the DP I estimates larger than the M ISE optimal bandwidths as shown in Figure 4. The bandwidth estimate is chosen by minimizing BCV (h). This value of h is denoted by b hBCV . Using the exponential 2 λ 2 density we obtain V1 (F ) = 12 , V2 (F ) = 3λ 35 , and [A2 (k)] 1 = 25 , so Table 5 : Bandwidth Estimates using the Exponential Distribution when λ = 0.5 λ 3λ2 h 0.25h4 b − + D3 (F ) 12n 35n 25 Direct Plug-In: The practical rules were proposed by Scott, Tapia, and Thompson [12] and Sheather [13], focuses on plugging in estimates of unknown quantities intoR the formula for hAM ISE . This method depends upon ψr = f (r) (x)f (x)dx functionals and the choice of a pilot bandwidth g. A kernel estimator of ψr is BCV (h) = i=1 n X n X . 35 b 3 (F )}h4 . BCV (h) = V1 n−1 − V2 hn−1 + {(0.25)[A2 (k)]2 D fb(r) (Xi ; g) = n−2 # 15 b3 nB R where ∗ − xj ) = kh00 (xi − xj − t) kh00 (t)dt Now for the distribution function case, we will use the same technique as that of the kernel density estimator. The AM ISE of Fb is, AM ISE(Fb(x, h)) = V1 n−1 − V2 hn−1 + B3 h4 , where V1 ,V2 and B3 are defined earlier. The BCV objective function is obtained by replacing D3 (F ) with the estimator which introduced by Hall and Marron b 3 (F ), which is given by D b 3 (F ) = [11], P denoted by P D n Pn n 1 0 Xi −Xj 0 Xi −Xk k ( )k ( )w(x i ), where i=1 j=1 k=1 n3 h4 h h k 0 is the derivative of a kernel k. Thus, n X R(k) = nµ2 (k)2 ψb4 (g) ponential case we can find b hDP I as follow: Let Φ(h) = b2 1 0.25V 3 { nBb } and h = Φ(h) so we find h such that h−Φ(h) = 0, 3 b 2 (F ) , B b3 = 0.01 D b 3 (F ) where, Vb2 = 9 D kh00 )(xi ψbr (g) = n−1 . Now in the distribution function, an alternative approach to select the bandwidth is to use an estimator of the asymptotically optimal bandwidth b hAM ISE , which was proposed by b 2 (F ), D b 2 (F ) = Altman and Leger, where Vb2 = 2A1 (k)D Pn Xi −Xj 1 −1 ∗ α k{ }w(x ), and α is the bandi v i6=j v n(n−1) αv 2b b b width,Pand P B3 =P 0.25 [A2 (k)] D3 (F ), where, D3 (F ) = n n n 1 0 Xi −Xk 0 Xi −Xj )w(xi ). Finally 4 i=1 j=1 k=1 k ( αb )k ( αb αb n3 1 b 0.25 V 2 the selected bandwidth is b hDP I = { } 3 . For the ex- i6=j (kh00 1 (r+3) A discussion of these two quantities can be found in Wand and Jones (1995). Using r = 2 and by replacing ψ4 by the kernel estimator ψb4 (g), the DPI bandwidth selectors is obtained. i=1 00 ) ^ R(f k!L(r) (0) = −nµk (L)ψr+2 n hM ISE hAM ISE b hLSCV b hBCV 10 0.965671 1.89983 1.43 1.475 20 0.462784 1.50789 0.4999 0.761 30 0.292054 1.31727 0.351 0.66 40 0.211148 1.19682 0.238 0.482 Also, from Figures 2,3, and 4, we conclude that bandwidth is close to the MISE bandwidth. b hDP I 1.159 1.135 1.046 0.874 the LSCV L(r) g (Xi− Xj ) i=1 j=1 where g is a bandwidth, possibly different from h, and L is a kernel, possibly different from k. Obviously, ψbr (g) depends upon the choice of a bandwidth g, which is called the pilot bandwidth. The asymptotic mean square error (AM SE) optimal bandwidth is used for g, is given by © 2014 GSTF 16 GSTF Journal of Mathematics, Statistics and Operations Research (JMSOR) Vol.2 No.2, October 2014 MISE optimal bandwidth (bottom curve) versus LSCV bandwidth estimate (top curve) MISE optimal bandwidth (bottom curve) versus DPI bandwidth estimate (top curve) Fig. 2. Fig. 4. AUTHORS' PROFILE AUTHORS' FPROFILE Abdel-Razzaq Mugdadi is an Associate Professor of Statistics at the Department of Mathematics and Statistics at Jordan University of Science and Technology. He received his Ph.D. in Statistics from Northern Illinois University in 1999. His research interest is in the areas of density estimation, Reliability, and life testing. His publications appeared in the Journal of Statistical Planning and Inference, Journal of Nonparametric Statistics, Computational Statistics and Data Analysis, IEEE transaction on Reliability and others. Rawan Bani-Melhem is part time lecturer at Jordan University of Science and Technology. Her research interest is in the area of density estimation. The current paper is based on her master thesis at Jordan University of Science and Technology under the supervision of Dr. Mugdadi. MISE optimal bandwidth (bottom curve) versus BCV bandwidth estimate (top curve) Fig. 3. R EFERENCES [1] Marron, J.S and Wand, M . P. (1992). Exact mean integrated squared error. Ann. statist. vol (20), 2 : 712-736. [2] Mugdadi, A. R. Jeter, J. (2010). A simulation study for the bandwidth selection in the kernel density estimation based on the exact and the asymptotic MISE. Pakistan Journal of Statistics. 26 (1), 239-265. [3] Nadaraya , E.A .(1964). Some new estimates for distribution functions, theory of probability and it is applications, 9 : 497-500. [4] Swanepoel.W.H.(1988). Mean integrated square error properties and optimal kernel when estimating a distribution function,17:3785-3799. [5] Altman, N. and Leger,C.(1995). Bandwidth selection for kernel distribution function estimation, J . statist.Planning and Inference, 46:195-214. [6] Wand. M.P. and Jones.M.C. (1995). Kernel Smoothing, First Edition. [7] Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scand. J . Statist, 9, 65-78. [8] Bowman, A.W.(1984). An alternative method of cross-Validation for the smoothing of density estimates. Biometrika,71, 353-360. [9] Sarda, P. (1993). Smoothing parameter selection for smooth distribution functions. J. Statist. Plann. Inference 35, 65-75. [10] Scott, D.W. and Terrell, G.R. (1987). Biased and unbiased crossvalidation in density estimation. J. Amer. Statist. Assoc., 82, 1131-1146. [11] Hall, P. and J.S. Marron (1987). Estimation of integrated squared density derivatives, Statist. Probab. Lett. 6,109-115. [12] Scott, D.W. Tapia, R.A. and Thompson, J.R. (1977). Kernel density estimation revisited. Nonlinear Anal. Theory Meth. Applic., 1, 339-372. [13] Sheather, S.J. (1983). A data-based algorithm for choosing the window width when estimating the density at a point. Comp. Statist. Data Anal. ,1, 229-238. © 2014 GSTF 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ON THE EXACT AND THE APPROXIMATE MEAN INTEGRATED