Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE KERNEL REGRESSION ESTIMATION FOR INCOMPLETE DATA A thesis submitted in partial fulfillment of the requirements For the degree of Master of Science in Mathematics by Timothy Reese May 2015 The thesis of Timothy Reese is approved: Dr. Mark Schilling Date Dr. Ann E. Watkins Date Dr. Majid Mojirsheibani, Chair Date California State University, Northridge ii Dedication This thesis is dedicated in memory of my grandmother Patrica, who first inspired me to pursue higher education. I also dedicate this thesis to the friendship and memory of my good friend Mathew Willerford, his passing in 2014 was difficult. Mathew was caring and always put family and friends before himself. He left the world too early. iii Acknowledgments I would first like to express my sincerest gratitude to Dr. Majid Mojirsheibani. The guidance and support provided by Dr. Mojirsheibani has been crucial to the completion of this thesis. The patience Dr. Mojirsheibani has demonstrated through out this process is greatly appreciated. Dr. Mojirsheibani has shown by example, the necessary traits required to produce meaningful work in the field of statistics. I hope that during this time I have adopted a portion of these traits. I would also like to thank Dr. Mark Schilling for introducing me to the beautiful world of statistics. Without such a great introduction to this field, I may have instead pursued a masters degree in computer science. Furthermore, I would like to thank my committee members Dr. Ann E. Watkins and Dr. Mark Schilling for their time and insightful comments. Additionally, I would like to extend thanks to many of the people in the math department who have provided support in one form or another. Finally, I am always grateful to my family for being supportive throughout my life, so I extend thanks to my parents Ron and Linda, and sister Katrina. iv Table of Contents Signature page iii Dedication iii Acknowledgments iv Abstract vii Introduction 1 1 Introduction to regression 1.1 What is regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Defining the relationship between response and predictors . . . . . . . . 1.3 Parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Supremum norm covers and total boundedness . . . . . . . . . . 1.3.3 Empirical Lp covers . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Partitioning Regression Estimator . . . . . . . . . . . . . . . . . 1.4.2 k-Nearest Neighbors(k-NN) . . . . . . . . . . . . . . . . . . . . 1.4.3 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Comparison between parametric least-squares and kernel regression estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 3 4 5 7 9 12 12 14 15 2 Missing data 2.1 Missing patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Selection probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Regression in the presence of missing values . . . . . . . . . . . . . . . . 21 21 23 24 3 Kernel regression estimation with incomplete covariates 3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 A kernel-based estimator of the selection probability . . . 3.1.2 A least-squares based estimator of the selection probability 3.2 An application to classification with missing covariates . . . . . . 3.3 Kernel regression simulation with missing covariates. . . . . . . . 27 28 32 43 53 55 v . . . . . . . . . . . . . . . . . . . . 17 4 Kernel regression estimation with incomplete response 4.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 A kernel-based estimator of the selection probability . . . 4.1.2 Least-squares-based estimator of the selection probability 4.2 Applications to classification with partially labeled data . . . . . . . . . . . . . . . . . . . . . . 57 57 59 67 74 Future work 77 Bibliography 78 Appendix A Probability Review 81 Appendix B Technical Lemmas 83 vi ABSTRACT KERNEL REGRESSION ESTIMATION FOR INCOMPLETE DATA By Timothy Reese Master of Science in Mathematics The problem of estimating the regression function from incomplete data is explored. The case in which the data is incomplete in the sense that a subset of the covariate vector is missing at random is the concern of chapter 3. Chapter 4 analyzes the case when the response variable is missing at random. In each of the aforementioned chapters, the problem of incomplete data is handled by constructing kernel regression estimators that utilize Horvitz-Thompson-type inverse weights, where the selection probabilities are estimated utilizing both kernel regression and least-squares methods. Exponential upper bounds are derived on the performance of the Lp norms of the proposed estimators, which can then be used to establish various strong convergence results. We conclude chapters 3 and 4 with an application to nonparametric classification in which kernel classifiers are constructed from incomplete data. vii Introduction Statisticians working in any field are often interested in the relationship between a response variable Y and a vector of covariates Z = (Z1 , · · · , Zd ). This relationship can best be described by the regression function m(z) = E[Y |Z = z]. The regression function is estimated utilizing data which has been collected on the variables of interest. In practice the data is often incomplete, having missing values among the covariates or the response variables. Values may be unavailable for a variety of reasons, for example, corruption may occur in the data collection or storage process, a responding unit may fail to answer a portion of the survey questions, or perhaps the measurement of a variable is too expensive to collect for the entire sample. Regardless of the reason, missing values are problematic, standard regression methods are no longer available and if the statistician performs a complete case analysis, the results may be inefficient at best, and at worst highly biased. Two common alternatives to the complete case analysis have been established in the literature. The most commonly employed method is that of imputation. Imputation methods attempt to replace the missing values by some estimate. The second approach utilizes a Horvitz-Thompson-type inverse scaling method (Horvitz and Thompson (1952)) which works by weighting the complete cases by the inverse of the missing probabilities. This thesis approaches the problem of estimating the regression function by way of the nonparametric method of kernel regression. The problem of incomplete data is handled by employing the Horvitz-Thompson-type inverse weighting technique. We start with a review of the problem of regression for the case in which the data is complete. Both parametric and nonparametric regression estimators are investigated. Convergence results for each estimator to the regression function are explored. Chapter 2 begins with an analysis of the patterns in which data can become unavailable. Next, the selection probabilities are clearly defined for a multitude of cases in which the data may be missing. We shall conclude the chapter with a discussion on the methods for estimating the regression function in the presence of missing data i.e., complete case analysis, imputation, and Horvitz-Thompson-type inverse weighting. In chapters 3 and 4, kernel regression estimators are proposed for handling the problem of missing data. Chapter 3 explores the case when a subset of the covariate vector may be missing. Chapter 4 develops similar results to that of chapter 3 for the case of missing response. In both chapters exponential upper bounds are established on the performance of the Lp norms for each estimator. These bounds are used to establish almost sure convergence results. Each chapter concludes with an application to the problem of nonparametric classification in the presence of incomplete data. 1 1 Introduction to regression 1.1 What is regression? Regression is the study of the relationship between some dependent variable Y called the response variable with a set of independent variables Z = (Z1 , · · · , Zd ) called the predictors or covariates. The response variable is assumed to have some functional relationship to the predictor variables i.e., if we observe Z = z then Y = f (z) + , where is some random error term. This relationship is studied in part by the regression function m(z) = E[Y |Z = z]. The importance of the regression function lies in the fact that m(·) is ”close” to Y . Closeness here is measured by the L2 risk or more commonly referred to as the mean squared error. The mean squared error is given by E |Y − f (Z)|2 . (1.1) It is easily shown that m(·) minimizes (1.1), i.e., m(z) = argmin E |Y − f (Z)|2 . f :Rd →R Let f be any measurable function from Rd into R, then E |Y − f (Z)|2 = E |Y − m(Z) + m(Z) − f (Z)|2 = E |Y − m(Z)|2 + E |m(Z) − f (Z)|2 + 2E [(Y − m(Z)) (m(Z) − f (Z))] = E |Y − m(Z)|2 + E |m(Z) − f (Z)|2 . (1.2) The last line follows by observing that a.s. E [(Y − m(Z)) (m(Z) − f (Z))] = = = = E [E [(Y − m(Z)) (m(Z) − f (Z)) |Z]] E [(m(Z) − f (Z)) E [(Y − m(Z)) |Z]] E [(m(Z) − f (Z)) (m(Z) − m(Z))] 0. Therefore, since (1.2) is greater than or equal to zero, we have that (1.2) is minimized when f is chosen to be the regression function. The regression function is important to many different fields of study. The following are a few examples of areas where regression techniques can be applied. 2 1. In gene expression profiling, information from thousands of genes can be used to predict recurrence rates for breast cancer. 2. Regression techniques can be used to predict an inmates propensity for violence. One such method is the Risk Assesment Scale for Prison (RASP-Potosi), RASPPotosi attempts to measure an inmates propensity for violence from a vector of covariates which include age, length of sentence, education, prior prison terms, prior probated sentences, conviction for a property offense, and years served (Davis 2013). 3. The success of a business often relies on advertising strategies. Regression techniques may be applied to acquire the necessary information to choose the right advertising strategy. For example, one could model the sales level in relation to the amount of money spent on advertising in different mediums, such as tv, newspaper, billboards, and internet. In each of these examples the researcher observes some data and then utilizes the data to construct an estimator of the regression function. The regression estimate can then be used to make inferences. 1.2 Defining the relationship between response and predictors Let (Z, Y ) be a Rd ×R-valued random vector, with cumulative distribution function (CDF) F (Z, Y ) = P (Z ≤ z, Y ≤ y). The random vector Z = (Z1 , · · · , Zd ) is the covariate vector and Y is the response variable of interest. One typically wishes to examine the relationship between these variables by way of the regression function m(z) = E [Y |Z = z] . (1.3) For example, the covariate vector Z could consist of the following predictor variables monthly income, amount of debt, previous political contributions, and asset value, the response variable of interest Y could be political contribution. The relationship of the potential contribution to the predictors could then be modeled as some function of the covariates with an additional random error term i.e., Y = f (Z) + . The mean response of this relationship is portrayed by the regression function and could be used to determine the amount of political pandering a politician should expend on a particular person or group. However, in practice the regression function is unavailable, since the distribution of (Z, Y ) is typically unknown. Instead one attempts to estimate the regression function by utilizing some observed data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, in which (Zi , Yi )’s are independent and identically distributed (i.i.d) random vectors with common distribution F (Z, Y ). 3 To determine the utility of an estimator it is important to show that the estimator converges to the true regression function. The usefulness of the estimator can be measured by the Lp statistic. The Lp statistic is a commonly used measure of global accuracy for the regression estimator denoted as mn (·), which is defined as follows Z In (p) = |mn (z) − m (z)|p µ(dz), 1 ≤ p < ∞ , (1.4) where µ is the probability measure of Z. In (2) is the mean squared error, this is the most commonly used measure of accuracy for regression estimators. The primary concern is to show that the Lp statistic converges to 0 in probability or almost surely (see Appendix A) as n → ∞, for some fixed value of p. The convergence of the Lp statistic to zero demonstrates the consistency of mn (·) as an estimator of m(·). There are two main approaches to estimation of the regression function. The first method is a parametric-based estimator, which assumes some partial knowledge of the underlying functional relationship between the response and vector of covariates. Once the model assumptions have been made, one proceeds to fit the data under some error minimization criterion. The second methodology utilizes local averaging techniques. These techniques function without any reliance on restrictive model assumptions. In the following subsections we briefly discuss both approaches, paying particular attention to the method of least-squares, and the method of kernel regression. 1.3 Parametric regression When there is prior knowledge about the functional relationship between the covariates and the response variable, it is important to use this information in the estimation of the regression function. In parametric regression one makes assumptions about the model Y = f (Z) + , specifically the function f is assumed to belong to some class of functions F. The observed data is then used to select the best function among the class of functions F. This function is chosen such that the residual error is small for the given data under some chosen minimization criterion. The popular least-squares criterion may be used to minimize the residual error over the given data. 4 (a) (b) Figure 1.1: (a) Best fit line is a good fit. (b) Poor simple linear least-squares fit. 1.3.1 Least-squares The method of least-squares selects the function from the assumed class of functions that minimizes the sum of the squared errors, and is defined as follows m b LS = argmin f ∈F n X (Yi − f (Zi ))2 . (1.5) i=1 The following is an example of the simplest case of least-squares regression. In this case we assume that the functional relationship between the response variable Y with that of the single predictor Z is linear. Example 1 (Simple Linear Model) Let Z be a single predictor random variable and let Y be the corresponding response. Assume the model Y = β0 + β1 Z + , i.e., assume that the class of functions F in (1.5) is the class of linear functions of one predictor. Let the sample data be Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, then the least-squares method argmin β0 ,β1 n X (Yi − β0 − β1 Zi )2 , i=1 uses simple calculus minimization techniques to yield βb0 = Y − βb1 Z βb1 = 5 1 n Pn i=1 1 n Pn Zi Yi −Z Y i=1 Zi2 −Z 2 , P where Z = n1 ni=1 Zi , and Y = results yield the best fit line 1 n Pn i=1 Yi denote the averages of Zi and Yi . These yb = βb0 + βb1 z . In figure 1.1 two cases were displayed, case (a) in which the best fit line is a good fit and case (b) where it is a poor fit. We can see based on the figures that model assumptions are crucial to a proper parametric fit. The most common way of analyzing the performance of the least-squares estimator in (1.5) as an estimator of the true regression function is by way of the L2 risk and as such consider the following result, which establishes an upper bound on the L2 risk of the least-squares estimator. Lemma 1.1 (Györfi et al. (2002)) Let Fn = Fn (Dn ) be a class of functions f : Rd → R depending on the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}. If m b LS is as in (1.5) then Z n 1 X 2 2 2 |Yi − f (Zi )| − E |Y − f (Z)| |m b LS (z) − m(z)| µ(dz) ≤ 2 sup f ∈Fn n i=1 Z + inf |m(z) − f (z)|2 µ(dz) . f ∈Fn Remark. For the proof and an explanation why the second term converges to 0, see Lemma 10.1 Györfi et al. (2002, p. 161). Therefore to establish the convergence of the L2 risk in lemma 1.1 to 0, it remains to show that the empirical L2 risk n 1X (Yi − f (Zi ))2 , n i=1 converges uniformly over F, to the actual L2 risk E |Y − f (Z)|2 , i.e., we need to establish n 1 X a.s. (Yi − f (Zi ))2 − E |Y − f (Z)|2 → 0 . sup f ∈F n i=1 6 (1.6) Figure 1.2: This figure is an adaption of Figure 9.1 of Györfi et al. (2002). The dashed lines represent a ± translation of f˜ in red. The figure portrays the supremum norm distance between the function f and one of the class members f˜, being contained within . To achieve the uniform convergence in (1.6), we must formally define the idea of totally bounded classes of functions. 1.3.2 Supremum norm covers and total boundedness As stated before parametric regression methods assume partial knowledge of the functional relationship between the response variable Y , to the vector of covariates Z. In other words, the regression function is assumed to belong to some class of functions F. To establish the convergence in (1.6), we assume that our class of functions F is totally bounded with respect to some norm. First, consider the case in which our class of functions is totally bounded with respect to the supremum norm, as defined below. Definition 1.1 (Totally Bounded with Respsect to the supremum Norm) We say that a class of functions F : Rd → R is totally bounded with respect to the supremum norm if for every > 0, there is a subclass of functions F = {f1 , · · · , fN() } such that for every f ∈ F there exists a f˜ ∈ F , satisfying kf − f˜k∞ := sup |f (z) − f˜(z)| < . z∈Rd We say that the class F is an -cover of F. The cardinality of the smallest -cover is denoted by N∞ (, F). 7 To establish the convergence in (1.6), let |Y | ≤ M < ∞ and suppose f belongs to some known class of functions F. Put G to be the class of functions described as G = {gf (z, y) = (y − f (z))2 |f ∈ F, gf : Rd → [0, B] , with B = 4M 2 } , then we have the following theorem Theorem 1.1 (see Györfi et al. (2002, Lemma 9.1)) Let G : Rd → [0, M ] be a class of functions bounded with respect to the supremum norm, then for every > 0 and n large enough ( ) n 1 X 2n2 P sup g(Zi , Yi ) − E [g(Z, Y )] > ≤ 2N∞ (/3, G)e− 9B2 . (1.7) g∈G n i=1 Remark. Let g, g̃ ∈ G and observe that kg − g̃k∞ = sup |g(Z, Y ) − g̃(Z, Y )| Z,Y 2 2 ˜ = sup (Y − f (Z)) − (Y − f (Z)) Z,Y ≤ sup (Y − f (Z)) − (Y − f˜(Z)) × (Y − f (Z)) + (Y − f˜(Z)) Z,Y ˜ ≤ 4M sup f (Z) − f (Z) , Z∈Rd it then follows that if F/4M = {f1 , · · · , fn } is a minimal /4M -cover of F with respect to the supremum norm, the class of functions G = {gf1 , · · · , gfN } will be an -cover of G. Therefore, the covering numbers for F and G satisfy N∞ (, G) ≤ N∞ (/4M, F), hence 2n2 (1.7) is also bounded by 2N∞ (/12M, F)e− 9B2 . As a result, applying the Borel-Cantelli lemma yields the convergence in (1.6). Many important classes of functions are in fact totally bounded with respect to the supremum norm; the following are some examples: Example 2 Consider the exponential class of functions n o F = fθ (x) = e−θ|x| , θ ∈ [0, 1], |x| < 1 8 This class is totally bounded: for every > 0, we have N∞ (, F ) ≤ 2 + b1/(2)c. We note that this is polynomial in 1/. To see this, observe that sup e−θ|x| − e−θ̃|x| ≤ θ − θ̃ , |x|≤1 where the above follows since e−θ|x| is continuously differentiable and |x| ≤ 1. This means the sub class of functions needed to cover F depends on the parameter θ. If we let θi = 2 · i, then taking the set Θ = {θ0 , · · · , θi , · · · , θb 1 c , 1}, we have for any θ there is 2 a θ̃ ∈ Θ such that |θ − θ̃| ≤ . Thus, F is totally bounded with respect toP the supremum norm. Similar bounds hold for the general case when f (x) = exp{−( di=1 θi x2i )1/2 } provided that θi ∈ [0, 1] and |xi | < 1, i = 1, · · · , d. (See Devroye et al.(1996, Ch.28)). Example 3 (Differentiable functions) Let k1 , · · · , ks ≥ 0 be non-negative integers and put k = k1 + · · · + ks . Also, for any ψ : Rs → R, define D(k) ψ(u) = ∂ k ψ(u)/∂uk11 · · · ∂uks s . Consider the class of functions with bounded partial derivatives of order r: ( ) X (k) d 1 Ψ = ψ : [0, 1] → R sup D ψ(u) ≤ C < ∞ . u k≤r Then, for every > 0, log N∞ (, Ψ) ≤ M −α , where α = d/r and M ≡ M (d, r); this is due to Kolmogorov and Tikhomirov (1959). Example 4 (Lipschitz bounded convex functions) Let F be the class of all convex functions f : C → [0, 1], where C ⊂ Rd is compact and convex. If f satisfies |f (z1 ) − f (z2 )| ≤ Lkz1 − z2 k for all z1 , z2 ∈ C, where L > 0 is finite, then log N (, F ) ≤ M −d/2 for all > 0, where the constant M depends on d and L only; see van der Vaart and Wellner (1996). 1.3.3 Empirical Lp covers Restricting a class of functions to be bounded with respect to the supremum norm may be too restrictive, since the covers required can be exceptionally large. If the supremum norm covers are too large, then we may not obtain convergence in (1.7). An alternative, is to use smaller Lp norm covers. Definition 1.2 (Totally Bounded with Respsect to the Lp Norm) We say a class of functions F : Rd → R is totally bounded with respect to the Lp norm if for every > 0, there 9 is a subclass F = {f1 , · · · , fN() } such that for every f ∈ F there exists a f˜ ∈ F such that Z p1 p < . (1.8) kf − f˜kp = f (z) − f˜(z) µ(dz) The cardinality of the smallest -cover is denoted by Np (, F). However, these Lp covers depend on the unknown distribution of the random variable Z. An alternative is to construct data dependent covers. These covers are obtained by replacing the probability measure µ in (1.8) with the empirical measure and they are defined as follows. Definition 1.3 (Totally Bounded with Respsect to the Empirical Lp Norm) We say a class of functions F : Rd → R is totally bounded with respect to the empirical Lp norm if for every > 0, there is a subclass F = {f1 , · · · , fN() } such that for every f ∈ F there exists a f˜ ∈ F , for fixed points {(z1 , y1 ), · · · , (zn , yn )} such that ( n p 1 X f (zi ) − f˜(zi ) n i=1 ) p1 < . (1.9) The cardinality of the smallest -cover for the empirical Lp norm is given by Np (, F, Dn ). Observe that the covering number Np (, F, Dn ) relies on the data Dn , hence the covering number is a random variable. As a result, we will be concerned with the expected value of Np (, F, Dn ). Classes of functions exist that are not totally bounded with respect to the supremum norm, but are in fact bounded with respect to the empirical Lp norm. To appreciate this, consider the following example. Example 5 (Non-Lipschitz bounded convex functions (Guntuboyina and Sen 2013)) Let F be the class of all convex functions f : C → [0, 1], where C ⊂ Rd is compact and convex. Consider the functions given by fj (t) := max(0, 1 − 2j t) , t ∈ [0, 1] , j > 1 , 10 the functions fj (t) belong to F but are not Lipschitz, observe that these functions are not totally bounded with respect to the supremum norm kfj − fk k∞ ≥ |fj (2−k ) − fk (2−k )| ≥ 1 − 2j−k ≥ 1/2 , for all j < k . However, it had been shown (see Guntuboyina and Sen 2013) that the class of all convex functions can be totally bounded with respect to the Lp norm, even if the functions are not Lipschitz. They show that log N (, F , Lp ) ≤ M −d/2 , where M is a positive constant that is independent of . To derive the convergence result of (1.6), first suppose that |Y | ≤ M < ∞ and let f belong to a given (known) class of functions F. Consider the class of functions G defined as G = {gf (z, y) = (y − f (z))2 |f ∈ F, gf : Rd → [0, B] , B = 4M 2 }, we have the following result from the empirical process theory. Theorem 1.2 (see Pollard (1984)) Let G : Rd → [0, B] be a class of functions bounded with respect to the L1 empirical norm, then for every > 0 and n large enough n ) ( 1 X 2n2 g(Zi , Yi ) − E [g(Z, Y )] > ≤ 8 E [N1 (/8, G, Dn )] e− 128B2 . (1.10) P sup f ∈F n i=1 Remark. To achieve the convergence in (1.6), first let g, g † ∈ G and observe n 1 X g(Zi , Yi ) − g † (Zi , Yi ) n i=1 n 1 X (Yi − f (Zi ))2 − (Yi − f † (Zi ))2 = n i=1 ≤ n 1 X (Yi − f (Zi )) − (Yi − f † (Zi )) × (Yi − f (Zi )) + (Yi − f † (Zi )) n i=1 n 1 X ≤ 4M f (Zi ) − f † (Zi ) . n i=1 Therefore if F/4M = {f1 , · · · , fN } is a minimal /4M -cover of F with respect to the L1 empirical norm, then the class of function G = {gf1 , · · · , gfN } will be an -cover of G. 11 Therefore, (1.10) can be bounded with respect to the covering number for F i.e., 2n2 2n2 8E[N1 (/8, G, Dn )] e− 128B2 ≤ 8E[N1 (/32M, F, Dn )] e− 128B2 . Thus, Theorem 1.2 combined with an application of the Borel Cantelli lemma, yield the almost sure convergence in (1.6), provided that log(E [N1 (/32M, F, Dn )])/n → 0 , 1.4 as n → ∞ . Nonparametric regression When there is little or no knowledge about the underlying relationships between the covariates and response variable then parametric methods will be problematic, since any model assumptions need to be accurate to produce reasonable estimates of the regression function. An alternative to the parametric methodology is nonparametric methods. Nonparametric methods perform without any assumptions on the functional relationship. Nonparametric methods use a type of local averaging of the response variables Yi . More specifically, the Yi ’s are weighted according to how ”close” each corresponding Zi is to the z of interest and then these weighted Yi ’s are added to produce the predicted response. The corresponding nonparametric regression estimator is as follows mn (z) = n X Wn,i (z)Yi , (1.11) i=1 where the weights Wn,i are assigned a real number value determined by the ”closeness” of z to Zi . The way in which these weights Wn,i are chosen determines the type of nonparametric regression estimator to be used. In this section we give an overview of three standard nonparametric based regression estimators. 1.4.1 Partitioning Regression Estimator The first nonparametric based estimator to be discussed is the partitioning regression estimator. The partitioning estimator or regressogram as referred to by Tukey in 1961. The partitioning regression estimator starts by partitioning the covariate space Rd into a collection of nonoverlapping cells denoted as {An,1 , An,2 , · · · }. The regression estimate of z is then determined by averaging the Yi ’s corresponding to the Zi ’s in the cell which z resides. For, example in figure 1.3 the y’s corresponding to the points in the circle would be averaged, since they are the only ones in the cell containing the new observation x. 12 . . . . .. . . . . . . . . . . . . z . . . . Figure 1.3: Arbitrary partition of R2 . Formally, let An [z] denote the cell containing z, then the regression estimate is Pn i=1 Yi I{Zi ∈ An [z]} mn (z) = P , n i=1 I{Zi ∈ An [z]} (1.12) where I{·} denotes the indicator function. The indicator function is one if Zi belongs to the region containing z and 0 otherwise. P In this case the weight function of (1.11) is chosen to be I{Zi ∈ An [z]} ÷ ni=1 I{Zi ∈ An [z]}. A computationally efficient way to choose the partitions is to use cubic and rectangular partitions which partition the space into cells of equal dimension. However, this method can result in some cells containing a large fraction of the data, while other cells contain only a couple points. An alternative to equally spaced cells is to use statistically equivalent blocks. The blocks are constructed to satisfy the criterion that each cell should contain an equal number of points. The almost sure convergence of (1.12) to the regression function is shown below. For a proof, refer to Györfi et al (2002). Theorem 1.3 Let mn (z) be defined as in (1.12). If for each sphere S centered at the origin lim max D(An,j ) = 0 , n→∞ j:An,j ∩S6=∅ 13 where D denotes the diameter of the cell, and |{j : An,j ∩ S 6= ∅}| = 0. n→∞ n lim then a.s. E[|mn (Z) − m(Z)| |Dn ] → 0 as n → ∞ , a.s. provided that |Y | ≤ M < ∞. 1.4.2 k-Nearest Neighbors(k-NN) An alternative to the partitioning method or regressogram is the k-nearest neighbors (kNN) estimator. Instead of partitioning the space of the covariates, the k-NN estimator sorts the data based on how ”far” each Zi is from z and then proceeds to average the Yi ’s of the k closest. Let the data used for estimating the regression function be Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} which is an i.i.d. sample from the distribution of (Z, Y ). Then denote D0n = {(Z(1) , Y(1) ), · · · , (Z(k) , Y(k) ), · · · , (Z(n) , Y(n) )} as the reordered data set based on how far each Zi is from z, here distance is measured using some distance function (for example Euclidean distance). Then, the corresponding k-NN estimator is defined as mn (z) = k X 1 i=1 k Y(i) . (1.13) The expression in (1.13) is simply the average of the Yi ’s corresponding to the k closest Zi ’s to z, in this case the weight function of (1.11) is chosen to be 1 for the k closest Zi to z k Wn,i (z) = . 0 otherwise The observant reader may recognize that issues may arise when there are Zi ’s of equal distance from z. If one is willing to assume that these events happen with probability 0 (this is true for the case when Z has a density) then it can be shown that mn in (1.13) converges to the regression function almost surely in the L1 sense, provided that Y is bounded (Devroye and Györfi (1985)). If the distribution of Z is not absolutely continuous, then tie breaking methods must be employed to derive the almost sure convergence of the estimator. There have been a variety of methods proposed for tie breaking. Stone (1977) proposed a pseudo k-NN method which during the event of a tie would average the responses for the values of the tie, Stone showed the weak convergence in probability of this pseudo k-NN estimator. 14 An alternative to Stone’s method is the method of randomization in which a new uniform random variable U independent of Z is introduced producing the random vector (Z, U ). We then artificially grow the data producing D0n = {(Z1 , U1 , Y1 ), · · · , (Zn , Un , Yn )} , where Ui ’s are i.i.d. Uniform(0,1) random variables. Next, reorder the data based on the distance the new observation (z, u) is from (Zi , Ui ), for each i = 1, · · · , n, i.e., if no tie exists for the distance between Zi and z then reorder normally, otherwise if a tie exists between say Zi and Zj then compare the distance between Ui to u and Uj to u to break the tie. This method results in the reordered data set D00n = {(Z(1) , Y(1) ), · · · , (Z(n) , Y(n) )} . The estimator now averages the first k Yi ’s in the reordered data set to produce the estimate for the response. The almost sure convergence of the k-NN estimator under the randomization tie breaking rule is provided by the following result. Theorem 1.4 (Devroye et al. (1994)). Let mn (z) be the k-NN regression estimator in (1.13). If |Y | ≤ M < ∞, if the distribution of (Z, Y ) is not absolutely continuous apply the randomization rule, then for every > 0, and n large enough Z P |mn (z) − m(z)|µ(dz) > ≤ e−an , where a is a positive constant that does not depend on n. Z Remark. Furthermore, we obtain a.s. |mn (z) − m(z)|µ(dz) → 0, by the Borel-Cantelli lemma, provided that k → ∞ and k/n → 0 as n → ∞. Therefore, we have that mn (z) converges almost surely to m(z) in the L1 sense. Further results hold, specifically the above theorem can be extended to the general Lp statistics by the fact that Y is bounded (Devroye et al. (1994)). 1.4.3 Kernel Regression Another local averaging method is the Nadaraya-Watson kernel regression estimator (Nadaraya (1964), Watson (1964)), which assigns weights that decrease the ”farther” Zi is from z, 15 K(t) = 1 2 I{|t| ≤ 1} K(t) = K(t) = (1−|t|) I{|t| ≤ 1} (b) (a) 1 2 √1 e− 2 t 2π (c) Figure 1.4: Three commonly used single variable kernels (a) Uniform kernel (b) Triangular Kernel (c) Gaussian Kernel and is defined as follows Pn mn (z) = z−Zi hn Yi K . Pn z−Zi K i=1 hn i=1 (1.14) The kernel function K : Rd → R is a smooth function of the distance between z and Zi . The choice of kernel is at the discretion of the statistician but is typically chosen to be a probability density function. Three common choices of kernels for the case of a univariate response are displayed in Figure (1.4). The term hn > 0 in (1.14), called the smoothing parameter of the kernel, typically ranges between 0 and 1. The smoothing parameter controls the smoothness of the kernel regression estimate. The smoothing parameter depends on the size of the sample n, as n gets larger hn gets smaller but at a rate that does not decrease too fast. One typically estimates the smoothing parameter from the data at hand. Various modes of convergence of In (p) defined in (1.4) to zero have been established in the literature; see, for example, Devroye and Wagner (1980), and Spiegelman and Sacks P (1980), whose results yield In (1) −→ 0, as n → ∞. The quantity In (1) also plays an important role in statistical classification; see for example Devroye et al (1996, Sec. 6.2), Devroye and Wagner (1980), and Krzyz̀ak and Pawlak (1984). For the almost sure convera.s. gence of In (1) to 0 (i.e., In (1) −→ 0) see, for example, Devroye (1981), and Devroye and Krzyz̀ak (1989). In this last cited work by Devroye and Krzyźak, they develop a number of equivalent results under the assumption that |Y | ≤ M < ∞ and the kernel is regular. 16 Definition A nonnegative kernel K is said to be regular if there are positive constants R b > 0 and r > 0 for which K(z) ≥ b I{z ∈ S0,r } and sup K(y)dz < ∞, where S0,r y∈z+S0,r is the ball of radius r centered at the origin. Many of the kernels used in practice satisfy the above definition. A widely used regular kernel is the Gaussian kernel d 1 T z K(z) = (2π)− 2 e− 2 z . The results of Devroye and Krzyz̀ak (1989) are of particular importance for later chapters of this thesis. In chapters 3 and 4 we shall produce similar results to the following theorem, for the case of incomplete data. Theorem 1.5 (Devroye and Krzyz̀ak (1989)). Assume that K is a regular kernel. Let mn (z) be the kernel regression estimate in (1.14), and let 0/0 be defined as 0. Then for every distribution of (Z, Y ), with |Y | ≤ M < ∞ and for every > 0 and n large enough we have Z P |mn (z) − m(z)|µ(dz) > ≤ e−cn , where c = min2 {2 /[128M 2 (1 + c1 )] , /[32M (1 + c1 )]}. Z Remark. This result provides a.s. |mn (z)−m(z)|µ(dz) → 0, by the Borel-Cantelli lemma, provided that limn→∞ hn = 0, limn→∞ nhd = ∞. Hence, the almost sure convergence of mn (z) to m(z) is established. For an in-depth treatment of kernel regression and kernel density estimation see the Monograph by Wand and Jones (1995). 1.5 Comparison between parametric least-squares and kernel regression estimators In this section a simulation is presented to compare kernel regression estimators with parametric least-squares regression estimators. A sample Dn = {(X1 , Y1 ), · · · , (X75 , Y75 )} is constructed from Yi = e−2|Xi |+5 + i , where Xi and i , i = 1, · · · , n are standard normal random variables. The kernel regression estimate is fitted using a Gaussian kernel for fixed smoothing parameters h = 0.5, 0.3, 0.2, 0.1. Four choices of parametric models are chosen to demonstrate the importance of correct model specification. Each regression estimator is tested on an independent sample of 1000 points, the results are plotted in Figure 1.5 and the corresponding errors of each regression fit are provided in Table 1.1. The errors in the table are constructed by comparing the fitted value to the actual Yi for 17 (a) (b) (c) Figure 1.5 2 a) Y = β0 + β1 x + β2 x kernel h = 0.5 b) Y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4 kernel h = 0.3 c) Y = β0 + β1 e−|x| kernel h = 0.2 d) Y = eβ1 |x|+β2 kernel h = 0.1 (d) L1 error L2 error 45.11667 3681.28500 13.29512 289.98340 22.54348 776.74680 5.83773 81.51567 14.21205 302.08270 4.69530 48.69061 0.06784 0.01039 1.71803 7.16343 max error 161.22510 58.77908 110.51710 44.82134 31.90593 38.57000 0.28397 17.75663 Table 1.1: Empirical L1 , L2 , and Max Errors i = 1, · · · , 1000. The error is measured using the following methods. The standard L1 18 error (or least absolute deviations) is given by n 1 X Yi − Ybi , n i=1 (1.15) the empirical L2 risk n 2 1 X b Yi − Yi , n i=1 (1.16) max Yi − Ybi . (1.17) and the max error is 1≤i≤1000 In Figure 1.5 (a), we see a poor fit for the case of both estimators. The parametric model of Y = β0 + β1 x + β2 x2 is a poor choice considering the actual model is nonlinear. For the kernel regression estimate the smoothing parameter of h = 0.5 is too smooth for a sample size of n = 75. In Figure 1.5 (b), improvement is seen from the kernel regression estimator, the smoothing parameter choice of h = 0.3 is closer to the correct choice. The choice of the parametric fourth degree polynomial is better as we would expect, but inadequate, as seen in the image and by the error rates in the provided table. In figure 1.5 (c), there is improvement by both methods. Considering that the parametric fit Y = β0 + β1 e−|x| is a linear model, the model performs surprisingly well in trying to approximate the nonlinear model. However, the linear model appears to have some trouble approximating the curvature of the relationship. Finally, in figure 1.5 (d) we see that the parametric fit performs great. The correct nonlinear model was specified as Y = eβ1 |x|+β2 and as such resulted in an excellent approximation, which can be seen in the errors and by visual inspection. The kernel regression method with smoothing parameter h = .1 visually appears unstable, but the error rate dropped. This simulation demonstrates the heavy dependence of parametric regression on the correct specification of the model. When a model is incorrectly specified, the error will be large. Visual inspection of the plots allows for easy identification of model inconsistencies. However, in practice there is no such access to plots, due to the fact that there is usually a large number of predictors. The statistician must be confident in the validity of any model assumptions or risk producing incorrect inferences. If the statistician is unable to rely on model assumptions, it is prudent to use nonparametric methods. Nonparametric methods 19 are not hindered by correct model specifications. However, the kernel regression method relies on the correct choice of the smoothing parameter h. If the smoothing parameter is too large (relative to the sample size) then the estimator will be too smooth, while too small of an h leads to over-fitting. The dependence of the kernel regression method on the smoothing parameter can be seen in the simulation. Additional difficulties arise for both estimators when the data contains missing values, this is the focus of the next chapter. 20 2 Missing data The analysis of the regression function was of primary interest in the previous chapter. Both parametric and nonparametric estimators were explored. In particular, the nonparmetric kernel regression estimator is of special importance to chapters 3 and 4 of the thesis. Each of the estimators used data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} to construct an estimate of the regression function. In practice, statistical researchers encounter data with missing values. In particular, a subset of the covariate vector may be missing, due, for instance to failure of a survey participant to respond to questions deemed too personal. For instance a poll of subjects may include questions on income, sexual preference, and prior drug use. Some of those polled may deem these particular questions too personal to answer. The response variable can also be missing, a situation common in experimental design, where the covariates are controlled by the researcher. The presence of missing data poses additional problems for the statistician to overcome. The standard regression estimators are no longer available to the statistician. New methods need to be considered, and knowledge of the behavior of the missing data is required to produce reliable regression estimators. This chapter is concerned with providing a brief presentation of the problem of missing data, see Little and Rubin (2002) for an in-depth accounting of the problem of missing data. First, the different patterns in which data may be unavailable are described. Second, the uncertainty of observing or not observing the ith observation is modeled as a Bernoulli random variable. Different assumptions on the conditional dependencies of the probability associated with this random variable called the selection probability or missing probability mechanism are presented. Lastly, three common procedures to produce regression estimators in the presence of incomplete data are examined. 2.1 Missing patterns There exist a variety of patterns for which data may be unobservable. We shall briefly discuss five of these patterns. See Little and Rubin (2002) for a more complete treatment of missing data patterns. Figure 2.1 gives a visual representation of these five patterns. In Figure 2.1 any of the variables Y1 , · · · , Y5 can represent either the covariates or the response variable. The first missing pattern is the univariate nonresponse pattern. This pattern represents the case in which several variables are completely observable, while one variable may have missing values. The variable with missing values can be either part of the vector of covariates or the response variable. However, in regression analysis this pattern is 21 Figure 2.1: This figure is an adaption of Figure 1.1 of Little and Rubin (2002). Each rectangle represents the available sample size for the given variable. more representative of the case of missing response. This missing pattern is common in experimental design scenarios, where the covariate vector is controlled by the researcher. Little and Rubin (2002) provide an example in which the yield of a particular crop under different input values of the covariates is of interest. The crop yield may be incomplete due to the failure of the seed to germinate. The multivariate two pattern problem is the missing pattern in which there are two distinct subsets of vectors. The first subset is always available and the second subset has missing values. The subset with missing values will have the same observation missing for each variable. This is common in survey sampling, in which individuals in the sample fail to report on some of the predictor variables. Another scenario that results in this pattern is when extreme cost of measurement exist for some of the variables. For example, consider a car manufacturer who wishes to measure the safety and durability of their vehicles. Some of the measurements can be cheap, like stopping time and tire durability. However, measurements under accident scenarios can result in the complete destruction of the vehicle. These measurements are expensive and it would not be in the interest of the company to perform repeated measurements. 22 When there is a monotonic decrease in the availability of observations, we denote this pattern as the monotone pattern. This pattern is common in longitudinal studies, in which measurements are conducted over a long period of time. The first measurement is always complete, since the availability at the start of the experiment determines the initial sample size. When studies progress over a long period of time, unforeseeable circumstances lead to a decrease in the responding units. The file matching pattern is a pattern in which two subsets of the variables are never observed at the same time. This behavior can be observed in figure 2.1 (e), in this case Y2 and Y3 are never observable together. This pattern can result in unique problems, these problems result from the inability to estimate the relationship between the variables which are never observed together. For example, in quantum mechanics the Heisenberg’s uncertainty principle states that a particles position and momentum can not both be simultaneously known, this behavior could lead to the file matching pattern. This specialized example portrays a situation in which the pattern can occur as a direct result of the experimental procedure being performed. The merging of two large government databases is a more common situation that may result in this type of pattern. The reason the merginging may result in this pattern is that the databases being merged may share some common variables, but each database has a set of variables that are specific to the department responsible for that particular database. For more on the file matching pattern and other missing patterns one may refer to Little and Rubin (2002). To further understand the behavior of missing data, methods need to be employed to deal with the uncertainty of a vector being observable. This uncertainty is explored in the next section. Probability models are developed to represent the probability that a vector is observed. 2.2 Selection probability When dealing with incomplete data, the uncertainty of observing a variable must be modeled with some probability mechanism. The underlying missing probability mechanism, called the selection probability, represents the probability that a variable will be available. To simplify the presentation of the selection probability, consider the following situation. Let (Z, Y ) be an Rd+s × R random vector, with Z = (X, V), X ∈ Rd , V ∈ Rs , and d, s ≥ 1. Suppose the subset V of the covariate vector Z has the possibility of being unobservable. Introduce the new random variable ∆, where ∆ = 1 if V is observable and ∆ = 0 otherwise. The selection probability can then be defined as follows, 23 Definition 2.1 (Missing Probability Mechanism) P ∆ = 1Z = z, Y = y = P ∆ = 1X = X, V = v, Y = y . (2.1) The conditional probability in (2.1) defines the probability that ∆ = 1 (V is observable) given (X, V, Y ). When the conditional probability depends on all variables including the possibly missing V it is called missing not at random (MNAR). When the right hand side of (2.1) equals a constant our missing probability mechanism is called Missing Completely at Random (MCAR), i.e., Definition 2.2 (Missing Completely at Random (MCAR)) P ∆ = 1Z = z, Y = y = P {∆ = 1} = p . (2.2) The MCAR assumption may be reasonable if the data is missing due to simple corruption in the data gathering or storage of the sample. This situation is denoted as ignorable, since proceeding if there is no missing data will yield consistent and unbiased estimates. However, this situation is unlikely and if assumed will often lead to biased estimates. Instead, a more common assumption is that of missing at random (MAR). The Missing at Random assumption is a compromise between the overly restrictive MNAR and highly unlikely MCAR assumptions. We define the MAR mechanism as follows Definition 2.3 (Missing at Random (MAR)) P ∆ = 1Z = z, Y = y = P ∆ = 1X = x, Y = y . (2.3) Here, (2.3) assumes that the probability that V is missing depends only on the always observable vector (X, Y ). 2.3 Regression in the presence of missing values Recall, that our main goal has been to estimate the regression function m(z) = E[Y |Z = z], when the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} may contain missing values. A variety of techniques have been developed to deal with the problem of missing values among the data. Here we briefly introduce three common methods of estimating the regression function when exposed to incomplete data and analyze the relative merits of each method. Complete case analysis 24 The first and most obvious method is to use only the cases with no missing values to estimate the regression function. This method is called complete case analysis. In complete case analysis any observation that has a missing value is ignored in the process of estimating the regression function. In complete case analysis the data used in estimating the regresson function is no longer Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, instead the effec0 tive data Pnis Dn = {(Z1 , Y1 ), · · · , (Zm , Ym )}. In this case the sample thsize is reduced to m = i=1 ∆i , with ∆i = 1 if there is no missing values among the i observation and ∆i = 0 otherwise. This method is only recommended if the number of missing cases is small relative to the sample size. If the number of missing cases is large, using complete case analysis will result in an increase in the variance of the estimator, which may lead to a drastic loss ofPefficiency. Another issue of complete case methods is that the reduced sample size m = ni=1 ∆i is a random variable. This randomness leads to difficulties. The majority of regression estimation methods assume that the sample size is a fixed quantity. Furthermore, if the selection probability is not MCAR the regression estimator is likely to be biased. Imputation methods The reduction in sample size in complete case analysis is an undesirable property. Imputation methods attempt to reconstruct the sample by imputing or ”substituting” the missing values. To fill in a missing value in a case, an estimate of this value is constructed from the available data, this estimate is then inserted in place of the missing value. To be more precise, suppose we have data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} and that a subvector Vi of Zi = (Xi , Vi ), could be missing. The imputation method would then produce the new sample Dn = {(X1 , Vb1 , Y1 ), · · · , (Xn , Vbn , Yn )}, where Vbi is either Vi if Vi is observable, or an estimate, if Vi was missing. Two common methods used in the area of survey sampling are hot deck and cold deck imputation. In hot deck imputation, the value of a similar available unit in the sample is used in place of the missing value. Cold deck imputation uses a similar unit from an alternative sample (perhaps a previous realization) in place of the missing value of the current sample. Another obvious possibility is to replace the missing values with a mean from the observed quantities of the same variable, this is referred to as mean imputation. Mean imputation is not recommended, this method can result in inaccurate summary statistics of the sample, in particular it underestimates the variance and tends to shift the sample correlation between variables toward zero. Another common method is to impute the missing values by performing a regression of the other available variables on the observed variables corresponding to the missing value. This regression estimate is then used to predict the missing value. The method of regression imputation conditions on the observed variables which helps reduce bias due to nonresponse. 25 The above methods of imputation are referred to as single imputation, since one single value is estimated to replace the missing value. Single imputation methods have the disadvantage of treating the imputed value as a known quantity, thus removing the uncertainty associated with the missing value. An alternative to single imputation is to use multiple imputation. Multiple imputation replaces the missing values with several estimates resulting in multiple complete data sets. Inferences on each data set can be combined to achieve one overall inference that properly portrays the uncertainty associated with the presence of missing values. Horvitz-Thompson Inverse Weights When the data is not MCAR the complete case analysis can result in biased estimates of the regression function. The complete case regression model can be bias-corrected by by using Horvitz-Thompson-type inverse weights (Horvitz and Thompson (1952)). This method works by weighting the complete cases by the inverse of the missing data probability mechanism. As a result of weighting by the inverse missing probabilities values that were observed but had a high probability of being unavailable will be given more weight, thus correcting for the bias of the complete case regression estimate. However, one typically does not have knowledge of the missing probability mechanism and as a result the method must use some estimator of these probabilities. This method has previously been used by some authors to estimate the mean of a distribution with missing data; see, for example, Hirano et al. (2003). Robins, Rotnitzky and Zhao (1994) developed a class of locally and globally adaptive semi-parametric estimators using inverse probability weighted estimating equations for the case of covariates missing at random. For a comparison between imputation methods and Horvitz-Thompson-type inverse weights see Vansteelandt (2010). In the following chapters we utilize the method of Horvitz-Thompson-type inverse weights to construct kernel regression estimators for incomplete data. 26 3 Kernel regression estimation with incomplete covariates In the previous chapter we introduced the problem of regression estimation when the data used in the estimation is incomplete. In this chapter we study nonparametric regression, for the specific case of missing covariates. Until recently, there has been little work in the area of nonparametric regression with missing covariates. Mojirsheibani (2007) considered a multiple imputation method to estimate the regression function when part of the covariate vector could be missing at random, this is similar to the work of Cheng and Chu (1996) in that the regression estimate imputes a specific function of a missing value (and not the missing value itself). Efromovich (2011) constructed an adaptive orthogonal series estimator and derived the minimax rate of the MISE when the regression function belongs to a k-fold Sobolev class. Wand et al. (2012) developed variational Bayes algorithms for penalized spline nonparametric regression when the single covaraite variable is susceptible to missingness. Our contribution will be to construct kernel regression estimators from data in which a subset of the covariate vector may be missing at random. We propose kernel regression estimators that use Horvitz-Thompson-type inverse weights, in which the selection probabilities are estimated utilizing kernel regression or least-squares methods. Exponential upper bounds are derived for the Lp norm of each estimator, these results yield almost sure convergence of the Lp statistic to 0 by an application of the Borel-Canteli lemma. As an immediate application of these results, the problem of nonparametric classification with missing covariates will be studied. Now, let us formally define the problem. Let (Z, Y ) be an Rd × R-valued random vector and let m(Z) = E [Y |Z = z] be the regression function. Given the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, where the (Zi , Yi )’s are independently and identically distributed (iid) random vectors with the same distribution as that of (Z, Y ), the regression function m(z) can be estimated by the NadarayaWatson kernel estimate (Nadaraya (1964), Watson (1964)) Pn z−Zi Y K i i=1 hn , (3.1) mn (z) = P n z−Zi K i=1 hn which was defined in (1.14). Here 0 < h ≡ h(n) is a smoothing parameter depending on n 27 and the kernel K is an absolutely integrable function. The case in which the data was completely observable was already explored in Chapter 1, where the almost sure convergence of mn (·) to m(·) was established. The result of Devroye and Krzyz̀ak (1989) in Theorem 1.5 will be of particular importance to the results of this chapter. We are interested in extending these results to the case where a subset of the covariate vector Zi may be unavailable from Dn . Clearly, the estimator (3.1) is no longer available. In the following sections we present our proposed estimators. 3.1 Main Results Let (Z, Y ) be an Rd+s × R-valued random vector and Z = (X0 , V0 )0 , with X ∈ Rd , V ∈ Rs , and, d, s ≥ 1. Our goal is to estimate the regression function m(z) = E[Y |Z = z] when V may be missing at random. Under the missing at random assumption the selection probability is defined as follows. P (∆ = 1|Z = z, Y = y) = P (∆ = 1|X = x, Y = y) =: η ∗ (x, y) . (3.2) We first consider the unrealistic case in which the selection probability η ∗ is known. In this case we estimate the regression function m(z) by constructing a modified kernel regression estimator from the i.i.d. sample Dn = {(Z1 , Y1 , ∆1 ), · · · , (Zn , Yn , ∆n )} = {(X1 , V1 , Y1 , ∆1 ), · · · , (Xn , Vn , Yn , ∆n )}, with ∆i = 1 if Vi is observable and 0 otherwise, and is defined as follows Pn z−Zi ∆ i Yi i=1 η ∗ (Xi ,Yi ) K( hn ) . ∗ (3.3) m b η (z) = P n ∆i z−Zi K ∗ i=1 η (Xi ,Yi ) hn Observe that the estimator in (3.3) works by weighting the complete cases by the inverse of the selection probabilities; this is very much in the spirit of the classical Horvitz-Thompson estimator (Horvitz and Thompson (1952)). As for the usefulness of m b η∗ (z) as an estimator of m(z), observe that (3.3) can be rewritten as follows Pn z−Zi z−Zi K( )/ i=1 K( hn ) i=1 hn P , m b η∗ (z) = P n n z−Zi ∆i z−Zi K / K( ) i=1 η ∗ (Xi ,Yi ) i=1 hn hn Pn ∆i Y i η ∗ (Xi ,Yi ) which is a ratio of kernel estimators for the following ratio of two conditional expectations 28 E h E h ∆Y η ∗ (X,Y |Z ) ∆ |Z η ∗ (X,Y ) i a.s. i E h E h = Y η ∗ (X,Y E [∆|Z, Y ] |Z ) 1 E η ∗ (X,Y ) [∆|Z, Y ] |Z i i= E [Y |Z] = m(Z) . E [1|Z] Therefore when the missing probability mechanism η ∗ is known, (3.3) can be seen as a kernel regression estimator of the regression function E[Y |Z] = m(Z) . To explore the accuracy of our estimator in (3.3) we examine the convergence of the Lp statistic to 0. We shall assume in what follows that the selected kernel K is regular as defined in (1.4.3), which we recall is Definition A nonnegative kernel K is said to be regular if there are positive constants R b > 0 and r > 0 for which K(z) ≥ b I{z ∈ S0,r } and sup K(y)dz < ∞, where S0,r y∈z+S0,r is the ball of radius r centered at the origin. Many of the kernels used in practice are regular kernels, in fact the popular Gaussian kernel is a regular kernel. The regular kernel is the type of kernel used by Devroye and Krzyz̀ak (1989). For more on regular kernels the reader is referred to Györfi et al (2002). Before proceeding we need to state the following condition Condition A1. inf x∈Rd ,y∈R P (∆ = 1|X = x, Y = y) := η0 > 0, where η0 can be arbitrarily small. Condition A1 is necessary so that V is observed with a nonzero probability. This is a rather standard assumption in the literature on missing data; see, for example, Wang and Qin (2010) and Cheng and Chu (1996). The following theorem employs the result of the main theorem of Devroye and Krzyz̀ak (1989). Theorem 3.1 Let m b η∗ (z) be the kernel regression estimator defined in (3.3), where K is a regular kernel, and the condition A1 holds. If |Y | ≤ M < ∞ and hn → 0, nhd+s → ∞ as n → ∞, then for every > 0 and n large enough Z p P |m b η∗ (z) − m(z)| µ(dz) > ≤ 8e−na1 where a1 ≡ a1 () = min2 (2 η02p /[22p+7 M 2p (1 + η0 )2p−2 (1 + c1 )], η0p /[2p+5 M p (1 + η0 )p−1 (1 + c1 )]), with c1 the positive constant of Lemma B.1. 29 Remark. Theorem 3.1 in conjunction with the Borel-Cantelli lemma yields E[|m b η∗ (Z) − a.s. p m(Z)| |Dn ] → 0, as n → ∞. PROOF OF THEOREM 3.1 Let m̃η∗ be the kernel regression estmator of m(z) and mη∗ be the kernel regression estia.s. mator of Y = 1, defined as Pn Pn ∆ i Yi z−Zi ∆i z−Zi K K i=1 η ∗ (Xi ,Yi ) i=1 η ∗ (Xi ,Yi ) hn hn m̃η∗ (z) = mη∗ (z) = . (3.4) Pn P n z−Zi z−Zi K K i=1 i=1 hn hn Observing that |m̃η∗ (z)/mη∗ (z)| ≤ M , we have the following m̃η∗ (z) m(z) ≤ M |mη∗ (z) − 1| + |m̃η∗ (z) − m(z)|(3.5) − . |m b η∗ (z) − m(z)| ≤ mη∗ (z) 1 30 As a result of (3.5), we have that for every > 0, and n large enough Z p P |m b η∗ (z) − m(z)| µ(dz) > Z p ≤P |M |mη∗ (z) − 1| + |m̃η∗ (z) − m(z)|| µ(dz) > Z p p p−1 p−1 p ≤P 2 M |mη∗ (z) − 1| + 2 |m̃η∗ (z) − m(z)| µ(dz) > Z p−1 p−1 p ≤P 2 M (|mη∗ (z)| + 1) |mη∗ (z) − 1| µ(dz) > 2 Z p−1 p−1 +P 2 (|m̃η∗ (z)| + |m(z)|) |m̃η∗ (z) − m(z)| µ(dz) > 2 (Z ) p−1 1 ≤P 2p−1 M p +1 |mη∗ (z) − 1| µ(dz) > η0 2 ) (Z p−1 M +P 2p−1 +M |m̃η∗ (z) − m(z)| µ(dz) > η0 2 Z Z 0 0 |mη∗ (z) − 1| µ(dz) > + P =P |m̃η∗ (z) − m(z)| µ(dz) > M (where, 0 = η0p−1 /[2p M p (1 + η0 )p−1 ]) ≤ 8e−na1 , (by (1.5), in view of the result of Devroye and Krzyz̀ak (1989)) . Here, a1 ≡ a1 () = min2 (2 η02p /[22p+7 M 2p (1+η0 )2p−2 (1+c1 )], η0p /[2p+5 M p (1+η0 )p−1 (1+ c1 )]), with c1 the positive constant in Lemma B.1. The kernel estimator defined in (3.3) requires the knowledge of the missing probability mechanism η ∗ (X, Y ) = E[∆|X, Y ] (an unrealistic case). If η ∗ is unknown, then it has to be replaced by some estimator in (3.3). We consider two different estimators of η ∗ , a kernel regression estimator and a least-squares regression type estimator. 31 3.1.1 A kernel-based estimator of the selection probability We first consider replacing η ∗ (Xi , Yi ) in (3.3) by one of the two following estimators Pn Xi −Xj j=16=i ∆j I {Yi = Yj } H λn , ηb(Xi , Yi ) = (3.6) Pn Xi −Xj I {Y = Y } H i j j=16=i λn Pn ηbc (Xi , Yi ) = Ui −Uj λn ∆j H , with Ui = (Z0i , Yi )0 , Pn Ui −Uj j=16=i H λn j=16=i (3.7) where H : Rd → R+ in (3.6) and H : Rd+1 → R+ in (3.7), where H is a kernel with smoothing parameter λn , satisfying λn → 0 as n → ∞, with the convention that 0/0 = 0. The estimator in (3.6) is used if Y is a discrete random variable, otherwise one estimates the selection probability with (3.7). We derive an exponential bound for the more complicated case of discrete Y i.e., when the selection probability is estimated by the estimator described in (3.6). The continuous case follows with simpler arguments. Now, suppose that Z ∈ Rd and Y ∈ Y, where Y is a countable subset of R, then our modified kernel-type regression estimator of m(z) = E[Y |Z = z] is given by Pn ∆i Y i z−Zi i=1 ηb(Xi ,Yi ) K( hn ) , m b ηb(z) = P (3.8) n ∆i z−Zi K i=1 ηb(Xi ,Yi ) hn where ηb is as in (3.6). To study (3.8) we first state some standard regularity conditions. R R Condition A2. The kernel H in (3.6) satisfies Rd H(u)du = 1 and Rd |ui |H(u)du < ∞, for i = 1, · · · , d, where u = (u1 , · · · , ud )0 . Furthermore, the smoothing parameter λn satisfies λn → 0 and nλdn → ∞, as n → ∞ Condition A3.PThe random vector X has a compactly supported probability density function, f (x) = y∈Y fy (x)P (Y = y), which is bounded away from zero on its support, where fy (x) is the conditional density of X given Y = y. Additionally, both f and its first-order partial derivatives are uniformly bounded on its support. 32 Condition A4. The partial derivatives ∂x∂ i η ∗ (x, y), exist for i = 1, · · · , d and are bounded uniformly, in x, on the compact support of f . Condition A2 is not restrictive since the choice of the kernel H is at our discretion. Condition A3 is usually imposed in nonparametric regression in order to avoid having unstable estimates (in the tail of the p.d.f of X). Condition A4 is technical and has already been used in the literature; see, for example, Cheng and Chu (1996). Theorem 3.2 Let m b ηb(z) be defined as in (3.8), where K is a regular kernel. Suppose that |Y | ≤ M < ∞. If hn → 0, nhd+s → ∞, as n → ∞, then under conditions A1, A2, A3, and A4, then for every > 0 and every p ∈ [1, ∞), and n large enough, Z d d p P |m b ηb(z) − m(z)| µ(dz) > ≤ 8e−na2 + 2e−na3 + 8ne−(n−1)λn a4 + 8ne−(n−1)λn a5 , where 2 a2 ≡ a2 () = min η0p 2 η02p , 24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 ) , 2 η02p+2 ψ02 2 η02p , a ≡ a () = , 4 4 24p+8 32p c21 M 2p 24p+12 32p−2 c21 M 2p kHk∞ [kf k∞ + ψ0 /12] η02 ψ02 = 7 , 2 kHk∞ [kf k∞ + ψ0 /12] a3 = a5 with ψ0 given by 0 < ψ0 := f0 · inf P (Y = y) and c1 the positive constant in Lemma B.1. y∈Y Remark. In view of Theorem 3.2 and the Borel-Cantelli lemma, if log n/(nλdn ) → 0 as a.s. n → ∞, then E[|m b ηb(z) − m(z)|p |Dn ] → 0. PROOF OF THEOREM 3.2 To prove Theorem 3.2, we first need to define a couple terms, let Pn Pn ∆ i Yi z−Zi ∆i z−Zi K K i=1 ηb(Xi ,Yi ) i=1 ηb(Xi ,Yi ) hn hn m̃ηb(z) = , and mηb(z) = .(3.9) Pn Pn z−Zi z−Zi K K i=1 i=1 hn hn 33 Then, upon observing that |m̃ηb(z)/mηb(z)| ≤ M , one obtains m̃ηb(z) m(z) m̃ηb(z) ≤ − |m b ηb(z) − m(z)| = mηb(z) |mηb(z) − 1| + |m̃ηb(z) − m(z)| mηb(z) 1 ≤ M |mηb(z) − 1| + |m̃ηb(z) − m(z)| . Therefore, with m̃ηb(z) and mηb(z) as in (3.9), and m̃η∗ (z) and mη∗ (z) as in (3.4), we have for every > 0 Z p P |m b ηb(z) − m(z)| µ(dz) > Z p p−1 p ≤P 2 M |mηb(z) − 1| µ(dz) > 2 Z p p−1 +P 2 |m̃ηb(z) − m(z)| µ(dz) > 2 Z p 2p−2 p ≤P 2 M |mηb(z) − mη∗ (z)| µ(dz) > 4 Z + P 22p−2 M p |mη∗ (z) − 1|p µ(dz) > 4 Z p 2p−2 +P 2 |m̃ηb(z) − m̃η∗ (z)| µ(dz) > 4 Z p 2p−2 +P 2 |m̃η∗ (z) − m(z)| µ(dz) > 4 ( ) p−1 Z 1 1 ≤ P 22p−2 M p n + |mηb(z) − mη∗ (z)| µ(dz) > ∧i=1 ηb(Xi , Yi ) η0 4 ( ) p−1 Z 2p−2 p 1 +P 2 M + 1 |mη∗ (z) − 1| µ(dz) > η0 4 ( ) p−1 Z M M + P 22p−2 n + |m̃ηb(z) − m̃η∗ (z)| µ(dz) > ∧i=1 ηb(Xi , Yi ) η0 4 ( ) p−1 Z 2p−2 M +P 2 |m̃η∗ (z) − m(z)| µ(dz) > η0 + M 4 := Pn,1 + Pn,2 + Pn,3 + Pn,4 , (say). 34 In view of Theorem 3.1, it follows that Pn,2 + Pn,4 ≤ 8e−na2 , (3.10) where a2 ≡ a2 () = min 2 2 η02p η0p , 24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 ) To deal with the term Pn,1 , first observe that |mηb(z) − mη∗ (z)| P P n z−Zi ∆i z−Zi ∆i n K K i=1 ηb(Xi ,Yi ) i=1 η ∗ (Xi ,Yi ) hn hn − = Pn P n z−Zi z−Zi i=1 K i=1 K hn hn P z−Zi n ∆ 1 1 − K i ∗ i=1 ηb(Xi ,Yi ) η (Xi ,Yi ) hn = Pn z−Z i i=1 K hn P z−Zi n ∆ 1 1 i=1 i ηb(Xi ,Yi ) − η∗ (Xi ,Yi ) K hn i h ≤ z−Z nE K hn n X 1 z − Zi 1 + ∆i − K ηb(Xi , Yi ) η ∗ (Xi , Yi ) hn i=1 1 1 − i h × P n z−Zi nE K z−Z i=1 K h h n := Sn,1 (z) + Sn,2 (z), (say). 35 n . R Sn,1 (z)µ(dz) as follows z−Zi 1 Z Z Pn ∗ 1 − ∆ K i i=1 η (Xi ,Yi ) ηb(Xi ,Yi ) hn h i Sn,1 (z)µ(dz) ≤ µ(dz) nE K z−Z hn ! Z n K z−w X hn 1 1 1 i µ(dz) h ≤ sup − η ∗ (Xi , Yi ) ηb(Xi , Yi ) z−Z n w E K hn i=1 ! n 1 X 1 1 , ≤ c1 − (3.11) n i=1 η ∗ (Xi , Yi ) ηb(Xi , Yi ) However, we can bound where the last line follows from Lemma B.1 in the appendix, and the fact R that |∆i | ≤ 1 , ∀i = 1, · · · , n. Here c1 is the positive constant of Lemma B.1. To bound Sn,2 (z)µ(dz), we have Z 1 1 − Sn,2 (z)µ(dz) ≤ max ∗ 1≤i≤n η (Xi , Yi ) ηb(Xi , Yi ) Z Pn K z−Zi i=1 hn i − 1 µ(dz) . (3.12) h × nE K z−Z h n 36 Therefore, we have ( Pn,1 ) p−1 Z 1 1 ≤ P 22p−2 M p + n (Sn,1 (z) + Sn,2 (z)) µ(dz) > η0 ∧i=1 ηb(Xi , Yi ) 4 (" # p−1 X n ∗ 1 1 η (X , Y ) − η b (X , Y ) c i i i i 1 ≤ P 22p−2 M p + n > η0 ∧i=1 ηb(Xi , Yi ) n i=1 η ∗ (Xi , Yi )b η (Xi , Yi ) 8 ) n h \ η0 i ∩ ηb(Xi , Yi ) ≥ 2 i=1 (" p−1 1 1 1 2p−2 p 1 max ∗ − +P 2 M + n 1≤i≤n η (Xi , Yi ) η0 ∧i=1 ηb(Xi , Yi ) ηb(Xi , Yi ) Z Pn K z−Zi n h i \ i=1 hn η0 µ(dz) > ∩ h i η b (X , Y ) ≥ − 1 × i i 8 2 nE K z−Z i=1 hn (n ) [ η0 +P ηb(Xi , Yi ) < 2 i=1 ) ( n η0p+1 1X ∗ |b η (Xi , Yi ) − η (Xi , Yi )| > 2p+2 p−1 p (3.13) ≤ P n i=1 2 3 M c1 P i p Z ni=1 K z−Z hn η0 i − 1 µ(dz) > 2p+1 p p h (3.14) +P nE K z−Z 2 3M hn (n ) [ η0 +P ηb(Xi , Yi ) < . (3.15) 2 i=1 37 The bound for the term Pn,3 follows in a similar fashion as in Pn,1 , we first note that |m̃ηb(z) − m̃η∗ (z)| P z−Zi n ∆Y 1 1 K − i=1 i i ηb(Xi ,Yi ) η∗ (Xi ,Yi ) hn = Pn z−Z i i=1 K hn P z−Zi n ∆Y 1 1 i=1 i i ηb(Xi ,Yi ) − η∗ (Xi ,Yi ) K hn h i ≤ z−Z nE K hn n X 1 1 z − Zi + ∆i Yi − ∗ K η b (X , Y ) η (X , Y ) hn i i i i i=1 1 1 − h i × P n z−Zi nE K z−Z i=1 K hn hn 0 0 := Sn,1 (z) + Sn,2 (z) (say). R R 0 (z)µ(dz) is as follows Similar to that of Sn,1 (z)µ(dz), the bound for Sn,1 Z 0 Sn,1 (z)µ(dz) ≤ M · c1 ! n X 1 1 1 . − n i=1 ηb(Xi , Yi ) η ∗ (Xi , Yi ) The above line follows as a result of Lemma B.1 in the appendix, and the fact that |∆i Yi | ≤ R 0 (z)µ(dz), we have M , ∀i = 1, · · · , n, (c1 as in Lemma B.1). For the term Sn,2 Z Z 1 1 0 Sn,2 (z)µ(dz) ≤ M · max − ∗ 1≤i≤n η b(Xi , Yi ) η (Xi , Yi ) 38 P n K z−Zi i=1 hn i − 1 µ(dz) . h nE K z−Z hn Therefore, Pn,3 can be bounded as follows ( ) p−1 Z M M 0 0 Pn,3 ≤ P 22p−2 + n Sn,1 (z) + Sn,2 (z) µ(dz) > η0 ∧i=1 ηb(Xi , Yi ) 4 (" # p−1 n ∗ X M η (X , Y ) − η b (X , Y ) M c i i i i 1 ≤ P 22p−2 + n > ∗ η0 ∧i=1 ηb(Xi , Yi ) n i=1 η (Xi , Yi )b η (Xi , Yi ) 8 ) n h \ η0 i ∩ ηb(Xi , Yi ) ≥ 2 i=1 (" p−1 1 M 1 2p−2 M +P 2 + max − η0 ∧ni=1 ηb(Xi , Yi ) 1≤i≤n η ∗ (Xi , Yi ) ηb(Xi , Yi ) Z Pn K z−Zi n h i \ i=1 hn η0 h i − 1 µ(dz) > × ∩ ηb(Xi , Yi ) ≥ 8 2 nE K z−Z i=1 hn (n ) [ η0 +P ηb(Xi , Yi ) < 2 ( i=1n ) p+1 1X η 0 ≤ P (3.16) |b η (Xi , Yi ) − η ∗ (Xi , Yi )| > 2p+2 p−1 n i=1 2 3 M p c1 P i p Z ni=1 K z−Z hn η0 µ(dz) > h i − 1 +P (3.17) nE K z−Z 22p+1 3p M p hn (n ) [ η0 +P ηb(Xi , Yi ) < . (3.18) 2 i=1 39 Therefore, in view of (3.13), (3.14), (3.15), (3.16), (3.17), and (3.18), we find n X η0p+1 ∗ Pn,1 + Pn,3 ≤ 2 P |b η (Xi , Yi ) − η (Xi , Yi )| > 2p+2 p−1 p 2 3 M c1 i=1 P i p Z ni=1 K z−Z hn η0 µ(dz) > i h − 1 +2P nE K z−Z 22p+1 3p M p hn +2 n X i=1 n η0 o P ηb(Xi , Yi ) < 2 := Tn,1 + Tn,2 + Tn,3 , (say). (3.19) a.s. First note that upon taking Yi = 1 and m(z) = 1 in Lemma B.2 for i = 1, · · · , n, we find Tn,2 ≤ 2e−na3 , (3.20) where a3 ≡ a3 () = (2 η02p )/(24p+8 32p c21 M 2p ). To deal with the term Tn,1 , define the following terms Φ (Xi , Yi ) = η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi ) n X X i − Xj −1 −d b Φ (Xi , Yi ) = (n − 1) λn ∆j I{Yi = Yj }H λn j=16=i Ψ (Xi , Yi ) = f (Xi )P (Y = Yi |Yi ) n X X i − Xj −1 −d b Ψ (Xi , Yi ) = (n − 1) λn I{Yi = Yj }H . λ n j=16=i Notice that b (Xi , Yi ) Φ Φ (Xi , Yi ) = η ∗ (Xi , Yi ) , = ηb(Xi , Yi ) and b (Xi , Yi ) Ψ (Xi , Yi ) Ψ Φ(Xi ,Yi ) as well as that | Ψ(X | ≤ 1. Where λn and H are as in (3.6). Using the above definitions, b ,Y ) b i i 40 observe that Φ b (Xi , Yi ) Φ (Xi , Yi ) |b η (Xi , Yi ) − η ∗ (Xi , Yi )| = − b (Xi , Yi ) Ψ (Xi , Yi ) Ψ b Ψ (Xi , Yi ) − Ψ (Xi , Yi ) ≤ − Ψ (Xi , Yi ) b Φ (Xi , Yi ) − Φ (Xi , Yi ) Ψ (Xi , Yi ) . Now, let 0 < f0 := inf f (x) (see condition A3), in view of this we have x∈Rd Ψ(Xi , Yi ) = f (Xi )P (Y = Yi |Yi ) ≥ f0 inf P (Y = y) =: ψ0 > 0 . y∈Y (3.21) Therefore, Tn,1 ≤ 2 n X n o b P Ψ (Xi , Yi ) − Ψ (Xi , Yi ) > t i=1 n X +2 o n b (Xi , Yi ) − Φ (Xi , Yi ) > t , P Φ i=1 where t = η0p+1 ψ0 3p−1 M p c1 22p+3 > 0, with ψ0 as in (3.21). Now, put Γj (Xi , Yi ) = Xi − X j ∆j I{Yi = Yj }H λn #) " Xi − Xj , −E ∆j I{Yi = Yj }H Xi , Yi λn λ−d n 41 (3.22) (3.23) for i = 1, · · · , n, j = 1 · · · , n, j 6= i, and observe that n o b P Φ (Xi , Yi ) − Φ (Xi , Yi ) > t b b = P Φ (Xi , Yi ) ± E Φ (Xi , Yi ) Xi , Yi − Φ (Xi , Yi ) > t t b b ≤ P Φ (Xi , Yi ) − E Φ (Xi , Yi ) Xi , Yi + > t 2 (for large enough n, by Lemma B.3 in the appendix) " ( )# t b (Xi , Yi ) − E Φ b (Xi , Yi ) Xi , Yi > Xi , Yi = E P Φ 2 )# " ( n t X −1 , = E P (n − 1) Γj (Xi , Yi ) > Xi , Yi 2 j=16=i But, conditional on (Xi , Yi ), the terms Γj (Xi , Yi ) , j = 1, · · · , n , j 6= i, are independent −d zero-mean random variables, bounded by −λ−d n kHk∞ and +λn kHk∞ . Also, observe 2 −d that Var[Γj (Xi , Yi )|Xi , Yi ] = E[Γj (Xi , Yi )|Xi , Yi ] ≤ λn kHk∞ kf k∞ . Therefore by an application of Bernstein’s Inequality (Bernstein 1946), we have ( ) n X t P (n − 1)−1 Γj (Xi , Yi ) > Xi , Yi 2 j=16=i ( ) 2 −(n − 1) 2t ≤ 2 exp 1 −d t 2 λ−d n kHk∞ kf k∞ + 3 λn kHk∞ 2 −(n − 1)λdn η02p+2 ψ02 2 ≤ 2 exp , 24p+12 32p−2 c21 M 2p kHk∞ [kf k∞ + ψ0 /12] for n large enough. The last line follows by the fact that in bounding P {|b η (Xi , Yi ) − η02p+1 }, 3p−1 c1 M p 3p+2 p−1 p we only need to consider ∈ (0, 2 η32p+1c1 M ), because 0 b i , Yi ) is a special case of Φ(X b i , Yi ), |b η (Xi , Yi ) − η ∗ (Xi , Yi )| ≤ 1. Similarly, since Ψ(X with ∆j = 1 for j = 1, · · · , n (compare (3.22) with (3.23)), one finds η ∗ (Xi , Yi )| > 23p+2 2p+2 2 2 ψ0 −(n−1)λd n o n η0 b 2p kHk∞ [kf k∞ +ψ /12] 24p+12 32p−2 c2 M 0 1 , P Ψ (Xi , Yi ) − Ψ (Xi , Yi ) > t ≤ 2e 42 for n large enough. Combining the above two bounds, we obtain d Tn,1 ≤ 8ne−(n−1)λn a4 , where a4 ≡ a4 () = 24p+12 2 η02p+2 ψ02 2 2p−2 3 c1 M 2p kHk∞ [kf k∞ +ψ0 /12] (3.24) . Finally, to deal with the remaining term Tn,3 , notice that n o n η0 η0 o = P ηb(Xi , Yi ) − η ∗ (Xi , Yi ) < − η ∗ (Xi , Yi ) P ηb(Xi , Yi ) < 2 2 −η 0 ≤ P ηb(Xi , Yi ) − η ∗ (Xi , Yi ) < 2 n η0 o ∗ ≤ P |b η (Xi , Yi ) − η (Xi , Yi )| ≥ . 2 Therefore, employing the arguments that lead to the derivation of the bound on Tn,1 , we have d Tn,3 ≤ 8ne−(n−1)λn a5 , (3.25) η2 ψ2 0 , which does not depend on or n. Therefore, in view of where a5 = 27 kHk∞ [kf0 k∞ +ψ0 /12] (3.19), (3.20), (3.24), and (3.25), one finds − Pn,1 + Pn,3 ≤ 2e 2p n2 η0 2p 24p+8 32p c2 1M − + 8ne − + 8ne 2p+2 2 2 η0 ψ0 2p kHk∞ [kf k∞ +ψ /12] 24p+12 32p−2 c2 M 0 1 2 ψ2 η0 0 27 kHk∞ [kf k∞ +ψ0 /12] . This completes the proof of Theorem 3.2. 3.1.2 A least-squares based estimator of the selection probability Suppose that it is known in advance that the regression function belongs to a given (known) class of functions. In this case the least-squares estimator is an alternative to the kernel estimator of η ∗ (x, y). This works as follows. Let η ∗ belong to a known class of functions M of the form η : Rd × R → [η0 , 1], where η0 = inf P (∆ = 1|X = x, Y = y) > 0, as x,y 43 described in assumption A1. The least-squares estimator of the function η ∗ is n 1X ηbLS (Xi , Yi ) = argmin (∆i − η(Xi , Yi ))2 . η∈M n i=1 (3.26) Upon replacing η ∗ (Xi , Yi ) in (3.3) with the least-squares estimator in (3.26), we find the revised regression estimator Pn z−Zi ∆ i Yi K i=1 ηbLS (Xi ,Yi ) hn . m b ηbLS (z) = P (3.27) n ∆i z−Zi K i=1 ηbLS (Xi ,Yi ) hn To study the performance of m b ηbLS (z), we employ results from the empirical process theory (see, for example, van der Vaart and Wellner (1996) p.83, Pollard (1984), p.25, or Györfi, et al. (2002) p.134). We say M is totally bounded with respect to the L1 empirical norm if for every > 0, there exists a subclass of functions M = {M1 , · · · , MN } such that for every η ∈ M there existsPa η † ∈ M with the property that for the fixed points n 1 † (x1 , y1 ), · · · , (xn , yn ) we have n i=1 η(xi , yi ) − η (xi , yi ) < . The sub class M is called an -cover of M. The cardinality of the smallest such cover is called the -covering number of M and is denoted by N1 (, M, Dn ). Theorem 3.3 Let m b ηbLS (z) be as in (3.27), where ηbLS is the least-squares estimator of η ∗ described in (3.26), with η ∗ a member of a known class of functions M, where M is totally bounded with respect to the empirical L1 norm. Also, suppose that condition A1 holds and let K in (3.27) be a regular kernel with smoothing parameter hn . Then, provided that hn → 0, nhd+s → ∞, as n → ∞, and |Y | ≤ M < ∞, we have for every > 0 and n every p ∈ [1, ∞), and n large enough Z p P |m b ηbLS (z) − m(z)| µ(dz) > ≤ 8e−nb1 + 2e−nb2 + 16 E [N1 (b3 , M, Dn )] e−nb4 +16 E [N1 (b5 , M, Dn )] e−nb6 . Here, the constants b1 , · · · , b6 (note no attempt was made to find the optimal constants) 44 are defined as 2 b1 ≡ b1 () = min 2 η02p η0p , 24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 ) , η0p+1 2 η02p+2 2 η02p , b ≡ b () = , b ≡ b () = , 3 3 4 4 26p+8 M 2p c21 23p+3 M p c1 26p+7 M 2p c21 2 η02p+2 4 η04p+4 b5 ≡ b5 () = 6p+4 2p 2 , b6 ≡ b6 () = 12p+6 4p 4 . 2 M c1 2 M c1 b2 ≡ b2 () = Remark. Combining an application of the Borel-Cantelli lemma with the conclusion of a.s Theorem 3.3, yields E[|m b ηbLS (z) − m(z)|p |Dn ] → 0, provided that log(E [N1 (b3 ∧ b5 , M, Dn )])/n → 0 as n → ∞ . PROOF OF THEOREM 3.3 First note that, as in the proof of Theorem 2, for every > 0, Pn m̃ηbLS (z) = ∆ i Yi i=1 ηbLS (Xi ,Yi ) K Pn i=1 K z−Zi hn z−Zi hn Pn , and mηbLS (z) = ∆i i=1 ηbLS (Xi ,Yi ) K Pn i=1 K z−Zi hn z−Zi hn .(3.28) Then, upon observing that |m̃ηbLS (z)/mηbLS (z)| ≤ M , one obtains m̃ηbLS (z) m(z) ≤ M |mηb (z) − 1| + |m̃ηb (z) − m(z)| . − |m b ηbLS (z) − m(z)| = LS LS mηbLS (z) 1 45 Therefore, with m̃ηbLS (z) and mηbLS (z) as in (3.28), and m̃η∗ (z) and mη∗ (z) as in (3.4), we have for every > 0 Z p P |m b ηbLS (z) − m(z)| µ(dz) > Z p p−1 p ≤P 2 M |mηbLS (z) − 1| µ(dz) > 2 Z p p−1 +P 2 |m̃ηbLS (z) − m(z)| µ(dz) > 2 Z p 2p−2 p ≤P 2 M |mηbLS (z) − mη∗ (z)| µ(dz) > 4 Z + P 22p−2 M p |mη∗ (z) − 1|p µ(dz) > 4 Z p 2p−2 +P 2 |m̃ηbLS (z) − m̃η∗ (z)| µ(dz) > 4 Z + P 22p−2 |m̃η∗ (z) − m(z)|p µ(dz) > 4 ) ( p−1 Z 1 1 |mηbLS (z) − mη∗ (z)| µ(dz) > ≤ P 22p−2 M p + η0 η0 4 ( ) p−1 Z 2p−2 p 1 +P 2 M + 1 |mη∗ (z) − 1| µ(dz) > η0 4 ( ) p−1 Z M M + P 22p−2 + |m̃ηbLS (z) − m̃η∗ (z)| µ(dz) > η0 η0 4 ( ) p−1 Z 2p−2 M +P 2 |m̃η∗ (z) − m(z)| µ(dz) > η0 + M 4 := Qn,1 + Qn,2 + Qn,3 + Qn,4 , (say). As a result of Theorem 3.1, we have Qn,2 + Qn,4 ≤ 8e−nb1 , 46 (3.29) where b1 ≡ b1 () = min 2 η0p 2 η02p , 24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 ) . To bound Qn,1 , first we note that |mηbLS (z) − mη∗ (z)| P z−Zi n ∆ 1 1 − K i=1 i ηbLS (Xi ,Yi ) η∗ (Xi ,Yi ) hn h i ≤ nE K z−Z hn n X 1 z − Zi 1 ∆i + K − ηbLS (Xi , Yi ) η ∗ (Xi , Yi ) hn i=1 1 1 − h i × P n z−Zi nE K z−Z i=1 K hn hn := πn,1 (z) + πn,2 (z), (say) . But, using the arguments that lead to (3.11), we have Z πn,1 (z)µ(dz) ≤ c1 ! n 1 1 X 1 . − n i=1 ηbLS (Xi , Yi ) η ∗ (Xi , Yi ) R The term πn,2 (z)µ(dz) can be bounded similar to that of (3.12), Z 1 1 − ∗ πn,2 (z)µ(dz) ≤ max 1≤i≤n η bLS (Xi , Yi ) η (Xi , Yi ) Z Pn K z−Zi i=1 hn h i − 1 µ(dz) . × nE K z−Z h n 47 (3.30) (3.31) Now, put 1 = η0p−1 /[23p−1 M p ], then combining (3.30) and (3.31), yields Z Qn,1 ≤ P (πn,1 (z) + πn,2 (z)) µ(dz) > 1 ( ) n c1 X ηbLS (Xi , Yi ) − η ∗ (Xi , Yi ) 1 ≤ P > n i=1 η ∗ (Xi , Yi )b ηLS (Xi , Yi ) 2 Z Pn K z−Zi i=1 hn 1 1 1 i − 1 µ(dz) > h +P max ∗ − 1≤i≤n η (Xi , Yi ) ηbLS (Xi , Yi ) nE K z−Z 2 hn ) ( n 2 η 1X 1 0 |b ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| > (3.32) ≤ P n i=1 c1 P i Z ni=1 K z−Z hn 1 η0 i h (3.33) +P − 1 µ(dz) > nE K z−Z 4 hn To bound Qn,3 we use the arguments for the bound on Qn,1 , first observing that |m̃ηbLS (z) − m̃η∗ (z)| P z−Zi n ∆Y 1 1 i=1 i i ηbLS (Xi ,Yi ) − η∗ (Xi ,Yi ) K hn i h ≤ z−Z nE K hn n X 1 1 z − Zi + ∆i Yi − ∗ K η b η (X , Y ) hn LS (Xi , Yi ) i i i=1 1 1 − h i × P n z−Zi nE K z−Z i=1 K h h n := 0 πn,1 (z) + 0 πn,2 (z), n (say) . But, once again the arguments that lead to (3.11), yield Z 0 πn,1 (z)µ(dz) ≤ M c1 ! n 1 X 1 1 . − ∗ n i=1 ηbLS (Xi , Yi ) η (Xi , Yi ) 48 (3.34) 0 πn,2 (z)µ(dz) follows as before in (3.12) Z 1 1 0 πn,2 (z)µ(dz) ≤ M max − ∗ 1≤i≤n η bLS (Xi , Yi ) η (Xi , Yi ) Z Pn K z−Zi i=1 hn h i − 1 µ(dz) . × nE K z−Z h The bound for the term R (3.35) n Therefore, with 1 = η0p−1 /[23p−1 M p ], the bound for Qn,3 follows similarly to that of Qn,1 Z 0 0 Qn,3 ≤ P πn,1 (z) + πn,2 (z) µ(dz) > M 1 ( n ) 2 1X η 1 0 ≤ P |b ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| > (3.36) n i=1 c1 P i Z ni=1 K z−Z hn µ(dz) > 1 η0 . i h − 1 (3.37) +P nE K z−Z 4 hn Thus, by (3.32), (3.33), (3.36), and (3.37), it follows that ( n ) 2 X 1 1 η0 Qn,1 + Qn,3 ≤ 2P |b ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| > n i=1 c1 P i Z ni=1 K z−Z hn 1 η0 i − 1 µ(dz) > h +2P nE K z−Z 4 h n := Un,1 + Un,2 , (say). (3.38) a.s. Observe that, upon taking Yi = 1, and m(z) = 1 in lemma B.2 for i = 1, · · · , n, one obtains Un,2 ≤ 2e−nb2 , where b2 ≡ b2 () = 2 η02p 6p+8 2 M 2p c21 . 49 (3.39) To deal with the term Un,1 , observe that ( n 1 X Un,1 ≤ 2P |b ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| n i=1 1 η02 −E |b ηLS (X, Y ) − η ∗ (X, Y )| Dn > 2c1 2 1 η0 +2P E |b ηLS (X, Y ) − η ∗ (X, Y )| Dn > . 2c1 (3.40) (3.41) η2 To handle the term (3.40), first put 2 = 214 c01 > 0 and let the class of functions G be defined as G = {gη (x, y) = |η(x, y) − η ∗ (x, y)| | η ∈ M}. Now, observe that for any g, g † ∈ G one has n n 1 X 1X † g(Xi , Yi ) − g (Xi , Yi ) = ||η(Xi , Yi ) − η ∗ (Xi , Yi )| n i=1 n i=1 − η † (Xi , Yi ) − η ∗ (Xi , Yi ) n 1 X η(Xi , Yi ) − η † (Xi , Yi ) . ≤ n i=1 Therefore, it follows that if M2 = {η1 , · · · , ηN2 } is a minimal 2 -cover of M with respect to the empirical L1 norm, the class of functions G2 = {gη1 , · · · , gηN2 } will be an 2 -cover of G. Futhermore, the 2 -covering numbers for G and M satisfy the following N1 (2 , G, Dn ≤ N1 (2 , M, Dn ). Upon, employing standard results from the empirical process theory (see Pollard (1984), p.25, or Theorem 9.1 Györfi et al (2002, p. 136 ), the upper bound for (3.40) follows n 1 X sup |η(Xi , Yi ) − η ∗ (Xi , Yi )| n η∈M i=1 ( (3.40) ≤ 2P 1 η02 −E |η(X, Y ) − η (X, Y )|| > 2c1 ≤ 16 E[N1 (b3 , M, Dn )] e−nb4 , ∗ where b3 ≡ b3 () = η0p+1 23p+3 M p c1 and b4 ≡ b4 () = 50 2 η02p+2 . 26p+7 M 2p c21 (3.42) To deal with (3.41), note that by the Cauchy Schwartz inequality we have q 1 η02 2 (3.41) ≤ 2P E |b ηLS (X, Y ) − η ∗ (X, Y )| Dn > 2c1 2 4 1 η0 2 ∗ ≤ 2P E |b ηLS (X, Y ) − η (X, Y )| Dn > 2 . 4c1 Since i h E |b ηLS (X, Y ) − ∆|2 Dn = inf E |η(X, Y ) − ∆|2 η∈M i h + E |b ηLS (X, Y ) − η ∗ (X, Y )|2 Dn we find i i h h ηLS (X, Y ) − ∆|2 Dn E |b ηLS (X, Y ) − η ∗ (X, Y )|2 Dn = E |b − inf E |η(X, Y ) − ∆|2 , η∈M 51 (3.43) Now, observe that i h E |b ηLS (X, Y ) − ∆|2 Dn − inf E |η(X, Y ) − ∆|2 η∈M ( n i 1X h 2 = sup E |b ηLS (X, Y ) − ∆| Dn − |b ηLS (Xi , Yi ) − ∆i |2 n η∈M i=1 n n 1X 1X |b ηLS (Xi , Yi ) − ∆i |2 − |η(Xi , Yi ) − ∆i |2 + n i=1 n i=1 ) n 1X 2 2 + |η(Xi , Yi ) − ∆i | − E |η(X, Y ) − ∆| n i=1 n i 1X h ≤ E |b ηLS (X, Y ) − ∆|2 Dn − |b ηLS (Xi , Yi ) − ∆i |2 n i=1 ( n ) 1X + sup |η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2 n η∈M i=1 n n 1X 1X |b ηLS (Xi , Yi ) − ∆i |2 − |η(Xi , Yi ) − ∆i |2 ≤ 0 , ∀η ∈ M) (by the fact that, n i=1 n i=1 ) ( n 1X |η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2 . (3.44) ≤ 2 sup n i=1 η∈M To bound (3.44), let F = {fη (x, y) = |η(x, y)−∆|2 | η ∈ M} and put 3 = (21 η04 )/(25 c21 ), then for any f, f †† ∈ F, observe that n n †† 2 1 X 1 X 2 †† f (Xi , Yi ) − f (Xi , Yi ) = |η(Xi , Yi ) − ∆i | − η (Xi , Yi ) − ∆i n i=1 n i=1 n 1 X (η(Xi , Yi ) − ∆i ) − η †† (Xi , Yi ) − ∆i n i=1 × (η(Xi , Yi ) − ∆i ) + η †† (Xi , Yi ) − ∆i n 1 X ≤2 η(Xi , Yi ) − η †† (Xi , Yi ) . n i=1 ≤ In other words, if M3 /2 = {η1 , · · · , ηN3 /2 } is a minimal 3 /2-cover of M with respect to the empirical L1 norm, then F3 = {fη1 , · · · , fηN3 } is an 3 -cover of F. Additionally one 52 has that N1 (3 , F, Dn ) ≤ N1 (3 /2, M, Dn ), thus under standard results from empirical process theory (see, for example, Pollard (1984), p.25, or Theorem 9.1 Györfi et al (2002, p. 136 ), one has the following bound for (3.43) ) ( n 1 X 2 η 4 (3.43) ≤ 2P sup |η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2 > 15 02 2 c1 η∈M n i=1 ≤ 16 E [N1 (b5 , M, (Xi , Yi )ni=1 )] e−nb6 2 η 2p+2 where b5 ≡ b5 () = 26p+40M 2p c2 and b6 ≡ b6 () = 1 (3.39), (3.42), and (3.45), we have − Qn,1 + Qn,3 ≤ 2e 4 η04p+4 12p+6 2 M 4p c41 (3.45) . Therefore, in view of (3.38), 2p n2 η0 26p+8 M 2p c2 1 +16 E N1 η0p+1 , M, Dn 23p+3 M p c1 +16 E N1 2 η02p+2 , M, Dn 26p+4 M 2p c21 − e 2p+2 n2 η0 6p+7 2 M 2p c2 1 − e 4p+4 n4 η0 12p+6 2 M 4p c4 1 . This completes the proof of Theorem 3.3. 3.2 An application to classification with missing covariates In this section we consider an application of our previous results to the problem of statistical classification. Let (Z, Y ) be an Rd+s × {1, · · · , N }-valued random vector, where Y is the class label and is to be predicted from the explanatory variables Z = (Z1 , · · · , Zd+s ). In classification one searches for a function φ : Rd+s → {1, · · · , N } such that the missclassification error probability L(φ) = P {φ(Z) 6= Y } is as small as possible. Let πk (z) = P {Y = k|Z = z} , z ∈ Rd+s , 1 ≤ k ≤ N , then the best classifier (i.e., the one that minimizes L(φ)) is called the Bayes classifier and is given by (see, for example, Devroye and Györfi (1985 pp. 253-254)) πB (z) = argmax πk (z) . 1≤k≤N 53 (3.46) We note that the Bayes classifier satisfies max πk (z) = πφB (z) (z) . 1≤k≤N In practice, the Bayes classifier is unavailable, because the distribution of (Z, Y ) is usually unknown. Instead, one only has access to a random sample Dn = {(Z1 , Y1 ) · · · , (Zn , Yn )} from the distribution (Z, Y ). Now, let φbn (Z) be any sample-based classifier to predict Y from the data Dn and Z. The error of missclassification of this estimator, conditioned on the data, is given by b b Ln (φn ) = P φn (Z) 6= Y Dn . Then, it can be shown (see, for example, Devroye and Györfi (1985, pp. 254)) that 0 ≤ Ln (φbn ) − L(φB ) ≤ N Z X |b πk (z) − πk (z)| µ(dz). (3.47) k=1 a.s. Therefore, to show Ln (φbn ) − L(φB ) −→ 0 , it is sufficient to show that E[|b πk (Z) − a.s. πk (Z)||Dn ] −→ 0, as n → ∞, for k = 1, · · · , N . Here, we consider the problem of classification with missing covaraites, i.e., the case where some subsets of the Zi ’s may be unavailable. For some relevant results along these lines see Mojirsheibani (2012). Define Pn ∆i I{Yi =k} z−Zi i=1 η̆(Xi ,Yi ) K hn k = 1, · · · , N , π bk (z) = P (3.48) n z−Zi ∆i i=1 η̆(Xi ,Yi ) K hn where η̆ is any estimator of η, and define the classifier φbn (z) = argmax π bk (z) . (3.49) 1≤k≤N The following result summarizes the performance of φbn . Theorem 3.4 Let φbn be the classifier defined in (3.49) and (3.48). (i) If η̆ in (3.48) is chosen to be the kernel estimator ηb defined in (3.6) then, under the a.s. conditions of Theorem 3.2, φbn is strongly consistent i.e., Ln (φbn ) − L(φB ) → 0, as n → ∞. 54 (ii) If η̆ in (3.48) is chosen to be the least-squares estimator ηbLS defined in (3.26) then, a.s. under the conditions of Theorem 3.3, φbn is strongly consistent i.e., Ln (φbn )−L(φB ) → 0, as n → ∞. Proof of Theorem 3.4 (i) Let η̆ be the kernel estimator ηb defined in (3.6) then by (3.47) one has for every > 0, and n large enough ) ( N Z X . P {Ln (φbn ) − L(φB ) > } ≤ P |b πk (z) − πk (z)| µ(dz) > N i=1 The proof now follows from an application of Theorem 3.2 and the Borel Cantelli lemma. (ii) The proof of part (ii) is similar and will not be given. 3.3 Kernel regression simulation with missing covariates. In this section we study the performance of the estimators m b η∗ , m b ηbc , and m b ηbLS . The performance of each estimator will be compared with the complete case estimator of chapter 2 (denoted as m b Co ). To explore the performance of these estimators, consider the following simulation. A sample of size 100 is taken from the linear model Yi = .5Vi + .9Xi + i , where Vi , Xi , and i are independent and identically distributed normal random variables with means µ1 = 5, µ2 = 9, µ3 = 0 and standard deviations σ1 = 1, σ2 = 6, σ3 = 1 respectively, for i = 1, · · · , 100. In the simulation Vi is sometimes missing with missing probability mechanism P (∆i = 1|Xi , Yi ) = ea|Xi |+b|Yi | , 1 + ea|Xi |+b|Yi | (3.50) here ∆i = 1 if Vi is available and ∆i = 0 otherwise for i = 1, · · · , n, a and b are chosen to produce approximately 25% or 75% missingness among the Vi ’s. The leastsquares estimator for the selection probability in the regression estimator m b ηbLS is correctly specified and the coefficients are estimated using the nls method in the R programing language. For the following smoothing parameters h = .2, .4, .6, and .8 we perform a series of fifty Monte Carlo runs and test the empirical L1 and L2 errors (defined in (1.15) and 55 h m b Co m b η∗ m b ηbc m b ηbLS 0.2 L1 6.439 0.450 6.426 0.449 6.180 0.407 6.409 0.471 L2 66.228 8.432 66.041 8.416 61.689 7.404 65.771 8.767 Approximately 25% missing 0.4 0.6 L1 L2 L1 L2 5.123 0.369 43.144 5.790 4.640 0.360 34.740 5.036 5.111 0.364 42.925 5.717 4.643 0.356 34.775 4.965 4.874 0.359 38.808 5.724 4.575 0.358 33.547 4.745 5.113 0.368 42.925 5.734 4.648 0.356 34.792 4.974 0.8 L1 4.511 0.341 4.530 0.336 4.508 0.335 4.540 0.335 L2 32.506 4.498 32.783 4.461 32.384 4.322 32.906 4.519 Table 3.1: Comparsion of empirical L1 and L2 errors of the estimators when Vi is missing at random with missing mechanism as in (3.50). The missing mechanism has parameters a = −2 and b = .1.95, the choice of parameters results in approximately 25% of the Vi ’s missing. h m b Co m b η∗ m b ηbc m b ηbLS 0.2 L1 7.491 0.725 7.399 0.706 6.915 0.653 7.391 0.701 L2 85.428 13.452 83.779 13.071 75.163 11.928 83.623 12.984 Approximately 75% missing 0.4 0.6 L1 L2 L1 L2 5.672 0.417 52.704 7.001 4.83 0.366 38.086 5.352 5.600 0.403 51.453 6.767 4.798 0.363 37.536 5.186 5.202 0.373 44.349 6.066 4.619 0.373 34.376 4.986 5.594 0.402 51.353 6.767 4.794 0.365 37.495 5.218 0.8 L1 4.581 0.357 4.595 0.358 4.524 0.371 4.594 0.361 L2 33.799 4.761 33.931 4.768 32.810 4.832 33.901 4.788 Table 3.2: Comparsion of empirical L1 and L2 errors of the estimators when Vi is missing at random with missing mechanism as in (3.50). The missing mechanism has parameters a = −1 and b = .7, the choice of parameters results in approximately 75% of the Vi ’s missing. (1.16)) for each estimator on an independent sample of 500 complete observations. The smoothing parameter for the kernel regression estimate in (3.7) is selected by using the cross-validation method of Racine and Li (2004) in the R package titled “np” (Racine and Hayfield (2008)). This method utilizes a conjugate gradient search pattern and returns a multi-dimensional bandwidth. Conversion of the vector of bandwidth into a single constant is done by taking the geometric mean of the vector. In tables 3.1 and 3.2 one may observe that the kernel regression estimator m b ηbc defined in section 3.1.1 out performs each estimator in every situation. A non-intuitive observation is that the estimator that assumes knowledge of the missing probability mechanism i.e., m b η∗ performs considerably worse than the estimator m b ηbc , which used an estimate of the missing probability mechanism. This behavior is observed in the literature, see for example Robins and Rotnitzky (1995). In almost all cases the complete case estimator performs worse than each of our estimators. 56 4 Kernel regression estimation with incomplete response This chapter explores constructing kernel regression estimators in which the response variable may be missing at random. The results in this chapter mirror that of the case of missing covariates in Chapter 3. We derive exponential upper bounds for the Lp norms of our proposed estimators, yielding almost sure convergence of the Lp statistic to 0. At the conclusion of this chapter we once again apply the results to nonparametric classification. We construct kernel classifiers from partially labeled data. 4.1 Main results We propose kernel estimators of an unknown regression function m(z) = E [Y |Z = z] in the presence of missing response. Our estimator weighs the complete cases, i.e., the observed Yi ’s, by the inverse of the missing data probabilities. First consider the simple, but unrealistic, case where the missing probability mechanism (also called the selection probability) π ∗ (z) = P {∆ = 1|Z = z} = E[∆|Z = z] (4.1) is completely known, where ∆ = 0 if Y is missing (and ∆ = 1, otherwise), and define the revised kernel estimator n X z − Zi ∆i Yi K π ∗ (Zi ) h i=1 . (4.2) m b π∗ (z) = n X z − Zi K h i=1 As for the usefulness of m b π∗ (z) as an estimator of m(z), first observe that (4.2) can be viewed as the kernel estimator of E[ π∆Y ∗ (Z) |Z = z]. Furthermore, when the MAR assumption (2.3) holds, one finds ∆Y ∆Y a.s. E ∗ = E E ∗ Z Z, Y Z π (Z) π (Z) Y = E ∗ E [∆|Z, Y ] Z π (Z) Y by (2.3) ∗ = =E ∗ π (Z)Z = E [Y |Z] := m(Z). π (Z) 57 In other words, (4.2) can be viewed as a kernel regression estimator of the regression function E[Y |Z = z] whenever the function π ∗ in (4.1) is completely known. How good is the estimator m b π∗ (z)? To answer this question, and in what follows, we shall assume that the selected kernel K is regular: as described earlier in (1.4.3) and we state the following condition. Condition B1. inf P {∆ = 1|Z = z} := π0 > 0, where π0 can be arbitrarily small. z Condition B1 guarantees, in a sense, that Y will be observed with a nonzero probability when Z = z, for all z. The following result is an immediate consequence of the main theorem of Devroye and Krzyz̀ak (1989). Theorem 4.1 Let m b π∗ (z) be the kernel estimator defined in (4.2), where K is a regular kernel, and suppose that condition B1 holds. If h → 0 and nhd → ∞, as n → ∞, then for any distribution of (Z, Y ) satisfying |Y | ≤ M < ∞, and for every > 0 and n large enough Z |m b π∗ (z) − m(z)| µ(dz) > P 2 where a ≡ a() = min the kernel K only. π02 2 128M 2 (1+c1 ) , π0 32M (1+c1 ) ≤ e−an , . Here c1 is a constant that depends on R Remarks. The above result continues to hold for |m b π∗ (z) − m(z)|p µ(dz), for all 1 ≤ p < ∞, with a() replaced by a(˜) where ˜ = /(M (π0−1 + 1))p−1 . To appreciate this, simply note that for all z ∈ Rd and p ∈ [1, ∞) |m b π∗ (z) − m(z)|p ≤ (|m b π∗ (z) | + |m(z)|)p−1 |m b π∗ (z) − m(z)| p−1 M +M |m b π∗ (z) − m(z)|. ≤ π0 Therefore, under the conditions of Theorem 4.1, and in view of the Borel-Cantelli lemma, a.s. one finds E [|m b π∗ (Z) − m(Z)|p |Dn ] −→ 0, as n → ∞. Of course the estimator m b π∗ defined by (4.2) is useful only if the missing probability mechanism π ∗ (z) = P {∆ = 1|Z = z} is known. If this is not the case, π ∗ (z) must be replaced by some estimator in (4.2). In what follows, we consider kernel as well as least-squares 58 regression type estimators of π ∗ (z). 4.1.1 A kernel-based estimator of the selection probability Our first estimator of π ∗ (Zi ) = E(∆i |Zi ) is defined by n X Zi − Zj ∆j H an j=1,6=i π b(Zi ) = n X Zi − Zj H an j=1,6=i (4.3) with the convention 0/0 = 0, where H : Rd → R+ is the kernel used with the smoothing parameter an (→ 0, as n → ∞). The kernel regression estimator in (4.3) is a weighted average of the ∆j ’s, this provides an estimate for the probability that the ith observation of the response will be unavailable. Our first proposed kernel-type regression estimator of m(z) = E [Y |Z = z] is given by n X z − Zi ∆i Yi K π b (Z ) h i . m b πb (z) = i=1 n X z − Zi K h i=1 (4.4) To investigate the performance of the regression estimator (4.4) we first state the following additional conditions. R R Condition B2. The kernel H in (4.3) satisfies Rd H(u)du = 1 and Rd |ui |H(u)du < ∞, for i = 1, · · · , d, where u = (u1 , · · · , ud )0 . Furthermore, the smoothing parameter an satisfies an → 0 and nadn → ∞, as n → ∞. Condition B3. The random vector Z has a compactly supported probability density function, f (z), where f is bounded away from zero on its compact support. Additionally, both f and its first-order partial derivatives are uniformly bounded. Condition B4. The partial derivatives ∂z∂ i E[∆|Z = z], exist for i = 1, · · · , d and are bounded uniformly, in z, on the compact support of f . 59 Condition B2 is not restrictive since the choice of the kernel H is at our discretion. Condition B3 is usually imposed in nonparametric regression in order to avoid having unstable estimates (in the tail of the p.d.f of Z). Condition B4 is technical and has already been used in the literature; see, for example, Cheng and Chu (1996). Theorem 4.2 Let m b πb (z) be as in (4.4), where K is a regular kernel. Suppose that |Y | ≤ M < ∞. If h → 0 and nhd → ∞, as n → ∞, then under conditions B1, B2, B3, and B4, for every > 0 and every p ∈ [1, ∞), and n large enough, Z d 2 2 d p P |m b πb (z) − m(z)| µ(dz) > ≤ 4e−bn + 4ne−C1 nan + e−C2 n + 4ne−C3 nan , where b ≡ b() = min2 n π02 2 9 2 2 M (1+c1 )[2(M π0−1 +M )]2p−2 , π0 26 M (1+c1 )[2(M π0−1 +M )]p−1 o , and π04 f02 , 22p (3M πo−1 )2p−2 (32M c1 )2 kHk∞ (2kf k∞ + f0 /12) π02 = , [2(3M π0−1 )]2p−2 (64M 2 c1 )2 π02 f02 . = 256kHk∞ (2kf k∞ + π0 f0 /24) C1 = C2 C3 Remark. We note that as an immediate consequence of Theorem 4.2, if log n/(nadn ) → 0, a.s. as n → ∞, then in view of the Borel-Cantelli lemma, one has E [|m b πb (Z) − m(Z)|p |Dn ] −→ 0. PROOF OF THEOREM 4.2 60 First note that Z Z p p−1 p P |m b πb (z) − m(z)| µ(dz) > ≤ P 2 |m b πb (z) − m b π∗ (z)| µ(dz) > 2 Z p−1 p +P 2 |m b π∗ (z) − m(z)| µ(dz) > 2 := Pn,1 + Pn,2 , (say) . (4.5) But, by Theorem 4.1 (and the remarks that follow Theorem 4.1), Z p−1 −1 Pn,2 ≤ P 2(M π0 + M ) |m b π∗ (z) − m(z)| > ≤ 4e−bn , 2 for n large enough, with π02 2 π0 2 b ≡ b() = min , . 29 M 2 (1 + c1 )[2(M π0−1 + M )]2p−2 26 M (1 + c1 )[2(M π0−1 + M )]p−1 To deal with the term Pn,1 , first note that P z−Zi n 1 1 − ∆ Y K i i i=1 πb(Zi ) π∗ (Zi ) h |m b πb (z) − m b π∗ (z)| = Pn z−Zi K i=1 h P z−Zi n 1 1 − ∆ Y K i i i=1 πb(Zi ) π∗ (Zi ) h ≤ z−Z nE K h ! n X 1 1 z − Z i + − ∗ ∆i Yi K π b (Z ) π (Z ) h i i i=1 ! 1 1 − × P n z−Zi nE K z−Z i=1 K h h =: Un,1 (z) + Un,2 (z) , (say). 61 (4.6) Therefore, by the definition of m b πb (z) in (4.4) and the fact that |Y | ≤ M , p |m b πb (z) − m b π∗ (z)| ≤ M M + n ∧i=1 π b(Zi ) π0 p−1 |Un,1 (z)| + |Un,2 (z)| . (4.7) But Pn 1 z−Zi 1 Z Z − |∆ Y | K i i ∗ i=1 π b(Zi ) π (Zi ) h µ(dz) Un,1 (z)µ(dz) ≤ z−Z nE[K h ] Z n z−Zi 1 K 1X 1 h = − µ(dz) |∆i Yi | n i=1 π b(Zi ) π ∗ (Zi ) E[K z−Z ] h Z n X K z−u 1 1 1 h ≤ sup µ(dz) × − |∆ Y | i i π ∗ (Z ) n b (Z ) π E[K z−Z ] u i i h i=1 n M c1 X 1 1 ≤ − ∗ , (4.8) b(Zi ) π (Zi ) n i=1 π where the last line follows from Lemma B.1 and the fact that |∆i Yi | ≤ M , for all i = 1, · · · , n; here c1 is a finite positive constant depending on the kernel K only. Now, for 62 every > 0, (Z ) p−1 M M P 2p−1 + Un,1 (z) µ(dz) > ∧ni=1 π b(Zi ) π0 4 ) ( p−1 n X 1 M M 1 1 > + × − ≤ P 2p−1 ∧ni=1 π b(Zi ) π0 n i=1 π b(Zi ) π ∗ (Zi ) 4M c1 (" # p−1 n ∗ X π b (Z ) − π (Z ) M M 1 i i ≤ P 2p−1 + > × ∧ni=1 π b(Zi ) π0 n i=1 π b(Zi )π ∗ (Zi ) 4M c1 n h \ π0 i ∩ π b(Zi ) ≥ 2 i=1 ( ) ) π0 i π b(Zi ) < +P 2 i=1 ( ) p−1 n X 3M 2 ≤ P 2p−1 b(Zi ) − π ∗ (Zi ) > π 2 π π 4M c1 0 0 i=1 n n X π0 o + P π b(Zi ) < . 2 i=1 n h [ As for the second term in (4.6), observe that since |∆i Yi | ≤ M , for all i = 1, · · · , n, one finds Pn z−Zi K 1 1 h i=1 − − 1 Un,2 (z) ≤ M max . 1≤i≤n π b(Zi ) π(Zi ) nE[K z−Z ] h 63 Now, for every > 0, (Z ) p−1 M M P 2p−1 + Un,2 (z)µ(dz) > ∧ni=1 π b(Zi ) π0 4 # n ) (" p−1 Z h i \ M M π 0 + ∩ π b(Zi ) ≥ ≤ P 2p−1 Un,2 (z)µ(dz) > ∧ni=1 π b(Zi ) π0 4 2 i=1 (n ) i [h π0 +P π b(Zi ) < 2 P ) ( i=1 p−1 Z n K z−Zi 2M 3M i=1 h − 1 µ(dz) > ≤ P 2p−1 π0 π0 nE K z−Z 4 h n n X π0 o + P π b(Zi ) < . 2 i=1 Therefore, in view of (4.7), the term Pn,1 in (4.5) can be upper-bounded as follows Pn,1 ≤ n X π02 P 8M c1 [2(3M π0−1 )]p−1 i=1 (Z Pn ) z−Z π0 i=1 K( h i ) − 1 µ(dz) > +P nE[K( z−Z 8M [2(3M π0−1 )]p−1 )] h n n X π0 o +2 P π b(Zi ) < 2 i=1 π b(Zi ) − π ∗ (Zi ) > := ∆n,1 + ∆n,2 + ∆n,3 , (say). First observe that by Lemma B.2, ∆n,2 ≤ exp −nπ02 2 (64M 2 c1 )2 [2(3M π0−1 )]2p−2 , for n large enough; (this is a special case of Lemma B.2 with Yi = 1 for i = 1, · · · , n, and m(z) = 1). To bound the term ∆n,1 we start by defining ψ(Z) = π ∗ (Z) · f (Z) 64 (4.9) b i) = ψ(Z n X 1 Zi − Zj ∆j H (n − 1)adn j=1,6=i an n X 1 Zi − Zj H (n − 1)adn j=1,6=i an b b where an and the kernel H are as in (4.3). Since ψ(Zi )/f (Zi ) ≤ 1, one finds fb(Zi ) = (4.10) (4.11) ψ(Z −ψ(Z b i ) ψ(Zi ) b i ) fb(Zi ) − f (Zi ) ψ(Z b i ) − ψ(Zi ) |b π (Zi ) − π ∗ (Zi )| = + − · = fb(Zi ) fb(Zi ) f (Zi ) f (Zi ) f (Zi ) 1 b 1 b ≤ f (Zi ) − f (Zi ) + ψ(Zi ) − ψ(Zi ) f (Zi ) f (Zi ) Therefore, ∆n,1 n X 1 b π02 ≤ P ψ(Zi ) − ψ(Zi ) > f (Z ) 16M c1 [2(3M π0−1 )]p−1 i i=1 n X π02 1 b + P . f (Zi ) − f (Zi ) > f (Zi ) 16M c1 [2(3M π0−1 )]p−1 i=1 65 Now, observe that b i ) − ψ(Zi ) ψ(Z 2 π0 P > f (Zi ) 16M c1 [2(3M π0−1 )]p−1 i i h h b b b ≤ P ψ(Z ) − E ψ(Z ) Z + E ψ(Z ) Z − ψ(Z ) i i i i i i > π02 f0 16M c1 [2(3M π0−1 )]p−1 (where f0 = inf f (z) > 0, by condition B3) z i h π02 f0 π02 f0 b b > ≤ P ψ(Zi ) − E ψ(Zi )Zi + 32M c1 [2(3M π0−1 )]p−1 16M c1 [2(3M π0−1 )]p−1 (for n large enough, by Lemma B.4) )# " ( i h 2 π f b 0 0 b i )Zi > = E P ψ(Zi ) − E ψ(Z −1 p−1 Zi 32M c1 [2(3M π0 )] 4 2 2 −(n−1)ad n π0 f0 2 f /(96M c [2(3M π −1 )]p−1 ) [2(3M π −1 )]2p−2 2(32M c1 )2 kHk∞ 2kf k∞ +a2d π n 0 0 1 0 0 [ ≤ 2e (by Lemma B.5) ≤ 2e ] 4 2 2 −(n−1)ad n π0 f 0 −1 )]2p−2 2(32M c1 )2 kHk∞ [2kf k∞ +f0 /12][2(3M π0 , for n large enough, where we have also used the fact that in bounding P {|b π (Zi )−π ∗ (Zi )| > −1 π02 /(8M c1 [2(3M π0 )]p−1 )} one only needs to consider 0 < < 8M c1 [2(3M π0−1 )]p−1 /π02 b with ∆j = 1 for (because |b π (Zi ) − π ∗ (Zi )| ≤ 1). Similarly, since fb is a special case of ψ, all j’s (compare (4.10) and (4.11)), one finds fb(Zi ) − f (Zi ) 2 π0 P > f (Zi ) 16M c1 [2(3M π0−1 )]p−1 −(n − 1)adn π04 f02 2 , ≤ 2 exp 2(32M c1 )2 kHk∞ [2kf k∞ + f0 /12] [2(3M π0−1 ]2p−2 for n large enough. Putting all the above together, we find d 2 ∆n,1 ≤ 4ne−(n−1)an C10 , 66 where C10 = π04 f02 / 2(32M c1 )2 kHk∞ [2kf k∞ + f0 /12] [2(3M π0−1 )]2p−2 Finally, since P {b π (Zi ) < π0 /2} ≤ P {|b π (Zi ) − π ∗ (Zi )| > π0 /2}, the arguments that lead to the derivation of the bound on ∆n,1 yield n X n π0 o P |b π (Zi ) − π ∗ (Zi )| > 2 i=1 −(n − 1)adn π02 f02 ≤ 4n exp 128kHk∞ [2kf k∞ + a2d n π0 f0 /24] ∆n,3 ≤ 2 d ≤ 4n e−(n−1)an C11 , (for n large enough), where C11 = π02 f02 /(128kHk∞ [2kf k∞ + π0 f0 /24]) does not depend on n. This completes the proof of Theorem 4.2. 4.1.2 Least-squares-based estimator of the selection probability As in chapter 3 if we have prior knowledge about the behavior of the missing probability mechanism we can use least-squares methods. As a result, our second estimator of π ∗ (z) is based on a least-squares approach and works as follows. Suppose that π ∗ belongs to a known class of functions P of the form π : Rd → [π0 , 1], where π0 = inf z π(z) > 0 is as before (see assumption B1) . Then the least-squares estimator of the function π ∗ is n π bLS = argmin π∈P 1X (∆i − π(Zi ))2 . n i=1 (4.12) Replacing π ∗ by π bLS in (4.2), one obtains n X ∆i Yi z − Zi K π b h LS (Zi ) m b πbLS (z) = i=1 n . X z − Zi K h i=1 (4.13) To study the performance of the regression function estimator in (4.13), we assume our class of functions is bounded with respect to the supnorm. Recall from chapter 1 that a class of functions P is said to be totally bounded with respect to the supnorm if for every 67 > 0, there is a subclass P = {π1 , · · · , πN() } such that for every π ∈ P, there is a π̃ ∈ P satisfying kπ − π̃k∞ < . The class P is called an -cover of P. The cardinality of the smallest such -cover is denoted by N∞ (, P). Theorem 4.3 Let m b πbLS (z) be as in (4.13), where π bLS is the least-squares estimator of π ∗ ∈ P, and where P is totally bounded with respect to the supnorm. Let K be a regular kernel in (4.13) and suppose that condition B1 holds. If h → 0 and nhd → ∞, as n → ∞, then for any distribution of (Z, Y ) satisfying |Y | ≤ M < ∞, and for every > 0 and every p ∈ [1, ∞), and n large enough Z 2 P |m b πbLS (z) − m(z)|µ(dz) > ≤ 4e−bn + 2N∞ (C4 , P) e−C5 n 4 2 + 2N∞ C6 2 , P e−C7 n + e−C8 n where b is as in Theorem 4.2, and π02 , 32M c1 [2(M π0−1 +M )]p−1 4 π C6 = 1024M 2 c [2(M0π−1 +M )]2p−2 1 0 π2 C8 = 1024M 4 c [2(M0π−1 +M )]2p−2 , 1 0 C4 = C5 = C7 = π04 −1 2p−2 1 [2(M π0 +M )] π08 215 M 4 c1 [2(M π0−1 +M )]4p−4 128M 2 c where c1 is a constant that depends on the kernel K only. Theorem 4.3 in conjunction with the Borel-Cantelli lemma yields a.s. E [|m b πbLS (Z) − m(Z)|p |Dn ] −→ 0, as n → ∞. Remark. The upper bound found in Theorem 4.3 is meaningful when the class P is totally bounded with respect to the supnorm. See Section 1.3.2 for examples of classes of functions totally bounded with respect to the supnorm. PROOF OF THEOREM 4.3 First note that for the least-squares estimator π bLS we have |m b πbLS (z)| ≤ 68 M bLS (Zi ) ∧n i=1 π ≤ M . π0 Therefore, employing the arguments used in the proof of Theorem 4.2, we have Z p P |m b πbLS (z) − m(z)| µ(dz) > ≤ qn,1 + qn,2 , where Z qn,1 = P [2(M π0−1 b π∗ (z) − m(z)µ(dz) > ≤ 4e−bn , (for large n), m 2 p−1 + M )] where b is as in Theorem 4.2, and where Z −1 p−1 qn,2 = P . [2(M π0 + M )] m b πbLS (z) − m b π∗ (z)µ(dz) > 2 (4.14) To upper-bound qn,2 first note that |m b πbLS (z) − m b π∗ (z)| ≤ Pn 1 − i=1 π bLS (Zi ) 1 |∆i Yi | K z−Z π ∗ (Zi ) nE K z−Zi h h n X 1 1 − (∆i Yi ) + π bLS (Zi ) π ∗ (Zi ) i=1 z−Zi z−Zi K h K h − P × z−Z n z−Zi nE K h i=1 K h := |Sn,1 (z)| + |Sn,2 (z)| , (say). Furthermore, as in (4.8), Z n M c1 X 1 1 |Sn,1 (z)| µ(dz) ≤ − , n i=1 π bLS (Zi ) π ∗ (Zi ) 69 (4.15) which yields, for every t > 0, Z P |Sn,1 (z)| µ(dz) > t ( n ) X 1 1 t 1 ≤P − ∗ > n π bLS (Zi ) π (Zi ) M c1 ( i=1 ) n 1 1 X t |b = P πLS (Zi ) − π ∗ (Zi )| > ∗ n π bLS (Zi )π (Zi ) M c1 ) ( i=1 n π02 t 1X ∗ |b πLS (Zi ) − π (Zi )| > ≤ P n i=1 M c1 (since B1 yields inf π bLS (z) ≥ inf π ∗ (z) = π0 > 0, and the definition of class P) z z ( n ) 1 X i h 2 t π π ≤P bLS (Z) − π ∗ (Z)Dn > 0 bLS (Zi ) − π ∗ (Zi ) − E π (4.16) n 2M c1 i=1 h i π02 t ∗ bLS (Z) − π (Z) Dn > +P E π . (4.17) 2M c1 To deal with the term (4.16), put 0 = π02 t/[8M c1 ] and let P0 be a minimal 0 -cover of P. In other words, for every π ∈ P, there is a π̃ ∈ P0 such that kπ − π̃k∞ ≤ 0 . Let g(z) = |π(z) − π ∗ (z)| and g̃(z) = |π̃(z) − π ∗ (z)| and note that 70 n 1 X π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) n i=1 n n X X 1 ∗ ∗ ≤ π(Zi ) − π (Zi ) − π̃(Zi ) − π (Zi ) n i=1 i=1 + E π̃(Z) − π ∗ (Z) − E π(Z) − π ∗ (Z) n 1 X + π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z) n i=1 n 1 X ≤ 2 ||g − g̃k∞ + π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z) n i=1 n 1 X π02 t ≤ + π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z) , 4M c1 n i=1 where the last line follows from the fact that |g(z) − g̃(z)| = ||π(z) − π ∗ (z)| − |π̃(z) − π ∗ (z)|| ≤ |π(z) − π̃(z)| ≤ kπ − π̃k∞ ≤ π02 t . 8M c1 Therefore with 0 = π02 t/[8M c1 ], ( ) n 1 X 2 π t (4.16) ≤ P sup π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) > 0 2M c1 π∈P n i=1 ( ) n 1 X 2 2 π t π t ≤ P sup π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) + 0 > 0 4M c1 2M c1 π∈P0 n i=1 π2t 0 ≤ N∞ ,P 8M c1 ( ) n 1X 2 π t × sup P π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) > 0 4M c1 n π∈P0 2 i=1 4 2 π0 t −π0 t n ≤ 2 N∞ , P · exp , (4.18) 8M c1 8M 2 c1 71 where the exponential bound in (4.18) follows from an application of Hoeffding’s (1963) inequality. Next, to bound (4.17), we first note that ) (s 2 2 π t (4.17) ≤ P E π bLS (Z) − π ∗ (Z) Dn > 0 2M c1 (by Cauchy-Schwartz inequality) 2 π04 t2 ∗ = P E π bLS (Z) − π (Z) Dn > 4M 2 c1 4 2 2 t π 0 ≤ P sup νn (π) − E π(Z) − ∆ > 8M 2 c1 π∈P n X −1 (π(Zi ) − ∆i )2 ). (by Lemma B.5, where νn (π) = n (4.19) i=1 Now, put 00 = π04 t2 /(64M 2 c1 ) and let P00 be a minimal 00 -cover of P with respect to the supnorm. Therefore, given π ∈ P, there is a π̃ ∈ P00 such that kπ − π̃k∞ < 00 . Since |π(Z) − ∆|2 − |π̃(Z) − ∆|2 ≤ |(π(Z) − ∆) + (π̃(Z) − ∆)| × |(π(Z) − ∆) − (π̃(Z) − ∆)| ≤ 2 |π(Z) − π̃(Z)| ≤ 2kπ − π̃k∞ ≤ 200 , one finds n n X X 1 2 2 2 ν (π) − E π(Z) − ∆ ≤ π(Z ) − ∆ π̃(Z ) − ∆ − n i i i i n i=1 i=1 2 2 + E π(Z) − ∆ − E π̃(Z) − ∆ n 1 X 2 2 π̃(Zi ) − ∆i − E π̃(Z) − ∆ + n i=1 n 1 X 2 2 00 ≤ 4 + π̃(Zi ) − ∆i − E π̃(Z) − ∆ . n i=1 72 Therefore, with 00 = π04 t2 /(64M 2 c1 ), ( ) n 1 X 4 2 π t 0 π̃(Zi ) − ∆i 2 − E π̃(Z) − ∆2 > (4.19) ≤ P sup 16M 2 c1 π∈P00 n 4 2 i=1 π0 t ≤ N∞ ,P 64M 2 c1 ) ( n 1 X 4 2 t π 2 2 0 π̃(Zi ) − ∆i − E π̃(Z) − ∆ > × sup P 16M 2 c1 n π∈P00 i=1 42 π0 t −π08 t4 n ≤ 2N∞ , P · exp , (4.20) 64M 2 c1 128M 4 c1 where the last line follows by an application of Hoeffding’s inequality. To deal with the term |Sn,2 (z)| in (4.15), first note that Pn i=1 K 1 1 − ∗ |Sn,2 (z)| ≤ M · max 1≤i≤n π bLS (Zi ) π (Zi ) nE K P n i M i=1 K z−Z h ≤ − 1 . π0 nE K z−Z h z−Zi h z−Z h − 1 Now, in view of Lemma 3 with Y = 1 (and thus m(z) = 1, for all z), we find that for every constant t > 0, Z Z ∗ tπ 0 mn (z) − 1µ(dz) > P |Sn,2 (z)| µ(dz) > t ≤ P M 2 2 −nπ0 t ≤ exp (4.21) 64M 4 c1 Pn z−Zi z−Z for n large enough, where m∗n (z) = i=1 K( h )/(nE[K( h )]) is a special case of m∗n (z) of Lemma B.2 with Y, Y1 , · · · , Yn = 1, (i.e., here m∗n (z) is the kernel regression estimator of m(z) = 1). 73 Therefore, in view of (4.14), (4.15), (4.18), (4.20), and (4.21), we have Z −1 p−1 qn,2 ≤ P [2(M π0 + M )] Sn,1 (z)µ(dz) > 4 Z −1 p−1 +P [2(M π0 + M )] Sn,2 (z)µ(dz) > 4 4 n2 π0 2 − π0 −1 2p−2 2 ≤ 2 N∞ , P e 128M c1 [2(M π0 +M )] −1 p−1 32M c1 [2(M π0 + M )] 8 n4 π0 − π04 2 −1 4p−4 15 4 , P e 2 M c1 [2(M π0 +M )] + 2 N∞ −1 2 2p−2 1024M c1 [2(M π0 + M )] +e − 2 n2 π0 −1 +M )]2p−2 1024M 4 c1 [2(M π0 . This completes the proof of Theorem 4.3. 4.2 Applications to classification with partially labeled data Here we consider an application of the results developed in the previous sections to the problem of statistical classification. Let (Z, Y ) be an Rd × {1, · · · , M }-valued random pair. In classification one has to predict the integer-valued label Y based on the covariate vector Z. More formally, one seeks to find a function (a classifier) φ : Rd → {1, · · · , M } for which the probability of missclassification error (incorrect prediction), i.e., P {φ(Z) 6= Y } is as small as possible. Let pk (z) = P Y = k Z = z , z ∈ Rd , 1 ≤ k ≤ M . It is straightforward to show that the best classifier (i.e., the one with the lowest probability of error) is given by φB (z) = argmax pk (z) , (4.22) 1≤k≤M i.e, the best classifier φB satisfies max pk (z) = pφB (z) (z); 1≤k≤M see, for example, Devroye and Györfi (1985, pp. 253-254). Since φB is almost always un74 known, one uses the data to construct estimates of φB . Let Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} be a random sample (the data) from the distribution of (Z, Y ), where each (Zi , Yi ) is fully observable. Let φbn be any sample-based classifier. In other words, φbn (Z) is the predicted value of Y , based on Dn and Z. Let o n b b Ln (φn ) = P φn (Z) 6= Y Dn , be the conditional probability of error of the sample-based classifier φbn . Then φbn is said a.s. to be strongly consistent if Ln (φbn ) −→ L (φB ) := P {φB (Z) 6= Y }, as n → ∞. For k = 1, · · · , M , let pbk (z) be any sample-based estimators of pk (z) = P {Y = k|Z = z} and define the classification rule φbn by φbn (z) = argmax pbk (z) . 1≤k≤M In other words, φbn satisfies max pbk (z) = pbφbn (z) (z) . 1≤k≤M Then, it can be shown (see, for example, Devroye and Györfi (1985, pp. 254)) that 0 ≤ Ln (φbn ) − L(φB ) ≤ M Z X |b pk (z) − pk (z)| µ(dz). (4.23) k=1 a.s. Therefore, to show Ln (φbn ) − L(φB ) −→ 0 , it is sufficient to show that E[|b pk (Z) − a.s. pk (Z)||Dn ] −→ 0, as n → ∞, for k = 1, · · · , M . Now, consider the case of a partially labeled data, where some of the Yi ’s (the labels) may be unavailable in Dn . Let Pn ∆i I{Yi =k} z−Zi K i=1 π̆(Z ) h , k = 1, · · · , M, (4.24) pbk (z) = Pn i z−Zi i=1 K h where π̆ is taken to be either π b in (4.3) or π bLS as in (4.12), and where ∆i = 1 if Yi is available (∆i = 0 if Yi is missing). With pbk as in (4.24), define the classifier φbn (z) = argmax pbk (z) . 1≤k≤M 75 (4.25) How good is φbn in (4.25) as an estimator of the optimal classifier φB in (4.22)? The answer is given by the following result. Theorem 4.4 Let φbn be the classifier defined via (4.25) and (4.24). (i) If π̆ in (4.24) is chosen to be the kernel estimator π b defined in (4.3) then, under the a.s. b conditions of Theorem 4.2, φn is strongly consistent i.e., Ln (φbn ) − L(φB ) −→ 0. (ii) If π̆ in (4.24) is chosen to be the least-squares estimator π bLS defined in (4.12) then, b under the conditions of Theorem 4.3, φn is strongly consistent. PROOF OF THEOREM 4.4 (i) Let π̆ be the kernel regression estimator π b defined in (4.3) and observe that by (4.23), for every > 0, Z M n o X b . P Ln (φn ) − L(φB ) > ≤ P |b pk (z) − pk (z)| µ(dz) > M k=1 The proof now follows from an application of Theorem 4.2 in conjunction with the BorelCantelli lemma. (ii) The proof is similar to that of part (i). 76 Conclusion-Future work In chapter 3, kernel regression estimators were constructed for the situation in which a subset of the covariate vector may be missing at random. In this case we are implicitly assuming that the missing pattern follows the multivariate two pattern problem discussed in Chapter 2. In practice the data can consist of multiple missing patterns, not just multivariate two pattern. To handle the case of multiple missing patterns, modifications to the estimators of chapter 3 are necessary. The primary focus of the thesis has been on kernel regression estimators in the presence of incomplete data. The partition regression estimator and k-NN estimator of chapter 1 could potentially yield similar results as that of the kernel regression estimators constructed in chapters 3 and 4. These estimators in chapters 3 and 4 also dealt with the problem of missing data by using inverse weighting techniques. One weakness of this method is that it is still a function of only the complete cases, resulting in a smaller random sample size. Constructing kernel regression estimators that combine the methods of imputation and inverse weights may result in estimators that perform more efficiently. 77 References [1] Bernstein, S. The theory of probabilities. Gastehizdat Publishing House, Moscow, 1946. [2] Cheng, P.E. and Chu, C.K. (1996), Kernel estimation of distribution functions and quantiles with missing data. Statistica Sinica, 6, 63-78. [3] Davis, J., & Sorensen, J. R. (2013). Using base rates and correlational data to supplement clinical risk assessments. Journal of the American Academy of Psychiatry and the Law, 41(3), 391-400. [4] Devroye, L. (1981), On the almost everywhere convergence of nonparametric regression function estimates. The Annals of Statistics, 9, 1310-1319. [5] Devroye,L. and Györfi,L. (1985). Nonparametric Density Estimation: The L1 View. Wiley, New York. [6] Devroye, L. , Györfi, L. , Krzyzak, A. , & Lugosi, G. (1994). On the strong universal consistency of nearest neighbor regression function estimates. The Annals of Statistics, 1371-1385. [7] Devroye, L., Györfi, L., Lugosi, G. A probabilistic theory of pattern recognition. Springer-Verlag, New York, 1996. [8] Devroye, L. and Krzyz̀ak, A. (1989), An equivalence theorem for L1 convergence of kernel regression estimate. Journal of Statistical Planning and Inference, 23, 71-82. [9] Devroye, L. and Wagner, T. (1980), On the L1 convergence of kernel estimators of regression functions with applications in discrimination. Z. Wahrsch. Verw. Gebiete, 51, 15-25. [10] Efromovich, S. (2011), Nonparametric regression with predictors missing at random. Journal of the American Statistical Association, 106, 306-319 [11] Guntuboyina, A. , & Sen, B. (2013). Covering numbers for convex functions. IEEE Transactions on Information Theory, 59(4), 1957-1965. [12] Györfi, L., Kohler, M., Krzyz̀ak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer-Verlag, New York, 2002. 78 [13] Hirano, K.I. and Ridder, G. (2003), Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71 1161-1189. [14] Hoeffding, Wassily. (1963), Probability Inequalities for Sums of Bounded Random Variables. Journal of the American Statistical Association, 58, 13-30. [15] Horvitz, D. G. and Thompson, D. J. (1952), A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663-685. [16] Kohler, M., Krzyz̀ak, A., and Walk, H. (2003,) Strong consistency of automatic kernel regression estimates. Ann. Inst. Statist. Math. 55, 287-308. [17] Kolmogorov, A.N. and Tikhomirov, V.M. (1959), -entropy and -capacity of sets in function spaces. Uspekhi Mat. Nauk, 14, 3-86. [18] Krzyz̀ak, A. and Pawlak, M. (1984), Distribution-free consistency of a nonparametric kernel regression estimate and classification. IEEE Trans. Inform. Theory, 30, 78-81. [19] Little, R.J.A. and Rubin, D.B. Statistical Analysis With Missing Data. Wiley, New York, 2002. [20] Mojirsheibani, M. (2007), Nonparametric curve estimation with missing data: a general empirical process approach. Journal of Statistical Planning and Inference, 137, 2733-2758. [21] Mojirsheibani, M. (2012). Some results on classifier selection with missing covariates. Metrika, 75, 521-539. [22] Nadaraya, E.A. (1964), On estimating regression. Theory Probab. Appl., 9, 141-142. [23] Pollard, D., (1984). Convergence of Stochastic Processes Springer, New York. [24] Racine, J., & Li, Q. Nonparametric estimation of regression functions with both categorical and continuous data, Journal of Econometrics, Volume 119, Issue 1, March 2004, Pages 99-130 [25] Racine, J., & Hayfield, T. (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). [26] Robins, J.M., Rotnitzky, A., Zhao, L. Estimation of Regression Coefficients when some Regressors are not Always Observed, J. Amer. Statist. Assoc., 89 (1994) 846866. 79 [27] Rotnitzky, A., & Robins, J. M. (1995). Semiparametric regression estimation in the presence of dependent censoring. Biometrika, 82(4), 805-820. [28] Spiegelman, C. and Sacks, J. (1980), Consistent window estimation in nonparametric regression. Ann. Statist., 8, 240-246. [29] Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Ann. Statist. 5 595- 645. [30] Tukey, J. W. (1961). Curves as parameters, and touch estimation. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I. Univ. California Press, Berkeley, Calif., 681694 [31] van der Vaart, A. and Wellner, J. Weak convergence and empirical processes. Springer-Verlag, New York, 1996. [32] Vansteelandt, S. , Carpenter, J. , & Kenward, M. (2010). Analysis of incomplete data using inverse probability weighting and doubly robust estimators. Methodologyeuropean Journal of Research Methods for the Behavioral and Social Sciences, 6, 37-48. [33] Walk, H. (2002 b), Almost sure convergence properties of Nadaraya-Watson regression estimates. Modeling uncertainty, 201223, Internat. Ser. Oper. Res. Management Sci., 46, Kluwer Acad. Publ., Boston, MA, 2002. [34] Wand, M.P., Faes, C., and Ormerod, J. (2011). Variational bayesian inference for parametric and nonparametric regression with missing data. Journal of the American Statistical Association, 106(495), 959-971. [35] Wand, M.P., Jones, M.C., Kernel Smoothing, Chapman & Hall/CRC, Boca Raton, 1995. [36] Wang, Q. and Qin, Y. (2010), Empirical likelihood confidence bands for distribution functions with missing responses. Journal of Statistical Planning and Inference, 140, 2778-2789. [37] Watson, G.S. (1964), Smooth regression analysis. Sankhya, Ser. A., 26, 359-372. 80 A Probability Review This section provides a brief review of some of the important probabilistic definitions used in establishing convergence of the regression estimators. Definition A.1 (Convergence in Probability) Let Xn be a sequence of random variables. We say that Xn converges in probability to X if ∀ > 0, P {|Xn − X| > } → 0 as n → ∞. p The notation Xn −→ X denotes “Xn converges to X in probability”. A stronger mode of convergence is almost sure convergence, this is the mode of convergence used in all our results. Definition A.2 (Almost Sure Convergence) The random variable Xn is said to converge almost surely (with probability 1) to X if n o P lim Xn = X = 1. n→∞ a.s. The notation Xn −→ X denotes “Xn converges to X almost surely”. This definition is not commonly used to establish almost sure convergence. We provide an alternative definition, which can be more easily used to obtain the almost sure convergence of an estimator, through way of the Borel-Cantelli Lemma. Lemma A.1 Let {An : n ≥ 1} be a sequence of events in a probability space. If ∞ X P (An ) < ∞ , n=1 then P ∞ [ ∞ \ ! An =0 k=1 n=k i.e., the probability that A0n s occurs infinitely often is 0. 81 The following inequalities were used to establish exponential upper bounds for our regression estimators in chapters 3 and 4. See Györfi et al. (2002 p. 594). Theorem A.1 (Hoeffding (1963)) Let X1 , ...,P Xn be independent random variables such 1 that |Xi | ≤ M with probability 1. Let X = n ni=1 Xi denote their average, then for any > 0, −2n2 P { X − E[X] > } ≤ 2 exp . M2 Theorem A.2 (Bernstein (1946)) Let X1 , ..., Xn be independent random variables variables, with |Xi | ≤ M almost surely, for all i = 1, · · · , n and E[Xi2 ] < ∞. Put σ 2 = P n 1 i=1 varXi , then for any > 0, n ( n ) 1 X −n2 . P (Xi − E[Xi ]) > ≤ 2 exp 2 + 2M /3 n 2σ i=1 82 B Technical Lemmas A number of important lemmas were used to establish the results of chapters 3 and 4. The lemmas have been provided in this section. Lemma B.1 Let K be a regular kernel, and let µ be any probability measure on the Borel sets of Rd+s . Then, there exists a finite positive constant c1 , only depending on the kernel K, such that for all hn > 0 Z K z−u hn µ(dz) ≤ c1 . sup u∈Rd+s E[K z−Z ] hn PROOF OF LEMMA B.1 This lemma and its proof is given in Devroye and Krzyz̀ak (1989; Lemma 1). Also, see Devroye and Wagner (1980) as well as Spiegelman and Sacks (1980). P i )/(nE[K( z−Z )]), Lemma B.2 Let K be a regular kernel, and put m∗n (z) = ni=1 Yi K( z−Z hn hn d+s where (Zi , Yi )’s are i.i.d. R × R-valued random vectors. Then, if |Y | ≤ M < ∞, we have for every > 0 and n large enough, Z −n2 ∗ |mn (z) − m(Z)| µ(dz) > ≤ exp P . 64M 2 c21 Here, µ is the probability measure of Z, m(z) = E[Y |Z = z], and c1 is as in Lemma B.1. PROOF OF LEMMA B.2 For a proof of this result see, for example, Györfi et al. (2002; Lemma 23.9). 83 Lemma B.3 Let φ(Xi , Yi ) = η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi ) where η ∗ (Xi , Yi ) is as in b i , Yi ) = λ−d n−1 Pn ∆j I{Yj = Yi }H((Xi − Xj )/λn ) where λn and (3.2), and put φ(X n j=1 the kernel H are as in (3.6). Suppose that conditions B2, B3, and B4 hold. Then, h a.s. i b i , Yi )|Xi , Yi − φ(Xi , Yi ) ≤ c2 λn E φ(X where c2 is a positive constant not depending on n. The proof is similar to that of Lemma 3 of Mojirsheibani (2012) and goes as follows. 84 PROOF OF LEMMA B.3 First note that h i b i , Yi )Xi , Yi − φ(Xi , Yi ) E φ(X # " n X Xi − X j −d −1 ∆j I{Yj = Yi }H = E λn n Xi , Yi − φ(Xi , Yi ) λn j=1 " # X i − X1 a.s. −d = λn E I{Y1 = Yi }H E [∆1 |X1 , Y1 , Xi , Yi ] Xi , Yi − φ(Xi , Yi ) λn " # Xi − X 1 −d = λn E I{Y1 = Yi }H η ∗ (X1 , Y1 ) Xi , Yi λn − η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi ) (since ∆1 is independent of (Xi , Yi )) " # X − X i 1 = λ−d E η ∗ (X1 , Y1 ) I{Y1 = Yi }H Xi , Yi n λn # " X − X i 1 ± E η ∗ (Xi , Yi ) I{Y1 = Yi }H Xi , Yi λn − η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi )) # X − X i 1 ∗ ∗ = E λ−d Xi , Yi n (η (X1 , Y1 ) − η (Xi , Yi )) I{Y1 = Yi }H λn Xi − X 1 + E η ∗ (Xi , Yi ) λ−d n I{Y1 = Yi }H λn −f (Xi )P (Y = Yi |Yi )) Xi , Yi " := ∆n,i (1) + ∆n,i (2) (say) . Utilizing a one term Taylor expansion, we can bound ∆n,i,j (1) as follows " d # X ∂η ∗ (X† , Yi ) |Xi,k − X1,k | H Xi − X1 Xi , Yi , |∆n,i (1)| ≤ λ−d n E ∂Xk λn k=1 85 where X1,k and Xi,k are the k th components of X1 and Xi , respectively, and X† is on the interior of the line segment joining X1 and Xi . Therefore, " d # X X − X i 1 Xi , Yi |∆n,i (1)| ≤ α1 E |Xi,k − X1,k | λ−d n H λ n k=1 ∗ ! ∂η (x, y) where α1 = max sup 1≤k≤d v∈Rd ,y∈Y ∂xk x=v d Z X Xi − x −d f (x)dx = α1 |Xi,k − xk | λn H λ d n R k=1 d Z X ≤ α1 kf k∞ |uk | λn H (u) du (by condition A3) k=1 Rd ≤ αλn , (by condition A2) . To bound the term ∆n,i,j (2), first note that Xi − X 1 ∗ −d ∆n,i (2) = η (Xi , Yi ) E λn I{Y1 = Yi }H − f (Xi )P (Y = Yi |Yi )Xi , Yi λn Now, let fy (x) be the conditional pdf of X, given Y = y, and observe that Xi − X1 −d E λn I{Yi = Y1 }H Xi , Yi λn Z X Xi − x −d λn H = P (Y = y)I {Yi = y} fy (x) dx λn Rd y∈Y Therefore, utilizing condition A2 and the fact that X P (Y = Yi |Yi ) = P (Y = y)I{Yi = y} , y∈Y one finds 86 ∆n,i (2) = η ∗ (Xi , Yi ) X P (Y = y)I {Yi = y} y∈Y Z × Rd λ−d n H Xi − x λn (fy (x) − fy (Xi )) dx Performing a one term Taylor expansion and using the fact that |η ∗ (Xi , Yi )| ≤ 1 and I {Yi = y} ≤ 1 for every y ∈ Y, yields |∆n,i (2)| ≤ X y∈Y Z P (Y = y) H (u) (fy (Xi − λn u) − fy (Xi )) du d R d Z X ∂fy (x) ≤ λn |uk | H(u)du P (Y = y) sup ∂x d d k v∈R R x=v y∈Y k=1 Z ∂f (x) |uk | H(u)du ≤ dλn max sup 1≤k≤d v∈Rd ∂xk x=v Rd ≤ β λn (by condition A2, A3 and A4). X This completes the proof of Lemma B.3 The following lemma is similar to that of Lemma B.3, with slight modifications to deal with the simplier case of missing response instead of missing covariates. Used in the proof of Theorem 4.2. Lemma B.4 Let ψ(Z) = π ∗ (Z)f (Z), where f is the probability density of Z, and π ∗ (z) = Pn b P {∆ = 1|Z = z}. Put ψ(Z) = n−1 a−d n j=1 ∆j H((Zj −Z)/an ), where an and the kernel H are as in (4.3). Suppose that conditions B2, B3, and B4 hold. Then h a.s. i b Z − ψ(Z) ≤ κan , E ψ(Z) where κ is a positive constant not depending on n. 87 PROOF OF LEMMA B.4 The proof is similar to that of Lemma 4.2 of Mojirsheibani (2007) and goes as follows. i i h h b Z − ψ(Z) = a−d E ∆1 H((Z1 − Z)/an )Z − π ∗ (Z)f (Z) E ψ(Z) n i h a.s. −d = an E H((Z1 − Z)/an ) E[∆1 |Z, Z1 ]Z − π ∗ (Z)f (Z) (. B.1) But E[∆1 |Z, Z1 ] = E[∆1 |Z1 ] = π ∗ (Z1 ) because Z is independent of ∆1 and Z1 . Therefore i h ∗ ∗ (B.1) = a−d E (π (Z ) − π (Z))H((Z − Z)/a ) 1 1 n Z n h −d i ∗ + E π (Z) an H((Z1 − Z)/an ) − f (Z) Z := Tn,1 (Z) + Tn,2 (Z) , (say) . A one-term Taylor expansion gives ! # " d X ∂π ∗ (Z† ) −d (Z1,i − Zi ) H((Z1 − Z)/an )Z , Tn,1 (Z) = an E ∂Zi i=1 where Zi and Z1,i are the ith components of Z and Z1 , respectively, and Z† is a point on interior of the line segment joining Z and Z1 . Therefore, |Tn,1 (Z)| ≤ C14 d X i h E |Z1,i − Zi | a−d H((Z − Z)/a ) Z , (by condition B4) 1 n n i=1 (where C14 = max sup |∂π ∗ (z)/∂zi |) 1≤i≤d = C14 d Z X z |zi − Zi | a−d n H((z − Z)/an )f (z)dz i=1 ≤ C14 kf k∞ d Z X an |vi |H(v)dv = |C15 | · an , (by condition B2 and B4), i=1 88 where we have made the substitution vi = zi − Zi , i = 1, · · · , d. Next, to bound Tn,2 (Z), first note that Z ∗ Tn,2 (Z) = π (Z) a−d n H((z − Z)/an )[f (z) − f (Z)]dz Z ∗ = π (Z) [f (Z + an y) − f (Z)]H(y)dy . Finally, a one-term Taylor expansion gives " # d Z X |Tn,2 (Z)| ≤ an · dkf 0 k∞ |zi |H(z)dz , (by condition B2). i=1 This completes the proof of Lemma B.4. The following lemma is used in the proof of Theorem 4.2. Lemma B.5 Let ψb be as in Lemma B.4. Then, for every t > 0, one has i n h o −nadn t2 b b P ψ(Z) − E ψ(Z) Z > tZ = z ≤ 2 exp . 2kHk∞ [2 kf k∞ + a2d n t/3] PROOF OF LEMMA B.5 Let Γj (Z) = a−d n Z − Zj Z − Zj − E ∆j H ∆j H Z an an and note that, conditional on Z, the terms Γj (Z) are independent, zero-mean random variables, bounded by −adn kHk∞ and +adn kHk∞ . Also, var(Γj (Z)|Z) = E[Γ2j (Z)|Z] ≤ 2a−d n kHk∞ kf k∞ . Therefore, by Bernstein’s (1946) inequality, ( n ) i X h 1 b b Z > tZ = z P ψ(Z) − E ψ(Z) = P Γj (Z) > tZ = z n j=1 −nt2 2 2a−d kHk kf k∞ +(ad ∞ n n kHk∞ t)/3 ≤ 2e [ 89 ], which does not depend on z. Lemma B.6 Let π bLS be as in (4.12) and suppose that π ∗ ∈ P, where π ∗ (z) = E[∆|Z = z]. Then i h ∗ E |b πLS (Z) − π (Z)| Dn ≤ 2 sup νn (π) − E |π(Z) − ∆|2 , π∈P where n 1X νn (π) = |π(Zi ) − ∆i |2 . n i=1 PROOF OF LEMMA B.6 Using the elementary decomposition i i h h 2 2 2 ∗ ∗ E |b πLS (Z) − ∆| Dn = E |π (Z) − ∆| + E |b πLS (Z) − π (Z)| Dn , one has h i i h 2 2 2 ∗ E |b πLS (Z) − ∆| Dn − inf E |π(Z) − ∆| E |b πLS (Z) − π (Z)| Dn = π∈P 2 2 ∗ + inf E |π(Z) − ∆| − E |π (Z) − ∆| π∈P =: Tn,1 + Tn,2 . But the term Tn,2 = 0, because π ∗ ∈ P. As for the term Tn,1 , for each π ∈ P, let n 1X νn (π) = |π(Zi ) − ∆i |2 n i=1 90 and observe that Tn,1 i n h 2 πLS (Z) − ∆| Dn − νn (π) + νn (π) = sup E |b π∈P −νn (b πLS ) + νn (b πLS ) − E |π(Z) − ∆|2 io n h ≤ −νn (b πLS ) + E |b πLS (Z) − ∆|2 Dn + sup νn (π) − E |π(Z) − ∆|2 π∈P (because νn (b πLS ) − νn (π) ≤ 0, ∀π ∈ P) ≤ 2 sup νn (π) − E |π(Z) − ∆|2 . π∈P 91