Download KERNEL REGRESSION ESTIMATION FOR INCOMPLETE DATA

Document related concepts

Data assimilation wikipedia , lookup

German tank problem wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
CALIFORNIA STATE UNIVERSITY, NORTHRIDGE
KERNEL REGRESSION ESTIMATION FOR INCOMPLETE
DATA
A thesis submitted in partial fulfillment of the requirements
For the degree of Master of Science
in Mathematics
by
Timothy Reese
May 2015
The thesis of Timothy Reese is approved:
Dr. Mark Schilling
Date
Dr. Ann E. Watkins
Date
Dr. Majid Mojirsheibani, Chair
Date
California State University, Northridge
ii
Dedication
This thesis is dedicated in memory of my grandmother Patrica, who first inspired me to
pursue higher education. I also dedicate this thesis to the friendship and memory of my
good friend Mathew Willerford, his passing in 2014 was difficult. Mathew was caring and
always put family and friends before himself. He left the world too early.
iii
Acknowledgments
I would first like to express my sincerest gratitude to Dr. Majid Mojirsheibani. The guidance and support provided by Dr. Mojirsheibani has been crucial to the completion of
this thesis. The patience Dr. Mojirsheibani has demonstrated through out this process is
greatly appreciated. Dr. Mojirsheibani has shown by example, the necessary traits required to produce meaningful work in the field of statistics. I hope that during this time I
have adopted a portion of these traits. I would also like to thank Dr. Mark Schilling for
introducing me to the beautiful world of statistics. Without such a great introduction to
this field, I may have instead pursued a masters degree in computer science. Furthermore,
I would like to thank my committee members Dr. Ann E. Watkins and Dr. Mark Schilling
for their time and insightful comments. Additionally, I would like to extend thanks to
many of the people in the math department who have provided support in one form or
another. Finally, I am always grateful to my family for being supportive throughout my
life, so I extend thanks to my parents Ron and Linda, and sister Katrina.
iv
Table of Contents
Signature page
iii
Dedication
iii
Acknowledgments
iv
Abstract
vii
Introduction
1
1
Introduction to regression
1.1 What is regression? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Defining the relationship between response and predictors . . . . . . . .
1.3 Parametric regression . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Supremum norm covers and total boundedness . . . . . . . . . .
1.3.3 Empirical Lp covers . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Nonparametric regression . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Partitioning Regression Estimator . . . . . . . . . . . . . . . . .
1.4.2 k-Nearest Neighbors(k-NN) . . . . . . . . . . . . . . . . . . . .
1.4.3 Kernel Regression . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Comparison between parametric least-squares and kernel regression estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
4
5
7
9
12
12
14
15
2
Missing data
2.1 Missing patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Selection probability . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Regression in the presence of missing values . . . . . . . . . . . . . . . .
21
21
23
24
3
Kernel regression estimation with incomplete covariates
3.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 A kernel-based estimator of the selection probability . . .
3.1.2 A least-squares based estimator of the selection probability
3.2 An application to classification with missing covariates . . . . . .
3.3 Kernel regression simulation with missing covariates. . . . . . . .
27
28
32
43
53
55
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
4
Kernel regression estimation with incomplete response
4.1 Main results . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 A kernel-based estimator of the selection probability . . .
4.1.2 Least-squares-based estimator of the selection probability
4.2 Applications to classification with partially labeled data . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
59
67
74
Future work
77
Bibliography
78
Appendix A Probability Review
81
Appendix B Technical Lemmas
83
vi
ABSTRACT
KERNEL REGRESSION ESTIMATION FOR INCOMPLETE DATA
By
Timothy Reese
Master of Science in Mathematics
The problem of estimating the regression function from incomplete data is explored. The
case in which the data is incomplete in the sense that a subset of the covariate vector is
missing at random is the concern of chapter 3. Chapter 4 analyzes the case when the
response variable is missing at random. In each of the aforementioned chapters, the problem of incomplete data is handled by constructing kernel regression estimators that utilize
Horvitz-Thompson-type inverse weights, where the selection probabilities are estimated
utilizing both kernel regression and least-squares methods. Exponential upper bounds are
derived on the performance of the Lp norms of the proposed estimators, which can then be
used to establish various strong convergence results. We conclude chapters 3 and 4 with
an application to nonparametric classification in which kernel classifiers are constructed
from incomplete data.
vii
Introduction
Statisticians working in any field are often interested in the relationship between a response variable Y and a vector of covariates Z = (Z1 , · · · , Zd ). This relationship can best
be described by the regression function m(z) = E[Y |Z = z]. The regression function is
estimated utilizing data which has been collected on the variables of interest. In practice
the data is often incomplete, having missing values among the covariates or the response
variables. Values may be unavailable for a variety of reasons, for example, corruption
may occur in the data collection or storage process, a responding unit may fail to answer a
portion of the survey questions, or perhaps the measurement of a variable is too expensive
to collect for the entire sample. Regardless of the reason, missing values are problematic, standard regression methods are no longer available and if the statistician performs a
complete case analysis, the results may be inefficient at best, and at worst highly biased.
Two common alternatives to the complete case analysis have been established in the literature. The most commonly employed method is that of imputation. Imputation methods
attempt to replace the missing values by some estimate. The second approach utilizes a
Horvitz-Thompson-type inverse scaling method (Horvitz and Thompson (1952)) which
works by weighting the complete cases by the inverse of the missing probabilities. This
thesis approaches the problem of estimating the regression function by way of the nonparametric method of kernel regression. The problem of incomplete data is handled by
employing the Horvitz-Thompson-type inverse weighting technique.
We start with a review of the problem of regression for the case in which the data is complete. Both parametric and nonparametric regression estimators are investigated. Convergence results for each estimator to the regression function are explored. Chapter 2 begins
with an analysis of the patterns in which data can become unavailable. Next, the selection
probabilities are clearly defined for a multitude of cases in which the data may be missing.
We shall conclude the chapter with a discussion on the methods for estimating the regression function in the presence of missing data i.e., complete case analysis, imputation, and
Horvitz-Thompson-type inverse weighting.
In chapters 3 and 4, kernel regression estimators are proposed for handling the problem
of missing data. Chapter 3 explores the case when a subset of the covariate vector may
be missing. Chapter 4 develops similar results to that of chapter 3 for the case of missing
response. In both chapters exponential upper bounds are established on the performance
of the Lp norms for each estimator. These bounds are used to establish almost sure convergence results. Each chapter concludes with an application to the problem of nonparametric
classification in the presence of incomplete data.
1
1 Introduction to regression
1.1
What is regression?
Regression is the study of the relationship between some dependent variable Y called
the response variable with a set of independent variables Z = (Z1 , · · · , Zd ) called the
predictors or covariates. The response variable is assumed to have some functional relationship to the predictor variables i.e., if we observe Z = z then Y = f (z) + , where is some random error term. This relationship is studied in part by the regression function
m(z) = E[Y |Z = z]. The importance of the regression function lies in the fact that m(·)
is ”close” to Y . Closeness here is measured by the L2 risk or more commonly referred to
as the mean squared error. The mean squared error is given by
E |Y − f (Z)|2 .
(1.1)
It is easily shown that m(·) minimizes (1.1), i.e.,
m(z) = argmin E |Y − f (Z)|2 .
f :Rd →R
Let f be any measurable function from Rd into R, then
E |Y − f (Z)|2 = E |Y − m(Z) + m(Z) − f (Z)|2
= E |Y − m(Z)|2 + E |m(Z) − f (Z)|2 + 2E [(Y − m(Z)) (m(Z) − f (Z))]
= E |Y − m(Z)|2 + E |m(Z) − f (Z)|2 .
(1.2)
The last line follows by observing that
a.s.
E [(Y − m(Z)) (m(Z) − f (Z))] =
=
=
=
E [E [(Y − m(Z)) (m(Z) − f (Z)) |Z]]
E [(m(Z) − f (Z)) E [(Y − m(Z)) |Z]]
E [(m(Z) − f (Z)) (m(Z) − m(Z))]
0.
Therefore, since (1.2) is greater than or equal to zero, we have that (1.2) is minimized
when f is chosen to be the regression function.
The regression function is important to many different fields of study. The following are a
few examples of areas where regression techniques can be applied.
2
1. In gene expression profiling, information from thousands of genes can be used to
predict recurrence rates for breast cancer.
2. Regression techniques can be used to predict an inmates propensity for violence.
One such method is the Risk Assesment Scale for Prison (RASP-Potosi), RASPPotosi attempts to measure an inmates propensity for violence from a vector of
covariates which include age, length of sentence, education, prior prison terms,
prior probated sentences, conviction for a property offense, and years served (Davis
2013).
3. The success of a business often relies on advertising strategies. Regression techniques may be applied to acquire the necessary information to choose the right advertising strategy. For example, one could model the sales level in relation to the
amount of money spent on advertising in different mediums, such as tv, newspaper,
billboards, and internet.
In each of these examples the researcher observes some data and then utilizes the data to
construct an estimator of the regression function. The regression estimate can then be used
to make inferences.
1.2
Defining the relationship between response and predictors
Let (Z, Y ) be a Rd ×R-valued random vector, with cumulative distribution function (CDF)
F (Z, Y ) = P (Z ≤ z, Y ≤ y). The random vector Z = (Z1 , · · · , Zd ) is the covariate
vector and Y is the response variable of interest. One typically wishes to examine the
relationship between these variables by way of the regression function
m(z) = E [Y |Z = z] .
(1.3)
For example, the covariate vector Z could consist of the following predictor variables
monthly income, amount of debt, previous political contributions, and asset value, the
response variable of interest Y could be political contribution. The relationship of the
potential contribution to the predictors could then be modeled as some function of the
covariates with an additional random error term i.e., Y = f (Z) + . The mean response of
this relationship is portrayed by the regression function and could be used to determine the
amount of political pandering a politician should expend on a particular person or group.
However, in practice the regression function is unavailable, since the distribution of (Z, Y )
is typically unknown. Instead one attempts to estimate the regression function by utilizing
some observed data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, in which (Zi , Yi )’s are independent
and identically distributed (i.i.d) random vectors with common distribution F (Z, Y ).
3
To determine the utility of an estimator it is important to show that the estimator converges
to the true regression function. The usefulness of the estimator can be measured by the Lp
statistic. The Lp statistic is a commonly used measure of global accuracy for the regression
estimator denoted as mn (·), which is defined as follows
Z
In (p) = |mn (z) − m (z)|p µ(dz), 1 ≤ p < ∞ ,
(1.4)
where µ is the probability measure of Z. In (2) is the mean squared error, this is the most
commonly used measure of accuracy for regression estimators. The primary concern is
to show that the Lp statistic converges to 0 in probability or almost surely (see Appendix
A) as n → ∞, for some fixed value of p. The convergence of the Lp statistic to zero
demonstrates the consistency of mn (·) as an estimator of m(·).
There are two main approaches to estimation of the regression function. The first method
is a parametric-based estimator, which assumes some partial knowledge of the underlying
functional relationship between the response and vector of covariates. Once the model
assumptions have been made, one proceeds to fit the data under some error minimization
criterion. The second methodology utilizes local averaging techniques. These techniques
function without any reliance on restrictive model assumptions. In the following subsections we briefly discuss both approaches, paying particular attention to the method of
least-squares, and the method of kernel regression.
1.3
Parametric regression
When there is prior knowledge about the functional relationship between the covariates
and the response variable, it is important to use this information in the estimation of the
regression function. In parametric regression one makes assumptions about the model
Y = f (Z) + , specifically the function f is assumed to belong to some class of functions
F. The observed data is then used to select the best function among the class of functions
F. This function is chosen such that the residual error is small for the given data under
some chosen minimization criterion. The popular least-squares criterion may be used to
minimize the residual error over the given data.
4
(a)
(b)
Figure 1.1: (a) Best fit line is a good fit. (b) Poor simple linear least-squares fit.
1.3.1
Least-squares
The method of least-squares selects the function from the assumed class of functions that
minimizes the sum of the squared errors, and is defined as follows
m
b LS = argmin
f ∈F
n
X
(Yi − f (Zi ))2 .
(1.5)
i=1
The following is an example of the simplest case of least-squares regression. In this case
we assume that the functional relationship between the response variable Y with that of
the single predictor Z is linear.
Example 1 (Simple Linear Model) Let Z be a single predictor random variable and let
Y be the corresponding response. Assume the model Y = β0 + β1 Z + , i.e., assume that
the class of functions F in (1.5) is the class of linear functions of one predictor. Let the
sample data be Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, then the least-squares method
argmin
β0 ,β1
n
X
(Yi − β0 − β1 Zi )2 ,
i=1
uses simple calculus minimization techniques to yield
βb0 = Y − βb1 Z
βb1 =
5
1
n
Pn
i=1
1
n
Pn
Zi Yi −Z Y
i=1
Zi2 −Z
2
,
P
where Z = n1 ni=1 Zi , and Y =
results yield the best fit line
1
n
Pn
i=1
Yi denote the averages of Zi and Yi . These
yb = βb0 + βb1 z .
In figure 1.1 two cases were displayed, case (a) in which the best fit line is a good fit and
case (b) where it is a poor fit. We can see based on the figures that model assumptions are
crucial to a proper parametric fit.
The most common way of analyzing the performance of the least-squares estimator in (1.5)
as an estimator of the true regression function is by way of the L2 risk and as such consider
the following result, which establishes an upper bound on the L2 risk of the least-squares
estimator.
Lemma 1.1 (Györfi et al. (2002)) Let Fn = Fn (Dn ) be a class of functions f : Rd → R
depending on the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}. If m
b LS is as in (1.5) then
Z
n
1 X
2
2
2
|Yi − f (Zi )| − E |Y − f (Z)| |m
b LS (z) − m(z)| µ(dz) ≤ 2 sup f ∈Fn n i=1
Z
+ inf
|m(z) − f (z)|2 µ(dz) .
f ∈Fn
Remark. For the proof and an explanation why the second term converges to 0, see
Lemma 10.1 Györfi et al. (2002, p. 161).
Therefore to establish the convergence of the L2 risk in lemma 1.1 to 0, it remains to show
that the empirical L2 risk
n
1X
(Yi − f (Zi ))2 ,
n i=1
converges uniformly over F, to the actual L2 risk
E |Y − f (Z)|2 ,
i.e., we need to establish
n
1 X
a.s.
(Yi − f (Zi ))2 − E |Y − f (Z)|2 → 0 .
sup f ∈F n i=1
6
(1.6)
Figure 1.2: This figure is an adaption of Figure 9.1 of Györfi et al. (2002). The dashed lines
represent a ± translation of f˜ in red. The figure portrays the supremum norm distance
between the function f and one of the class members f˜, being contained within .
To achieve the uniform convergence in (1.6), we must formally define the idea of totally
bounded classes of functions.
1.3.2
Supremum norm covers and total boundedness
As stated before parametric regression methods assume partial knowledge of the functional
relationship between the response variable Y , to the vector of covariates Z. In other words,
the regression function is assumed to belong to some class of functions F. To establish
the convergence in (1.6), we assume that our class of functions F is totally bounded with
respect to some norm. First, consider the case in which our class of functions is totally
bounded with respect to the supremum norm, as defined below.
Definition 1.1 (Totally Bounded with Respsect to the supremum Norm) We say that a
class of functions F : Rd → R is totally bounded with respect to the supremum norm if
for every > 0, there is a subclass of functions F = {f1 , · · · , fN() } such that for every
f ∈ F there exists a f˜ ∈ F , satisfying
kf − f˜k∞ := sup |f (z) − f˜(z)| < .
z∈Rd
We say that the class F is an -cover of F. The cardinality of the smallest -cover is
denoted by N∞ (, F).
7
To establish the convergence in (1.6), let |Y | ≤ M < ∞ and suppose f belongs to some
known class of functions F. Put G to be the class of functions described as
G = {gf (z, y) = (y − f (z))2 |f ∈ F, gf : Rd → [0, B] , with B = 4M 2 } ,
then we have the following theorem
Theorem 1.1 (see Györfi et al. (2002, Lemma 9.1)) Let G : Rd → [0, M ] be a class of
functions bounded with respect to the supremum norm, then for every > 0 and n large
enough
(
)
n
1 X
2n2
P sup g(Zi , Yi ) − E [g(Z, Y )] > ≤ 2N∞ (/3, G)e− 9B2 . (1.7)
g∈G n
i=1
Remark. Let g, g̃ ∈ G and observe that
kg − g̃k∞ = sup |g(Z, Y ) − g̃(Z, Y )|
Z,Y
2
2
˜
= sup (Y − f (Z)) − (Y − f (Z)) Z,Y
≤ sup (Y − f (Z)) − (Y − f˜(Z)) × (Y − f (Z)) + (Y − f˜(Z))
Z,Y
˜
≤ 4M sup f (Z) − f (Z) ,
Z∈Rd
it then follows that if F/4M = {f1 , · · · , fn } is a minimal /4M -cover of F with respect
to the supremum norm, the class of functions G = {gf1 , · · · , gfN } will be an -cover of G.
Therefore, the covering numbers for F and G satisfy N∞ (, G) ≤ N∞ (/4M, F), hence
2n2
(1.7) is also bounded by 2N∞ (/12M, F)e− 9B2 . As a result, applying the Borel-Cantelli
lemma yields the convergence in (1.6).
Many important classes of functions are in fact totally bounded with respect to the supremum norm; the following are some examples:
Example 2 Consider the exponential class of functions
n
o
F = fθ (x) = e−θ|x| , θ ∈ [0, 1], |x| < 1
8
This class is totally bounded: for every > 0, we have N∞ (, F ) ≤ 2 + b1/(2)c. We
note that this is polynomial in 1/. To see this, observe that
sup e−θ|x| − e−θ̃|x| ≤ θ − θ̃ ,
|x|≤1
where the above follows since e−θ|x| is continuously differentiable and |x| ≤ 1. This
means the sub class of functions needed to cover F depends on the parameter θ. If we let
θi = 2 · i, then taking the set Θ = {θ0 , · · · , θi , · · · , θb 1 c , 1}, we have for any θ there is
2
a θ̃ ∈ Θ such that |θ − θ̃| ≤ . Thus, F is totally bounded with respect toP
the supremum
norm. Similar bounds hold for the general case when f (x) = exp{−( di=1 θi x2i )1/2 }
provided that θi ∈ [0, 1] and |xi | < 1, i = 1, · · · , d. (See Devroye et al.(1996, Ch.28)).
Example 3 (Differentiable functions) Let k1 , · · · , ks ≥ 0 be non-negative integers and put
k = k1 + · · · + ks . Also, for any ψ : Rs → R, define D(k) ψ(u) = ∂ k ψ(u)/∂uk11 · · · ∂uks s .
Consider the class of functions with bounded partial derivatives of order r:
(
)
X
(k)
d
1 Ψ = ψ : [0, 1] → R sup D ψ(u) ≤ C < ∞ .
u
k≤r
Then, for every > 0, log N∞ (, Ψ) ≤ M −α , where α = d/r and M ≡ M (d, r); this is
due to Kolmogorov and Tikhomirov (1959).
Example 4 (Lipschitz bounded convex functions) Let F be the class of all convex functions f : C → [0, 1], where C ⊂ Rd is compact and convex. If f satisfies |f (z1 ) − f (z2 )| ≤
Lkz1 − z2 k for all z1 , z2 ∈ C, where L > 0 is finite, then log N (, F ) ≤ M −d/2 for all
> 0, where the constant M depends on d and L only; see van der Vaart and Wellner
(1996).
1.3.3
Empirical Lp covers
Restricting a class of functions to be bounded with respect to the supremum norm may be
too restrictive, since the covers required can be exceptionally large. If the supremum norm
covers are too large, then we may not obtain convergence in (1.7). An alternative, is to use
smaller Lp norm covers.
Definition 1.2 (Totally Bounded with Respsect to the Lp Norm) We say a class of functions F : Rd → R is totally bounded with respect to the Lp norm if for every > 0, there
9
is a subclass F = {f1 , · · · , fN() } such that for every f ∈ F there exists a f˜ ∈ F such
that
Z p1
p
< .
(1.8)
kf − f˜kp =
f (z) − f˜(z) µ(dz)
The cardinality of the smallest -cover is denoted by Np (, F).
However, these Lp covers depend on the unknown distribution of the random variable Z.
An alternative is to construct data dependent covers. These covers are obtained by replacing the probability measure µ in (1.8) with the empirical measure and they are defined as
follows.
Definition 1.3 (Totally Bounded with Respsect to the Empirical Lp Norm) We say a class
of functions F : Rd → R is totally bounded with respect to the empirical Lp norm if for
every > 0, there is a subclass F = {f1 , · · · , fN() } such that for every f ∈ F there
exists a f˜ ∈ F , for fixed points {(z1 , y1 ), · · · , (zn , yn )} such that
(
n
p
1 X f (zi ) − f˜(zi )
n i=1
) p1
< .
(1.9)
The cardinality of the smallest -cover for the empirical Lp norm is given by Np (, F, Dn ).
Observe that the covering number Np (, F, Dn ) relies on the data Dn , hence the covering
number is a random variable. As a result, we will be concerned with the expected value
of Np (, F, Dn ). Classes of functions exist that are not totally bounded with respect to
the supremum norm, but are in fact bounded with respect to the empirical Lp norm. To
appreciate this, consider the following example.
Example 5 (Non-Lipschitz bounded convex functions (Guntuboyina and Sen 2013)) Let
F be the class of all convex functions f : C → [0, 1], where C ⊂ Rd is compact and
convex. Consider the functions given by
fj (t) := max(0, 1 − 2j t) , t ∈ [0, 1] , j > 1 ,
10
the functions fj (t) belong to F but are not Lipschitz, observe that these functions are not
totally bounded with respect to the supremum norm
kfj − fk k∞ ≥ |fj (2−k ) − fk (2−k )| ≥ 1 − 2j−k ≥ 1/2 , for all j < k .
However, it had been shown (see Guntuboyina and Sen 2013) that the class of all convex
functions can be totally bounded with respect to the Lp norm, even if the functions are not
Lipschitz. They show that log N (, F , Lp ) ≤ M −d/2 , where M is a positive constant that
is independent of .
To derive the convergence result of (1.6), first suppose that |Y | ≤ M < ∞ and let f
belong to a given (known) class of functions F. Consider the class of functions G defined
as G = {gf (z, y) = (y − f (z))2 |f ∈ F, gf : Rd → [0, B] , B = 4M 2 }, we have the
following result from the empirical process theory.
Theorem 1.2 (see Pollard (1984)) Let G : Rd → [0, B] be a class of functions bounded
with respect to the L1 empirical norm, then for every > 0 and n large enough
n
)
(
1 X
2n2
g(Zi , Yi ) − E [g(Z, Y )] > ≤ 8 E [N1 (/8, G, Dn )] e− 128B2 . (1.10)
P sup f ∈F n
i=1
Remark. To achieve the convergence in (1.6), first let g, g † ∈ G and observe
n
1 X g(Zi , Yi ) − g † (Zi , Yi )
n i=1
n
1 X (Yi − f (Zi ))2 − (Yi − f † (Zi ))2 =
n i=1
≤
n
1 X (Yi − f (Zi )) − (Yi − f † (Zi )) × (Yi − f (Zi )) + (Yi − f † (Zi ))
n i=1
n
1 X ≤ 4M
f (Zi ) − f † (Zi ) .
n i=1
Therefore if F/4M = {f1 , · · · , fN } is a minimal /4M -cover of F with respect to the L1
empirical norm, then the class of function G = {gf1 , · · · , gfN } will be an -cover of G.
11
Therefore, (1.10) can be bounded with respect to the covering number for F i.e.,
2n2
2n2
8E[N1 (/8, G, Dn )] e− 128B2 ≤ 8E[N1 (/32M, F, Dn )] e− 128B2 .
Thus, Theorem 1.2 combined with an application of the Borel Cantelli lemma, yield the
almost sure convergence in (1.6), provided that
log(E [N1 (/32M, F, Dn )])/n → 0 ,
1.4
as n → ∞ .
Nonparametric regression
When there is little or no knowledge about the underlying relationships between the covariates and response variable then parametric methods will be problematic, since any
model assumptions need to be accurate to produce reasonable estimates of the regression function. An alternative to the parametric methodology is nonparametric methods.
Nonparametric methods perform without any assumptions on the functional relationship.
Nonparametric methods use a type of local averaging of the response variables Yi . More
specifically, the Yi ’s are weighted according to how ”close” each corresponding Zi is to
the z of interest and then these weighted Yi ’s are added to produce the predicted response.
The corresponding nonparametric regression estimator is as follows
mn (z) =
n
X
Wn,i (z)Yi ,
(1.11)
i=1
where the weights Wn,i are assigned a real number value determined by the ”closeness”
of z to Zi . The way in which these weights Wn,i are chosen determines the type of nonparametric regression estimator to be used. In this section we give an overview of three
standard nonparametric based regression estimators.
1.4.1
Partitioning Regression Estimator
The first nonparametric based estimator to be discussed is the partitioning regression estimator. The partitioning estimator or regressogram as referred to by Tukey in 1961. The
partitioning regression estimator starts by partitioning the covariate space Rd into a collection of nonoverlapping cells denoted as {An,1 , An,2 , · · · }. The regression estimate of
z is then determined by averaging the Yi ’s corresponding to the Zi ’s in the cell which z
resides. For, example in figure 1.3 the y’s corresponding to the points in the circle would
be averaged, since they are the only ones in the cell containing the new observation x.
12
.
.
.
.
..
.
.
.
.
.
.
.
.
.
.
.
.
z
.
.
.
.
Figure 1.3: Arbitrary partition of R2 .
Formally, let An [z] denote the cell containing z, then the regression estimate is
Pn
i=1 Yi I{Zi ∈ An [z]}
mn (z) = P
,
n
i=1 I{Zi ∈ An [z]}
(1.12)
where I{·} denotes the indicator function. The indicator function is one if Zi belongs to
the region containing z and 0 otherwise.
P
In this case the weight function of (1.11) is chosen to be I{Zi ∈ An [z]} ÷ ni=1 I{Zi ∈
An [z]}. A computationally efficient way to choose the partitions is to use cubic and rectangular partitions which partition the space into cells of equal dimension. However, this
method can result in some cells containing a large fraction of the data, while other cells
contain only a couple points. An alternative to equally spaced cells is to use statistically
equivalent blocks. The blocks are constructed to satisfy the criterion that each cell should
contain an equal number of points. The almost sure convergence of (1.12) to the regression
function is shown below. For a proof, refer to Györfi et al (2002).
Theorem 1.3 Let mn (z) be defined as in (1.12). If for each sphere S centered at the origin
lim
max D(An,j ) = 0 ,
n→∞ j:An,j ∩S6=∅
13
where D denotes the diameter of the cell, and
|{j : An,j ∩ S 6= ∅}|
= 0.
n→∞
n
lim
then
a.s.
E[|mn (Z) − m(Z)| |Dn ] → 0 as n → ∞ ,
a.s.
provided that |Y | ≤ M < ∞.
1.4.2
k-Nearest Neighbors(k-NN)
An alternative to the partitioning method or regressogram is the k-nearest neighbors (kNN) estimator. Instead of partitioning the space of the covariates, the k-NN estimator sorts the data based on how ”far” each Zi is from z and then proceeds to average
the Yi ’s of the k closest. Let the data used for estimating the regression function be
Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} which is an i.i.d. sample from the distribution of (Z, Y ).
Then denote D0n = {(Z(1) , Y(1) ), · · · , (Z(k) , Y(k) ), · · · , (Z(n) , Y(n) )} as the reordered data
set based on how far each Zi is from z, here distance is measured using some distance
function (for example Euclidean distance). Then, the corresponding k-NN estimator is
defined as
mn (z) =
k
X
1
i=1
k
Y(i) .
(1.13)
The expression in (1.13) is simply the average of the Yi ’s corresponding to the k closest
Zi ’s to z, in this case the weight function of (1.11) is chosen to be
1
for the k closest Zi to z
k
Wn,i (z) =
.
0
otherwise
The observant reader may recognize that issues may arise when there are Zi ’s of equal distance from z. If one is willing to assume that these events happen with probability 0 (this is
true for the case when Z has a density) then it can be shown that mn in (1.13) converges to
the regression function almost surely in the L1 sense, provided that Y is bounded (Devroye
and Györfi (1985)). If the distribution of Z is not absolutely continuous, then tie breaking
methods must be employed to derive the almost sure convergence of the estimator. There
have been a variety of methods proposed for tie breaking. Stone (1977) proposed a pseudo
k-NN method which during the event of a tie would average the responses for the values of
the tie, Stone showed the weak convergence in probability of this pseudo k-NN estimator.
14
An alternative to Stone’s method is the method of randomization in which a new uniform
random variable U independent of Z is introduced producing the random vector (Z, U ).
We then artificially grow the data producing
D0n = {(Z1 , U1 , Y1 ), · · · , (Zn , Un , Yn )} ,
where Ui ’s are i.i.d. Uniform(0,1) random variables. Next, reorder the data based on the
distance the new observation (z, u) is from (Zi , Ui ), for each i = 1, · · · , n, i.e., if no tie
exists for the distance between Zi and z then reorder normally, otherwise if a tie exists
between say Zi and Zj then compare the distance between Ui to u and Uj to u to break the
tie. This method results in the reordered data set
D00n = {(Z(1) , Y(1) ), · · · , (Z(n) , Y(n) )} .
The estimator now averages the first k Yi ’s in the reordered data set to produce the estimate for the response. The almost sure convergence of the k-NN estimator under the
randomization tie breaking rule is provided by the following result.
Theorem 1.4 (Devroye et al. (1994)). Let mn (z) be the k-NN regression estimator in
(1.13). If |Y | ≤ M < ∞, if the distribution of (Z, Y ) is not absolutely continuous apply
the randomization rule, then for every > 0, and n large enough
Z
P
|mn (z) − m(z)|µ(dz) > ≤ e−an ,
where a is a positive constant that does not depend on n.
Z
Remark. Furthermore, we obtain
a.s.
|mn (z) − m(z)|µ(dz) → 0, by the Borel-Cantelli
lemma, provided that k → ∞ and k/n → 0 as n → ∞. Therefore, we have that mn (z)
converges almost surely to m(z) in the L1 sense. Further results hold, specifically the
above theorem can be extended to the general Lp statistics by the fact that Y is bounded
(Devroye et al. (1994)).
1.4.3
Kernel Regression
Another local averaging method is the Nadaraya-Watson kernel regression estimator (Nadaraya
(1964), Watson (1964)), which assigns weights that decrease the ”farther” Zi is from z,
15
K(t) =
1
2
I{|t| ≤ 1}
K(t) =
K(t) = (1−|t|) I{|t| ≤ 1}
(b)
(a)
1 2
√1 e− 2 t
2π
(c)
Figure 1.4: Three commonly used single variable kernels (a) Uniform kernel (b) Triangular
Kernel (c) Gaussian Kernel
and is defined as follows
Pn
mn (z) =
z−Zi
hn
Yi K
.
Pn
z−Zi
K
i=1
hn
i=1
(1.14)
The kernel function K : Rd → R is a smooth function of the distance between z and Zi .
The choice of kernel is at the discretion of the statistician but is typically chosen to be a
probability density function. Three common choices of kernels for the case of a univariate
response are displayed in Figure (1.4). The term hn > 0 in (1.14), called the smoothing
parameter of the kernel, typically ranges between 0 and 1. The smoothing parameter controls the smoothness of the kernel regression estimate. The smoothing parameter depends
on the size of the sample n, as n gets larger hn gets smaller but at a rate that does not
decrease too fast. One typically estimates the smoothing parameter from the data at hand.
Various modes of convergence of In (p) defined in (1.4) to zero have been established in
the literature; see, for example, Devroye and Wagner (1980), and Spiegelman and Sacks
P
(1980), whose results yield In (1) −→ 0, as n → ∞. The quantity In (1) also plays an
important role in statistical classification; see for example Devroye et al (1996, Sec. 6.2),
Devroye and Wagner (1980), and Krzyz̀ak and Pawlak (1984). For the almost sure convera.s.
gence of In (1) to 0 (i.e., In (1) −→ 0) see, for example, Devroye (1981), and Devroye and
Krzyz̀ak (1989). In this last cited work by Devroye and Krzyźak, they develop a number
of equivalent results under the assumption that |Y | ≤ M < ∞ and the kernel is regular.
16
Definition A nonnegative kernel K is said to be regular
if there are positive constants
R
b > 0 and r > 0 for which K(z) ≥ b I{z ∈ S0,r } and
sup K(y)dz < ∞, where S0,r
y∈z+S0,r
is the ball of radius r centered at the origin.
Many of the kernels used in practice satisfy the above definition. A widely used regular
kernel is the Gaussian kernel
d
1 T
z
K(z) = (2π)− 2 e− 2 z
.
The results of Devroye and Krzyz̀ak (1989) are of particular importance for later chapters
of this thesis. In chapters 3 and 4 we shall produce similar results to the following theorem,
for the case of incomplete data.
Theorem 1.5 (Devroye and Krzyz̀ak (1989)). Assume that K is a regular kernel. Let
mn (z) be the kernel regression estimate in (1.14), and let 0/0 be defined as 0. Then for
every distribution of (Z, Y ), with |Y | ≤ M < ∞ and for every > 0 and n large enough
we have
Z
P
|mn (z) − m(z)|µ(dz) > ≤ e−cn ,
where c = min2 {2 /[128M 2 (1 + c1 )] , /[32M (1 + c1 )]}.
Z
Remark. This result provides
a.s.
|mn (z)−m(z)|µ(dz) → 0, by the Borel-Cantelli lemma,
provided that limn→∞ hn = 0, limn→∞ nhd = ∞. Hence, the almost sure convergence of
mn (z) to m(z) is established. For an in-depth treatment of kernel regression and kernel
density estimation see the Monograph by Wand and Jones (1995).
1.5
Comparison between parametric least-squares and kernel regression estimators
In this section a simulation is presented to compare kernel regression estimators with parametric least-squares regression estimators. A sample Dn = {(X1 , Y1 ), · · · , (X75 , Y75 )} is
constructed from Yi = e−2|Xi |+5 + i , where Xi and i , i = 1, · · · , n are standard normal random variables. The kernel regression estimate is fitted using a Gaussian kernel
for fixed smoothing parameters h = 0.5, 0.3, 0.2, 0.1. Four choices of parametric models
are chosen to demonstrate the importance of correct model specification. Each regression
estimator is tested on an independent sample of 1000 points, the results are plotted in
Figure 1.5 and the corresponding errors of each regression fit are provided in Table 1.1.
The errors in the table are constructed by comparing the fitted value to the actual Yi for
17
(a)
(b)
(c)
Figure 1.5
2
a) Y = β0 + β1 x + β2 x
kernel h = 0.5
b) Y = β0 + β1 x + β2 x2 + β3 x3 + β4 x4
kernel h = 0.3
c) Y = β0 + β1 e−|x|
kernel h = 0.2
d) Y = eβ1 |x|+β2
kernel h = 0.1
(d)
L1 error
L2 error
45.11667 3681.28500
13.29512 289.98340
22.54348 776.74680
5.83773
81.51567
14.21205 302.08270
4.69530
48.69061
0.06784
0.01039
1.71803
7.16343
max error
161.22510
58.77908
110.51710
44.82134
31.90593
38.57000
0.28397
17.75663
Table 1.1: Empirical L1 , L2 , and Max Errors
i = 1, · · · , 1000. The error is measured using the following methods. The standard L1
18
error (or least absolute deviations) is given by
n
1 X Yi − Ybi ,
n i=1
(1.15)
the empirical L2 risk
n
2
1 X
b
Yi − Yi ,
n i=1
(1.16)
max Yi − Ybi .
(1.17)
and the max error is
1≤i≤1000
In Figure 1.5 (a), we see a poor fit for the case of both estimators. The parametric model
of Y = β0 + β1 x + β2 x2 is a poor choice considering the actual model is nonlinear.
For the kernel regression estimate the smoothing parameter of h = 0.5 is too smooth
for a sample size of n = 75. In Figure 1.5 (b), improvement is seen from the kernel
regression estimator, the smoothing parameter choice of h = 0.3 is closer to the correct
choice. The choice of the parametric fourth degree polynomial is better as we would
expect, but inadequate, as seen in the image and by the error rates in the provided table.
In figure 1.5 (c), there is improvement by both methods. Considering that the parametric
fit Y = β0 + β1 e−|x| is a linear model, the model performs surprisingly well in trying to
approximate the nonlinear model. However, the linear model appears to have some trouble
approximating the curvature of the relationship. Finally, in figure 1.5 (d) we see that the
parametric fit performs great. The correct nonlinear model was specified as Y = eβ1 |x|+β2
and as such resulted in an excellent approximation, which can be seen in the errors and
by visual inspection. The kernel regression method with smoothing parameter h = .1
visually appears unstable, but the error rate dropped.
This simulation demonstrates the heavy dependence of parametric regression on the correct specification of the model. When a model is incorrectly specified, the error will be
large. Visual inspection of the plots allows for easy identification of model inconsistencies.
However, in practice there is no such access to plots, due to the fact that there is usually a
large number of predictors. The statistician must be confident in the validity of any model
assumptions or risk producing incorrect inferences. If the statistician is unable to rely on
model assumptions, it is prudent to use nonparametric methods. Nonparametric methods
19
are not hindered by correct model specifications. However, the kernel regression method
relies on the correct choice of the smoothing parameter h. If the smoothing parameter
is too large (relative to the sample size) then the estimator will be too smooth, while too
small of an h leads to over-fitting. The dependence of the kernel regression method on the
smoothing parameter can be seen in the simulation. Additional difficulties arise for both
estimators when the data contains missing values, this is the focus of the next chapter.
20
2 Missing data
The analysis of the regression function was of primary interest in the previous chapter.
Both parametric and nonparametric estimators were explored. In particular, the nonparmetric kernel regression estimator is of special importance to chapters 3 and 4 of the thesis.
Each of the estimators used data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} to construct an estimate
of the regression function. In practice, statistical researchers encounter data with missing
values. In particular, a subset of the covariate vector may be missing, due, for instance to
failure of a survey participant to respond to questions deemed too personal. For instance
a poll of subjects may include questions on income, sexual preference, and prior drug use.
Some of those polled may deem these particular questions too personal to answer. The
response variable can also be missing, a situation common in experimental design, where
the covariates are controlled by the researcher. The presence of missing data poses additional problems for the statistician to overcome. The standard regression estimators are no
longer available to the statistician. New methods need to be considered, and knowledge of
the behavior of the missing data is required to produce reliable regression estimators.
This chapter is concerned with providing a brief presentation of the problem of missing
data, see Little and Rubin (2002) for an in-depth accounting of the problem of missing
data. First, the different patterns in which data may be unavailable are described. Second,
the uncertainty of observing or not observing the ith observation is modeled as a Bernoulli
random variable. Different assumptions on the conditional dependencies of the probability
associated with this random variable called the selection probability or missing probability
mechanism are presented. Lastly, three common procedures to produce regression estimators in the presence of incomplete data are examined.
2.1
Missing patterns
There exist a variety of patterns for which data may be unobservable. We shall briefly
discuss five of these patterns. See Little and Rubin (2002) for a more complete treatment
of missing data patterns. Figure 2.1 gives a visual representation of these five patterns.
In Figure 2.1 any of the variables Y1 , · · · , Y5 can represent either the covariates or the
response variable.
The first missing pattern is the univariate nonresponse pattern. This pattern represents
the case in which several variables are completely observable, while one variable may
have missing values. The variable with missing values can be either part of the vector
of covariates or the response variable. However, in regression analysis this pattern is
21
Figure 2.1: This figure is an adaption of Figure 1.1 of Little and Rubin (2002). Each
rectangle represents the available sample size for the given variable.
more representative of the case of missing response. This missing pattern is common in
experimental design scenarios, where the covariate vector is controlled by the researcher.
Little and Rubin (2002) provide an example in which the yield of a particular crop under
different input values of the covariates is of interest. The crop yield may be incomplete
due to the failure of the seed to germinate.
The multivariate two pattern problem is the missing pattern in which there are two distinct
subsets of vectors. The first subset is always available and the second subset has missing
values. The subset with missing values will have the same observation missing for each
variable. This is common in survey sampling, in which individuals in the sample fail to
report on some of the predictor variables. Another scenario that results in this pattern is
when extreme cost of measurement exist for some of the variables. For example, consider a car manufacturer who wishes to measure the safety and durability of their vehicles.
Some of the measurements can be cheap, like stopping time and tire durability. However,
measurements under accident scenarios can result in the complete destruction of the vehicle. These measurements are expensive and it would not be in the interest of the company
to perform repeated measurements.
22
When there is a monotonic decrease in the availability of observations, we denote this
pattern as the monotone pattern. This pattern is common in longitudinal studies, in which
measurements are conducted over a long period of time. The first measurement is always
complete, since the availability at the start of the experiment determines the initial sample
size. When studies progress over a long period of time, unforeseeable circumstances lead
to a decrease in the responding units.
The file matching pattern is a pattern in which two subsets of the variables are never observed at the same time. This behavior can be observed in figure 2.1 (e), in this case Y2 and
Y3 are never observable together. This pattern can result in unique problems, these problems result from the inability to estimate the relationship between the variables which are
never observed together. For example, in quantum mechanics the Heisenberg’s uncertainty
principle states that a particles position and momentum can not both be simultaneously
known, this behavior could lead to the file matching pattern. This specialized example
portrays a situation in which the pattern can occur as a direct result of the experimental
procedure being performed. The merging of two large government databases is a more
common situation that may result in this type of pattern. The reason the merginging may
result in this pattern is that the databases being merged may share some common variables,
but each database has a set of variables that are specific to the department responsible for
that particular database. For more on the file matching pattern and other missing patterns
one may refer to Little and Rubin (2002).
To further understand the behavior of missing data, methods need to be employed to deal
with the uncertainty of a vector being observable. This uncertainty is explored in the
next section. Probability models are developed to represent the probability that a vector is
observed.
2.2
Selection probability
When dealing with incomplete data, the uncertainty of observing a variable must be modeled with some probability mechanism. The underlying missing probability mechanism,
called the selection probability, represents the probability that a variable will be available.
To simplify the presentation of the selection probability, consider the following situation.
Let (Z, Y ) be an Rd+s × R random vector, with Z = (X, V), X ∈ Rd , V ∈ Rs , and
d, s ≥ 1. Suppose the subset V of the covariate vector Z has the possibility of being
unobservable. Introduce the new random variable ∆, where ∆ = 1 if V is observable and
∆ = 0 otherwise. The selection probability can then be defined as follows,
23
Definition 2.1 (Missing Probability Mechanism)
P ∆ = 1Z = z, Y = y = P ∆ = 1X = X, V = v, Y = y .
(2.1)
The conditional probability in (2.1) defines the probability that ∆ = 1 (V is observable)
given (X, V, Y ). When the conditional probability depends on all variables including the
possibly missing V it is called missing not at random (MNAR). When the right hand side
of (2.1) equals a constant our missing probability mechanism is called Missing Completely
at Random (MCAR), i.e.,
Definition 2.2 (Missing Completely at Random (MCAR))
P ∆ = 1Z = z, Y = y = P {∆ = 1} = p .
(2.2)
The MCAR assumption may be reasonable if the data is missing due to simple corruption
in the data gathering or storage of the sample. This situation is denoted as ignorable,
since proceeding if there is no missing data will yield consistent and unbiased estimates.
However, this situation is unlikely and if assumed will often lead to biased estimates.
Instead, a more common assumption is that of missing at random (MAR). The Missing
at Random assumption is a compromise between the overly restrictive MNAR and highly
unlikely MCAR assumptions. We define the MAR mechanism as follows
Definition 2.3 (Missing at Random (MAR))
P ∆ = 1Z = z, Y = y = P ∆ = 1X = x, Y = y .
(2.3)
Here, (2.3) assumes that the probability that V is missing depends only on the always
observable vector (X, Y ).
2.3
Regression in the presence of missing values
Recall, that our main goal has been to estimate the regression function m(z) = E[Y |Z =
z], when the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} may contain missing values. A variety
of techniques have been developed to deal with the problem of missing values among
the data. Here we briefly introduce three common methods of estimating the regression
function when exposed to incomplete data and analyze the relative merits of each method.
Complete case analysis
24
The first and most obvious method is to use only the cases with no missing values to
estimate the regression function. This method is called complete case analysis. In complete case analysis any observation that has a missing value is ignored in the process of
estimating the regression function. In complete case analysis the data used in estimating
the regresson function is no longer Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, instead the effec0
tive data
Pnis Dn = {(Z1 , Y1 ), · · · , (Zm , Ym )}. In this case the sample thsize is reduced to
m = i=1 ∆i , with ∆i = 1 if there is no missing values among the i observation and
∆i = 0 otherwise. This method is only recommended if the number of missing cases is
small relative to the sample size. If the number of missing cases is large, using complete
case analysis will result in an increase in the variance of the estimator, which may lead
to a drastic loss ofPefficiency. Another issue of complete case methods is that the reduced
sample size m = ni=1 ∆i is a random variable. This randomness leads to difficulties. The
majority of regression estimation methods assume that the sample size is a fixed quantity.
Furthermore, if the selection probability is not MCAR the regression estimator is likely to
be biased.
Imputation methods
The reduction in sample size in complete case analysis is an undesirable property. Imputation methods attempt to reconstruct the sample by imputing or ”substituting” the missing
values. To fill in a missing value in a case, an estimate of this value is constructed from
the available data, this estimate is then inserted in place of the missing value. To be more
precise, suppose we have data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )} and that a subvector Vi of
Zi = (Xi , Vi ), could be missing. The imputation method would then produce the new
sample Dn = {(X1 , Vb1 , Y1 ), · · · , (Xn , Vbn , Yn )}, where Vbi is either Vi if Vi is observable,
or an estimate, if Vi was missing. Two common methods used in the area of survey sampling are hot deck and cold deck imputation. In hot deck imputation, the value of a similar
available unit in the sample is used in place of the missing value. Cold deck imputation
uses a similar unit from an alternative sample (perhaps a previous realization) in place of
the missing value of the current sample. Another obvious possibility is to replace the missing values with a mean from the observed quantities of the same variable, this is referred
to as mean imputation. Mean imputation is not recommended, this method can result in
inaccurate summary statistics of the sample, in particular it underestimates the variance
and tends to shift the sample correlation between variables toward zero. Another common
method is to impute the missing values by performing a regression of the other available
variables on the observed variables corresponding to the missing value. This regression
estimate is then used to predict the missing value. The method of regression imputation
conditions on the observed variables which helps reduce bias due to nonresponse.
25
The above methods of imputation are referred to as single imputation, since one single
value is estimated to replace the missing value. Single imputation methods have the disadvantage of treating the imputed value as a known quantity, thus removing the uncertainty
associated with the missing value. An alternative to single imputation is to use multiple
imputation. Multiple imputation replaces the missing values with several estimates resulting in multiple complete data sets. Inferences on each data set can be combined to achieve
one overall inference that properly portrays the uncertainty associated with the presence
of missing values.
Horvitz-Thompson Inverse Weights
When the data is not MCAR the complete case analysis can result in biased estimates
of the regression function. The complete case regression model can be bias-corrected by
by using Horvitz-Thompson-type inverse weights (Horvitz and Thompson (1952)). This
method works by weighting the complete cases by the inverse of the missing data probability mechanism. As a result of weighting by the inverse missing probabilities values that
were observed but had a high probability of being unavailable will be given more weight,
thus correcting for the bias of the complete case regression estimate. However, one typically does not have knowledge of the missing probability mechanism and as a result the
method must use some estimator of these probabilities. This method has previously been
used by some authors to estimate the mean of a distribution with missing data; see, for
example, Hirano et al. (2003). Robins, Rotnitzky and Zhao (1994) developed a class of locally and globally adaptive semi-parametric estimators using inverse probability weighted
estimating equations for the case of covariates missing at random. For a comparison between imputation methods and Horvitz-Thompson-type inverse weights see Vansteelandt
(2010). In the following chapters we utilize the method of Horvitz-Thompson-type inverse
weights to construct kernel regression estimators for incomplete data.
26
3 Kernel regression estimation with incomplete covariates
In the previous chapter we introduced the problem of regression estimation when the data
used in the estimation is incomplete. In this chapter we study nonparametric regression, for
the specific case of missing covariates. Until recently, there has been little work in the area
of nonparametric regression with missing covariates. Mojirsheibani (2007) considered a
multiple imputation method to estimate the regression function when part of the covariate
vector could be missing at random, this is similar to the work of Cheng and Chu (1996)
in that the regression estimate imputes a specific function of a missing value (and not
the missing value itself). Efromovich (2011) constructed an adaptive orthogonal series
estimator and derived the minimax rate of the MISE when the regression function belongs
to a k-fold Sobolev class. Wand et al. (2012) developed variational Bayes algorithms for
penalized spline nonparametric regression when the single covaraite variable is susceptible
to missingness.
Our contribution will be to construct kernel regression estimators from data in which a
subset of the covariate vector may be missing at random. We propose kernel regression
estimators that use Horvitz-Thompson-type inverse weights, in which the selection probabilities are estimated utilizing kernel regression or least-squares methods. Exponential
upper bounds are derived for the Lp norm of each estimator, these results yield almost sure
convergence of the Lp statistic to 0 by an application of the Borel-Canteli lemma. As an
immediate application of these results, the problem of nonparametric classification with
missing covariates will be studied. Now, let us formally define the problem.
Let (Z, Y ) be an Rd × R-valued random vector and let
m(Z) = E [Y |Z = z]
be the regression function. Given the data Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}, where the
(Zi , Yi )’s are independently and identically distributed (iid) random vectors with the same
distribution as that of (Z, Y ), the regression function m(z) can be estimated by the NadarayaWatson kernel estimate (Nadaraya (1964), Watson (1964))
Pn
z−Zi
Y
K
i
i=1
hn
,
(3.1)
mn (z) = P
n
z−Zi
K
i=1
hn
which was defined in (1.14). Here 0 < h ≡ h(n) is a smoothing parameter depending on n
27
and the kernel K is an absolutely integrable function. The case in which the data was completely observable was already explored in Chapter 1, where the almost sure convergence
of mn (·) to m(·) was established. The result of Devroye and Krzyz̀ak (1989) in Theorem 1.5 will be of particular importance to the results of this chapter. We are interested
in extending these results to the case where a subset of the covariate vector Zi may be
unavailable from Dn . Clearly, the estimator (3.1) is no longer available. In the following
sections we present our proposed estimators.
3.1
Main Results
Let (Z, Y ) be an Rd+s × R-valued random vector and Z = (X0 , V0 )0 , with X ∈ Rd , V ∈
Rs , and, d, s ≥ 1. Our goal is to estimate the regression function m(z) = E[Y |Z = z]
when V may be missing at random. Under the missing at random assumption the selection
probability is defined as follows.
P (∆ = 1|Z = z, Y = y) = P (∆ = 1|X = x, Y = y) =: η ∗ (x, y) .
(3.2)
We first consider the unrealistic case in which the selection probability η ∗ is known.
In this case we estimate the regression function m(z) by constructing a modified kernel regression estimator from the i.i.d. sample Dn = {(Z1 , Y1 , ∆1 ), · · · , (Zn , Yn , ∆n )}
= {(X1 , V1 , Y1 , ∆1 ), · · · , (Xn , Vn , Yn , ∆n )}, with ∆i = 1 if Vi is observable and 0 otherwise, and is defined as follows
Pn
z−Zi
∆ i Yi
i=1 η ∗ (Xi ,Yi ) K( hn )
.
∗
(3.3)
m
b η (z) = P
n
∆i
z−Zi
K
∗
i=1 η (Xi ,Yi )
hn
Observe that the estimator in (3.3) works by weighting the complete cases by the inverse of
the selection probabilities; this is very much in the spirit of the classical Horvitz-Thompson
estimator (Horvitz and Thompson (1952)). As for the usefulness of m
b η∗ (z) as an estimator
of m(z), observe that (3.3) can be rewritten as follows
Pn
z−Zi
z−Zi
K(
)/
i=1 K( hn )
i=1
hn
P
,
m
b η∗ (z) = P
n
n
z−Zi
∆i
z−Zi
K
/
K(
)
i=1 η ∗ (Xi ,Yi )
i=1
hn
hn
Pn
∆i Y i
η ∗ (Xi ,Yi )
which is a ratio of kernel estimators for the following ratio of two conditional expectations
28
E
h
E
h
∆Y
η ∗ (X,Y
|Z
)
∆
|Z
η ∗ (X,Y )
i
a.s.
i
E
h
E
h
=
Y
η ∗ (X,Y
E [∆|Z, Y ] |Z
)
1
E
η ∗ (X,Y )
[∆|Z, Y ] |Z
i
i=
E [Y |Z]
= m(Z) .
E [1|Z]
Therefore when the missing probability mechanism η ∗ is known, (3.3) can be seen as a
kernel regression estimator of the regression function E[Y |Z] = m(Z) . To explore the
accuracy of our estimator in (3.3) we examine the convergence of the Lp statistic to 0. We
shall assume in what follows that the selected kernel K is regular as defined in (1.4.3),
which we recall is
Definition A nonnegative kernel K is said to be regular
if there are positive constants
R
b > 0 and r > 0 for which K(z) ≥ b I{z ∈ S0,r } and
sup K(y)dz < ∞, where S0,r
y∈z+S0,r
is the ball of radius r centered at the origin.
Many of the kernels used in practice are regular kernels, in fact the popular Gaussian
kernel is a regular kernel. The regular kernel is the type of kernel used by Devroye and
Krzyz̀ak (1989). For more on regular kernels the reader is referred to Györfi et al (2002).
Before proceeding we need to state the following condition
Condition A1.
inf
x∈Rd ,y∈R
P (∆ = 1|X = x, Y = y) := η0 > 0, where η0 can be arbitrarily
small.
Condition A1 is necessary so that V is observed with a nonzero probability. This is a
rather standard assumption in the literature on missing data; see, for example, Wang and
Qin (2010) and Cheng and Chu (1996). The following theorem employs the result of the
main theorem of Devroye and Krzyz̀ak (1989).
Theorem 3.1 Let m
b η∗ (z) be the kernel regression estimator defined in (3.3), where K is a
regular kernel, and the condition A1 holds. If |Y | ≤ M < ∞ and hn → 0, nhd+s → ∞
as n → ∞, then for every > 0 and n large enough
Z
p
P
|m
b η∗ (z) − m(z)| µ(dz) > ≤ 8e−na1
where a1 ≡ a1 () = min2 (2 η02p /[22p+7 M 2p (1 + η0 )2p−2 (1 + c1 )], η0p /[2p+5 M p (1 +
η0 )p−1 (1 + c1 )]), with c1 the positive constant of Lemma B.1.
29
Remark. Theorem 3.1 in conjunction with the Borel-Cantelli lemma yields E[|m
b η∗ (Z) −
a.s.
p
m(Z)| |Dn ] → 0, as n → ∞.
PROOF OF THEOREM 3.1
Let m̃η∗ be the kernel regression estmator of m(z) and mη∗ be the kernel regression estia.s.
mator of Y = 1, defined as
Pn
Pn
∆ i Yi
z−Zi
∆i
z−Zi
K
K
i=1 η ∗ (Xi ,Yi )
i=1 η ∗ (Xi ,Yi )
hn
hn
m̃η∗ (z) =
mη∗ (z) =
. (3.4)
Pn
P
n
z−Zi
z−Zi
K
K
i=1
i=1
hn
hn
Observing that |m̃η∗ (z)/mη∗ (z)| ≤ M , we have the following
m̃η∗ (z) m(z) ≤ M |mη∗ (z) − 1| + |m̃η∗ (z) − m(z)|(3.5)
−
.
|m
b η∗ (z) − m(z)| ≤ mη∗ (z)
1 30
As a result of (3.5), we have that for every > 0, and n large enough
Z
p
P
|m
b η∗ (z) − m(z)| µ(dz) > Z
p
≤P
|M |mη∗ (z) − 1| + |m̃η∗ (z) − m(z)|| µ(dz) > Z
p
p
p−1
p−1
p
≤P
2 M |mη∗ (z) − 1| + 2 |m̃η∗ (z) − m(z)| µ(dz) > Z
p−1
p−1
p
≤P
2 M (|mη∗ (z)| + 1) |mη∗ (z) − 1| µ(dz) >
2
Z
p−1
p−1
+P
2 (|m̃η∗ (z)| + |m(z)|) |m̃η∗ (z) − m(z)| µ(dz) >
2
(Z
)
p−1
1
≤P
2p−1 M p
+1
|mη∗ (z) − 1| µ(dz) >
η0
2
)
(Z
p−1
M
+P
2p−1
+M
|m̃η∗ (z) − m(z)| µ(dz) >
η0
2
Z
Z
0
0
|mη∗ (z) − 1| µ(dz) > + P
=P
|m̃η∗ (z) − m(z)| µ(dz) > M (where, 0 = η0p−1 /[2p M p (1 + η0 )p−1 ])
≤ 8e−na1 , (by (1.5), in view of the result of Devroye and Krzyz̀ak (1989)) .
Here, a1 ≡ a1 () = min2 (2 η02p /[22p+7 M 2p (1+η0 )2p−2 (1+c1 )], η0p /[2p+5 M p (1+η0 )p−1 (1+
c1 )]), with c1 the positive constant in Lemma B.1.
The kernel estimator defined in (3.3) requires the knowledge of the missing probability
mechanism η ∗ (X, Y ) = E[∆|X, Y ] (an unrealistic case). If η ∗ is unknown, then it has
to be replaced by some estimator in (3.3). We consider two different estimators of η ∗ , a
kernel regression estimator and a least-squares regression type estimator.
31
3.1.1
A kernel-based estimator of the selection probability
We first consider replacing η ∗ (Xi , Yi ) in (3.3) by one of the two following estimators
Pn
Xi −Xj
j=16=i ∆j I {Yi = Yj } H
λn
,
ηb(Xi , Yi ) =
(3.6)
Pn
Xi −Xj
I
{Y
=
Y
}
H
i
j
j=16=i
λn
Pn
ηbc (Xi , Yi ) =
Ui −Uj
λn
∆j H
, with Ui = (Z0i , Yi )0 ,
Pn
Ui −Uj
j=16=i H
λn
j=16=i
(3.7)
where H : Rd → R+ in (3.6) and H : Rd+1 → R+ in (3.7), where H is a kernel with
smoothing parameter λn , satisfying λn → 0 as n → ∞, with the convention that 0/0 = 0.
The estimator in (3.6) is used if Y is a discrete random variable, otherwise one estimates
the selection probability with (3.7). We derive an exponential bound for the more complicated case of discrete Y i.e., when the selection probability is estimated by the estimator
described in (3.6). The continuous case follows with simpler arguments.
Now, suppose that Z ∈ Rd and Y ∈ Y, where Y is a countable subset of R, then our
modified kernel-type regression estimator of m(z) = E[Y |Z = z] is given by
Pn
∆i Y i
z−Zi
i=1 ηb(Xi ,Yi ) K( hn )
,
m
b ηb(z) = P
(3.8)
n
∆i
z−Zi
K
i=1 ηb(Xi ,Yi )
hn
where ηb is as in (3.6). To study (3.8) we first state some standard regularity conditions.
R
R
Condition A2. The kernel H in (3.6) satisfies Rd H(u)du = 1 and Rd |ui |H(u)du < ∞,
for i = 1, · · · , d, where u = (u1 , · · · , ud )0 . Furthermore, the smoothing parameter λn
satisfies λn → 0 and nλdn → ∞, as n → ∞
Condition A3.PThe random vector X has a compactly supported probability density function, f (x) = y∈Y fy (x)P (Y = y), which is bounded away from zero on its support,
where fy (x) is the conditional density of X given Y = y. Additionally, both f and its
first-order partial derivatives are uniformly bounded on its support.
32
Condition A4. The partial derivatives ∂x∂ i η ∗ (x, y), exist for i = 1, · · · , d and are bounded
uniformly, in x, on the compact support of f .
Condition A2 is not restrictive since the choice of the kernel H is at our discretion. Condition A3 is usually imposed in nonparametric regression in order to avoid having unstable
estimates (in the tail of the p.d.f of X). Condition A4 is technical and has already been
used in the literature; see, for example, Cheng and Chu (1996).
Theorem 3.2 Let m
b ηb(z) be defined as in (3.8), where K is a regular kernel. Suppose that
|Y | ≤ M < ∞. If hn → 0, nhd+s → ∞, as n → ∞, then under conditions A1, A2, A3,
and A4, then for every > 0 and every p ∈ [1, ∞), and n large enough,
Z
d
d
p
P
|m
b ηb(z) − m(z)| µ(dz) > ≤ 8e−na2 + 2e−na3 + 8ne−(n−1)λn a4 + 8ne−(n−1)λn a5 ,
where
2
a2 ≡ a2 () = min
η0p
2 η02p
,
24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 )
,
2 η02p+2 ψ02
2 η02p
,
a
≡
a
()
=
,
4
4
24p+8 32p c21 M 2p
24p+12 32p−2 c21 M 2p kHk∞ [kf k∞ + ψ0 /12]
η02 ψ02
= 7
,
2 kHk∞ [kf k∞ + ψ0 /12]
a3 =
a5
with ψ0 given by 0 < ψ0 := f0 · inf P (Y = y) and c1 the positive constant in Lemma B.1.
y∈Y
Remark. In view of Theorem 3.2 and the Borel-Cantelli lemma, if log n/(nλdn ) → 0 as
a.s.
n → ∞, then E[|m
b ηb(z) − m(z)|p |Dn ] → 0.
PROOF OF THEOREM 3.2
To prove Theorem 3.2, we first need to define a couple terms, let
Pn
Pn
∆ i Yi
z−Zi
∆i
z−Zi
K
K
i=1 ηb(Xi ,Yi )
i=1 ηb(Xi ,Yi )
hn
hn
m̃ηb(z) =
, and mηb(z) =
.(3.9)
Pn
Pn
z−Zi
z−Zi
K
K
i=1
i=1
hn
hn
33
Then, upon observing that |m̃ηb(z)/mηb(z)| ≤ M , one obtains
m̃ηb(z) m(z) m̃ηb(z) ≤ −
|m
b ηb(z) − m(z)| = mηb(z) |mηb(z) − 1| + |m̃ηb(z) − m(z)|
mηb(z)
1 ≤ M |mηb(z) − 1| + |m̃ηb(z) − m(z)| .
Therefore, with m̃ηb(z) and mηb(z) as in (3.9), and m̃η∗ (z) and mη∗ (z) as in (3.4), we have
for every > 0
Z
p
P
|m
b ηb(z) − m(z)| µ(dz) > Z
p
p−1
p
≤P 2 M
|mηb(z) − 1| µ(dz) >
2
Z
p
p−1
+P 2
|m̃ηb(z) − m(z)| µ(dz) >
2
Z
p
2p−2
p
≤P 2
M
|mηb(z) − mη∗ (z)| µ(dz) >
4
Z
+ P 22p−2 M p |mη∗ (z) − 1|p µ(dz) >
4
Z
p
2p−2
+P 2
|m̃ηb(z) − m̃η∗ (z)| µ(dz) >
4
Z
p
2p−2
+P 2
|m̃η∗ (z) − m(z)| µ(dz) >
4
(
)
p−1 Z
1
1
≤ P 22p−2 M p n
+ |mηb(z) − mη∗ (z)| µ(dz) >
∧i=1 ηb(Xi , Yi ) η0 4
(
)
p−1 Z
2p−2
p 1
+P 2
M + 1
|mη∗ (z) − 1| µ(dz) >
η0
4
(
)
p−1 Z
M
M
+ P 22p−2 n
+ |m̃ηb(z) − m̃η∗ (z)| µ(dz) >
∧i=1 ηb(Xi , Yi ) η0 4
(
)
p−1 Z
2p−2 M
+P 2
|m̃η∗ (z) − m(z)| µ(dz) >
η0 + M 4
:= Pn,1 + Pn,2 + Pn,3 + Pn,4 , (say).
34
In view of Theorem 3.1, it follows that
Pn,2 + Pn,4 ≤ 8e−na2 ,
(3.10)
where
a2 ≡ a2 () = min
2
2 η02p
η0p
,
24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 )
To deal with the term Pn,1 , first observe that
|mηb(z) − mη∗ (z)|
P
P
n
z−Zi
∆i
z−Zi ∆i
n
K
K
i=1 ηb(Xi ,Yi )
i=1 η ∗ (Xi ,Yi )
hn
hn
−
= Pn
P
n
z−Zi
z−Zi
i=1 K
i=1 K
hn
hn
P
z−Zi n ∆
1
1
−
K
i
∗
i=1
ηb(Xi ,Yi )
η (Xi ,Yi )
hn
= Pn
z−Z
i
i=1 K
hn
P
z−Zi n ∆
1
1
i=1 i ηb(Xi ,Yi ) − η∗ (Xi ,Yi ) K hn i
h ≤ z−Z
nE K hn
n
X
1
z − Zi
1
+
∆i
−
K
ηb(Xi , Yi ) η ∗ (Xi , Yi )
hn
i=1


1
1


−
i h × P
n
z−Zi
nE K z−Z
i=1 K
h
h
n
:= Sn,1 (z) + Sn,2 (z), (say).
35
n
.
R
Sn,1 (z)µ(dz) as follows
z−Zi
1
Z
Z Pn ∗ 1
−
∆
K
i
i=1 η (Xi ,Yi )
ηb(Xi ,Yi )
hn
h i
Sn,1 (z)µ(dz) ≤
µ(dz)
nE K z−Z
hn


!
Z
n K z−w
X
hn
1
1
1
i µ(dz)
h ≤ sup
−
η ∗ (Xi , Yi ) ηb(Xi , Yi ) z−Z
n
w
E K hn
i=1
!
n
1 X 1
1
,
≤ c1
−
(3.11)
n i=1 η ∗ (Xi , Yi ) ηb(Xi , Yi ) However, we can bound
where the last line follows from Lemma B.1 in the appendix, and the fact
R that |∆i | ≤
1 , ∀i = 1, · · · , n. Here c1 is the positive constant of Lemma B.1. To bound Sn,2 (z)µ(dz),
we have
Z
1
1
−
Sn,2 (z)µ(dz) ≤ max ∗
1≤i≤n η (Xi , Yi )
ηb(Xi , Yi ) Z Pn K z−Zi
i=1
hn
i − 1 µ(dz) . (3.12)
h × nE K z−Z
h
n
36
Therefore, we have
(
Pn,1
)
p−1 Z
1
1
≤ P 22p−2 M p + n
(Sn,1 (z) + Sn,2 (z)) µ(dz) >
η0 ∧i=1 ηb(Xi , Yi ) 4
("
#
p−1 X
n ∗
1
1
η
(X
,
Y
)
−
η
b
(X
,
Y
)
c
i
i
i
i 1
≤ P
22p−2 M p + n
>
η0 ∧i=1 ηb(Xi , Yi ) n i=1 η ∗ (Xi , Yi )b
η (Xi , Yi ) 8
)
n h
\
η0 i
∩
ηb(Xi , Yi ) ≥
2
i=1
("
p−1
1
1
1
2p−2
p 1
max ∗
−
+P
2
M + n
1≤i≤n η (Xi , Yi )
η0 ∧i=1 ηb(Xi , Yi )
ηb(Xi , Yi ) 

Z Pn K z−Zi
n
h
i
\
i=1
hn
η0 
µ(dz) >  ∩
h
i
η
b
(X
,
Y
)
≥
−
1
×
i
i
8
2 
nE K z−Z
i=1
hn
(n
)
[
η0
+P
ηb(Xi , Yi ) <
2
i=1
)
( n
η0p+1
1X
∗
|b
η (Xi , Yi ) − η (Xi , Yi )| > 2p+2 p−1 p
(3.13)
≤ P
n i=1
2
3 M c1

 P
i
p
Z ni=1 K z−Z

hn
η0
i − 1 µ(dz) > 2p+1 p p
h (3.14)
+P
 nE K z−Z
2
3M 
hn
(n
)
[
η0
+P
ηb(Xi , Yi ) <
.
(3.15)
2
i=1
37
The bound for the term Pn,3 follows in a similar fashion as in Pn,1 , we first note that
|m̃ηb(z) − m̃η∗ (z)|
P
z−Zi n ∆Y
1
1
K
−
i=1 i i ηb(Xi ,Yi ) η∗ (Xi ,Yi )
hn
= Pn
z−Z
i
i=1 K
hn
P
z−Zi n ∆Y
1
1
i=1 i i ηb(Xi ,Yi ) − η∗ (Xi ,Yi ) K hn h i
≤ z−Z
nE K hn
n
X
1
1
z − Zi
+
∆i Yi
− ∗
K
η
b
(X
,
Y
)
η
(X
,
Y
)
hn
i
i
i
i
i=1


1
1


−
h i × P
n
z−Zi
nE K z−Z
i=1 K
hn
hn
0
0
:= Sn,1
(z) + Sn,2
(z) (say).
R
R 0
(z)µ(dz) is as follows
Similar to that of Sn,1 (z)µ(dz), the bound for Sn,1
Z
0
Sn,1
(z)µ(dz) ≤ M · c1
!
n X
1
1
1
.
−
n i=1 ηb(Xi , Yi ) η ∗ (Xi , Yi ) The above line follows as a result of Lemma B.1 in the appendix,
and the fact that |∆i Yi | ≤
R 0
(z)µ(dz), we have
M , ∀i = 1, · · · , n, (c1 as in Lemma B.1). For the term Sn,2
Z
Z
1
1
0
Sn,2 (z)µ(dz) ≤ M · max − ∗
1≤i≤n η
b(Xi , Yi ) η (Xi , Yi ) 38
P
n K z−Zi
i=1
hn
i − 1 µ(dz) .
h nE K z−Z
hn
Therefore, Pn,3 can be bounded as follows
(
)
p−1 Z
M
M
0
0
Pn,3 ≤ P 22p−2 + n
Sn,1
(z) + Sn,2
(z) µ(dz) >
η0
∧i=1 ηb(Xi , Yi )
4
("
#
p−1
n
∗
X
M
η
(X
,
Y
)
−
η
b
(X
,
Y
)
M
c
i
i
i
i 1
≤ P
22p−2 + n
>
∗
η0
∧i=1 ηb(Xi , Yi )
n i=1 η (Xi , Yi )b
η (Xi , Yi ) 8
)
n h
\
η0 i
∩
ηb(Xi , Yi ) ≥
2
i=1
("
p−1
1
M
1
2p−2 M
+P
2
+
max
−
η0
∧ni=1 ηb(Xi , Yi ) 1≤i≤n η ∗ (Xi , Yi ) ηb(Xi , Yi ) 

Z Pn K z−Zi
n h
i
\
i=1
hn

η0 
h i − 1 µ(dz) >
×
∩
ηb(Xi , Yi ) ≥
8
2 
nE K z−Z
i=1
hn
(n
)
[
η0
+P
ηb(Xi , Yi ) <
2
( i=1n
)
p+1
1X
η
0
≤ P
(3.16)
|b
η (Xi , Yi ) − η ∗ (Xi , Yi )| > 2p+2 p−1
n i=1
2
3 M p c1
 P

i
p
Z ni=1 K z−Z

hn
η0
µ(dz) >
h
i
−
1
+P
(3.17)
 nE K z−Z
22p+1 3p M p 
hn
(n
)
[
η0
+P
ηb(Xi , Yi ) <
.
(3.18)
2
i=1
39
Therefore, in view of (3.13), (3.14), (3.15), (3.16), (3.17), and (3.18), we find
n
X
η0p+1
∗
Pn,1 + Pn,3 ≤ 2
P |b
η (Xi , Yi ) − η (Xi , Yi )| > 2p+2 p−1 p
2
3 M c1
i=1
 P

i
p
Z ni=1 K z−Z

hn
η0
µ(dz) >
i
h
−
1
+2P
 nE K z−Z
22p+1 3p M p 
hn
+2
n
X
i=1
n
η0 o
P ηb(Xi , Yi ) <
2
:= Tn,1 + Tn,2 + Tn,3 , (say).
(3.19)
a.s.
First note that upon taking Yi = 1 and m(z) = 1 in Lemma B.2 for i = 1, · · · , n, we find
Tn,2 ≤ 2e−na3 ,
(3.20)
where a3 ≡ a3 () = (2 η02p )/(24p+8 32p c21 M 2p ). To deal with the term Tn,1 , define the
following terms
Φ (Xi , Yi ) = η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi )
n
X
X i − Xj
−1 −d
b
Φ (Xi , Yi ) = (n − 1) λn
∆j I{Yi = Yj }H
λn
j=16=i
Ψ (Xi , Yi ) = f (Xi )P (Y = Yi |Yi )
n
X
X i − Xj
−1 −d
b
Ψ (Xi , Yi ) = (n − 1) λn
I{Yi = Yj }H
.
λ
n
j=16=i
Notice that
b (Xi , Yi )
Φ
Φ (Xi , Yi )
= η ∗ (Xi , Yi ) ,
= ηb(Xi , Yi ) and
b (Xi , Yi )
Ψ (Xi , Yi )
Ψ
Φ(Xi ,Yi )
as well as that | Ψ(X
| ≤ 1. Where λn and H are as in (3.6). Using the above definitions,
b
,Y )
b
i
i
40
observe that
Φ
b (Xi , Yi ) Φ (Xi , Yi ) |b
η (Xi , Yi ) − η ∗ (Xi , Yi )| = −
b (Xi , Yi ) Ψ (Xi , Yi ) Ψ
b
Ψ (Xi , Yi ) − Ψ (Xi , Yi )
≤
−
Ψ (Xi , Yi )
b
Φ (Xi , Yi ) − Φ (Xi , Yi )
Ψ (Xi , Yi )
.
Now, let 0 < f0 := inf f (x) (see condition A3), in view of this we have
x∈Rd
Ψ(Xi , Yi ) = f (Xi )P (Y = Yi |Yi ) ≥ f0 inf P (Y = y) =: ψ0 > 0 .
y∈Y
(3.21)
Therefore,
Tn,1 ≤ 2
n
X
n
o
b
P Ψ (Xi , Yi ) − Ψ (Xi , Yi ) > t
i=1
n
X
+2
o
n
b
(Xi , Yi ) − Φ (Xi , Yi ) > t ,
P Φ
i=1
where t =
η0p+1 ψ0
3p−1 M p c1
22p+3
> 0, with ψ0 as in (3.21). Now, put
Γj (Xi , Yi ) =
Xi − X j
∆j I{Yi = Yj }H
λn
#)
"
Xi − Xj ,
−E ∆j I{Yi = Yj }H
Xi , Yi
λn
λ−d
n
41
(3.22)
(3.23)
for i = 1, · · · , n, j = 1 · · · , n, j 6= i, and observe that
n
o
b
P Φ (Xi , Yi ) − Φ (Xi , Yi ) > t
b
b
= P Φ (Xi , Yi ) ± E Φ (Xi , Yi ) Xi , Yi − Φ (Xi , Yi ) > t
t
b
b
≤ P Φ (Xi , Yi ) − E Φ (Xi , Yi ) Xi , Yi + > t
2
(for large enough n, by Lemma B.3 in the appendix)
" (
)#
t b (Xi , Yi ) − E Φ
b (Xi , Yi ) Xi , Yi > Xi , Yi
= E P Φ
2
)#
" (
n
t
X
−1 ,
= E P (n − 1) Γj (Xi , Yi ) > Xi , Yi
2
j=16=i
But, conditional on (Xi , Yi ), the terms Γj (Xi , Yi ) , j = 1, · · · , n , j 6= i, are independent
−d
zero-mean random variables, bounded by −λ−d
n kHk∞ and +λn kHk∞ . Also, observe
2
−d
that Var[Γj (Xi , Yi )|Xi , Yi ] = E[Γj (Xi , Yi )|Xi , Yi ] ≤ λn kHk∞ kf k∞ . Therefore by an
application of Bernstein’s Inequality (Bernstein 1946), we have
(
)
n
X
t
P (n − 1)−1 Γj (Xi , Yi ) > Xi , Yi
2
j=16=i
(
)
2
−(n − 1) 2t
≤ 2 exp
1 −d
t
2 λ−d
n kHk∞ kf k∞ + 3 λn kHk∞ 2
−(n − 1)λdn η02p+2 ψ02 2
≤ 2 exp
,
24p+12 32p−2 c21 M 2p kHk∞ [kf k∞ + ψ0 /12]
for n large enough. The last line follows by the fact that in bounding P {|b
η (Xi , Yi ) −
η02p+1
},
3p−1 c1 M p
3p+2
p−1
p
we only need to consider ∈ (0, 2 η32p+1c1 M ), because
0
b i , Yi ) is a special case of Φ(X
b i , Yi ),
|b
η (Xi , Yi ) − η ∗ (Xi , Yi )| ≤ 1. Similarly, since Ψ(X
with ∆j = 1 for j = 1, · · · , n (compare (3.22) with (3.23)), one finds
η ∗ (Xi , Yi )| >
23p+2
2p+2 2 2
ψ0 −(n−1)λd
n
o
n η0
b
2p kHk∞ [kf k∞ +ψ /12]
24p+12 32p−2 c2
M
0
1
,
P Ψ (Xi , Yi ) − Ψ (Xi , Yi ) > t ≤ 2e
42
for n large enough. Combining the above two bounds, we obtain
d
Tn,1 ≤ 8ne−(n−1)λn a4 ,
where a4 ≡ a4 () =
24p+12
2 η02p+2 ψ02
2
2p−2
3
c1 M 2p kHk∞ [kf k∞ +ψ0 /12]
(3.24)
.
Finally, to deal with the remaining term Tn,3 , notice that
n
o
n
η0
η0 o
= P ηb(Xi , Yi ) − η ∗ (Xi , Yi ) <
− η ∗ (Xi , Yi )
P ηb(Xi , Yi ) <
2
2 −η
0
≤ P ηb(Xi , Yi ) − η ∗ (Xi , Yi ) <
2
n
η0 o
∗
≤ P |b
η (Xi , Yi ) − η (Xi , Yi )| ≥
.
2
Therefore, employing the arguments that lead to the derivation of the bound on Tn,1 , we
have
d
Tn,3 ≤ 8ne−(n−1)λn a5 ,
(3.25)
η2 ψ2
0
, which does not depend on or n. Therefore, in view of
where a5 = 27 kHk∞ [kf0 k∞
+ψ0 /12]
(3.19), (3.20), (3.24), and (3.25), one finds
−
Pn,1 + Pn,3 ≤ 2e
2p
n2 η0
2p
24p+8 32p c2
1M
−
+ 8ne
−
+ 8ne
2p+2 2
2 η0
ψ0
2p kHk∞ [kf k∞ +ψ /12]
24p+12 32p−2 c2
M
0
1
2 ψ2
η0
0
27 kHk∞ [kf k∞ +ψ0 /12]
.
This completes the proof of Theorem 3.2.
3.1.2
A least-squares based estimator of the selection probability
Suppose that it is known in advance that the regression function belongs to a given (known)
class of functions. In this case the least-squares estimator is an alternative to the kernel
estimator of η ∗ (x, y). This works as follows. Let η ∗ belong to a known class of functions
M of the form η : Rd × R → [η0 , 1], where η0 = inf P (∆ = 1|X = x, Y = y) > 0, as
x,y
43
described in assumption A1. The least-squares estimator of the function η ∗ is
n
1X
ηbLS (Xi , Yi ) = argmin
(∆i − η(Xi , Yi ))2 .
η∈M n
i=1
(3.26)
Upon replacing η ∗ (Xi , Yi ) in (3.3) with the least-squares estimator in (3.26), we find the
revised regression estimator
Pn
z−Zi
∆ i Yi
K
i=1 ηbLS (Xi ,Yi )
hn
.
m
b ηbLS (z) = P
(3.27)
n
∆i
z−Zi
K
i=1 ηbLS (Xi ,Yi )
hn
To study the performance of m
b ηbLS (z), we employ results from the empirical process theory
(see, for example, van der Vaart and Wellner (1996) p.83, Pollard (1984), p.25, or Györfi,
et al. (2002) p.134). We say M is totally bounded with respect to the L1 empirical norm
if for every > 0, there exists a subclass of functions M = {M1 , · · · , MN } such
that for every η ∈ M there existsPa η † ∈ M with the property
that for the fixed points
n 1
†
(x1 , y1 ), · · · , (xn , yn ) we have n i=1 η(xi , yi ) − η (xi , yi ) < . The sub class M is
called an -cover of M. The cardinality of the smallest such cover is called the -covering
number of M and is denoted by N1 (, M, Dn ).
Theorem 3.3 Let m
b ηbLS (z) be as in (3.27), where ηbLS is the least-squares estimator of η ∗
described in (3.26), with η ∗ a member of a known class of functions M, where M is
totally bounded with respect to the empirical L1 norm. Also, suppose that condition A1
holds and let K in (3.27) be a regular kernel with smoothing parameter hn . Then, provided
that hn → 0, nhd+s
→ ∞, as n → ∞, and |Y | ≤ M < ∞, we have for every > 0 and
n
every p ∈ [1, ∞), and n large enough
Z
p
P
|m
b ηbLS (z) − m(z)| µ(dz) > ≤ 8e−nb1 + 2e−nb2 + 16 E [N1 (b3 , M, Dn )] e−nb4
+16 E [N1 (b5 , M, Dn )] e−nb6 .
Here, the constants b1 , · · · , b6 (note no attempt was made to find the optimal constants)
44
are defined as
2
b1 ≡ b1 () = min
2 η02p
η0p
,
24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 )
,
η0p+1
2 η02p+2
2 η02p
,
b
≡
b
()
=
,
b
≡
b
()
=
,
3
3
4
4
26p+8 M 2p c21
23p+3 M p c1
26p+7 M 2p c21
2 η02p+2
4 η04p+4
b5 ≡ b5 () = 6p+4 2p 2 , b6 ≡ b6 () = 12p+6 4p 4 .
2
M c1
2
M c1
b2 ≡ b2 () =
Remark. Combining an application of the Borel-Cantelli lemma with the conclusion of
a.s
Theorem 3.3, yields E[|m
b ηbLS (z) − m(z)|p |Dn ] → 0, provided that
log(E [N1 (b3 ∧ b5 , M, Dn )])/n → 0
as n → ∞ .
PROOF OF THEOREM 3.3
First note that, as in the proof of Theorem 2, for every > 0,
Pn
m̃ηbLS (z) =
∆ i Yi
i=1 ηbLS (Xi ,Yi ) K
Pn
i=1 K
z−Zi
hn
z−Zi
hn
Pn
, and mηbLS (z) =
∆i
i=1 ηbLS (Xi ,Yi ) K
Pn
i=1 K
z−Zi
hn
z−Zi
hn
.(3.28)
Then, upon observing that |m̃ηbLS (z)/mηbLS (z)| ≤ M , one obtains
m̃ηbLS (z) m(z) ≤ M |mηb (z) − 1| + |m̃ηb (z) − m(z)| .
−
|m
b ηbLS (z) − m(z)| = LS
LS
mηbLS (z)
1 45
Therefore, with m̃ηbLS (z) and mηbLS (z) as in (3.28), and m̃η∗ (z) and mη∗ (z) as in (3.4), we
have for every > 0
Z
p
P
|m
b ηbLS (z) − m(z)| µ(dz) > Z
p
p−1
p
≤P 2 M
|mηbLS (z) − 1| µ(dz) >
2
Z
p
p−1
+P 2
|m̃ηbLS (z) − m(z)| µ(dz) >
2
Z
p
2p−2
p
≤P 2
M
|mηbLS (z) − mη∗ (z)| µ(dz) >
4
Z
+ P 22p−2 M p |mη∗ (z) − 1|p µ(dz) >
4
Z
p
2p−2
+P 2
|m̃ηbLS (z) − m̃η∗ (z)| µ(dz) >
4
Z
+ P 22p−2 |m̃η∗ (z) − m(z)|p µ(dz) >
4
)
(
p−1 Z
1
1
|mηbLS (z) − mη∗ (z)| µ(dz) >
≤ P 22p−2 M p + η0 η0
4
(
)
p−1 Z
2p−2
p 1
+P 2
M + 1
|mη∗ (z) − 1| µ(dz) >
η0
4
(
)
p−1 Z
M
M
+ P 22p−2 + |m̃ηbLS (z) − m̃η∗ (z)| µ(dz) >
η0
η0
4
(
)
p−1 Z
2p−2 M
+P 2
|m̃η∗ (z) − m(z)| µ(dz) >
η0 + M 4
:= Qn,1 + Qn,2 + Qn,3 + Qn,4 , (say).
As a result of Theorem 3.1, we have
Qn,2 + Qn,4 ≤ 8e−nb1 ,
46
(3.29)
where
b1 ≡ b1 () = min
2
η0p
2 η02p
,
24p+7 M 2p (1 + η0 )2p−2 (1 + c1 ) 22p+5 M p (1 + η0 )p−1 (1 + c1 )
.
To bound Qn,1 , first we note that
|mηbLS (z) − mη∗ (z)|
P
z−Zi n ∆
1
1
−
K
i=1 i ηbLS (Xi ,Yi ) η∗ (Xi ,Yi )
hn
h i
≤ nE K z−Z
hn
n
X
1
z − Zi
1
∆i
+
K
−
ηbLS (Xi , Yi ) η ∗ (Xi , Yi )
hn
i=1


1
1
−
h i 
× P
n
z−Zi
nE K z−Z
i=1 K
hn
hn
:= πn,1 (z) + πn,2 (z), (say) .
But, using the arguments that lead to (3.11), we have
Z
πn,1 (z)µ(dz) ≤ c1
!
n 1
1 X 1
.
−
n i=1 ηbLS (Xi , Yi ) η ∗ (Xi , Yi ) R
The term πn,2 (z)µ(dz) can be bounded similar to that of (3.12),
Z
1
1
− ∗
πn,2 (z)µ(dz) ≤ max 1≤i≤n η
bLS (Xi , Yi ) η (Xi , Yi ) Z Pn K z−Zi
i=1
hn
h i − 1 µ(dz) .
× nE K z−Z
h
n
47
(3.30)
(3.31)
Now, put 1 = η0p−1 /[23p−1 M p ], then combining (3.30) and (3.31), yields
Z
Qn,1 ≤ P
(πn,1 (z) + πn,2 (z)) µ(dz) > 1
(
)
n c1 X ηbLS (Xi , Yi ) − η ∗ (Xi , Yi ) 1
≤ P
>
n i=1 η ∗ (Xi , Yi )b
ηLS (Xi , Yi ) 2


Z Pn K z−Zi

i=1
hn
1
1 
1
i − 1 µ(dz) >
h +P max ∗
−
1≤i≤n η (Xi , Yi ) ηbLS (Xi , Yi ) nE K z−Z
2
hn
)
( n
2
η
1X
1 0
|b
ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| >
(3.32)
≤ P
n i=1
c1
 P

i
Z ni=1 K z−Z
hn
1 η0 
i
h
(3.33)
+P
− 1 µ(dz) >
 nE K z−Z
4 
hn
To bound Qn,3 we use the arguments for the bound on Qn,1 , first observing that
|m̃ηbLS (z) − m̃η∗ (z)|
P
z−Zi n ∆Y
1
1
i=1 i i ηbLS (Xi ,Yi ) − η∗ (Xi ,Yi ) K hn i
h ≤ z−Z
nE K hn
n
X
1
1
z − Zi
+
∆i Yi
− ∗
K
η
b
η
(X
,
Y
)
hn
LS (Xi , Yi )
i
i
i=1


1
1


−
h i × P
n
z−Zi
nE K z−Z
i=1 K
h
h
n
:=
0
πn,1
(z)
+
0
πn,2
(z),
n
(say) .
But, once again the arguments that lead to (3.11), yield
Z
0
πn,1
(z)µ(dz) ≤ M c1
!
n 1 X 1
1
.
− ∗
n i=1 ηbLS (Xi , Yi ) η (Xi , Yi ) 48
(3.34)
0
πn,2
(z)µ(dz) follows as before in (3.12)
Z
1
1
0
πn,2 (z)µ(dz) ≤ M max − ∗
1≤i≤n η
bLS (Xi , Yi ) η (Xi , Yi ) Z Pn K z−Zi
i=1
hn
h i − 1 µ(dz) .
× nE K z−Z
h
The bound for the term
R
(3.35)
n
Therefore, with 1 = η0p−1 /[23p−1 M p ], the bound for Qn,3 follows similarly to that of Qn,1
Z
0
0
Qn,3 ≤ P
πn,1 (z) + πn,2 (z) µ(dz) > M 1
( n
)
2
1X
η
1
0
≤ P
|b
ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| >
(3.36)
n i=1
c1
 P

i
Z ni=1 K z−Z

hn
µ(dz) > 1 η0 .
i
h
−
1
(3.37)
+P
 nE K z−Z
4 
hn
Thus, by (3.32), (3.33), (3.36), and (3.37), it follows that
( n
)
2
X
1
1 η0
Qn,1 + Qn,3 ≤ 2P
|b
ηLS (Xi , Yi ) − η ∗ (Xi , Yi )| >
n i=1
c1

 P
i
Z ni=1 K z−Z
hn
1 η0 
i − 1 µ(dz) >
h +2P
 nE K z−Z
4 
h
n
:= Un,1 + Un,2 , (say).
(3.38)
a.s.
Observe that, upon taking Yi = 1, and m(z) = 1 in lemma B.2 for i = 1, · · · , n, one
obtains
Un,2 ≤ 2e−nb2 ,
where b2 ≡ b2 () =
2 η02p
6p+8
2
M 2p c21
.
49
(3.39)
To deal with the term Un,1 , observe that
( n
1 X
Un,1 ≤ 2P |b
ηLS (Xi , Yi ) − η ∗ (Xi , Yi )|
n
i=1
1 η02
−E |b
ηLS (X, Y ) − η ∗ (X, Y )| Dn >
2c1
2
1 η0
+2P E |b
ηLS (X, Y ) − η ∗ (X, Y )| Dn >
.
2c1
(3.40)
(3.41)
η2
To handle the term (3.40), first put 2 = 214 c01 > 0 and let the class of functions G be defined
as G = {gη (x, y) = |η(x, y) − η ∗ (x, y)| | η ∈ M}. Now, observe that for any g, g † ∈ G
one has
n
n
1 X 1X
†
g(Xi , Yi ) − g (Xi , Yi ) =
||η(Xi , Yi ) − η ∗ (Xi , Yi )|
n i=1
n i=1
− η † (Xi , Yi ) − η ∗ (Xi , Yi )
n
1 X η(Xi , Yi ) − η † (Xi , Yi ) .
≤
n i=1
Therefore, it follows that if M2 = {η1 , · · · , ηN2 } is a minimal 2 -cover of M with
respect to the empirical L1 norm, the class of functions G2 = {gη1 , · · · , gηN2 } will be an
2 -cover of G. Futhermore, the 2 -covering numbers for G and M satisfy the following
N1 (2 , G, Dn ≤ N1 (2 , M, Dn ). Upon, employing standard results from the empirical
process theory (see Pollard (1984), p.25, or Theorem 9.1 Györfi et al (2002, p. 136 ), the
upper bound for (3.40) follows
n
1 X
sup |η(Xi , Yi ) − η ∗ (Xi , Yi )|
n
η∈M
i=1
(
(3.40) ≤ 2P
1 η02
−E |η(X, Y ) − η (X, Y )|| >
2c1
≤ 16 E[N1 (b3 , M, Dn )] e−nb4 ,
∗
where b3 ≡ b3 () =
η0p+1
23p+3 M p c1
and b4 ≡ b4 () =
50
2 η02p+2
.
26p+7 M 2p c21
(3.42)
To deal with (3.41), note that by the Cauchy Schwartz inequality we have
q
1 η02
2
(3.41) ≤ 2P
E |b
ηLS (X, Y ) − η ∗ (X, Y )| Dn >
2c1
2 4
1 η0
2
∗
≤ 2P E |b
ηLS (X, Y ) − η (X, Y )| Dn > 2 .
4c1
Since
i
h
E |b
ηLS (X, Y ) − ∆|2 Dn =
inf E |η(X, Y ) − ∆|2
η∈M
i
h
+ E |b
ηLS (X, Y ) − η ∗ (X, Y )|2 Dn
we find
i
i
h
h
ηLS (X, Y ) − ∆|2 Dn
E |b
ηLS (X, Y ) − η ∗ (X, Y )|2 Dn = E |b
− inf E |η(X, Y ) − ∆|2 ,
η∈M
51
(3.43)
Now, observe that
i
h
E |b
ηLS (X, Y ) − ∆|2 Dn − inf E |η(X, Y ) − ∆|2
η∈M
(
n
i 1X
h
2
= sup E |b
ηLS (X, Y ) − ∆| Dn −
|b
ηLS (Xi , Yi ) − ∆i |2
n
η∈M
i=1
n
n
1X
1X
|b
ηLS (Xi , Yi ) − ∆i |2 −
|η(Xi , Yi ) − ∆i |2
+
n i=1
n i=1
)
n
1X
2
2
+
|η(Xi , Yi ) − ∆i | − E |η(X, Y ) − ∆|
n i=1
n
i 1X
h
≤ E |b
ηLS (X, Y ) − ∆|2 Dn −
|b
ηLS (Xi , Yi ) − ∆i |2
n i=1
( n
)
1X
+ sup
|η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2
n
η∈M
i=1
n
n
1X
1X
|b
ηLS (Xi , Yi ) − ∆i |2 −
|η(Xi , Yi ) − ∆i |2 ≤ 0 , ∀η ∈ M)
(by the fact that,
n i=1
n i=1
)
( n
1X
|η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2
.
(3.44)
≤ 2 sup
n i=1
η∈M
To bound (3.44), let F = {fη (x, y) = |η(x, y)−∆|2 | η ∈ M} and put 3 = (21 η04 )/(25 c21 ),
then for any f, f †† ∈ F, observe that
n
n
††
2 1 X 1 X 2
††
f (Xi , Yi ) − f (Xi , Yi ) =
|η(Xi , Yi ) − ∆i | − η (Xi , Yi ) − ∆i n i=1
n i=1
n
1 X (η(Xi , Yi ) − ∆i ) − η †† (Xi , Yi ) − ∆i n i=1
× (η(Xi , Yi ) − ∆i ) + η †† (Xi , Yi ) − ∆i n
1 X ≤2
η(Xi , Yi ) − η †† (Xi , Yi ) .
n i=1
≤
In other words, if M3 /2 = {η1 , · · · , ηN3 /2 } is a minimal 3 /2-cover of M with respect to
the empirical L1 norm, then F3 = {fη1 , · · · , fηN3 } is an 3 -cover of F. Additionally one
52
has that N1 (3 , F, Dn ) ≤ N1 (3 /2, M, Dn ), thus under standard results from empirical
process theory (see, for example, Pollard (1984), p.25, or Theorem 9.1 Györfi et al (2002,
p. 136 ), one has the following bound for (3.43)
)
(
n
1 X
2 η 4
(3.43) ≤ 2P sup |η(Xi , Yi ) − ∆i |2 − E |η(X, Y ) − ∆|2 > 15 02
2 c1
η∈M n
i=1
≤ 16 E [N1 (b5 , M, (Xi , Yi )ni=1 )] e−nb6
2 η 2p+2
where b5 ≡ b5 () = 26p+40M 2p c2 and b6 ≡ b6 () =
1
(3.39), (3.42), and (3.45), we have
−
Qn,1 + Qn,3 ≤ 2e
4 η04p+4
12p+6
2
M 4p c41
(3.45)
. Therefore, in view of (3.38),
2p
n2 η0
26p+8 M 2p c2
1
+16 E N1
η0p+1
, M, Dn
23p+3 M p c1
+16 E N1
2 η02p+2
, M, Dn
26p+4 M 2p c21
−
e
2p+2
n2 η0
6p+7
2
M 2p c2
1
−
e
4p+4
n4 η0
12p+6
2
M 4p c4
1
.
This completes the proof of Theorem 3.3.
3.2
An application to classification with missing covariates
In this section we consider an application of our previous results to the problem of statistical classification. Let (Z, Y ) be an Rd+s × {1, · · · , N }-valued random vector, where Y is
the class label and is to be predicted from the explanatory variables Z = (Z1 , · · · , Zd+s ).
In classification one searches for a function φ : Rd+s → {1, · · · , N } such that the missclassification error probability L(φ) = P {φ(Z) 6= Y } is as small as possible. Let
πk (z) = P {Y = k|Z = z} , z ∈ Rd+s , 1 ≤ k ≤ N ,
then the best classifier (i.e., the one that minimizes L(φ)) is called the Bayes classifier and
is given by (see, for example, Devroye and Györfi (1985 pp. 253-254))
πB (z) = argmax πk (z) .
1≤k≤N
53
(3.46)
We note that the Bayes classifier satisfies
max πk (z) = πφB (z) (z) .
1≤k≤N
In practice, the Bayes classifier is unavailable, because the distribution of (Z, Y ) is usually
unknown. Instead, one only has access to a random sample Dn = {(Z1 , Y1 ) · · · , (Zn , Yn )}
from the distribution (Z, Y ). Now, let φbn (Z) be any sample-based classifier to predict Y
from the data Dn and Z. The error of missclassification of this estimator, conditioned on
the data, is given by
b
b
Ln (φn ) = P φn (Z) 6= Y Dn .
Then, it can be shown (see, for example, Devroye and Györfi (1985, pp. 254)) that
0 ≤ Ln (φbn ) − L(φB ) ≤
N Z
X
|b
πk (z) − πk (z)| µ(dz).
(3.47)
k=1
a.s.
Therefore, to show Ln (φbn ) − L(φB ) −→ 0 , it is sufficient to show that E[|b
πk (Z) −
a.s.
πk (Z)||Dn ] −→ 0, as n → ∞, for k = 1, · · · , N . Here, we consider the problem of
classification with missing covaraites, i.e., the case where some subsets of the Zi ’s may be
unavailable. For some relevant results along these lines see Mojirsheibani (2012). Define
Pn ∆i I{Yi =k} z−Zi i=1 η̆(Xi ,Yi ) K
hn
k = 1, · · · , N ,
π
bk (z) = P
(3.48)
n
z−Zi
∆i
i=1 η̆(Xi ,Yi ) K
hn
where η̆ is any estimator of η, and define the classifier
φbn (z) = argmax π
bk (z) .
(3.49)
1≤k≤N
The following result summarizes the performance of φbn .
Theorem 3.4 Let φbn be the classifier defined in (3.49) and (3.48).
(i) If η̆ in (3.48) is chosen to be the kernel estimator ηb defined in (3.6) then, under the
a.s.
conditions of Theorem 3.2, φbn is strongly consistent i.e., Ln (φbn ) − L(φB ) → 0, as
n → ∞.
54
(ii) If η̆ in (3.48) is chosen to be the least-squares estimator ηbLS defined in (3.26) then,
a.s.
under the conditions of Theorem 3.3, φbn is strongly consistent i.e., Ln (φbn )−L(φB ) →
0, as n → ∞.
Proof of Theorem 3.4
(i) Let η̆ be the kernel estimator ηb defined in (3.6) then by (3.47) one has for every > 0,
and n large enough
)
( N Z
X
.
P {Ln (φbn ) − L(φB ) > } ≤ P
|b
πk (z) − πk (z)| µ(dz) >
N
i=1
The proof now follows from an application of Theorem 3.2 and the Borel Cantelli
lemma.
(ii) The proof of part (ii) is similar and will not be given.
3.3
Kernel regression simulation with missing covariates.
In this section we study the performance of the estimators m
b η∗ , m
b ηbc , and m
b ηbLS . The performance of each estimator will be compared with the complete case estimator of chapter 2
(denoted as m
b Co ). To explore the performance of these estimators, consider the following
simulation. A sample of size 100 is taken from the linear model Yi = .5Vi + .9Xi + i ,
where Vi , Xi , and i are independent and identically distributed normal random variables
with means µ1 = 5, µ2 = 9, µ3 = 0 and standard deviations σ1 = 1, σ2 = 6, σ3 = 1
respectively, for i = 1, · · · , 100. In the simulation Vi is sometimes missing with missing
probability mechanism
P (∆i = 1|Xi , Yi ) =
ea|Xi |+b|Yi |
,
1 + ea|Xi |+b|Yi |
(3.50)
here ∆i = 1 if Vi is available and ∆i = 0 otherwise for i = 1, · · · , n, a and b are
chosen to produce approximately 25% or 75% missingness among the Vi ’s. The leastsquares estimator for the selection probability in the regression estimator m
b ηbLS is correctly
specified and the coefficients are estimated using the nls method in the R programing language. For the following smoothing parameters h = .2, .4, .6, and .8 we perform a series
of fifty Monte Carlo runs and test the empirical L1 and L2 errors (defined in (1.15) and
55
h
m
b Co
m
b η∗
m
b ηbc
m
b ηbLS
0.2
L1
6.439 0.450
6.426 0.449
6.180 0.407
6.409 0.471
L2
66.228 8.432
66.041 8.416
61.689 7.404
65.771 8.767
Approximately 25% missing
0.4
0.6
L1
L2
L1
L2
5.123 0.369 43.144 5.790 4.640 0.360 34.740 5.036
5.111 0.364 42.925 5.717 4.643 0.356 34.775 4.965
4.874 0.359 38.808 5.724 4.575 0.358 33.547 4.745
5.113 0.368 42.925 5.734 4.648 0.356 34.792 4.974
0.8
L1
4.511 0.341
4.530 0.336
4.508 0.335
4.540 0.335
L2
32.506 4.498
32.783 4.461
32.384 4.322
32.906 4.519
Table 3.1: Comparsion of empirical L1 and L2 errors of the estimators when Vi is missing
at random with missing mechanism as in (3.50). The missing mechanism has parameters
a = −2 and b = .1.95, the choice of parameters results in approximately 25% of the Vi ’s
missing.
h
m
b Co
m
b η∗
m
b ηbc
m
b ηbLS
0.2
L1
7.491 0.725
7.399 0.706
6.915 0.653
7.391 0.701
L2
85.428 13.452
83.779 13.071
75.163 11.928
83.623 12.984
Approximately 75% missing
0.4
0.6
L1
L2
L1
L2
5.672 0.417 52.704 7.001 4.83 0.366 38.086 5.352
5.600 0.403 51.453 6.767 4.798 0.363 37.536 5.186
5.202 0.373 44.349 6.066 4.619 0.373 34.376 4.986
5.594 0.402 51.353 6.767 4.794 0.365 37.495 5.218
0.8
L1
4.581 0.357
4.595 0.358
4.524 0.371
4.594 0.361
L2
33.799 4.761
33.931 4.768
32.810 4.832
33.901 4.788
Table 3.2: Comparsion of empirical L1 and L2 errors of the estimators when Vi is missing
at random with missing mechanism as in (3.50). The missing mechanism has parameters
a = −1 and b = .7, the choice of parameters results in approximately 75% of the Vi ’s
missing.
(1.16)) for each estimator on an independent sample of 500 complete observations. The
smoothing parameter for the kernel regression estimate in (3.7) is selected by using the
cross-validation method of Racine and Li (2004) in the R package titled “np” (Racine and
Hayfield (2008)). This method utilizes a conjugate gradient search pattern and returns a
multi-dimensional bandwidth. Conversion of the vector of bandwidth into a single constant is done by taking the geometric mean of the vector. In tables 3.1 and 3.2 one may
observe that the kernel regression estimator m
b ηbc defined in section 3.1.1 out performs each
estimator in every situation. A non-intuitive observation is that the estimator that assumes
knowledge of the missing probability mechanism i.e., m
b η∗ performs considerably worse
than the estimator m
b ηbc , which used an estimate of the missing probability mechanism.
This behavior is observed in the literature, see for example Robins and Rotnitzky (1995).
In almost all cases the complete case estimator performs worse than each of our estimators.
56
4 Kernel regression estimation with incomplete response
This chapter explores constructing kernel regression estimators in which the response variable may be missing at random. The results in this chapter mirror that of the case of missing covariates in Chapter 3. We derive exponential upper bounds for the Lp norms of our
proposed estimators, yielding almost sure convergence of the Lp statistic to 0. At the conclusion of this chapter we once again apply the results to nonparametric classification. We
construct kernel classifiers from partially labeled data.
4.1
Main results
We propose kernel estimators of an unknown regression function m(z) = E [Y |Z = z]
in the presence of missing response. Our estimator weighs the complete cases, i.e., the
observed Yi ’s, by the inverse of the missing data probabilities. First consider the simple,
but unrealistic, case where the missing probability mechanism (also called the selection
probability)
π ∗ (z) = P {∆ = 1|Z = z} = E[∆|Z = z]
(4.1)
is completely known, where ∆ = 0 if Y is missing (and ∆ = 1, otherwise), and define the
revised kernel estimator
n
X
z − Zi
∆i Yi
K
π ∗ (Zi )
h
i=1
.
(4.2)
m
b π∗ (z) =
n
X
z − Zi
K
h
i=1
As for the usefulness of m
b π∗ (z) as an estimator of m(z), first observe that (4.2) can be
viewed as the kernel estimator of E[ π∆Y
∗ (Z) |Z = z]. Furthermore, when the MAR assumption (2.3) holds, one finds
∆Y ∆Y a.s.
E ∗
=
E E ∗
Z
Z, Y Z
π (Z)
π (Z)
Y
=
E ∗
E [∆|Z, Y ] Z
π (Z)
Y
by (2.3)
∗
=
=E ∗
π (Z)Z = E [Y |Z] := m(Z).
π (Z)
57
In other words, (4.2) can be viewed as a kernel regression estimator of the regression
function E[Y |Z = z] whenever the function π ∗ in (4.1) is completely known. How good
is the estimator m
b π∗ (z)? To answer this question, and in what follows, we shall assume that
the selected kernel K is regular: as described earlier in (1.4.3) and we state the following
condition.
Condition B1. inf P {∆ = 1|Z = z} := π0 > 0, where π0 can be arbitrarily small.
z
Condition B1 guarantees, in a sense, that Y will be observed with a nonzero probability
when Z = z, for all z. The following result is an immediate consequence of the main
theorem of Devroye and Krzyz̀ak (1989).
Theorem 4.1 Let m
b π∗ (z) be the kernel estimator defined in (4.2), where K is a regular
kernel, and suppose that condition B1 holds. If h → 0 and nhd → ∞, as n → ∞, then
for any distribution of (Z, Y ) satisfying |Y | ≤ M < ∞, and for every > 0 and n large
enough
Z
|m
b π∗ (z) − m(z)| µ(dz) > P
2
where a ≡ a() = min
the kernel K only.
π02 2
128M 2 (1+c1 )
,
π0 32M (1+c1 )
≤ e−an ,
. Here c1 is a constant that depends on
R
Remarks. The above result continues to hold for |m
b π∗ (z) − m(z)|p µ(dz), for all 1 ≤
p < ∞, with a() replaced by a(˜) where ˜ = /(M (π0−1 + 1))p−1 . To appreciate this,
simply note that for all z ∈ Rd and p ∈ [1, ∞)
|m
b π∗ (z) − m(z)|p ≤ (|m
b π∗ (z) | + |m(z)|)p−1 |m
b π∗ (z) − m(z)|
p−1
M
+M
|m
b π∗ (z) − m(z)|.
≤
π0
Therefore, under the conditions of Theorem 4.1, and in view of the Borel-Cantelli lemma,
a.s.
one finds E [|m
b π∗ (Z) − m(Z)|p |Dn ] −→ 0, as n → ∞.
Of course the estimator m
b π∗ defined by (4.2) is useful only if the missing probability mechanism π ∗ (z) = P {∆ = 1|Z = z} is known. If this is not the case, π ∗ (z) must be replaced
by some estimator in (4.2). In what follows, we consider kernel as well as least-squares
58
regression type estimators of π ∗ (z).
4.1.1
A kernel-based estimator of the selection probability
Our first estimator of π ∗ (Zi ) = E(∆i |Zi ) is defined by
n
X
Zi − Zj
∆j H
an
j=1,6=i
π
b(Zi ) =
n
X
Zi − Zj
H
an
j=1,6=i
(4.3)
with the convention 0/0 = 0, where H : Rd → R+ is the kernel used with the smoothing
parameter an (→ 0, as n → ∞). The kernel regression estimator in (4.3) is a weighted
average of the ∆j ’s, this provides an estimate for the probability that the ith observation
of the response will be unavailable. Our first proposed kernel-type regression estimator of
m(z) = E [Y |Z = z] is given by
n
X
z − Zi
∆i Yi
K
π
b
(Z
)
h
i
.
m
b πb (z) = i=1 n
X
z − Zi
K
h
i=1
(4.4)
To investigate the performance of the regression estimator (4.4) we first state the following
additional conditions.
R
R
Condition B2. The kernel H in (4.3) satisfies Rd H(u)du = 1 and Rd |ui |H(u)du < ∞,
for i = 1, · · · , d, where u = (u1 , · · · , ud )0 . Furthermore, the smoothing parameter an
satisfies an → 0 and nadn → ∞, as n → ∞.
Condition B3. The random vector Z has a compactly supported probability density function, f (z), where f is bounded away from zero on its compact support. Additionally, both
f and its first-order partial derivatives are uniformly bounded.
Condition B4. The partial derivatives ∂z∂ i E[∆|Z = z], exist for i = 1, · · · , d and are
bounded uniformly, in z, on the compact support of f .
59
Condition B2 is not restrictive since the choice of the kernel H is at our discretion. Condition B3 is usually imposed in nonparametric regression in order to avoid having unstable
estimates (in the tail of the p.d.f of Z). Condition B4 is technical and has already been
used in the literature; see, for example, Cheng and Chu (1996).
Theorem 4.2 Let m
b πb (z) be as in (4.4), where K is a regular kernel. Suppose that |Y | ≤
M < ∞. If h → 0 and nhd → ∞, as n → ∞, then under conditions B1, B2, B3, and B4,
for every > 0 and every p ∈ [1, ∞), and n large enough,
Z
d 2
2
d
p
P
|m
b πb (z) − m(z)| µ(dz) > ≤ 4e−bn + 4ne−C1 nan + e−C2 n + 4ne−C3 nan ,
where b ≡ b() = min2
n
π02 2
9
2
2 M (1+c1 )[2(M π0−1 +M )]2p−2
,
π0 26 M (1+c1 )[2(M π0−1 +M )]p−1
o
, and
π04 f02
,
22p (3M πo−1 )2p−2 (32M c1 )2 kHk∞ (2kf k∞ + f0 /12)
π02
=
,
[2(3M π0−1 )]2p−2 (64M 2 c1 )2
π02 f02
.
=
256kHk∞ (2kf k∞ + π0 f0 /24)
C1 =
C2
C3
Remark. We note that as an immediate consequence of Theorem 4.2, if log n/(nadn ) → 0,
a.s.
as n → ∞, then in view of the Borel-Cantelli lemma, one has E [|m
b πb (Z) − m(Z)|p |Dn ] −→
0.
PROOF OF THEOREM 4.2
60
First note that
Z
Z
p
p−1
p
P
|m
b πb (z) − m(z)| µ(dz) > ≤ P
2 |m
b πb (z) − m
b π∗ (z)| µ(dz) >
2
Z
p−1
p
+P
2 |m
b π∗ (z) − m(z)| µ(dz) >
2
:= Pn,1 + Pn,2 , (say) .
(4.5)
But, by Theorem 4.1 (and the remarks that follow Theorem 4.1),
Z
p−1
−1
Pn,2 ≤ P
2(M π0 + M )
|m
b π∗ (z) − m(z)| >
≤ 4e−bn ,
2
for n large enough, with
π02 2
π0 2
b ≡ b() = min
,
.
29 M 2 (1 + c1 )[2(M π0−1 + M )]2p−2 26 M (1 + c1 )[2(M π0−1 + M )]p−1
To deal with the term Pn,1 , first note that
P z−Zi n
1
1
−
∆
Y
K
i i
i=1 πb(Zi ) π∗ (Zi )
h
|m
b πb (z) − m
b π∗ (z)| = Pn
z−Zi
K
i=1
h
P z−Zi n
1
1
−
∆
Y
K
i i
i=1 πb(Zi ) π∗ (Zi )
h
≤ z−Z
nE K h
!
n X
1
1
z
−
Z
i
+
− ∗
∆i Yi K
π
b
(Z
)
π
(Z
)
h
i
i
i=1
!
1
1
−
×
P
n
z−Zi
nE K z−Z
i=1 K
h
h
=: Un,1 (z) + Un,2 (z) , (say).
61
(4.6)
Therefore, by the definition of m
b πb (z) in (4.4) and the fact that |Y | ≤ M ,
p
|m
b πb (z) − m
b π∗ (z)| ≤
M
M
+
n
∧i=1 π
b(Zi ) π0
p−1
|Un,1 (z)| + |Un,2 (z)| .
(4.7)
But
Pn 1
z−Zi
1
Z Z
−
|∆
Y
|
K
i
i
∗
i=1 π
b(Zi )
π (Zi )
h
µ(dz)
Un,1 (z)µ(dz) ≤
z−Z
nE[K h ]
Z
n
z−Zi
1
K
1X
1
h =
−
µ(dz)
|∆i Yi | n i=1
π
b(Zi ) π ∗ (Zi ) E[K z−Z
]
h
Z
n
X
K z−u
1
1
1
h ≤ sup
µ(dz)
×
−
|∆
Y
|
i
i
π
∗ (Z ) n
b
(Z
)
π
E[K z−Z
]
u
i
i
h
i=1
n
M c1 X 1
1 ≤
− ∗
,
(4.8)
b(Zi ) π (Zi ) n i=1 π
where the last line follows from Lemma B.1 and the fact that |∆i Yi | ≤ M , for all i =
1, · · · , n; here c1 is a finite positive constant depending on the kernel K only. Now, for
62
every > 0,
(Z
)
p−1 M
M
P
2p−1
+
Un,1 (z) µ(dz) >
∧ni=1 π
b(Zi ) π0
4
)
(
p−1
n X
1
M
M
1
1
>
+
×
−
≤ P 2p−1
∧ni=1 π
b(Zi ) π0
n i=1 π
b(Zi ) π ∗ (Zi ) 4M c1
("
#
p−1
n ∗
X
π
b
(Z
)
−
π
(Z
)
M
M
1
i
i ≤ P
2p−1
+
>
×
∧ni=1 π
b(Zi ) π0
n i=1 π
b(Zi )π ∗ (Zi ) 4M c1
n h
\
π0 i
∩
π
b(Zi ) ≥
2
i=1
(
)
)
π0 i
π
b(Zi ) <
+P
2
i=1
(
)
p−1 n
X
3M
2
≤
P 2p−1
b(Zi ) − π ∗ (Zi ) >
π
2
π
π
4M c1
0
0
i=1
n
n
X
π0 o
+
P π
b(Zi ) <
.
2
i=1
n h
[
As for the second term in (4.6), observe that since |∆i Yi | ≤ M , for all i = 1, · · · , n, one
finds
Pn
z−Zi
K
1
1
h
i=1
−
−
1
Un,2 (z) ≤ M max .
1≤i≤n π
b(Zi ) π(Zi ) nE[K z−Z
]
h
63
Now, for every > 0,
(Z
)
p−1 M
M
P
2p−1
+
Un,2 (z)µ(dz) >
∧ni=1 π
b(Zi ) π0
4
# n
)
("
p−1 Z h
i
\
M
M
π
0
+
∩
π
b(Zi ) ≥
≤ P
2p−1
Un,2 (z)µ(dz) >
∧ni=1 π
b(Zi ) π0
4
2
i=1
(n
)
i
[h
π0
+P
π
b(Zi ) <
2
P
)
( i=1
p−1 Z
n K z−Zi 2M
3M
i=1
h − 1 µ(dz) >
≤ P 2p−1
π0
π0 nE K z−Z
4
h
n
n
X
π0 o
+
P π
b(Zi ) <
.
2
i=1
Therefore, in view of (4.7), the term Pn,1 in (4.5) can be upper-bounded as follows
Pn,1
≤
n
X
π02 P
8M c1 [2(3M π0−1 )]p−1
i=1
(Z Pn
)
z−Z
π0 i=1 K( h i )
− 1 µ(dz) >
+P
nE[K( z−Z
8M [2(3M π0−1 )]p−1
)]
h
n
n
X
π0 o
+2
P π
b(Zi ) <
2
i=1
π
b(Zi ) − π ∗ (Zi ) >
:= ∆n,1 + ∆n,2 + ∆n,3 , (say).
First observe that by Lemma B.2,
∆n,2 ≤ exp
−nπ02 2
(64M 2 c1 )2 [2(3M π0−1 )]2p−2
,
for n large enough; (this is a special case of Lemma B.2 with Yi = 1 for i = 1, · · · , n, and
m(z) = 1). To bound the term ∆n,1 we start by defining
ψ(Z) = π ∗ (Z) · f (Z)
64
(4.9)
b i) =
ψ(Z
n
X
1
Zi − Zj
∆j H
(n − 1)adn j=1,6=i
an
n
X
1
Zi − Zj
H
(n − 1)adn j=1,6=i
an
b
b
where an and the kernel H are as in (4.3). Since ψ(Zi )/f (Zi ) ≤ 1, one finds
fb(Zi ) =
(4.10)
(4.11)
ψ(Z
−ψ(Z
b i ) ψ(Zi ) b i ) fb(Zi ) − f (Zi ) ψ(Z
b i ) − ψ(Zi ) |b
π (Zi ) − π ∗ (Zi )| = +
−
·
= fb(Zi )
fb(Zi )
f (Zi ) f (Zi )
f (Zi )
1 b
1 b
≤
f (Zi ) − f (Zi ) +
ψ(Zi ) − ψ(Zi )
f (Zi )
f (Zi )
Therefore,
∆n,1
n
X
1 b
π02 ≤
P
ψ(Zi ) − ψ(Zi ) >
f
(Z
)
16M c1 [2(3M π0−1 )]p−1
i
i=1
n
X
π02 1 b
+
P
.
f (Zi ) − f (Zi ) >
f (Zi )
16M c1 [2(3M π0−1 )]p−1
i=1
65
Now, observe that
 
b i ) − ψ(Zi )
 ψ(Z

2
π0 P
>

f (Zi )
16M c1 [2(3M π0−1 )]p−1 
i
i
h
h
b
b
b
≤ P ψ(Z
)
−
E
ψ(Z
)
Z
+
E
ψ(Z
)
Z
−
ψ(Z
)
i
i i
i i
i >
π02 f0 16M c1 [2(3M π0−1 )]p−1
(where f0 = inf f (z) > 0, by condition B3)
z
i
h
π02 f0 π02 f0 b
b
>
≤ P ψ(Zi ) − E ψ(Zi )Zi +
32M c1 [2(3M π0−1 )]p−1
16M c1 [2(3M π0−1 )]p−1
(for n large enough, by Lemma B.4)
)#
" (
i
h
2
π
f
b
0
0
b i )Zi >
= E P ψ(Zi ) − E ψ(Z
−1 p−1 Zi
32M c1 [2(3M π0 )]
4 2 2
−(n−1)ad
n π0 f0 2 f /(96M c [2(3M π −1 )]p−1 ) [2(3M π −1 )]2p−2
2(32M c1 )2 kHk∞ 2kf k∞ +a2d
π
n 0 0
1
0
0
[
≤ 2e
(by Lemma B.5)
≤ 2e
]
4 2 2
−(n−1)ad
n π0 f 0 −1 )]2p−2
2(32M c1 )2 kHk∞ [2kf k∞ +f0 /12][2(3M π0
,
for n large enough, where we have also used the fact that in bounding P {|b
π (Zi )−π ∗ (Zi )| >
−1
π02 /(8M c1 [2(3M π0 )]p−1 )} one only needs to consider 0 < < 8M c1 [2(3M π0−1 )]p−1 /π02
b with ∆j = 1 for
(because |b
π (Zi ) − π ∗ (Zi )| ≤ 1). Similarly, since fb is a special case of ψ,
all j’s (compare (4.10) and (4.11)), one finds

 
 fb(Zi ) − f (Zi )
2
π0 P
>

f (Zi )
16M c1 [2(3M π0−1 )]p−1 
−(n − 1)adn π04 f02 2
,
≤ 2 exp
2(32M c1 )2 kHk∞ [2kf k∞ + f0 /12] [2(3M π0−1 ]2p−2
for n large enough. Putting all the above together, we find
d
2
∆n,1 ≤ 4ne−(n−1)an C10 ,
66
where
C10 = π04 f02 / 2(32M c1 )2 kHk∞ [2kf k∞ + f0 /12] [2(3M π0−1 )]2p−2
Finally, since P {b
π (Zi ) < π0 /2} ≤ P {|b
π (Zi ) − π ∗ (Zi )| > π0 /2}, the arguments that lead
to the derivation of the bound on ∆n,1 yield
n
X
n
π0 o
P |b
π (Zi ) − π ∗ (Zi )| >
2
i=1
−(n − 1)adn π02 f02
≤ 4n exp
128kHk∞ [2kf k∞ + a2d
n π0 f0 /24]
∆n,3 ≤ 2
d
≤ 4n e−(n−1)an C11 , (for n large enough),
where C11 = π02 f02 /(128kHk∞ [2kf k∞ + π0 f0 /24]) does not depend on n. This completes
the proof of Theorem 4.2.
4.1.2
Least-squares-based estimator of the selection probability
As in chapter 3 if we have prior knowledge about the behavior of the missing probability
mechanism we can use least-squares methods. As a result, our second estimator of π ∗ (z)
is based on a least-squares approach and works as follows. Suppose that π ∗ belongs to a
known class of functions P of the form π : Rd → [π0 , 1], where π0 = inf z π(z) > 0 is as
before (see assumption B1) . Then the least-squares estimator of the function π ∗ is
n
π
bLS = argmin
π∈P
1X
(∆i − π(Zi ))2 .
n i=1
(4.12)
Replacing π ∗ by π
bLS in (4.2), one obtains
n
X
∆i Yi
z − Zi
K
π
b
h
LS (Zi )
m
b πbLS (z) = i=1 n
.
X z − Zi K
h
i=1
(4.13)
To study the performance of the regression function estimator in (4.13), we assume our
class of functions is bounded with respect to the supnorm. Recall from chapter 1 that a
class of functions P is said to be totally bounded with respect to the supnorm if for every
67
> 0, there is a subclass P = {π1 , · · · , πN() } such that for every π ∈ P, there is a
π̃ ∈ P satisfying kπ − π̃k∞ < . The class P is called an -cover of P. The cardinality
of the smallest such -cover is denoted by N∞ (, P).
Theorem 4.3 Let m
b πbLS (z) be as in (4.13), where π
bLS is the least-squares estimator of π ∗ ∈
P, and where P is totally bounded with respect to the supnorm. Let K be a regular kernel
in (4.13) and suppose that condition B1 holds. If h → 0 and nhd → ∞, as n → ∞, then
for any distribution of (Z, Y ) satisfying |Y | ≤ M < ∞, and for every > 0 and every
p ∈ [1, ∞), and n large enough
Z
2
P
|m
b πbLS (z) − m(z)|µ(dz) > ≤ 4e−bn + 2N∞ (C4 , P) e−C5 n
4
2
+ 2N∞ C6 2 , P e−C7 n + e−C8 n
where b is as in Theorem 4.2, and
π02
,
32M c1 [2(M π0−1 +M )]p−1
4
π
C6 = 1024M 2 c [2(M0π−1 +M )]2p−2
1
0
π2
C8 = 1024M 4 c [2(M0π−1 +M )]2p−2 ,
1
0
C4 =
C5 =
C7 =
π04
−1
2p−2
1 [2(M π0 +M )]
π08
215 M 4 c1 [2(M π0−1 +M )]4p−4
128M 2 c
where c1 is a constant that depends on the kernel K only.
Theorem 4.3 in conjunction with the Borel-Cantelli lemma yields
a.s.
E [|m
b πbLS (Z) − m(Z)|p |Dn ] −→ 0,
as n → ∞.
Remark. The upper bound found in Theorem 4.3 is meaningful when the class P is totally bounded with respect to the supnorm. See Section 1.3.2 for examples of classes of
functions totally bounded with respect to the supnorm.
PROOF OF THEOREM 4.3
First note that for the least-squares estimator π
bLS we have |m
b πbLS (z)| ≤
68
M
bLS (Zi )
∧n
i=1 π
≤
M
.
π0
Therefore, employing the arguments used in the proof of Theorem 4.2, we have
Z
p
P
|m
b πbLS (z) − m(z)| µ(dz) > ≤ qn,1 + qn,2 ,
where
Z
qn,1 = P
[2(M π0−1
b π∗ (z) − m(z)µ(dz) >
≤ 4e−bn , (for large n),
m
2
p−1 + M )]
where b is as in Theorem 4.2, and where
Z
−1
p−1 qn,2 = P
.
[2(M π0 + M )] m
b πbLS (z) − m
b π∗ (z)µ(dz) >
2
(4.14)
To upper-bound qn,2 first note that
|m
b πbLS (z) − m
b π∗ (z)| ≤
Pn 1
−
i=1 π
bLS (Zi )
1
|∆i Yi | K
z−Z
π ∗ (Zi )
nE K
z−Zi
h
h
n X
1
1
−
(∆i Yi )
+
π
bLS (Zi ) π ∗ (Zi )
i=1


z−Zi
z−Zi
K h
K h

− P
×
z−Z
n
z−Zi
nE K h
i=1 K
h
:= |Sn,1 (z)| + |Sn,2 (z)| , (say).
Furthermore, as in (4.8),
Z
n M c1 X 1
1 |Sn,1 (z)| µ(dz) ≤
−
,
n i=1 π
bLS (Zi ) π ∗ (Zi ) 69
(4.15)
which yields, for every t > 0,
Z
P
|Sn,1 (z)| µ(dz) > t
( n )
X
1
1 t
1
≤P
− ∗
>
n
π
bLS (Zi ) π (Zi )
M c1
( i=1
)
n
1
1 X t
|b
= P
πLS (Zi ) − π ∗ (Zi )| >
∗
n
π
bLS (Zi )π (Zi )
M c1
)
( i=1
n
π02 t
1X
∗
|b
πLS (Zi ) − π (Zi )| >
≤ P
n i=1
M c1
(since B1 yields inf π
bLS (z) ≥ inf π ∗ (z) = π0 > 0, and the definition of class P)
z
z
( n
)
1 X
i
h
2
t
π
π
≤P bLS (Z) − π ∗ (Z)Dn > 0
bLS (Zi ) − π ∗ (Zi ) − E π
(4.16)
n
2M c1
i=1
h
i
π02 t
∗
bLS (Z) − π (Z) Dn >
+P E π
.
(4.17)
2M c1
To deal with the term (4.16), put 0 = π02 t/[8M c1 ] and let P0 be a minimal 0 -cover of
P. In other words, for every π ∈ P, there is a π̃ ∈ P0 such that kπ − π̃k∞ ≤ 0 . Let
g(z) = |π(z) − π ∗ (z)| and g̃(z) = |π̃(z) − π ∗ (z)| and note that
70
n
1 X
π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z)
n
i=1
n
n X
X
1
∗
∗
≤ π(Zi ) − π (Zi ) −
π̃(Zi ) − π (Zi )
n i=1
i=1
+ E π̃(Z) − π ∗ (Z) − E π(Z) − π ∗ (Z)
n 1 X
+
π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z)
n
i=1
n
1 X
≤ 2 ||g − g̃k∞ + π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z)
n
i=1
n 1 X
π02 t
≤
+
π̃(Zi ) − π ∗ (Zi ) − E π̃(Z) − π ∗ (Z) ,
4M c1 n
i=1
where the last line follows from the fact that
|g(z) − g̃(z)| = ||π(z) − π ∗ (z)| − |π̃(z) − π ∗ (z)|| ≤ |π(z) − π̃(z)| ≤ kπ − π̃k∞ ≤
π02 t
.
8M c1
Therefore with 0 = π02 t/[8M c1 ],
(
)
n 1 X
2
π t
(4.16) ≤ P sup π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) > 0
2M c1
π∈P n i=1
(
)
n 1 X
2
2
π t
π t
≤ P sup π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) + 0 > 0
4M c1
2M c1
π∈P0 n i=1
π2t
0
≤ N∞
,P
8M c1
(
)
n 1X
2
π
t
× sup P π(Zi ) − π ∗ (Zi ) − E π(Z) − π ∗ (Z) > 0
4M c1
n
π∈P0
2
i=1 4 2 π0 t
−π0 t n
≤ 2 N∞
, P · exp
,
(4.18)
8M c1
8M 2 c1
71
where the exponential bound in (4.18) follows from an application of Hoeffding’s (1963)
inequality. Next, to bound (4.17), we first note that
)
(s 2 2
π
t
(4.17) ≤ P
E π
bLS (Z) − π ∗ (Z) Dn > 0
2M c1
(by Cauchy-Schwartz inequality)
2 π04 t2
∗
= P E π
bLS (Z) − π (Z) Dn >
4M 2 c1
4 2
2 t
π
0
≤ P sup νn (π) − E π(Z) − ∆ >
8M 2 c1
π∈P
n
X
−1
(π(Zi ) − ∆i )2 ).
(by Lemma B.5, where νn (π) = n
(4.19)
i=1
Now, put
00 = π04 t2 /(64M 2 c1 )
and let P00 be a minimal 00 -cover of P with respect to the supnorm. Therefore, given
π ∈ P, there is a π̃ ∈ P00 such that kπ − π̃k∞ < 00 . Since
|π(Z) − ∆|2 − |π̃(Z) − ∆|2 ≤ |(π(Z) − ∆) + (π̃(Z) − ∆)|
× |(π(Z) − ∆) − (π̃(Z) − ∆)|
≤ 2 |π(Z) − π̃(Z)| ≤ 2kπ − π̃k∞ ≤ 200 ,
one finds
n
n
X
X
1
2
2
2
ν
(π)
−
E
π(Z)
−
∆
≤
π(Z
)
−
∆
π̃(Z
)
−
∆
−
n
i
i
i
i n i=1
i=1
2
2 + E π(Z) − ∆ − E π̃(Z) − ∆ n
1 X
2
2 π̃(Zi ) − ∆i − E π̃(Z) − ∆ +
n
i=1
n
1 X
2
2
00
≤ 4 + π̃(Zi ) − ∆i − E π̃(Z) − ∆ .
n
i=1
72
Therefore, with 00 = π04 t2 /(64M 2 c1 ),
(
)
n
1 X
4 2
π
t
0
π̃(Zi ) − ∆i 2 − E π̃(Z) − ∆2 >
(4.19) ≤ P
sup 16M 2 c1
π∈P00 n
4 2 i=1 π0 t
≤ N∞
,P
64M 2 c1
)
( n
1 X
4 2
t
π
2
2
0
π̃(Zi ) − ∆i − E π̃(Z) − ∆ >
× sup P 16M 2 c1
n
π∈P00
i=1
42
π0 t
−π08 t4 n
≤ 2N∞
, P · exp
,
(4.20)
64M 2 c1
128M 4 c1
where the last line follows by an application of Hoeffding’s inequality.
To deal with the term |Sn,2 (z)| in (4.15), first note that
Pn
i=1 K
1
1
− ∗
|Sn,2 (z)| ≤ M · max 1≤i≤n π
bLS (Zi ) π (Zi ) nE K
P
n
i
M i=1 K z−Z
h ≤
−
1
.
π0 nE K z−Z
h
z−Zi
h z−Z
h
− 1
Now, in view of Lemma 3 with Y = 1 (and thus m(z) = 1, for all z), we find that for
every constant t > 0,
Z
Z
∗
tπ
0
mn (z) − 1µ(dz) >
P
|Sn,2 (z)| µ(dz) > t
≤ P
M
2 2
−nπ0 t
≤ exp
(4.21)
64M 4 c1
Pn
z−Zi
z−Z
for n large enough, where m∗n (z) =
i=1 K( h )/(nE[K( h )]) is a special case of
m∗n (z) of Lemma B.2 with Y, Y1 , · · · , Yn = 1, (i.e., here m∗n (z) is the kernel regression
estimator of m(z) = 1).
73
Therefore, in view of (4.14), (4.15), (4.18), (4.20), and (4.21), we have
Z
−1
p−1 qn,2 ≤ P
[2(M π0 + M )] Sn,1 (z)µ(dz) >
4
Z
−1
p−1 +P
[2(M π0 + M )] Sn,2 (z)µ(dz) >
4
4 n2
π0
2
−
π0 −1
2p−2
2
≤ 2 N∞
, P e 128M c1 [2(M π0 +M )]
−1
p−1
32M c1 [2(M π0 + M )]
8 n4
π0
−
π04 2
−1
4p−4
15
4
, P e 2 M c1 [2(M π0 +M )]
+ 2 N∞
−1
2
2p−2
1024M c1 [2(M π0 + M )]
+e
−
2 n2
π0
−1 +M )]2p−2
1024M 4 c1 [2(M π0
.
This completes the proof of Theorem 4.3.
4.2
Applications to classification with partially labeled data
Here we consider an application of the results developed in the previous sections to the
problem of statistical classification. Let (Z, Y ) be an Rd × {1, · · · , M }-valued random
pair. In classification one has to predict the integer-valued label Y based on the covariate
vector Z. More formally, one seeks to find a function (a classifier) φ : Rd → {1, · · · , M }
for which the probability of missclassification error (incorrect prediction), i.e., P {φ(Z) 6=
Y } is as small as possible. Let
pk (z) = P Y = k Z = z , z ∈ Rd , 1 ≤ k ≤ M .
It is straightforward to show that the best classifier (i.e., the one with the lowest probability
of error) is given by
φB (z) = argmax pk (z) ,
(4.22)
1≤k≤M
i.e, the best classifier φB satisfies
max pk (z) = pφB (z) (z);
1≤k≤M
see, for example, Devroye and Györfi (1985, pp. 253-254). Since φB is almost always un74
known, one uses the data to construct estimates of φB . Let Dn = {(Z1 , Y1 ), · · · , (Zn , Yn )}
be a random sample (the data) from the distribution of (Z, Y ), where each (Zi , Yi ) is fully
observable. Let φbn be any sample-based classifier. In other words, φbn (Z) is the predicted
value of Y , based on Dn and Z. Let
o
n
b
b
Ln (φn ) = P φn (Z) 6= Y Dn ,
be the conditional probability of error of the sample-based classifier φbn . Then φbn is said
a.s.
to be strongly consistent if Ln (φbn ) −→ L (φB ) := P {φB (Z) 6= Y }, as n → ∞.
For k = 1, · · · , M , let pbk (z) be any sample-based estimators of pk (z) = P {Y = k|Z = z}
and define the classification rule φbn by
φbn (z) = argmax pbk (z) .
1≤k≤M
In other words, φbn satisfies
max pbk (z) = pbφbn (z) (z) .
1≤k≤M
Then, it can be shown (see, for example, Devroye and Györfi (1985, pp. 254)) that
0 ≤ Ln (φbn ) − L(φB ) ≤
M Z
X
|b
pk (z) − pk (z)| µ(dz).
(4.23)
k=1
a.s.
Therefore, to show Ln (φbn ) − L(φB ) −→ 0 , it is sufficient to show that E[|b
pk (Z) −
a.s.
pk (Z)||Dn ] −→ 0, as n → ∞, for k = 1, · · · , M . Now, consider the case of a partially
labeled data, where some of the Yi ’s (the labels) may be unavailable in Dn . Let
Pn ∆i I{Yi =k}
z−Zi
K
i=1
π̆(Z )
h
, k = 1, · · · , M,
(4.24)
pbk (z) =
Pn i z−Zi i=1 K
h
where π̆ is taken to be either π
b in (4.3) or π
bLS as in (4.12), and where ∆i = 1 if Yi is
available (∆i = 0 if Yi is missing). With pbk as in (4.24), define the classifier
φbn (z) = argmax pbk (z) .
1≤k≤M
75
(4.25)
How good is φbn in (4.25) as an estimator of the optimal classifier φB in (4.22)? The answer
is given by the following result.
Theorem 4.4 Let φbn be the classifier defined via (4.25) and (4.24).
(i) If π̆ in (4.24) is chosen to be the kernel estimator π
b defined in (4.3) then, under the
a.s.
b
conditions of Theorem 4.2, φn is strongly consistent i.e., Ln (φbn ) − L(φB ) −→ 0.
(ii) If π̆ in (4.24) is chosen to be the least-squares estimator π
bLS defined in (4.12) then,
b
under the conditions of Theorem 4.3, φn is strongly consistent.
PROOF OF THEOREM 4.4
(i) Let π̆ be the kernel regression estimator π
b defined in (4.3) and observe that by (4.23),
for every > 0,
Z
M
n
o
X
b
.
P Ln (φn ) − L(φB ) > ≤
P
|b
pk (z) − pk (z)| µ(dz) >
M
k=1
The proof now follows from an application of Theorem 4.2 in conjunction with the BorelCantelli lemma.
(ii) The proof is similar to that of part (i).
76
Conclusion-Future work
In chapter 3, kernel regression estimators were constructed for the situation in which a
subset of the covariate vector may be missing at random. In this case we are implicitly
assuming that the missing pattern follows the multivariate two pattern problem discussed
in Chapter 2. In practice the data can consist of multiple missing patterns, not just multivariate two pattern. To handle the case of multiple missing patterns, modifications to the
estimators of chapter 3 are necessary.
The primary focus of the thesis has been on kernel regression estimators in the presence of
incomplete data. The partition regression estimator and k-NN estimator of chapter 1 could
potentially yield similar results as that of the kernel regression estimators constructed in
chapters 3 and 4. These estimators in chapters 3 and 4 also dealt with the problem of
missing data by using inverse weighting techniques. One weakness of this method is that
it is still a function of only the complete cases, resulting in a smaller random sample size.
Constructing kernel regression estimators that combine the methods of imputation and
inverse weights may result in estimators that perform more efficiently.
77
References
[1] Bernstein, S. The theory of probabilities. Gastehizdat Publishing House, Moscow,
1946.
[2] Cheng, P.E. and Chu, C.K. (1996), Kernel estimation of distribution functions and
quantiles with missing data. Statistica Sinica, 6, 63-78.
[3] Davis, J., & Sorensen, J. R. (2013). Using base rates and correlational data to supplement clinical risk assessments. Journal of the American Academy of Psychiatry and
the Law, 41(3), 391-400.
[4] Devroye, L. (1981), On the almost everywhere convergence of nonparametric regression function estimates. The Annals of Statistics, 9, 1310-1319.
[5] Devroye,L. and Györfi,L. (1985). Nonparametric Density Estimation: The L1 View.
Wiley, New York.
[6] Devroye, L. , Györfi, L. , Krzyzak, A. , & Lugosi, G. (1994). On the strong universal consistency of nearest neighbor regression function estimates. The Annals of
Statistics, 1371-1385.
[7] Devroye, L., Györfi, L., Lugosi, G. A probabilistic theory of pattern recognition.
Springer-Verlag, New York, 1996.
[8] Devroye, L. and Krzyz̀ak, A. (1989), An equivalence theorem for L1 convergence of
kernel regression estimate. Journal of Statistical Planning and Inference, 23, 71-82.
[9] Devroye, L. and Wagner, T. (1980), On the L1 convergence of kernel estimators of
regression functions with applications in discrimination. Z. Wahrsch. Verw. Gebiete,
51, 15-25.
[10] Efromovich, S. (2011), Nonparametric regression with predictors missing at random.
Journal of the American Statistical Association, 106, 306-319
[11] Guntuboyina, A. , & Sen, B. (2013). Covering numbers for convex functions. IEEE
Transactions on Information Theory, 59(4), 1957-1965.
[12] Györfi, L., Kohler, M., Krzyz̀ak, A., and Walk, H. A distribution-free theory of nonparametric regression. Springer-Verlag, New York, 2002.
78
[13] Hirano, K.I. and Ridder, G. (2003), Efficient estimation of average treatment effects
using the estimated propensity score. Econometrica, 71 1161-1189.
[14] Hoeffding, Wassily. (1963), Probability Inequalities for Sums of Bounded Random
Variables. Journal of the American Statistical Association, 58, 13-30.
[15] Horvitz, D. G. and Thompson, D. J. (1952), A generalization of sampling without
replacement from a finite universe. Journal of the American Statistical Association,
47, 663-685.
[16] Kohler, M., Krzyz̀ak, A., and Walk, H. (2003,) Strong consistency of automatic kernel regression estimates. Ann. Inst. Statist. Math. 55, 287-308.
[17] Kolmogorov, A.N. and Tikhomirov, V.M. (1959), -entropy and -capacity of sets in
function spaces. Uspekhi Mat. Nauk, 14, 3-86.
[18] Krzyz̀ak, A. and Pawlak, M. (1984), Distribution-free consistency of a nonparametric
kernel regression estimate and classification. IEEE Trans. Inform. Theory, 30, 78-81.
[19] Little, R.J.A. and Rubin, D.B. Statistical Analysis With Missing Data. Wiley, New
York, 2002.
[20] Mojirsheibani, M. (2007), Nonparametric curve estimation with missing data: a general empirical process approach. Journal of Statistical Planning and Inference, 137,
2733-2758.
[21] Mojirsheibani, M. (2012). Some results on classifier selection with missing covariates. Metrika, 75, 521-539.
[22] Nadaraya, E.A. (1964), On estimating regression. Theory Probab. Appl., 9, 141-142.
[23] Pollard, D., (1984). Convergence of Stochastic Processes Springer, New York.
[24] Racine, J., & Li, Q. Nonparametric estimation of regression functions with both categorical and continuous data, Journal of Econometrics, Volume 119, Issue 1, March
2004, Pages 99-130
[25] Racine, J., & Hayfield, T. (2008). Nonparametric Econometrics: The np Package.
Journal of Statistical Software 27(5).
[26] Robins, J.M., Rotnitzky, A., Zhao, L. Estimation of Regression Coefficients when
some Regressors are not Always Observed, J. Amer. Statist. Assoc., 89 (1994) 846866.
79
[27] Rotnitzky, A., & Robins, J. M. (1995). Semiparametric regression estimation in the
presence of dependent censoring. Biometrika, 82(4), 805-820.
[28] Spiegelman, C. and Sacks, J. (1980), Consistent window estimation in nonparametric
regression. Ann. Statist., 8, 240-246.
[29] Stone, C. J. (1977). Consistent nonparametric regression (with discussion). Ann.
Statist. 5 595- 645.
[30] Tukey, J. W. (1961). Curves as parameters, and touch estimation. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I. Univ. California Press, Berkeley, Calif.,
681694
[31] van der Vaart, A. and Wellner, J. Weak convergence and empirical processes.
Springer-Verlag, New York, 1996.
[32] Vansteelandt, S. , Carpenter, J. , & Kenward, M. (2010). Analysis of incomplete
data using inverse probability weighting and doubly robust estimators. Methodologyeuropean Journal of Research Methods for the Behavioral and Social Sciences, 6,
37-48.
[33] Walk, H. (2002 b), Almost sure convergence properties of Nadaraya-Watson regression estimates. Modeling uncertainty, 201223, Internat. Ser. Oper. Res. Management
Sci., 46, Kluwer Acad. Publ., Boston, MA, 2002.
[34] Wand, M.P., Faes, C., and Ormerod, J. (2011). Variational bayesian inference for
parametric and nonparametric regression with missing data. Journal of the American
Statistical Association, 106(495), 959-971.
[35] Wand, M.P., Jones, M.C., Kernel Smoothing, Chapman & Hall/CRC, Boca Raton,
1995.
[36] Wang, Q. and Qin, Y. (2010), Empirical likelihood confidence bands for distribution
functions with missing responses. Journal of Statistical Planning and Inference, 140,
2778-2789.
[37] Watson, G.S. (1964), Smooth regression analysis. Sankhya, Ser. A., 26, 359-372.
80
A Probability Review
This section provides a brief review of some of the important probabilistic definitions used
in establishing convergence of the regression estimators.
Definition A.1 (Convergence in Probability) Let Xn be a sequence of random variables.
We say that Xn converges in probability to X if ∀ > 0,
P {|Xn − X| > } → 0
as n → ∞.
p
The notation Xn −→ X denotes “Xn converges to X in probability”.
A stronger mode of convergence is almost sure convergence, this is the mode of convergence used in all our results.
Definition A.2 (Almost Sure Convergence) The random variable Xn is said to converge
almost surely (with probability 1) to X if
n
o
P lim Xn = X = 1.
n→∞
a.s.
The notation Xn −→ X denotes “Xn converges to X almost surely”.
This definition is not commonly used to establish almost sure convergence. We provide an
alternative definition, which can be more easily used to obtain the almost sure convergence
of an estimator, through way of the Borel-Cantelli Lemma.
Lemma A.1 Let {An : n ≥ 1} be a sequence of events in a probability space. If
∞
X
P (An ) < ∞ ,
n=1
then
P
∞ [
∞
\
!
An
=0
k=1 n=k
i.e., the probability that
A0n s
occurs infinitely often is 0.
81
The following inequalities were used to establish exponential upper bounds for our regression estimators in chapters 3 and 4. See Györfi et al. (2002 p. 594).
Theorem A.1 (Hoeffding (1963)) Let X1 , ...,P
Xn be independent random variables such
1
that |Xi | ≤ M with probability 1. Let X = n ni=1 Xi denote their average, then for any
> 0,
−2n2
P { X − E[X] > } ≤ 2 exp
.
M2
Theorem A.2 (Bernstein (1946)) Let X1 , ..., Xn be independent random variables variables,
with |Xi | ≤ M almost surely, for all i = 1, · · · , n and E[Xi2 ] < ∞. Put σ 2 =
P
n
1
i=1 varXi , then for any > 0,
n
( n
)
1 X
−n2
.
P (Xi − E[Xi ]) > ≤ 2 exp
2 + 2M /3
n
2σ
i=1
82
B Technical Lemmas
A number of important lemmas were used to establish the results of chapters 3 and 4. The
lemmas have been provided in this section.
Lemma B.1 Let K be a regular kernel, and let µ be any probability measure on the Borel
sets of Rd+s . Then, there exists a finite positive constant c1 , only depending on the kernel
K, such that for all hn > 0
Z K z−u
hn
µ(dz) ≤ c1 .
sup
u∈Rd+s
E[K z−Z
]
hn
PROOF OF LEMMA B.1
This lemma and its proof is given in Devroye and Krzyz̀ak (1989; Lemma 1). Also, see
Devroye and Wagner (1980) as well as Spiegelman and Sacks (1980).
P
i
)/(nE[K( z−Z
)]),
Lemma B.2 Let K be a regular kernel, and put m∗n (z) = ni=1 Yi K( z−Z
hn
hn
d+s
where (Zi , Yi )’s are i.i.d. R
× R-valued random vectors. Then, if |Y | ≤ M < ∞, we
have for every > 0 and n large enough,
Z
−n2
∗
|mn (z) − m(Z)| µ(dz) > ≤ exp
P
.
64M 2 c21
Here, µ is the probability measure of Z, m(z) = E[Y |Z = z], and c1 is as in Lemma B.1.
PROOF OF LEMMA B.2
For a proof of this result see, for example, Györfi et al. (2002; Lemma 23.9).
83
Lemma B.3 Let φ(Xi , Yi ) = η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi ) where η ∗ (Xi , Yi ) is as in
b i , Yi ) = λ−d n−1 Pn ∆j I{Yj = Yi }H((Xi − Xj )/λn ) where λn and
(3.2), and put φ(X
n
j=1
the kernel H are as in (3.6). Suppose that conditions B2, B3, and B4 hold.
Then,
h
a.s.
i
b i , Yi )|Xi , Yi − φ(Xi , Yi ) ≤ c2 λn
E φ(X
where c2 is a positive constant not depending on n.
The proof is similar to that of Lemma 3 of Mojirsheibani (2012) and goes as follows.
84
PROOF OF LEMMA B.3
First note that
h
i
b i , Yi )Xi , Yi − φ(Xi , Yi )
E φ(X
#
"
n
X
Xi − X j −d −1
∆j I{Yj = Yi }H
= E λn n
Xi , Yi − φ(Xi , Yi )
λn
j=1
"
#
X i − X1
a.s. −d
= λn E I{Y1 = Yi }H
E [∆1 |X1 , Y1 , Xi , Yi ] Xi , Yi − φ(Xi , Yi )
λn
"
#
Xi − X 1
−d
= λn E I{Y1 = Yi }H
η ∗ (X1 , Y1 ) Xi , Yi
λn
− η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi )
(since ∆1 is independent of (Xi , Yi ))
"
#
X
−
X
i
1
= λ−d
E η ∗ (X1 , Y1 ) I{Y1 = Yi }H
Xi , Yi
n
λn
#
"
X
−
X
i
1
± E η ∗ (Xi , Yi ) I{Y1 = Yi }H
Xi , Yi
λn
− η ∗ (Xi , Yi )f (Xi )P (Y = Yi |Yi ))
#
X
−
X
i
1
∗
∗
= E λ−d
Xi , Yi
n (η (X1 , Y1 ) − η (Xi , Yi )) I{Y1 = Yi }H
λn
Xi − X 1
+ E η ∗ (Xi , Yi ) λ−d
n I{Y1 = Yi }H
λn
−f (Xi )P (Y = Yi |Yi )) Xi , Yi
"
:= ∆n,i (1) + ∆n,i (2) (say) .
Utilizing a one term Taylor expansion, we can bound ∆n,i,j (1) as follows
" d #
X ∂η ∗ (X† , Yi ) |Xi,k − X1,k | H Xi − X1 Xi , Yi ,
|∆n,i (1)| ≤ λ−d
n E
∂Xk
λn
k=1
85
where X1,k and Xi,k are the k th components of X1 and Xi , respectively, and X† is on the
interior of the line segment joining X1 and Xi . Therefore,
" d
#
X
X
−
X
i
1
Xi , Yi
|∆n,i (1)| ≤ α1 E
|Xi,k − X1,k | λ−d
n H
λ
n
k=1
∗
!
∂η (x, y) where α1 = max sup 1≤k≤d v∈Rd ,y∈Y
∂xk x=v
d Z
X
Xi − x
−d
f (x)dx
= α1
|Xi,k − xk | λn H
λ
d
n
R
k=1
d Z
X
≤ α1 kf k∞
|uk | λn H (u) du (by condition A3)
k=1
Rd
≤ αλn , (by condition A2) .
To bound the term ∆n,i,j (2), first note that
Xi − X 1
∗
−d
∆n,i (2) = η (Xi , Yi ) E λn I{Y1 = Yi }H
− f (Xi )P (Y = Yi |Yi )Xi , Yi
λn
Now, let fy (x) be the conditional pdf of X, given Y = y, and observe that
Xi − X1 −d
E λn I{Yi = Y1 }H
Xi , Yi
λn
Z
X
Xi − x
−d
λn H
=
P (Y = y)I {Yi = y}
fy (x) dx
λn
Rd
y∈Y
Therefore, utilizing condition A2 and the fact that
X
P (Y = Yi |Yi ) =
P (Y = y)I{Yi = y} ,
y∈Y
one finds
86
∆n,i (2) = η ∗ (Xi , Yi )
X
P (Y = y)I {Yi = y}
y∈Y
Z
×
Rd
λ−d
n H
Xi − x
λn
(fy (x) − fy (Xi )) dx
Performing a one term Taylor expansion and using the fact that |η ∗ (Xi , Yi )| ≤ 1 and
I {Yi = y} ≤ 1 for every y ∈ Y, yields
|∆n,i (2)| ≤
X
y∈Y
Z
P (Y = y) H (u) (fy (Xi − λn u) − fy (Xi )) du
d
R
d Z
X
∂fy (x) ≤ λn
|uk | H(u)du
P (Y = y) sup ∂x
d
d
k
v∈R
R
x=v
y∈Y
k=1
Z
∂f (x) |uk | H(u)du
≤ dλn max sup 1≤k≤d v∈Rd
∂xk x=v Rd
≤ β λn (by condition A2, A3 and A4).
X
This completes the proof of Lemma B.3
The following lemma is similar to that of Lemma B.3, with slight modifications to deal
with the simplier case of missing response instead of missing covariates. Used in the proof
of Theorem 4.2.
Lemma B.4 Let ψ(Z) = π ∗ (Z)f (Z), where f is the probability density of Z, and π ∗ (z) =
Pn
b
P {∆ = 1|Z = z}. Put ψ(Z)
= n−1 a−d
n
j=1 ∆j H((Zj −Z)/an ), where an and the kernel
H are as in (4.3). Suppose that conditions B2, B3, and B4 hold. Then
h
a.s.
i
b Z − ψ(Z) ≤ κan ,
E ψ(Z)
where κ is a positive constant not depending on n.
87
PROOF OF LEMMA B.4
The proof is similar to that of Lemma 4.2 of Mojirsheibani (2007) and goes as follows.
i
i
h
h
b Z − ψ(Z) = a−d E ∆1 H((Z1 − Z)/an )Z − π ∗ (Z)f (Z)
E ψ(Z)
n
i
h
a.s.
−d
= an E H((Z1 − Z)/an ) E[∆1 |Z, Z1 ]Z − π ∗ (Z)f (Z) (. B.1)
But E[∆1 |Z, Z1 ] = E[∆1 |Z1 ] = π ∗ (Z1 ) because Z is independent of ∆1 and Z1 . Therefore
i
h
∗
∗
(B.1) = a−d
E
(π
(Z
)
−
π
(Z))H((Z
−
Z)/a
)
1
1
n Z
n
h
−d
i
∗
+ E π (Z) an H((Z1 − Z)/an ) − f (Z) Z
:= Tn,1 (Z) + Tn,2 (Z) , (say) .
A one-term Taylor expansion gives
!
#
" d
X ∂π ∗ (Z† )
−d
(Z1,i − Zi ) H((Z1 − Z)/an )Z ,
Tn,1 (Z) = an E
∂Zi
i=1
where Zi and Z1,i are the ith components of Z and Z1 , respectively, and Z† is a point on
interior of the line segment joining Z and Z1 . Therefore,
|Tn,1 (Z)| ≤ C14
d
X
i
h
E |Z1,i − Zi | a−d
H((Z
−
Z)/a
)
Z , (by condition B4)
1
n
n
i=1
(where C14 = max sup |∂π ∗ (z)/∂zi |)
1≤i≤d
= C14
d Z
X
z
|zi − Zi | a−d
n H((z − Z)/an )f (z)dz
i=1
≤ C14 kf k∞
d Z
X
an |vi |H(v)dv = |C15 | · an , (by condition B2 and B4),
i=1
88
where we have made the substitution vi = zi − Zi , i = 1, · · · , d. Next, to bound Tn,2 (Z),
first note that
Z
∗
Tn,2 (Z) = π (Z) a−d
n H((z − Z)/an )[f (z) − f (Z)]dz
Z
∗
= π (Z) [f (Z + an y) − f (Z)]H(y)dy .
Finally, a one-term Taylor expansion gives
"
#
d Z
X
|Tn,2 (Z)| ≤ an · dkf 0 k∞
|zi |H(z)dz , (by condition B2).
i=1
This completes the proof of Lemma B.4.
The following lemma is used in the proof of Theorem 4.2.
Lemma B.5 Let ψb be as in Lemma B.4. Then, for every t > 0, one has
i
n
h
o
−nadn t2
b
b
P ψ(Z) − E ψ(Z) Z > tZ = z ≤ 2 exp
.
2kHk∞ [2 kf k∞ + a2d
n t/3]
PROOF OF LEMMA B.5
Let
Γj (Z) =
a−d
n
Z − Zj Z − Zj
− E ∆j H
∆j H
Z
an
an
and note that, conditional on Z, the terms Γj (Z) are independent, zero-mean random variables, bounded by −adn kHk∞ and +adn kHk∞ . Also, var(Γj (Z)|Z) = E[Γ2j (Z)|Z] ≤ 2a−d
n
kHk∞ kf k∞ . Therefore, by Bernstein’s (1946) inequality,
( n
)
i
X
h
1
b
b Z > tZ = z
P ψ(Z)
− E ψ(Z)
= P
Γj (Z) > tZ = z
n
j=1
−nt2
2 2a−d
kHk
kf
k∞ +(ad
∞
n
n kHk∞ t)/3
≤ 2e [
89
],
which does not depend on z.
Lemma B.6 Let π
bLS be as in (4.12) and suppose that π ∗ ∈ P, where π ∗ (z) = E[∆|Z =
z]. Then
i
h
∗
E |b
πLS (Z) − π (Z)| Dn ≤ 2 sup νn (π) − E |π(Z) − ∆|2 ,
π∈P
where
n
1X
νn (π) =
|π(Zi ) − ∆i |2 .
n i=1
PROOF OF LEMMA B.6
Using the elementary decomposition
i
i
h
h
2
2
2
∗
∗
E |b
πLS (Z) − ∆| Dn = E |π (Z) − ∆| + E |b
πLS (Z) − π (Z)| Dn ,
one has
h
i
i
h
2
2
2
∗
E |b
πLS (Z) − ∆| Dn − inf E |π(Z) − ∆|
E |b
πLS (Z) − π (Z)| Dn =
π∈P
2
2
∗
+ inf E |π(Z) − ∆| − E |π (Z) − ∆|
π∈P
=: Tn,1 + Tn,2 .
But the term Tn,2 = 0, because π ∗ ∈ P. As for the term Tn,1 , for each π ∈ P, let
n
1X
νn (π) =
|π(Zi ) − ∆i |2
n i=1
90
and observe that
Tn,1
i
n h
2
πLS (Z) − ∆| Dn − νn (π) + νn (π)
= sup E |b
π∈P
−νn (b
πLS ) + νn (b
πLS ) − E |π(Z) − ∆|2
io
n
h
≤ −νn (b
πLS ) + E |b
πLS (Z) − ∆|2 Dn + sup νn (π) − E |π(Z) − ∆|2 π∈P
(because νn (b
πLS ) − νn (π) ≤ 0, ∀π ∈ P)
≤ 2 sup νn (π) − E |π(Z) − ∆|2 .
π∈P
91