Download Functional Additive Models

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Functional Additive Models
Hans-Georg Müller
Department of Statistics, University of California
One Shield Ave., Davis, CA 95616, USA
Fang Yao
Department of Statistics, University of Toronto
100 St. George Street, Toronto, ON M5S 3G3, Canada
July 2008
Author’s Footnote:
Hans-Georg Müller is Professor, Department of Statistics, University of California,
Davis, One Shields Avenue, Davis, CA 95616, U.S.A, email: [email protected].
Fang Yao is Associate Professor, Department of Statistics, University of Toronto, 100
St. George Street, Toronto, ON M5S 3G3, Canada, email: [email protected].
We wish to thank two referees and an Associate Editor for helpful remarks that led to
an improved version of the paper. This research was supported in part by grants from
the National Science Council (NSC95-2118M001-013MY3), the National Science Foundation (DMS-03-54448 and DMS-05-05537), and the Natural Sciences and Engineering
Research Council of Canada.
ABSTRACT
In commonly used functional regression models, the regression of a scalar or functional response on the functional predictor is assumed to be linear. This means the
response is a linear function of the functional principal component scores of the predictor process. We relax the linearity assumption and propose to replace it by an additive
structure. This leads to a more widely applicable and much more flexible framework
for functional regression models. The proposed functional additive regression models
are suitable for both scalar and functional responses. The regularization needed for
effective estimation of the regression parameter function is implemented through a projection on the eigenbasis of the covariance operator of the functional components in the
model. The utilization of functional principal components in an additive rather than
linear way leads to substantial broadening of the scope of functional regression models
and emerges as a natural approach, as the uncorrelatedness of the functional principal
components is shown to lead to a straightforward implementation of the functional additive model, just based on a sequence of one-dimensional smoothing steps and without
need for backfitting. This facilitates the theoretical analysis, and we establish asymptotic consistency of the estimates of the components of the functional additive model.
The empirical performance of the proposed modeling framework and estimation methods is illustrated through simulation studies and in applications to gene expression
time course data.
KEY WORDS: Asymptotics, Additive Model, Functional Data Analysis, Functional
Regression, Linear Model, Principal Components, Smoothing, Stochastic Processes.
1. INTRODUCTION
A characterizing feature of functional regression models is that either predictor or
response or both are functions, where the functional components are typically assumed
to be realizations of a stochastic process. The functional linear model is the commonly
adopted functional regression model. It has been introduced in its most general form,
where both predictor and response are functions, by Ramsay and Dalzell (1991). The
case of a functional predictor and scalar response has been the focus of most research
to date on the functional linear model, as well as somewhat artificial situations where
the functional data are assumed to be observed without noise and on a very dense
and regular grid. For this case, Cardot et al. (2003) provided consistency results and
introduced a testing procedure. Theory for the case of fixed design and functional
response was developed in Cuevas et al. (2002). For a summary of some of these
developments, we refer to Ramsay and Silverman (2005).
Several extensions of the basic linear functional regression models have been proposed, often motivated by established analogous extensions of the classical multivariate
regression models towards more general regression models. These include generalized
functional linear models (James, 2002; Escabias et al., 2004; Cardot and Sarda, 2005;
Müller and Stadtmüller, 2005), modifications of functional regression for longitudinal,
i.e., sparse, irregular and noisy data (Yao et al., 2005b), varying-coefficient functional
models (Malfait and Ramsay, 2003; Fan and Zhang, 2000; Fan et al., 2003), waveletbased functional models (Morris et al., 2003) and multiple-index models (James and
Silverman, 2005). Recent asymptotic studies of estimation in functional linear regression models with scalar response and fully observed predictor trajectories include Cai
and Hall (2006) and Hall and Horowitz (2007).
It is well known that the estimation of the regression parameter function is an
inverse problem and therefore requires regularization. Two main approaches have
1
emerged for the implementation of the regularization step. The first approach is projection onto a finite number of elements of a functional basis, which can be either fixed
in advance such as the wavelet or Fourier basis, or may be chosen data-adaptively such
as the eigenbasis of the auto-covariance operator of the predictor processes X. If the
eigenbasis is used, conveying the advantage of the most parsimonious representation,
this approach is based on an initial functional principal component analysis (see, e.g.,
Rice and Silverman, 1991). No matter which basis is chosen, effective regularization
is then obtained by suitably truncating the number of included basis functions. The
second approach to regularization is based on penalized likelihood or penalized least
squares, and has been implemented for example via splines or ridge regression (Hall
and Horowitz, 2007). In this paper we adopt the first approach and express functional
regression models in terms of the functional principal components of the predictor,
and if applicable, also of the response processes. Since the stochastic part of square
integrable stochastic processes can always be equivalently represented by the countable sequence of their functional principal component scores and eigenfunctions, all
functional regression models have such a representation, irrespective of their structure. In the previously studied functional linear regression models, the regression of
the scalar or functional response on the functional predictors is a linear function of
the predictor functional principal component scores, and estimation, inference, asymptotics and extensions of these basic functional linear models are studied within this
linear framework.
The purpose of this paper is an analysis of an alternative additive functional regression model. Additive models are attractive as they provide effective dimension
and great flexibility in modeling (Hastie and Tibshirani, 1990). While extensions of
linear models to single and multiple index models are in place for functional regression, the extension to additive models has proven elusive and this is due to a major
2
challenge: A direct extension, analogous to multivariate data analysis, faces the difficulty that the equivalent to individual predictors in vector regression is the continuum
of function values over the entire time domain. Therefore, the predictor set is not
countable, and an additive model in these predictors would require uncountably many
additive components. We overcome this difficulty by taking as predictors the countable
set of functional principal component scores, which can be truncated at an increasing
sequence of finitely many predictors, while representing the entire predictor function
adequately.
For the case of a scalar response, the combination of functional principal component scores with an additive model, emphasizing applied modeling with readily available software tools, has been demonstrated in concurrent work with applied emphasis
(Foutz and Jank, 2008; Liu and Müller, 2008; Sood et al., 2008). As we show, a key to
both analysis and implementation of this combination is the uncorrelatedness of the
functional predictor scores. The usual implementation of additive models, which must
take into account dependencies between predictors, requires backfitting or similarly
complex schemes. Because the functional predictor scores are uncorrelated, the additive fitting step can be greatly simplified and requires no more than one-dimensional
smoothing steps, separately applied to each predictor score. This has important consequences: Very fast and simple implementation, and simplification which makes it
possible to study asymptotic properties of the resulting model, especially in the Gaussian case. The combination of functional principal component score predictors and
additive models therefore emerges as a particularly natural and flexible data-adaptive
nonparametric framework for functional regression models. We refer to this approach
as the Functional Additive Model (FAM).
The paper is organized as follows: Section 2 contains background from functional
linear regression. The proposed Functional Additive Models are introduced in Section
3
3, and issues of fitting these models are the theme of Section 4. Asymptotic consistency
properties are presented in Section 5, while Section 6 is devoted to a report on simulation results, followed by a description of an application to regression for gene expression
time courses in the Drosophila life cycle in Section 7 and concluding remarks in Section
8. Details and assumptions can be found in a separate Appendix, and auxiliary results
and proofs in a Supplement.
2. FUNCTIONAL LINEAR MODELS AND
EIGENREPRESENTATIONS
We consider regression models where the predictor is a smooth square integrable random function X(·) defined on a domain S, and the response is either a scalar or a random function Y (·) on domain T , with mean functions EX(s) = µX (s) and EY (t) =
µY (t) and covariance functions cov(X(s1 ), X(s2 )) = GX (s1 , s2 ), cov(Y (t1 ), Y (t2 )) =
GY (t1 , t2 ), s, s1 , s2 ∈ S and t, t1 , t2 ∈ T , respectively. We denote centered predictor
processes by X c (s) = X(s)−µX (s). The established linear functional regression models
with scalar resp. functional response are (Ramsay and Silverman, 2005)
E(Y |X) = µY +
Z
β(s)X c(s) ds,
(1)
S
E{Y (t)|X} = µY (t) +
Z
β(s, t)X c (s) ds,
(2)
S
where the regression parameter functions β are assumed to be smooth and square
integrable. To estimate these functions and thus identify the functional linear model,
basis representations such as the eigendecomposition of the functional components
in models (1) and (2) are a convenient way to implement the necessary regularization
(Yao et al., 2005b). Taking advantage of the equivalence between process and countable
sequence of functional principal component (FPC) scores, we represent predictor and
response processes X and Y in terms of these scores. An implementation of these
4
representations is available through the PACE package which can be found under
“programs” at http://www.stat.ucdavis.edu/∼mueller/.
In our sampling model we assume as in Yao et al. (2003) that the functional trajectories are not completely observed but rather are measured on a grid and measurements
are contaminated with measurement error. This is a more realistic scenario than the
common assumption in functional data analysis that entire trajectories are observed.
For our theoretical analysis we assume that the grid of measurements is dense. Empirically, as evidenced by simulations and in applications, the proposed methods also
work well for the case of sparse and irregularly observed longitudinal data. Denote
by Uij (respectively, Vil ) the noisy observations made of the random trajectories Xi
(respectively, Yi ) at times sij (resp., til ), where (Xi , Yi ), i = 1, . . . , n, corresponds to
an i.i.d. sample of processes (X, Y ) (the scalar response case is always included in
these considerations). The available observations are contaminated with measurement
errors ǫij (respectively, εil ), 1 ≤ j ≤ ni , 1 ≤ l ≤ mi , 1 ≤ i ≤ n, where ni and mi
are the numbers of observations from Xi and Yi. The errors are assumed to be i.i.d.,
2
with zero means Eǫij = 0, and constant variance E(ǫ2ij ) = σX
, respectively, Eεil = 0,
R
E(ε2il ) = σY2 , and independent of the FPC scores ξik = (Xi (s)−µX (s))φk (s) ds, respecR
tively, ζim = (Yi (t) − µY (t))ψm (t) dt, where φk , ψm are the eigenfunctions of processes
X and Y , defined in the Appendix. We note that the FPC scores satisfy Eξik = 0,
2
) = λk , respectively, Eζim = 0, E(ζim ζim′ ) = 0 for
E(ξik ξik′ ) = 0 for k 6= k ′ and E(ξik
2
m 6= m′ and E(ζim
) = ρm .
Invoking the population least squares property of conditional expectation, and using
the fact that the predictors are uncorrelated, leads to an extension of the representation
β1 = cov(X, Y )/var(X) of the slope parameter in the simple linear regression model
E(Y |X) = β0 +β1 X to the functional case. By solving a “functional population normal
5
equation” (He et al., 2000, 2003), one obtains for scalar resp. functional responses
β(s) =
∞
X
E(ξk Y )
k=1
E(ξk2 )
φk (s),
∞ X
∞
X
E(ξk ζm )
β(s, t) =
φk (s)ψm (t).
E(ξk2 )
m=1
(3)
k=1
Plugging (3) into (1), (2) and observing (23) then leads to
E(Y |X) = µY +
E(ζm |X) =
Z
∞
X
bk ξk ,
with bk =
k=1
E(ξk Y )
,
E(ξk2 )
E(Y (t) − µY (t)|X)ψm (t) dt =
∞
X
bkm ξk , with bkm =
k=1
(4)
E(ξk ζm )
, (5)
E(ξk2 )
for scalar responses Y , respectively, for each FPC score ζm of the response process.
In practice, when fitting a functional regression model and estimating the regression
parameter function β, one needs to regularize by truncating these expansions at a finite
number of components K and M.
3. FUNCTIONAL ADDITIVE MODELING
Suppose for the moment that the true FPC scores ξik for predictor processes are known.
Then the functional linear models (1) and (2) are reduced to a standard linear model
with infinitely many of these FPC scores as predictors, as demonstrated in eq. (4) and
(5). Moreover, the linear structure in the predictor scores and the uncorrelatedness of
the FPC scores then imply that
E(Y − µY |ξk ) = bk ξk ,
E(ζm |ξk ) = bkm ξk
(6)
for scalar, respectively, functional response models. Accordingly, the best predictor
is the linear predictor, and the regressions on predictor scores are lines through the
origin.
This observation provides a key motivation for the extension of the functional linear
model to the functional additive model (FAM). It seems natural to generalize the
well-known extension of (generalized) linear to (generalized) additive models (Hastie
6
and Tibshirani, 1990) to the functional case by replacing the linear terms bk ξk resp.
bkm ξk in (6) and (4), (5) by more general relationships. We thus generalize the linear
relationship with ξk to arbitrary functional relations fk (ξk ), where functions fk (·),
k = 1, 2, . . ., respectively, functions fkm (·), k, m = 1, 2, . . ., are assumed to be smooth;
beyond smoothness, nothing more is needed. This substitution transforms functional
linear models (1), (2) to functional additive models with the underlying FPC scores ξk
as predictors,
E(Y |X) = µY +
∞
X
fk (ξk ),
(7)
k=1
E(Y (t)|X) = µY (t) +
∞ X
∞
X
fkm (ξk )ψm (t),
(8)
k=1 m=1
for scalar and functional response cases, respectively. To ensure identifiability, we
require furthermore that
Efk (ξk ) = 0, k = 1, 2, . . . , resp., Efkm (ξk ) = 0, k = 1, 2, . . . , m = 1, 2, . . . .
(9)
In this model, the linear relationship between the response Y and the predictor
FPC scores ξk is replaced by an additive relation, which gives rise to a far more flexible and essentially nonparametric model, while avoiding the curse of dimension, which
for infinite-dimensional functional data is unsurmountable, if no structure is imposed.
Beyond additivity, a second key assumption we make from now on is that the predictor
FPC scores ξk are independent. Since these scores are always uncorrelated, this assumption is for example satisfied for the case where predictor processes are Gaussian.
Then the basic functional additive model assumptions (7), (8) imply
E(Y − µY |ξk ) = E{E(Y − µY |X)|ξk } = E{
∞
X
j=1
fj (ξj )|ξk } = fk (ξk ),
(10)
and for the functional response case analogously
E(ζm |ξk ) = E{E(ζm |X)|ξk } = E{
7
∞
X
j=1
fjm (ξj )|ξk } = fkm (ξk ).
(11)
The relations fk (ξk ) = E(Y − µY |ξk ), fkm (ξk ) = E(ζm |ξk ), for all k, m = 1, 2, . . . ,
are a straightforward generalization of the decomposition of functional linear regression into simple linear regressions against the predictor FPC scores as in (6). They are
key for surprisingly simple implementations of the functional additive model. While
complex iterative procedures are required to fit a regular additive model (backfitting
and variants, see, e.g., Hastie and Tibshirani, 1990; Mammen and Park, 2005), representations (10), (11) motivate a straightforward estimation scheme to recover the
component functions fk , respectively fkm , by a series of one-dimensional smoothing
steps. This not only leads to fast and easily diagnosable procedures for the underlying
infinite-dimensional data, but also facilitates asymptotic analysis. The high degree
of flexibility and the simplicity of model fitting makes FAM an especially attractive
alternative to the special case of the standard functional linear models (1), (2).
4. FITTING OF FUNCTIONAL ADDITIVE MODELS
We begin with an overview of the estimation procedures. In a first step, smooth
estimates of the mean and covariance functions for the predictor processes are obtained
by scatterplot smoothing. This is followed by a functional principal component (FPC)
analysis, which yields estimates φ̂k for the eigenfunctions, λ̂k for the eigenvalues, and ξˆik
for the FPC scores of individual predictor trajectories; some additional details are given
in the Appendix. The estimation steps are implemented with the Principal Analysis
by Conditional Expectation (PACE) approach, also regarding the choice of the number
of included eigenfunctions K through pseudo-AIC (Yao et al., 2005a), available in the
PACE package. This was shown to work for densely sampled trajectories and in the
Gaussian case in addition to the case of sparse and irregular measurements; it also has
been demonstrated to be fairly robust against violations of the Gaussian assumption.
Once these preliminary estimates are in hand, it is straightforward to obtain estimates fˆk , fˆkm of the smooth component functions fk , respectively, fkm . We implement
8
all smoothing steps with local polynomial fitting; other smoothing techniques can be
used equally well. For the case of scalar responses, we estimate the functions fk by
fitting a local linear regression to the data {ξˆik , Yi }i=1,...,n , where ξˆik is obtained by (26):
Minimizing
n
X
K1 (
i=1
ξˆik − x
){Yi − β0 − β1 (x − ξˆik )}2
hk
(12)
with respect to β0 and β1 , leads to fˆk (x) = β̂0 (x) − Ȳ , where hk is the bandwidth
used for this smoothing step, and K1 is a symmetric probability density that serves as
kernel function. For the functional response case, the functions fmk are analogously
estimated by passing a local linear smoother through the data {ξˆik , ζ̂im }i=1,...,n that are
obtained by (26), i.e., minimizing
n
X
i=1
K1 (
ξˆik − x
){ζ̂im − β0 − β1 (x − ξˆik )}2
hmk
(13)
with respect to β0 and β1 , leading to fˆmk (x) = β̂0 (x), where hmk is the bandwidth.
Then the fitted version of the functional additive model (7) with scalar response is
b |X) = Ȳ +
E(Y
K
X
fˆk (ξk ).
(14)
k=1
To quantify the strength of the regression relationship, we use a global measure
similar to the coefficient of determination R2 in standard linear regression (Draper and
Smith, 1998), with population and sample versions
Pn
Pn
b i |Xi )}2
{Yi − E(Y
{Yi − E(Yi |Xi)}2
2
2
i=1
b = 1 − i=1
P
P
,
,
R
R =1−
2
2
i=1 (Yi − µY )
i=1 (Yi − Ȳ )
(15)
b i|Xi ) for the ith subject are as in (7) and (14).
where E(Yi |Xi) and E(Y
Analogously, the fitted FAM for functional responses, based on (8), is
b (t)|X} = µ̂Y (t) +
E{Y
M X
K
X
m=1 k=1
9
fˆmk (ξk )ψ̂m (t),
t∈T,
(16)
and the strength of the regression for this case can be measured by
Pn R
[Yi (t) − E{Yi (t)|Xi }]2 dt
Pn T R
R2 = 1 − i=1
,
2
i=1 T {Yi (t) − µY (t)} dt
Pn Pmi
b i (til )|Xi }]2 (til − ti,l−1 )
[V − E{Y
2
b
Pmiil
Pn l=2
R = 1 − i=1
,
2
l=2 {Vil − µY (til )} (til − ti,l−1 )
i=1
(17)
b i (t)|Xi ) for the ith
for population and sample versions, where E{Yi (t)|Xi } and E(Y
subject are as in (8) and (16), and we assume a dense grid of measurements til for each
subject.
5. THEORETICAL RESULTS
Establishing relevant asymptotic results requires studying the relationship between the
true and the estimated FPC scores ξik and ξˆik , ζim and ζ̂im , k = 1, . . . , K, m = 1, . . . , M,
since the estimates of the FAM component functions fk , respectively, fkm need to
be based on the estimated scores. Starting with known convergence results for the
estimated population components such as mean function, eigenfunction and eigenvalue
estimates in model (24) or (25) (see Yao et al., 2005a; Hall and Hosseini-Nasab, 2006),
a key step in the mathematical analysis is to establish exact upper bounds of |ξˆik − ξik |
and |ζ̂im − ζim |, where such bounds are i.i.d. in terms of i or do not depend on i,
i = 1, . . . , n (see Supplement for details). The convergence properties of the estimated
additive model components fk in (7) or fmk in (8) will follow from those upper bounds,
as these estimates are obtained by applying a nonparametric smoothing method to
{ξˆik , Yi } or {ξˆik , ζ̂im } for i = 1, . . . , n.
Asymptotic results are obtained for n → ∞, and also the number of included
components K, M needs to satisfy K = K(n) → ∞, M = M(n) → ∞, for a genuinely functional (infinite-dimensional) approach. The results concern consistency of
estimates (12), (13) and of predicted responses (14), (16), obtained for new predictor
processes. Details about the regularity assumptions can be found in the Appendix.
10
Theorem 1 Under assumptions (A1.1)-(A5), (C1.1), (C1.2), (C2.1) and (C2.3) (see
Appendix), in the scalar response case, for all k ≥ 1 for which λj , j ≤ k are eigenvalues
of multiplicity 1,
p
fˆk (x) − fk (x) −→ 0,
(18)
for estimates (12). Under the additional assumptions (B1.1)-(B4) and (C2.2), in the
functional response case, for all k, m for which λj , j ≤ k and ρl , l ≤ m are eigenvalues
of multiplicity 1,
p
fˆkm (x) − fkm (x) −→ 0,
(19)
for estimates (13).
Additional results on the rates of convergence of θ̃k (x) = |fˆk (x) − fk (x)| and ϑ̃mk (x) =
|fˆmk (x) − fmk (x)| can be found in the Supplement, eq. (41). Next, we consider consistency of the predictions obtained by applying FAM.
Theorem 2 Under (A1.1)-(A4), (A6), (A7), (C1.1), (C1.2), (C2.1) and (C2.3), for
the scalar response case,
p
b |X) − E(Y |X) −→
E(Y
0,
(20)
b |X) = Ȳ +PK fˆk (ξk ) as in (12). Under the additional assumptions (B1.1)where E(Y
k=1
(B4), (B5), (B6) and (C2.2), it holds for the functional response case that for all t ∈ T ,
p
b (t)|X} − E{Y (t)|X} −→
E{Y
0,
b (t)|X} = µ̂Y (t) + PK PM fˆmk (ξk )ψ̂m (t), with fˆmk (ξk ) as in (13).
where E{Y
m=1
k=1
(21)
b |X) − E(Y |X)| and
Again, additional results on the rates of convergence of θn∗ = |E(Y
b (t)|X) − E(Y (t)|X)| are in eq. (42) of the Supplement.
ϑ∗n = |E(Y
11
6. SIMULATION STUDIES
Simulation studies were conducted to illustrate the empirical performance of the Functional Additive Models (FAM) (7) with scalar and also (8) with functional response.
For both cases, we generated 200 simulation runs, each consisting of a sample of n = 100
predictor trajectories Xi , with mean function µX (s) = s + sin (s), 0 ≤ s ≤ 10, and a
√
covariance function derived from two eigenfunctions, φ1 (s) = − cos (πs/10)/ 5, and
√
φ2 (s) = sin (πs/10)/ 5, 0 ≤ s ≤ 10. The corresponding eigenvalues were chosen as
i.i.d.
λ1 = 4, λ2 = 1 and λk = 0, k ≥ 3, the measurement errors in (24) as ǫij ∼ N(0, 0.52 ).
We consider two different underlying distributions of the predictor FPC scores: (i)
√
i.i.d.
ξik ∼ N (0, λk ), (ii) ξik = λk (Zik − 4)/2, where Zik ∼ Gamma(4, 1), which is a
right skewed distribution, k = 1, 2.
As for the design of number and spacing of the measurement locations at which
predictor trajectories were sampled, we considered both dense and also sparse designs,
in order to check the robustness of our methods against violations of the dense design
assumption. For the dense case, each predictor trajectory was sampled at locations that
were uniformly distributed over the domain [0, 10], where the number of measurements
was chosen separately and randomly for each predictor trajectory, by selecting a number
from {30, . . . , 40} with equal probability. For the more challenging sparse case, the
number of measurements was chosen from {3, . . . , 6} with equal probability. For each
simulation run, we generated 100 new predictor trajectories Xi∗ with measurements
Uij∗ , taken at the same time points Uij as for the 100 observed predictor trajectories,
and 100 associated response variables, respectively, response functions Yi∗ .
P
For the scalar response case, we generated responses Yi = 2k=1 fk (ξik ) + εi , where
fk are the true component functions that relate the FPC scores ξik of the predictor
i.i.d.
trajectories Xi to the responses Yi , with errors εi ∼ N(0, 0.1), so that µY = 0 and
σY2 = 0.1. We compared the performance of fitting FAM (7) and the functional linear
12
regression model (1) under two situations: (a) the true functions fk (x) = x2 − λk for
k = 1, 2 were nonlinear; and (b) the true functions f1 (x) = 2x and f2 (x) = x/2 were
linear. As we demonstrated in (6), the functions fk become lines through the origin for
the case of the functional linear regression model. The representation (6) immediately
P
Pn
¯
suggests the estimates fˆk (x) = ni=1 (Yi − Ȳ )(ξˆik − ξˆ·k )λ̂−1
k x, where Ȳi =
i=1 Yi /n,
P
¯
ξˆ·k = ni=1 ξˆik /n, λ̂k is the estimate of the kth eigenvalue of the predictor process X,
and the ξˆik are obtained by the PACE method (26).
For both scalar and functional response cases, FAM was implemented as described
in Section 4, including choice of the number of model components for the predictor processes by AIC, and local polynomial smoothing applied to estimate the functions fk ,
with one-point-leave-out cross-validation for automatic choice of the smoothing bandwidths. The functional linear model was fitted as described above, and we compared
the quality of the prediction of responses for the new subjects.
For the functional response case, we also compared the predictive performance of
FAM with that of functional linear regression. For the latter, we estimated the (in
P
¯
this case linear) component functions fkm by fˆkm (x) = ni=1 (ζ̂im − ζ̄·m )(ξˆik − ξˆ·k )λ̂−1
k x.
The simulation settings were the same as those for the scalar response case, and in
addition functional response trajectories were generated as Yi (t) = µY (t) + ζi1 ψ1 (t),
√
where µY (t) = t+sin (t), ψ1 (t) = − cos (πt/10)/ 5, and ζi1 were the only non-zero FPC
scores for the responses, 0 ≤ t ≤ 10. To generate the scores ζi1 , as before we considered
nonlinear and linear scenarios for the component functions fkm , k = 1, 2, m = 1: (a)
P
2
− λk );
the true functions fk1 (x) = x2 − λk were nonlinear, k = 1, 2, i.e., ζi1 = 2k=1 (ξik
and (b) the true functions f11 (x) = 2x, f21 (x) = x/2 were linear, i.e., ζi1 = 2ξi1 + ξi2/2.
The measurement errors εil in (25) were generated i.i.d. from N (0, 0.1). The designs for
sampling the response trajectories were chosen in the same way as those for sampling
the predictor trajectories, with both sparse and dense cases included.
13
To evaluate the prediction of new responses from future subjects, we generated 100
new predictor and response trajectories Xi∗ and Yi∗ , respectively, with measurements
Uij∗ and Vil∗ taken at the same time points as Uij and Vil , respectively. The quality of
the responses was measured in terms of the relative prediction errors (RPE),
R
(Yi∗ (t) − Ŷi∗ (t))2 dt
(Yi∗ − Ŷi∗ )2
S
R ∗2
RPEi =
,
RPEi,f =
,
Yi∗2
Y (t)dt
S i
(22)
for scalar and functional response cases, respectively.
The results for the relative prediction errors when the predictor FPC scores are
normal are shown in Table 1 and suggest that FAM leads to similar prediction errors
in the sparse and somewhat larger prediction errors in the dense case, as compared to
the functional linear approach when the true functions fk or fkm are linear, while FAM
improves upon functional linear regression when the underlying component functions
are nonlinear. This holds equally for scalar and functional responses.
Similar results emerge for the case of right skewed distributions (Table 2). Again
the median losses when using FAM for the case of an underlying linear model are
small in the sparse case, while they are now more noticeable in the dense case. The
improvements obtained when using FAM for the nonlinear case for both dense and
sparse designs are found to persist for the situation with skewed distributions. We
note that the performance of FAM appears to be less stable as compared to the linear
model in the tails of the error distribution, as evidenced by occasional relatively large
values in the 75th percentiles of relative prediction errors. Our conclusion is that, in
situations where the signal is not too weak, the losses when using FAM in the linear
case are relatively small, while FAM performs better than the linear model in situations
when the underlying regression relationship is nonlinear.
7. APPLICATION TO GENE EXPRESSION TIME COURSE DATA
We apply our methods to gene expression profile data where both predictors and responses are functional. Arbeitman et al. (2002) obtained developmental gene expres14
sion profiles over the entire lifespan of Drosophila, and we apply functional regression
to study the relation of the expression profiles for different developmental periods.
The genes in the biologically identified group of n = 21 “transiently expressed zygotic
genes” show early peaks in expression during the embryonal phase and are active in the
cellularization phase of the embryo. For this well-defined gene group we study how the
expression in the pupa or metamorphosis phase depends on that in the embryo phase.
The data consist of 31 measurements during the embryo phase, i.e., the predictor process, and 18 measurements during the pupa phase, i.e., the response process. For one
of the genes, data were only available for the embryo phase, and the data for this
gene therefore were only used to carry out the FPC analysis for predictor processes.
The linearly interpolated gene expression trajectories for both predictor and response
processes, as well as their mean functions, are shown in Figure 1, confirming an early
peak in the predictor trajectories and displaying quite a bit of variation between genes.
The AIC method selected three components for predictor processes and four for
response processes. The corresponding eigenfunctions are displayed in Figure 2; see
legend for fraction of variation explained by the corresponding eigenvalues. Of interest
are the pairwise scatterplots of all pairings of response FPC scores ζ̂i1 , ζ̂i2 , ζ̂i3 , ζ̂i4 versus
the predictor FPC scores ξˆi1 , ξˆi2 , ξˆi3, as shown in Figure 3. Judging from these scatterplots, there exist clear relationships between response and predictor scores; while
several of these appear close to linear, for others a linear fit is not good, pointing to the
presence of nonlinear relationships. To interpret these relations, one needs to take the
shape of both predictor and response eigenfunctions into account. For example, the
negative relationship between ζ̂i1 and ξˆi1 implies that sharper initial peaks and lower
late embryonal gene expression is coupled with an overall lower pupa phase expression,
in the sense that the contribution of the first eigenfunction in the response is reduced.
Interpretation of FAM can be aided by a “principal response plot”, displayed in
15
Figure 4, where one varies one predictor FPC score, while keeping the others fixed at
level 0. This can be visualized by looking at a set of predictor functions moving in a
certain direction “away” from the mean function of the predictor process, where the
direction depends on the chosen eigenfunction; each of these predictor functions, when
plugged into FAM, then generates a corresponding response function. The set of these
response functions is then jointly visualized with the set of predictor functions in a way
that the corresponding predictor/response pairs can be easily identified. This device
can be employed for each predictor score. Since these act independently from each
other on the response, displaying a series of such plots for all relevant predictor scores
then provides a graphical representation of FAM.
The left panels of Figure 4 indicate that shifts in the levels on the right-hand side
of the peak of the predictor curves are associated with broad shifts up or down in
responses, the middle panels that the size of peak expression in predictors is associated
with amplitude shifts in the right half of response curves, and the left panels, that
combined time and amplitude shifts of predictor peaks are associated with strong
amplitude shifts in responses in a nonlinear way. The principal response plots thus
characterize the response changes induced by the modes of variation of the predictors.
For proper interpretation, it is helpful to note that the actual sample variation of the
predictor scores in the direction of each eigenfunction, as depicted in the top panels
of Figure 4, depends on the size of the respective eigenvalue, which corresponds to
the variance of the FPC scores. This variation accordingly is much smaller for the
second or third eigenfunction, as compared to the first eigenfunction, and therefore the
system’s response function changes, as depicted in the lower panels, will be realized on
increasingly smaller scales for real functional data, as the order of the eigenfunction
increases.
The increased bias when fitting functional linear regression to these data is evident
16
in the observed leave-one-subject-out relative prediction errors, i.e., the cross-validated
observed version of (22), denoted by RPE(−i),f ; mean and percentiles are listed in
Table 3. Median and mean of the errors are larger for the functional linear model in
comparison with FAM. This finding is in line with the increase in functional R2 (17)
that is obtained for FAM as compared to the functional linear regression model.
8. CONCLUDING REMARKS
The Functional Additive Model (FAM) strikes a fine balance between greatly enhancing
flexibility, as compared to the functional linear model, and preventing the curse of
dimension incurred by a fully nonparametric approach, due to its sensible structural
constraints. This is analogous to the situation in ordinary multiple regression, where
additive models do not suffer from the curse of dimension in the way unstructured
nonparametric models do. As a reviewer has pointed out, one could also imagine
other additive regression models geared towards specific regression relations, where the
predictors correspond to suitably chosen functionals of the predictor processes other
than the functional principal component scores.
As we have demonstrated, specific advantages of using functional principal component scores in an additive regression model are that (1) such a model is a strict generalization of the functional linear model, which emerges as a special case; (2) asymptotic
consistency results can still be derived under truly nonparametric (smoothness) assumptions for all component functions, including mean and covariance functions; (3)
implementing the resulting additive model requires not more than applying simple
smoothing steps. The necessary smoothing parameters for the component functions
can be easily selected in a data-adaptive way. FAM does not impose a largely increased computational burden over the established functional linear regression model,
and therefore the added flexibility comes at low additional computational cost. Judging from our simulations, the loss in efficiency against the functional linear regression
17
model when the underlying functional regression is truly linear is quite limited. On
the other hand, the increased flexibility of FAM can lead to substantial gains.
APPENDIX
A.1 Eigenrepresentations and Estimating Functional Principal Component Scores
The eigenfunctions φk , k = 1, 2, . . . , for the representation of predictor processes X
are the solutions of the equations (AGX f )(s) = λf (s) on the space of square integrable functions f ∈ L2 (S) under constraints, where the auto-covariance operator
R
(AGX f )(t) = f (s)GX (s, t) ds is a compact linear integral operator of Hilbert-Schmidt
type (Ash and Gardner, 1975). The constraints correspond to orthonormality of the
R
eigenfunctions, i.e., φj (s)φk (s) ds = δjk , where δjk = 1 if j = k and = 0 if j 6= k. The
eigenfunctions φj are ordered according to the size of the corresponding eigenvalues,
λ1 ≥ λ2 ≥ . . .. Analogously, eigenfunctions and eigenvalues of the response process
Y are denoted by ψm and ρm . We assume that all functions {µX , φj }, {µY , ψk } are
smooth (twice continuously differentiable).
P
One then has well-known representations GX (s1 , s2 ) = k λk φk (s1 )φk (s2 ) and
P
GY (t1 , t2 ) = m ρm ψm (t1 )ψm (t2 ) of the covariance functions of X and Y , as well as
the Karhunen-Loève expansions for processes X and Y ,
∞
∞
X
X
X(s) = µX (s) +
ξj φj (s), Y (t) = µY (t) +
ζk ψk (t).
j=1
(23)
k=1
Since {φk , k = 1, 2, . . .} and {ψm , m = 1, 2, . . .} form orthonormal bases of the respective space of square integrable functions, it follows that the regression parameter
functions can also be represented in this basis, i.e., there exist coefficients βk , βkm such
P∞ P∞
P
that β(s) = ∞
m=1 βkm φk (s)ψm (t).
k=1
k=1 βk φk (t), resp., β(s, t) =
According to (23), we may represent the measurements for predictor trajectories in
(1) and both predictor and response trajectories in (2) as
∞
X
Uij = Xi (sij ) + ǫij = µX (sij ) +
ξik φk (sij ) + ǫij , sij ∈ S, 1 ≤ i ≤ n, 1 ≤ j ≤ ni , (24)
k=1
18
Vil = Yi (til ) + εil = µY (til ) +
∞
X
m=1
ζim ψk (til ) + εil , sil ∈ T , 1 ≤ i ≤ n, 1 ≤ l ≤ mi . (25)
More specifically, writing U i = (Ui1 , . . . , Uini )T , µXi = (µX (si1 ), . . . , µX (sini ))T , and
φik = (φk (si1 ), . . . , φk (sini ))T , the best linear prediction for ξik is λk φTik Σ−1
Ui (U i − µXi ),
2
where the (j, l) entry of the ni × ni matrix ΣUi is (ΣUi )j, l = GX (sij , sil ) + σX
δjl , with
δjl = 1, if j = l, and δjl = 0, if j 6= l. Then the estimates for the scores ξik are obtained
by substituting estimates for µXi , λk and φik , ΣXi (see Supplement), obtained from
the entire data ensemble, leading to
T
b −1 (U i − µ̂X ),
ξˆik = λ̂k φ̂ik Σ
Ui
i
(26)
b U is (Σ
b U )j, l = G
bX (sij , sil ) + σ̂ 2 (sij )δjl . We note that it
where the (j, l) element of Σ
i
i
Y
follows from results in Müller (2005) that as designs become dense, these best linear
(PACE) estimates ξˆik , ζ̂im (26) converge to those obtained by the more traditional
integration-based estimates
I
ξˆik
=
ni
ni
X
X
I
(Uij − µ̂X (tij ))φ̂k (sij )(sij − si,j−1), ζ̂im =
(Vij − µ̂Y (tij ))ψ̂k (tij )(tij − ti,j−1 ),(27)
j=2
j=2
which are motivated by the definition of the FPC scores as inner products, ξik =
R
R
{Xi (s) − µX (s)}φk (s)ds, ζim = {Yi(t) − µY (t)}ψm (t)dt. Therefore, the PACE estimates and the estimates based on integral approximations can be considered equivalent
in the dense design case that we consider here.
A.2 Estimation Procedures
To obtain the FPC scores for predictor and response processes (in case of a functional response), we adopt the PACE (Principal Analysis by Conditional Expectation)
methodology (Yao et al., 2005a). Estimating the predictor mean function µX by local
linear scatterplot smoothers, one minimizes
ni
n X
X
i=1 j=1
K1 (
sij − s
){Uij − β0X − β1X (s − sij )}2
bX
19
(28)
with respect to β0X , β1X , to obtain µ̂X (s) = β̂0X (s). The kernel K1 is assumed to
be a smooth symmetric density function and bX is a bandwidth. Estimating the covariance function GX (s1 , s2 ) of predictor processes X, define GX
i (sij1 , sij2 ) = (Uij1 −
µ̂X (sij1 ))(Uij2 − µ̂X (sij2 )), and the local linear surface smoother through minimizing
n
X
X
i=1 1≤j1 6=j2 ≤ni
K2 (
sij1 − s1 sij2 − s2
X
2
,
){GX
i (sij1 , sij2 ) − f (β , (s1 , s2 ), (sij1 , sij2 ))} , (29)
hX
hX
X
X
where f (β X , (s1 , s2 ), (sij1 , sij2 )) = β0X + β11
(s1 − sjl1 ) + β12
(s2 − sij2 ), with respect
X
X T
bX (s1 , s2 ) = β̂ X (s1 , s2 ). Here, the kernel K2 is a
to β X = (β0X , β11
, β12
) , yielding G
0
two-dimensional smooth density with zero mean and finite covariances and hX is a
bandwidth. An essential feature is the omission of the diagonal elements j1 = j2 which
are contaminated with the measurement errors.
2
For the estimation of the variance of the measurement error σX
, we fit a local
quadratic component orthogonal to the diagonal of GX , and a local linear component
in the direction of the diagonal. Denote the diagonal of the resulting surface estimate
eX (s), and a local linear smoother focusing on diagonal values {GX (s, s) + σ 2 } by
by G
X
VbX (s), using bandwidth h∗X . Let a0 = inf{s : s ∈ S}, b0 = sup{s : s ∈ S}, |S| = b0 −a0 ,
2
S1 = [a0 + |S|/4, b0 − |S|/4]. The estimate of σX
is
Z
2
eX (s)} ds/|S|,
{VbX (s) − G
σ̂X = 2
(30)
S1
2
2
if σ̂X
> 0, and σ̂X
= 0 otherwise. Estimates of eigenvalues/eigenfunctions {λk , φk }k≥1
are obtained as numerical solutions {λ̂k , φ̂k }k≥1 of suitably discretized eigenequations,
Z
bX (s1 , s2 )φ̂k (s2 )ds2 = λ̂k φ̂k (s1 ),
G
(31)
S
with orthonormal constraints on {φ̂k }k≥1. These estimates are unique up to a sign
change, and a projection on the space of positive definite covariance surfaces is obtained by simply omitting components with non-positive eigenvalues in the final representation. Analogous procedures are applied to response processes Y .
20
A.3 Assumptions
We compile the necessary assumptions for establishing theoretical properties. Recall
bX = bX (n), hX = hX (n), h∗X = h∗X (n), bY = bY (n), hY = hY (n), h∗Y = h∗Y (n) are the
bX and G
bY in (29), and VbX and VbY in
bandwidths for estimating µ̂X and µ̂Y in (28), G
(30). As the number of subjects n → ∞, we require:
(A1.1) bX → 0, h∗X → 0, nb4X → ∞, nh∗X 4 → ∞, nb6X < ∞, and nh∗X 6 < ∞.
(A1.2) hX → 0, nh6X → ∞, and nh8X < ∞.
For FAM with functional responses, analogous requirements are
(B1.1) bY → 0, h∗Y → 0, nb4Y → ∞, nh∗Y 4 → ∞, nb6Y < ∞, and nh∗Y 6 < ∞.
(B1.2) hY → 0, nh6Y → ∞, and nh8Y < ∞.
The time points {sij }i=1,...,n;j=1,...,ni here are considered deterministic. Write the sorted
time points across all subjects as a0 ≤ s(1) ≤ . . . ≤ s(Nn ) ≤ b0 , and ∆X = max{s(k) −
P
s(k−1) : k = 1, . . . , N + 1}, where Nn = ni=1 ni , S = [a0 , b0 ], s(0) = a0 , and s(N +1) =
b0 . For the ith subject, suppose that the time points sij have been ordered non-
decreasingly. Let ∆iX = max{sij − si,j−1 : j = 1, . . . , ni + 1}, ∆∗X = max{∆iX : i =
P
1, . . . , n}, where si0 = a0 and si,ni +1 = b0 , and n̄x = n−1 ni=1 ni . To obtain uniform
consistency, we require both the pooled data across all subjects and also the data from
each subject to be dense in the time domain S. Assume that
−1/2 ∗ −1
(A2.1) ∆X = O(min{n−1/2 b−1
hX , n−1/4 h−1
X ,n
X }).
(A2.2) n̄x → ∞, max{ni : i = 1, . . . , n} ≤ C n̄x for some C > 0, and ∆∗X = O(1/n̄x ).
For the FAM with response process Y and observations {til , Vil }, l = 1, . . . , mi , i =
1, . . . , n, we analogously define the quantities ∆Y , ∆iY , ∆∗Y , m̄y and assume that
−1/2 ∗ −1
(B2.1) ∆Y = O(min{n−1/2 b−1
hY , n−1/4 h−1
Y ,n
Y }).
(B2.2) m̄y → ∞, max{mi : i = 1, . . . , n} ≤ C m̄y for some C > 0, and ∆∗Y = O(1/m̄y ).
21
i.i.d.
Denote by Ui (s) ∼ U(s) the distributions that generate Uij for the ith subject at
i.i.d.
s = sij . Analogously, let Vi (t) ∼ V (t) denote the distributions that yield Vil for the
ith subject at til . Assume that the fourth moments of U(s) and V (t) are uniformly
bounded for all s ∈ S and t ∈ T , respectively; i.e.,
(A3) sups∈S E[U 4 (s)] < ∞.
(B3) supt∈T E[V 4 (t)] < ∞.
Denoting the Fourier transforms of kernels K1 and K2 by κ1 (t) =
R
κ2 (t, s) = e−(iut+ivs) K2 (u, v)du dv, we require
(C1.1) κ1 (t) is absolutely integrable, i.e.,
on its compact support.
R
R
e−iut K1 (u)du and
|κ1 (t)|dt < ∞, and K1 is Lipschitz continuous
(C1.2) κ2 (t, s) is absolutely integrable; i.e.,
RR
|κ2 (t, s)|dtds < ∞.
In the sequel, let gu1 (u; s), gu2 (u1 , u2 ; s1 , s2 ), gv1 (v; t) and gv2 (v1 , v2 ; t1 , t2 ) denote the
density functions of U(s), (U(s1 ), U(s2 )), V (t), and (V (t1 ), V (t2 )), respectively, and pk
and qm the densities of ξk and ζm . It is assumed that these density functions satisfy
the following regularity conditions:
(C2.1) (d2 /ds2 )gu1 (u; s) exists and is uniformly continuous on ℜ × S;
(d2 /(dsℓ11 dsℓ22 ))gu2 (u1 , u2; s1 , s2 ) exists and is uniformly continuous on ℜ2 ×S 2 , for
ℓ1 + ℓ2 = 2, 0 ≤ ℓ1 , ℓ2 ≤ 2.
(C2.2) (d2 /dt2 )gv1 (v; t) exists and is uniformly continuous on ℜ × T ;
(d2 /(dtℓ11 dtℓ22 ))gv2 (v1 , v2 ; t1 , t2 ) exists and is uniformly continuous on ℜ2 × T 2 , for
ℓ1 + ℓ2 = 2, 0 ≤ ℓ1 , ℓ2 ≤ 2.
(C2.3) The second derivative p(2) (x) exists and is continuous on ℜ.
Let kf k∞ = supx∈A |f (t)| for an arbitrary function f with support A, and kgk =
qR
g 2 (t)dt for any g ∈ L2 (A). The following assumptions are needed for Theorem 1.
A
22
(A4) E(kX ′ k2∞ ) < ∞, E(kX ′2 k2∞ ) = o(n̄x ) and E(ξk4) < ∞ for any fixed k.
4
(B4) E(kY ′ k2∞ ) < ∞, E(kY ′2 k2∞ ) = o(m̄y ) and E(ζm
) < ∞ for any fixed m.
(A5) nh4X h2k → 0, nb2X h2k → 0 and n̄x h2k → 0.
P
P
P
P
PM
PK PM
For brevity, denote “ K
m=1 ” by “
k,m ”,
k=1 ” by “
m=1 ” by “
k=1
k ”, “
m ”, “
“maxk=1,...,K ” by “maxk ” and “maxm=1,...,M ” by “maxm ” in the following assumptions
(A6) and (B5), which are needed for Theorem 2 and are assumed to hold for all fixed
ξ k , K ≤ K0 ,
(A6)
P
1/2
1/2
pk (ξk )h−1
bX , n̄x )}, maxk kφk φ′k k∞ = O(n̄x ),
k = o{min(n
P
o(n1/2 h2X ), k E(ξk4) < ∞.
k
(B5) For any fixed ξk , t ∈ T , K ≤ K0 , M ≤ M0 ,
P
1/2
−1
1/2
bX , n̄x )},
k,m pk (ξk )|ψm (t)|hmk = o{min(n
P
1/2
1/2
bY , m̄y )},
k,m pk (ξk )|ψm (t)| = o{min(n
P
−1
x
1/2 2
hX ),
k,m pk (ξk )πk |ψm (t)|hmk = o(n
P
y
1/2 2
hY ),
k,m πm |fmk (ξk )| = o(n
P
k
pk (ξk )πkx h−1
k =
maxk kφk φ′k k∞ = O(n̄x ),
′
maxm kψm ψm
k∞ = O(m̄y ),
P
y
1/2 2
hY ),
k,m pk (ξk )πm |ψm (t)| = o(n
P
4
m E(ζm ) < ∞.
To guarantee the consistency of predictions, we also require that for fixed ξk and t ∈ T ,
(A7)
(B6)
PK
2
′′
k=1 [|fk (ξk )|hk
PK PM
k=1
−1/2
+ n−1/2 {var(Y |ξk )}1/2 pk
−1/2
(ξk )hk
] → 0.
−1/2
−1/2
−1/2
2
′′
{var(ζm |ξk )}1/2 pk (ξk )|ψm (t)|hmk ]
m=1 [|fmk (ξk )ψm (t)|hmk +n
→ 0.
References
Ash, R. B., and Gardner, M. F. (1975), Topics in stochastic processes, New York:
Academic Press.
Billingsley, P. (1995), Probability and measure, 2nd edn, New York: John Wiley.
Cai, T., and Hall, P. (2006), “Prediction in functional linear regression,” The Annals
of Statistics, 34, 2159–2179.
23
Cardot, H., Ferraty, F., Mas, A., and Sarda, P. (2003), “Testing hypotheses in the
functional linear model,” Scandinavian Journal of Statistics, 30, 241–255.
Cardot, H., and Sarda, P. (2005), “Estimation in generalized linear models for functional data via penalized likelihood,” Journal of Multivariate Analysis, 92, 24–41.
Cuevas, A., Febrero, M., and Fraiman, R. (2002), “Linear functional regression: the
case of fixed design and functional response,” Canadian Journal of Statistics, 30, 285–
300.
Draper, N. R., and Smith, H. (1998), Applied Regression Analysis, New York: Wiley.
Escabias, M., Aguilera, A. M., and Valderrama, M. J. (2004), “Principal component
estimation of functional logistic regression: discussion of two different approaches,”
Journal of Nonparametric Statistics, 16, 365 – 384.
Fan, J. Q., Yao, Q. W., and Cai, Z. W. (2003), “Adaptive varying-coefficient linear
models,” Journal of the Royal Statistical Society, Series B, 65, 57 – 80.
Fan, J., and Zhang, J. T. (2000), “Two-step Estimation of Functional Linear Models
with Applications to Longitudinal Data,” Journal of the Royal Statistical Society,
Series B, 62, 303–322.
Foutz, N., and Jank, W. (2008), “The wisdom of crowds: pre-release forecasting via
functional shape analysis of the online virtual stock market,” manuscript.
Hall, P., and Horowitz, J. L. (2007), “Methodology and convergence rates for functional
linear regression,” The Annals of Statistics, 35, 70–91.
Hall, P., and Hosseini-Nasab, M. (2006), “On properties of functional principal components analysis,” Journal of the Royal Statistical Society, Series B, 68, 109–126.
24
Hastie, T. J., and Tibshirani, R. J. (1990), Generalized Additive Models Chapman and
Hall, London.
He, G., Müller, H. G., and Wang, J. L. (2000), “Extending correlation and regression
from multivariate to functional data,” in Asymptotics in Statistics and Probability,
ed. M. L. Puri VSP International Science Publishers, pp. 301–315.
He, G., Müller, H. G., and Wang, J. L. (2003), “Functional canonical analysis for
square integrable stochastic processes,” Journal of Multivariate Analysis, 85, 54 –
77.
James, G. (2002), “Generalized linear models with functional predictors,” Journal of
the Royal Statistical Society, Series B, 64, 411–432.
James, G., and Silverman, B. (2005), “Functional adaptive model estimation,” Journal
of the American Statistical Association, 100, 565–576.
Liu, B., and Müller, H. (2008), “Functional data analysis for sparse auction data,” in
Statistical Methods in eCommerce Research, eds. W. Jank, and G. Shmueli Wiley,
New York.
Malfait, N., and Ramsay, J. O. (2003), “The historical functional linear model,” Canadian Journal of Statistics, 31, 115–128.
Mammen, E., and Park, B. U. (2005), “Bandwidth selection for smooth backfitting in
additive models,” The Annals of Statistics, 33, 1260–1294.
Morris, J., Vannucci, M., Brown, P., and Carroll, R. (2003), “Wavelet-based nonparametric modeling of hierarchical functions in colon carcinogenesis (with discussion),”
Journal of the American Statistical Association, 98, 573–594.
25
Müller, H. G. (2005), “Functional modelling and classification of longitudinal data,”
Scandinavian Journal of Statistics, 32, 223–240.
Müller, H. G., and Stadtmüller, U. (2005), “Generalized functional linear models,” The
Annals of Statistics, 33, 774–805.
Ramsay, J. O., and Dalzell, C. J. (1991), “Some tools for functional data analysis,”
Journal of the Royal Statistical Society, Series B, 53, 539–572.
Ramsay, J. O., and Silverman, B. W. (2005), Functional Data Analysis, 2nd edn, New
York: Springer.
Rice, J., and Silverman, B. W. (1991), “Estimating the mean and covariance structure
nonparametrically when the data are curves,” Journal of the Royal Statistical Society,
Series B, 53, 233–243.
Sood, A., James, G., and Tellis, G. (2008), “Functional regression: a new model for
predicting market penetration of new products,” Marketing Science, to appear.
Yao, F., and Lee, T. C. M. (2006), “Penalized spline models for functional principal
component analysis,” Journal of the Royal Statistical Society, Series B, 68, 3–25.
Yao, F., Müller, H. G., Clifford, A. J., Dueker, S. R., Follett, J., L., Y., Buchholz, B. A.,
and Vogel, J. S. (2003), “Shrinkage estimation for functional principal component
scores with application to the population kinetics of plasma folate,” Biometrics,
59, 676–685.
Yao, F., Müller, H. G., and Wang, J. L. (2005a), “Functional data analysis for sparse
longitudinal data,” Journal of the American Statistical Association, 100, 577–590.
Yao, F., Müller, H. G., and Wang, J. L. (2005b), “Functional linear regression analysis
for longitudinal data,” The Annals of Statistics, 33, 2873–2903.
26
Table 1: Simulation results for the comparison of predictions obtained by the Functional Additive Model (FAM) and functional linear regression (LIN), for models with
scalar response (see (7) for the FAM version, (1) for the linear version) and with functional response ((8) for the FAM version, (2) for the linear version), for both dense and
sparse designs. The true component functions are linear (LF) and nonlinear (NLF),
and the true FPC scores of the predictor process are generated from normal distributions, as described in Section 6. Simulations were based on 400 Monte Carlo runs with
n = 100 predictors and responses per sample. Shown in the table are the Monte Carlo
estimates of the 25th, 50th and 75th percentiles of the relative prediction error, RPE
(22).
Design
Response
Model
True
FAM
Scalar
Dense
LIN
NLF
FAM
Functional
LIN
NLF
FAM
Scalar
Sparse
LIN
NLF
FAM
Functional
LIN
NLF
25th
50th
75th
.0683
.2983
1.820
.6458
1.068
1.870
.0025
.0086
.0279
.0109
.0363
.0705
.1124
.4437
2.884
.6614
1.071
1.822
.0066
.0156
.0432
.0138
.0389
.0737
27
True
LF
LF
LF
LF
25th
50th
75th
.0075
.0431
.3024
.0102
.0335
.1362
.0005
.0012
.0031
.0004
.0009
.0019
.0176
.1066
.7149
.0270
.1133
.5378
.0023
.0042
.0090
.0023
.0044
.0094
Table 2: Simulation results obtained for the same settings as those in Table 1, except
that the true FPC scores of the predictor process are generated from right skewed
distributions related to Gamma(4,1), as described in Section 6.
Design
Response
Model
True
FAM
Scalar
Dense
LIN
NLF
FAM
Functional
LIN
NLF
FAM
Scalar
Sparse
LIN
NLF
FAM
Functional
LIN
NLF
25th
50th
75th
.0163
.0885
1.035
.0396
.1827
3.383
.0028
.0101
.0375
.0126
.0417
.0819
.0448
.2775
3.558
.1867
.7179
3.642
.0072
.0146
.0485
.0156
.0422
.0803
True
LF
LF
LF
LF
25th
50th
75th
.0014
.0066
.0410
.0008
.0030
.0135
.0005
.0013
.0034
.0003
.0005
.0010
.0291
.1511
1.031
.0305
.1423
.9316
.0023
.0041
.0090
.0021
.0042
.0091
Table 3: Functional R2 (17), 25th, 50th and 75th percentiles and mean of the crossvalidated observed relative prediction errors, RPE(−i),f (22), comparing FAM and functional linear regression models for zygotic data.
75th
Mean
R2
FAM .0506 .0776 .1662
.1301
0.19
LIN
.1374
0.16
25th
50th
.0479 .0891 .1727
28
5
1
4
0.5
3
Gene Expression Level
Gene Expression Level
0
2
1
0
−0.5
−1
−1.5
−1
−2
−2
−2.5
−3
−3
0
5
10
15
20
120
Time(hours)
140
160
180
200
Time(hours)
Figure 1: Observed (thin curves) gene expression levels and smoothed estimates of
the mean functions (thick curves) for embryo phase (left) and pupa phase (right), for
zygotic data.
29
Eigenfunctions of X
Eigenfunctions of Y
0.6
0.15
0.5
0.1
0.4
0.05
0.3
0
0.2
−0.05
0.1
0
−0.1
−0.1
−0.15
−0.2
−0.2
−0.3
0
5
10
15
Time(hours)
−0.25
120
20
140
160
180
Time(hours)
200
Figure 2: Left: Estimates of the first 3 eigenfunctions for predictor trajectories from
the embryo phase (predictor process), K = 3 components selected by AIC, accounting
for 80.69% (first component, solid), 13.65% (second component, dashed) and 2.71%
(third component, dash-dot) of the total variation. Right: Estimates of the first 4
eigenfunctions of the pupa (metamorphosis) phase (response process), M = 4 components selected by AIC, accounting for 80.47% (first component, solid), 8.89% (second
component, dashed), 4.2% (third component, dash-dot) and 1.55% (fourth component,
dotted) of the total variation.
30
5
5
0
0
0
−5
−5
−5
−10
−10
−10
ζ
1
5
0
5
10
−2
0
2
−1
0
1
−1
0
1
−1
0
1
−1
0
1
4
2
2
0
0
0
−2
−2
−2
ζ
2
2
−4
0
5
10
−2
10
0.5
0
−0.5
−1
−1.5
−2
ζ
3
0.5
0
−0.5
−1
−1.5
0
5
0
2
0.5
0
−0.5
−1
−1.5
0
2
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
ζ
4
1
0
ξ
5
10
−2
0
1
ξ
2
2
ξ
3
Figure 3: Scatterplots (dots), local polynomial (solid) and linear (dashed) estimates for
the regressions of estimated FPC scores of the pupa phase (responses, y-axis) versus
those for the embryo phase (predictors, x-axis). The FPC scores of the embryo phase
expressions are arranged from left to right (ξk , k = 1, 2, 3) and the FPC scores of the
pupa phase expressions are arranged from top to bottom (ζm , m = 1, 2, 3, 4), for zygotic
data.
31
Predictor: Embryo Phase
2.5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2
−2
0
Response: Pupa Phase
2.5
5
10
15
Time(hours)
20
−2
0
5
10
15
Time(hours)
20
0
5
10
15
Time(hours)
20
120
140
160
180
Time(hours)
200
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−0.6
−0.6
−0.6
−0.8
−0.8
−0.8
−1
−1
−1
−1.2
−1.2
−1.2
−1.4
−1.4
−1.4
−1.6
−1.6
−1.6
−1.8
−1.8
−1.8
−2
−2
−2
−2.2
−2.2
−2.2
120
140
160
180
Time(hours)
200
120
140
160
180
Time(hours)
200
Figure 4: Top panels: Predictor trajectories µ̂X (s) + αφ̂k (s), for α = 0 (solid), α =
1 (dashed) and α = −1 (+++) for k = 1 (left panels), k = 2 (middle panels) and
k = 3 (right panels). Bottom panels: Corresponding response trajectories µ̂Y (t) +
PM ˆ
P
ˆ
m=1 fkm (α) +
ℓ6=k fℓm (0) ψ̂m (t), for zygotic data.
32
Supplement for “Functional Additive Models”
1. NOTATIONS AND AUXILIARY RESULTS
bX ; i.e.,
Covariance operators are denoted by GX , GbX , generated by kernels GX , G
R
R
bX (s, t)f (s)ds for any f ∈ L2 (T ). Define
GX (f ) = T GX (s, t)f (s)ds, GbX (f ) = T G
DX =
R
T2
b X (s, t) − GX (s, t)}2 dsdt,
{G
K0 = inf{j ≥ 1 : λj − λj+1 ≤ 2DX } − 1,
δkx = min1≤j≤k (λj − λj+1 ),
πkx
= 1/λk +
1/δkx .
(32)
Let K = K(n) denote the numbers of leading eigenfunctions included to approximate
bi (s) = µ̂X (s) + PK ξˆik φ̂k (s). Analogously, define the
X as sample size n varies; i.e., X
k=1
y
y
quantities GY , GbY , DY , δm
πm
, M0 and M for the process Y , for the case of functional
responses. The following lemma gives the weak uniform convergence rates for the
estimators of the FPCs, setting the stage for the subsequent developments. The proof
is in Section 2.
Lemma 1 Under (A1.1)-(A3), (C1.1), (C1.2) and (C2.1),
sup |µ̂X (s) − µX (s)| = Op ( √
t∈S
1
),
nbX
b X (s1 , s2 ) − GX (s1 , s2 )| = Op ( √
sup |G
s1 ,s2 ∈S
1
),(33)
nh2X
2
2
−1/2 ∗ −1
and as a consequence, σ̂X
− σX
= Op (n−1/2 h−2
hX ). Considering eigenvalues
X +n
λk of multiplicity one, φ̂k can be chosen such that
πx
P ( sup |λ̂k − λk | ≤ DX ) = 1, sup |φ̂k (s) − φk (s)| = Op ( √ k 2 ), k = 1, . . . , K0 ,(34)
nhX
s∈S
1≤k≤K0
where DX , πkx and K0 are defined in (32).
Analogously, under (B1.1)-(B3), (C1.1), (C1.2) and (C2.2),
sup |µ̂Y (t) − µY (t)| = Op ( √
t∈T
1
),
nbY
bY (t1 , t2 ) − GY (t1 , t2 )| = Op ( √
sup |G
t1 ,t2 ∈T
1
), (35)
nh2Y
−1/2 ∗ −1
and as a consequence, σ̂Y2 − σY2 = Op (n−1/2 h−2
hY ). Considering eigenvalues
Y +n
ρm of multiplicity one, ψ̂m can be chosen such that
πky
√
P ( sup |ρ̂m − ρm | ≤ DY ) = 1, sup |ψ̂m (t) − ψm (t)| = Op (
), m = 1, . . . , M0 ,(36)
nh2Y
1≤m≤M0
t∈T
1
where DY , πky and M0 are defined analogously to (32) for process Y .
Recall that kf k∞ = supx∈A |f (t)| for an arbitrary function f with support A, and
qR
g 2 (t)dt for any g ∈ L2 (A) and define
kgk =
A
(1)
(1)
θik = c1 kXi k + c2 kXi Xi′ k∞ ∆∗X + c3 , Zk = sups∈S |φ̂k (s) − φk (s)|,
(2)
(2)
θik = 1 + kφk φ′k k∞ ∆∗X ,
Zk = sups∈S |µ̂X (s) − µX (s)|,
(3)
(3)
(37)
θik = c4 kXi k∞ + c5 kXi′ k∞ + c6 ,
Zk = kφ′k k∞ ∆∗X ,
P i
(4)
(4)
ǫij φk (sij )(sij − si,j−1 )|, Zk ≡ 1,
θik = | nj=2
P i
(5)
(1)
(5)
|ǫij |(sij − si,j−1),
Zk ≡ Zk ,
θik = nj=2
for some positive constants a1 , . . . , c6 that do not depend on i or k. Similarly, define
corresponding quantities for the process Y as follows,
(1)
(1)
ϑim = d1 kYi k + d2 kYiYi′ k∞ ∆∗Y + d3 , Qm = supt∈T |ψ̂m (t) − ψm (t)|,
(2)
(2)
′
ϑim = 1 + kψm ψm
k∞ ∆∗Y ,
Qm = supt∈T |µ̂Y (t) − µY (t)|,
ϑim = d4 kYi k∞ + d5 kYi′ k∞ + d6 ,
P i
(4)
ϑim = | m
l=2 εil ψm (til )(til − ti,l−1 )|,
P i
(5)
ϑim = m
l=2 |εil |(til − ti,l−1 ),
′
Qm = kψm
k∞ ∆∗Y ,
(3)
(3)
(38)
(4)
Qm ≡ 1,
(5)
(1)
Qm ≡ Qm ,
for some positive constants d1 , . . . , d6 that do not depend on i or m. We note that
the subscripts are mainly for notational convenience and do not necessarily reflect
(1)
(3)
(5)
dependence on these indices. Note that in (37), θik , θik , θik in fact do not depend
(2)
(3)
(4)
on k and θik does not depend on i, while Zk is deterministic and Zk is a constant.
(ℓ)
More importantly, we emphasize that θik are i.i.d. over i (ℓ = 1, 3, 4, 5) or free of i
(ℓ)
(ℓ = 2), and that the Zk do not depend on i.
The next lemma is critical for the subsequent developments, providing exact upper
I
I
I
bounds for the estimation errors |ξˆik
− ξik | and |ζ̂im
− ζim |, for the FPC estimates ξˆik
,
I
ζ̂im
(27).
2
(ℓ)
(ℓ)
(ℓ)
(ℓ)
Lemma 2 For θik , Zk , ϑim and Qm as defined in (37) and (38),
I
|ξˆik
− ξik | ≤
5
X
ℓ=1
(ℓ)
(ℓ)
θik Zk ,
I
|ζ̂im
− ζim | ≤
5
X
(ℓ)
ϑim Q(ℓ)
m.
(39)
ℓ=1
The proof is in Section 2. In the sequel we suppress the superscript I in the FPC
I
I
estimates ξˆik
and ζ̂im
.
Recall that the sequences of bandwidths hk and hmk are employed to obtain the
estimates fˆk and fˆmk for the regression functions fk and fmk , and that the density of
ξk is denoted by pk . Define
p
πkx
1
√
√
θk (x) = pk (x){
+
+
∆∗X },
2
nhX
nbX
y
p
πm
1
ϑmk (x) = pk (x){ √ 2 + √
+ ∆∗Y }.
nhY
nbY
(40)
The weak convergence rates θ̃k and ϑ̃mk of the regression function estimators fˆk (x)
and fˆmk (x) (see Theorem 1) are as follows,
s
var(Y |x)kK1 k2
,
pk (x)nhk
s
var(ζm |x)kK1 k2
θk (x)
1 ′′
ϑ̃mk (x) =
+ ϑmk (x) + |fmk
(x)|h2mk +
.
hmk
2
pk (x)nhmk
θk (x) 1 ′′
+ |fk (x)|h2k +
θ̃k (x) =
hk
2
(41)
b |X) for the scalar response case and E{Y
b (t)|X}
Considering the predictions E(Y
for the functional response case, the numbers of eigenfunctions K and M used for
approximating the infinite dimensional processes X and Y generally tend to infinity
as the sample size n increases. We require K ≤ K0 and M ≤ M0 in (A6). Since it
follows from (35) that K0 → ∞, as long as all eigenvalues λj are of multiplicity 1, and
analogously for M0 , this is not a strong restriction. Denote the set of positive integers
by N and Nk = {1, . . . , k}. Convergence rates θn∗ and ϑ∗n for the predictions (20) and
3
(21) are as follows,
θn∗
ϑ∗n
K
X
θk (ξk ) 1 ′′
=
{
+ |fk (ξk )|h2k +
h
2
k
k=1
s
X
var(Y |ξk )kK1 k2
}+
fk (ξk ),
(42)
pk (ξk )nhk
k≥K+1
s
K X
M
X
var(ζm |ξk )kK1 k2
1 ′′
θk (ξk )
+ ϑmk (ξk ))|ψm (t)| + |fmk (ξk )|ψm (t)|h2mk +
|ψm (t)|
=
{(
hmk
2
pk (ξk )nhk
k=1 m=1
y
X
πm
|fmk (ξk )|
+ √ 2
fmk (ξk )ψm (t),
}+
nhY
2
(k,m)∈N \NK ×NM
where θk and ϑk are defined in (40) and we note that ϑ∗n depends on t.
2. PROOFS
Proof of Lemma 1. It is sufficient to show (33) and (34). The weak convergence results
bX (s1 , s2 ) have been derived in Lemma 2 of Yao and Lee (2006).
(33) for µ̂X and G
Theorem 1 of Hall and Hosseini-Nasab (2006) implies the first equation of (34) for the
estimated eigenvalues λ̂k . Assuming λk > 0 without loss of generality, we have
Z
Z
b
|λ̂k φ̂k (s) − λk φk (s)| = | GX (t, s)φ̂k (t)dt − GX (t, s)φk (t)dt|
S
S
Z
Z
bX (t, s) − GX (t, s)| · |φ̂k (t)|dt + |GX (t, s)| · |φ̂k (t) − φk (t)|dt
≤
|G
S
sZ S
sZ
≤
S
bX (t, s) − GX (t, s))2 dt +
(G
S
G2X (t, s)dtkφ̂k − φk k,
and |λ̂k φ̂k (s)/λk − φk (s)| = Op {DX (1/λk + 1/δkx )} uniformly in s ∈ S for 1 ≤ k ≤ K0 ,
where K0 is defined in (32). The second equation of (34) follows immediately. Proof of Lemma 2. Let
η̂ik =
η̃ik =
τ̂ik =
ni
X
j=2
ni
X
j=2
ni
X
j=2
{Xi (sij ) − µ̂(sij )}φ̂k (sij )(sij − si,j−1),
{Xi (sij ) − µ(sij )}φk (sij )(sij − si,j−1),
ǫij φ̂k (sij )(sij − si,j−1 ),
4
τ̃ik =
ni
X
j=2
ǫij φk (sij )(sij − si,j−1).
Noting ξˆik = η̂ik + τ̂ik , one finds
|ξˆik − ξik | ≤ {|η̂ik − η̃ik | + |η̃ik − ξik | + |τ̂ik |}.
(43)
Without loss of generality, assume kφk k∞ ≥ 1, kφ′k k∞ ≥ 1, kXi k∞ ≥ 1 and kXi′ k∞ ≥ 1.
(ℓ)
(ℓ)
For θik and Zk (37), ℓ = 1, . . . , 5, the first term on the r.h.s. of (43) is bounded by
{
ni
X
j=2
≤{
[|Xi (sij ) − µ̂(sij )| · |φ̂k (sij ) − φk (sij )| + |µ̂(sij ) − µ(sij )| · |φk (sij )|](sij − si,j−1)}
ni
X
j=1
2
[|Xi (sij )| + |µ(sij )| + 1] (sij − si,j−1)}
+{
ni
X
j=1
≤
(1) (1)
θik Zk
[µ̂(sij ) − µ(sij )]2 (sij − si,j−1 )}1/2 {
+
(2) (2)
θik Zk .
1/2
{
ni
X
j=2
ni
X
j=2
[φ̂k (sij ) − φk (sij )]2 (sij − si,j−1)}1/2
φ2k (sij )(sij − si,j−1)}1/2
The second term on the r.h.s. of (43) has the upper bound
(3)
(3)
|η̃ij − ξik | ≤ k(Xi + µ)′ φk + (Xi + µ)φ′k k∞ ∆∗X ≤ θik Zk .
(4)
(4)
(5)
(5)
From the above, the third term on the r.h.s. of (43) is bounded by (θik Zk + θik Zk ).
Proof of Theorem 1. For simplicity, denote “
Pn
i=1 ”
by “
P
i ”,
wi = K1 {(x−ξik )/hk }/(nhk ),
ŵi = K1 {(x − ξˆik )/hk }/(nhk ), and write θk = θk (x). From (12), the local linear estimator fˆk (x) of the regression function fk (x) can be explicitly written as
P
P
ŵi (ξˆik − x) ˆ′
ŵi Yi
i
ˆ
− i P
fk (x),
fk (t) = P
i ŵi
i ŵi
(44)
where
fˆk′ (x)
=
P
i
P
P
P
ŵi (ξˆik − x)Yi − { i ŵi (ξˆik − x) i ŵi Yi }/ i ŵi
.
P
ˆik − x)2 − {P ŵi (ξˆik − x)}2 / P ŵi
ŵ
(
ξ
i
i
i
i
(45)
Let f˜k (x) be a hypothetical estimator, obtained by substituting the true values wi and
ξik for ŵi , ξˆik in (44) and (45). To evaluate |fˆk (x) − f˜k (x)|, one has to quantify the
5
orders of the differences
D1 =
X
i
D3 =
X
i
(ŵi − wi ),
D2 =
X
i
(ŵi ξˆik − wi ξik ),
(ŵi − wi )Yi ,
D4 =
X
i
2
2
(ŵi ξˆik
− wi ξik
).
Considering D1 , without loss of generality, assume the compact support of K1 is [−1, 1].
Since K1 is Lipschitz continuous on its support,
D1 ≤
c X ˆ
|ξik − ξik |{I(|x − ξik | ≤ hk ) + I(|x − ξˆik | ≤ hk )},
nh2k i
(46)
for some c > 0, where I(·) is an indicator function. Lemma 2 implies for the first term
on the r.h.s. of (46)
5
X
X (ℓ)
1 X ˆ
(ℓ) 1
Z
|
ξ
−
ξ
|I(|x
−
ξ
|
≤
h
)
≤
θik I(|x − ξik | ≤ hk ).
ik
ik
ik
k
k
2
nh2k i
nh
k
i
ℓ=1
Applying the central limit theorem for a random number of summands (Billingsley,
P
p
1995, page 380), observing i I(|x − ξik | ≤ hk )/(nhk ) → 2pk (x), one finds
1 X (ℓ)
p
(ℓ)
θik I(|x − ξik | ≤ hk ) −→ 2pk (x)E(θik ),
nhk i
(ℓ)
(1)
(47)
(3)
provided that E(θik ) < ∞ for ℓ = 1, . . . , 5. Note that Eθik < ∞, Eθik < ∞ by (A4),
p
(4)
(5)
Eθik ≤ 2σX ∆∗X and Eθik ≤ |S|σX by the Cauchy-Schwarz inequality. Then
1 X (1)
πkx
√
θ
I(|x
−
ξ
|
≤
h
)
=
O
{
pk (x)},
ik
k
p
nh2k i ik
nh2X hk
X (2)
1
(2) 1
θik I(|x − ξik | ≤ hk ) = Op { √
Zk
pk (x)},
2
nhk i
nbX hk
X (3)
kφk k∞ ∆∗X
(3) 1
Zk
pk (x)},
θ
I(|x
−
ξ
|
≤
h
)
=
O
{
ik
k
p
nh2k i ik
hk
p ∗
X
∆X
1
(4)
(4)
Zk
pk (x)},
θik I(|x − ξik | ≤ hk ) = Op {
2
nhk i
hk
X (5)
πkx
(5) 1
√
pk (x)}.
θ
I(|x
−
ξ
|
≤
h
)
=
O
{
Zk
ik
k
p
nh2X hk
nh2k i ik
(1)
Zk
6
(48)
We now obtain (nh2k )−1
P ˆ
−1
i |ξik − ξik |I(|x − ξik | ≤ hk ) = Op (θk hk ). The asymptotic
rate of the second term can be derived analogously, observing
5
X
1 X
1 X
p
(ℓ) (ℓ)
ˆ
I(|x−ξik | ≤ hk ) ≤
{I(|x−ξik | ≤ 2hk )+I(
θik Zk > hk )} −→ 4pk (x),
nhk i
nhk i
ℓ=1
leading to (nh2k )−1
follows.
P ˆ
−1
−1
ˆ
i |ξik − ξik |I(|x − ξik | ≤ hk ) = Op (θk hk ). Then D1 = Op (θk hk )
Analogously, one shows D2 = Op (θk h−1
k ), applying the Cauchy-Schwarz inequality
(ℓ)
(ℓ)
for θik , ℓ = 1, 3, and observing the independence between Yi and θik for ℓ = 2, 4, 5,
given the moment condition (A4). For D3 , observe
D3 =
X
i
{(ŵi − wi )ξik + (ŵi − wi )(ξˆik − ξik ) + wi (ξˆik − ξik )} ≡ D31 + D32 + D33 .
Then D31 = Op (θk h−1
k ), analogously to D1 . It is easy to see that D32 = op (D31 ).
P (ℓ)
P
(ℓ)
Since D33 ≤ c 5ℓ=1 Zk (nhk )−1 i θik I(|x − ξik | ≤ hk ) for some c > 0, one also has
2
ˆ2
ˆ
D33 = op (D31 ). This results in D3 = Op (θk h−1
k ). Observing |ξik − ξik | ≤ |ξik − ξik | ·
|ξik | + (ξˆik − ξik )2 , one can show D4 = Op (θk h−1
k ), using similar arguments as for D3 ,
4
and Eξik
< ∞ from (A4). Combining the results for Dℓ , ℓ = 1, . . . , 4, and applying
Slutsky’s Theorem leads to |fˆk (x) − f˜k (x)| = Op (θk h−1
k ). Using (A5), and applying
standard asymptotic results for the hypothetical local linear smoother f˜k (x) completes
the proof of (18).
To derive (19), additionally one only needs to consider
P
i (ŵi ζ̂im −wi ζim )
=
P
i ({ŵi −
wi )ζim + (ŵi − wi)(ζ̂im − ζim ) + wi (ζ̂im − ζim )}, where the third term yields an extra
term of order Op (ϑmk ) by observing
|
X
i
wi (ζ̂im − ζim )| ≤
5
X
ℓ=1
Q(ℓ)
m
X
i
(ℓ)
wi ϑim
5
1 X (ℓ) X (ℓ)
≤
Qm
ϑim I(|x − ξik | ≤ hmk )
nhmk
i
ℓ=1
Similar arguments as above complete this derivation.
Proof of Theorem 2.
Using (A7), the derivation of θn∗ in (42) is straightforward,
7
following the above arguments. To obtain (21), note that
b (t)|X} − E{Y (t)|X}
E{Y
K X
M
X
X
≤
|fˆmk (ξk )ψ̂m (t) − fmk (ξk )ψm (t)| + |
k=1 m=1
≤
K X
M
X
k=1 m=1
+|
X
fmk (ξk )ψm (t)|
k≥K+1 m≥M +1
[|fˆmk (ξk ) − fmk (ξk )|{|ψm (t)| + |ψ̂m (t) − ψm (t)|} + |fmk (ξk )| · |ψ̂m (t) − ψm (t)|]
X
fmk (ξk )ψm (t)|.
(k,m)∈N 2 \NK ×NM
This implies the convergence rate ϑ∗n in (42).
8