Download Functional linear model

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Time series wikipedia , lookup

German tank problem wikipedia , lookup

Least squares wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Statistics & Probability Letters 45 (1999) 11 – 22
www.elsevier.nl/locate/stapro
Functional linear model
HervÃe Cardot a , FrÃedÃeric Ferraty b , Pascal Sardab;∗
a UnitÃ
e
BiomÃetrie et Intelligence Artiÿcielle, INRA, Toulouse, BP 27, 31326 Castanet-Tolosan CÃedex, France
de Statistique et ProbabilitÃes, UMR CNRS C5583, UniversitÃe Paul Sabatier, 118, route de Narbonne,
31062 Toulouse Cedex, France
b Laboratoire
Received June 1998; received in revised form December 1998
Abstract
In this paper, we study a regression model in which explanatory variables are sampling points of a continuous-time
process. We propose an estimator of regression by means of a Functional Principal Component Analysis analogous to
the one introduced by Bosq [(1991) NATO, ASI Series, pp. 509–529] in the case of Hilbertian AR processes. Both
c 1999 Elsevier Science B.V. All
convergence in probability and almost sure convergence of this estimator are stated. rights reserved
Keywords: Functional linear model; Functional data analysis; Hilbert spaces; Convergence
1. Introduction
Classical regression models, such as generalized linear models, may be inadequate in some statistical studies:
it is the case when explanatory variables are digitized points of a curve. Examples can be found in dierent
ÿelds of application such as chemometrics (Frank and Friedman, 1993), linguistic (Hastie et al., 1995) and
many other areas (see Hastie and Mallows, 1993; Ramsay and Silverman, 1997, among others).
In this context, Frank and Friedman (1993) describe and compare dierent estimation procedures – Partial
Least Squares, Ridge Regression and Principal Component Regression – which take into account both the
number of explanatory variables (which may exceed the sample size) and the high correlation between these
variables. On the other hand, several authors (see below) have developped models which allow to describe
the “functional” nature of explanatory variables.
Formally, the above situation can be described through the following functional linear model. Let Y be a
real random variable (r.r.v.) and X = (X (t); t ∈ [0; 1]) be a continuous-time process deÿned on the same space
∗
Corresponding author.
E-mail address: [email protected] (P. Sarda)
c 1999 Elsevier Science B.V. All rights reserved
0167-7152/99/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 0 3 6 - X
12
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
R1
(
; A; P). Assuming that E( 0 X 2 (t) dt) ¡ ∞, the dependence between X and Y is expressed as
Z 1
(t)X (t) dt + ;
Y=
0
(1)
where is a square integrable function deÿned on [0; 1] and is an r.r.v. independent of X with zero mean
and variance equal to 2 .
Hastie and Mallows (1993) introduce an estimator of function
based on the minimization of a cubic
spline criterion and Marx and Eilers (1996) use a smooth basis of B-splines and then introduce a dierence
penalty in the log-likelihood in order to derive a P-splines estimator of .
Alternatively, model (1) can be generalized to the case where X is a random variable valued in a real
separable Hilbert space H and the relation between X and Y can now be written as
Y = (X ) + ;
(2)
0
0
where is an element of H , and H is the space of R-valued continuous linear operators deÿned on H .
Following ideas from Bosq (1991) in the case of ARH processes, we propose in Section 2 below, an
estimator of the operator . This estimator is based on the spectral analysis of the empirical second moment
operator of X , which is then inverted in the space spanned by kn eigenvectors associated with the kn greatest
eigenvalues. The main results are stated in Section 3, that is convergence in probability and almost sure
convergence for this estimator. Computational aspects for the method are discussed in Section 4 through a
simulation study. A sketch of the proofs are given in Section 5 (detailed proofs may be found in Cardot
et al., 1998).
2. Deÿnition of estimator
The inner product and norm in H are, respectively, denoted by h: ; :iH and k : kH and the usual norm k : kH 0
in H 0 is deÿned as
∀T ∈ H 0 ; kT kH 0 = sup |Tx|;
kxkH =1
and satisÿes
∀T ∈ H 0 ; kT kH 0 =
X
!1=2
(Tei )2
;
i∈N
where (ei )i∈N is an orthonormal basis in H . Assuming that the Hilbertian variable X satisÿes
Z
2
kX (!) k2H dP(!) ¡ + ∞;
E[kX kH ] =
we deÿne (cf. Grenander, 1963), from Riesz’s Theorem, the second moment operator
(x) = E(X ⊗H X (x)) = E(hX; xiH X );
of X by
∀x ∈ H:
The operator
is nuclear (and therefore is an Hilbert–Schmidt operator), self-adjoint and positive. Let us
deÿne as the cross second moment operator between X and Y
(x) = E(X ⊗H 0 Y (x)) = E(hX; xiH Y );
∀x ∈ H:
It is easy to see that we have by relation (2)
= :
(3)
In general, the inverse of does not exist and even if it does, since is nuclear, its inverse is not bounded
when H is a set with inÿnite dimension. Direct estimation of −1 is then problematic. Nevertheless, in order
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
13
to estimate , one can think of projecting observations in a subspace of H with ÿnite dimension (depending
on n).
Let (Xi ; Yi ); i = 1; : : : ; n, be a sample from (X; Y ). Empirical versions of operators and are deÿned by
n
1X
Xi ⊗H Xi ;
n
n=
(4)
i=1
n
n =
1X
Xi ⊗H 0 Yi :
n
(5)
i=1
Let us note ˆ1 ¿ˆ2 ¿ · · · ¿ˆn ¿0 = ˆn+1 = : : : ; the eigenvalues of n and V̂ 1 ; V̂ 2 ; : : : ; orthonormal eigenvectors
associated with these eigenvalues. Let (kn )n∈N∗ be a sequence of positive integers such that limn→∞ kn = +∞
with kn 6n and Ĥ kn be the space spanned by {V̂j ; j = 1; : : : ; kn }. In H , let ˆ kn be the orthogonal projection
on this subspace
ˆ kn =
kn
X
V̂j ⊗H V̂j :
j=1
If we suppose ˆkn ¿ 0, we deÿne an estimator ˆkn of as
ˆkn = n ˆ kn (ˆ kn
ˆ
n kn )
−1
:
(6)
Remark. Projecting observations onto the space Ĥ kn spanned by eigenvectors of n leads us to an “optimal”
linear representation of Xi with respect to the explained variance (see Dauxois et al., 1982).
3. Main results
In order to state the main results of the paper, let us introduce the following condition:
(H0 )
ˆ1 ¿ ˆ2 ¿ · · · ¿ ˆkn ¿ 0
a:s:;
which insures almost surely that ˆ kn n ˆ kn is regular and its eigenvectors are identiÿable. Let us note (j )j∈N∗
the sequence of decreasing eigenvalues of and let us deÿne
√
2 2
if j = 1;
aj =
1 − 2
√
2 2
if j 6= 1:
aj =
min(j−1 − j ; j − j+1 )
Theorem 3.1. Suppose that (H0 ) and the following hypotheses are satisÿed:
(H1 )
1 ¿ 2 ¿ · · · ¿ 0;
(H2 )
EkX kH ¡ + ∞;
(H3 )
4
lim nk4n = +∞;
n→+∞
lim
n→+∞
nk2n
= +∞:
Pkn
( j=1 aj )2
14
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
Then
k ˆkn − kH 0 −→ 0
n→+∞
in probability:
Theorem 3.2. Suppose that (H0 ); (H1 ) and the following hypotheses are satisÿed:
(H4 )
kX kH 6c1
(H5 )
||6c2
(H6 )
lim
n→+∞
lim
n→+∞
a:s:;
a:s:;
nk4n
= +∞;
log n
(
Pkn
nk2n
j=1
aj )2 log n
= +∞:
Then
kˆkn − kH 0 −→ 0;
n→+∞
a:s:
Remark. If we have kn = o(log n) and
j = ar j with a ¿ 0; r ∈ ]0; 1[
or
j = aj − with a ¿ 0; ¿ 1;
then (H3 ) and (H6 ) are satisÿed.
In other words, if the eigenvalues of
are geometrically or exponentially decreasing, we may have
Theorems 3:1 and 3:2 satisÿed so long as the sequence kn tends slowly enough to inÿnity. The same kind of
hypotheses are also introduced in Bosq (1991) or in Cardot (1998) and they allow them to obtain rates of
convergence.
4. A simulation study
We have simulated samples (Xi ; Yi ); i = 1; : : : ; n, from model (1) in which X (t) is a Brownian motion
deÿned on [0; 1], is normal with mean 0 and variance 0.2 var ((X )). The Hilbert space H is L2 [0; 1] and
the eigenelements of the covariance operator of X are known to be (see Ash and Gardner, 1975):
√
1
;
Vj (t) = 2 sin{( j − 0:5)t}; t ∈ [0; 1]; j = 1; 2; : : :
j =
2
2
( j − 0:5) In that case, assumptions (H3 ) (respectively (H6 )) on the sequence of eigenvalues in Theorem 3:1 (respectively
Theorem 3:2) are fulÿlled provided that the dimension kn tends slowly enough to inÿnity, i.e. satisfying the
constraint kn = o(log n) (see the remark at the end of Section 3).
We made simulations for two dierent functions :
•
•
1 (t)
2 (t)
= sin(t=2) + 0:5 sin(3t=2) + 0:25 sin(5t=2):
= sin(4t):
In the ÿrst case, the function 1 is a linear combination of eigenfunctions associated with the three greatest
eigenvalues of , so that the best dimension kn should be 3. We tried several sample sizes in each case:
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
Table 1
Quadratic error for the estimate of
15
Table 2
Quadratic error for the estimate of
1
2
kn
n = 50
n = 200
n = 1000
kn
n = 50
n = 200
n = 1000
2
3
4
5
6
3.33
5.46
11.27
10.5
74.3
6.58
3.93
3.93
3.92
3.97
2.99
1.76
1.79
1.96
3.72
4
5
6
7
8
9
10
81.8
9.9
13.6
15
19.2
51.6
53
40.8
18.14
6.02
5.58
4.38
7.64
11.62
15.76
1.9
1.02
0.92
0.49
0.54
1.71
n = 50; 200; 1000: To deal practically with the Brownian random functions Xi (t), their sample paths were
discretized by 100 points equispaced in [0; 1].
The aim of our study is to look at the best dimension kn for the estimation procedure and so we have
considered the following error criterion:
R1
( (t) − ˆ kn (t))2 dt
ˆ
:
R( kn ) = 0 R 1
2 (t) dt
0
Table 1 (resp. Table 2) gives the quadratic errors of estimators of 1 (resp. 2 ) for each sample size and
dierent dimensions kn . In each case, one can notice that R( ˆ kn ) looks like a convex function of dimension
by increasing the variance of the estimate. Also, it appears
kn and a too large kn gives bad estimates of
that, for the ÿrst example, the best dimension selected for the estimation procedure is reasonably close to the
theoretical “optimal” dimension. This last point illustrates the good behaviour of our estimator. In real life
study, this quadratic criterion error cannot be computed and on the other hand, it is clear from the above
simulation that the quality of the estimator depends considerably on the choice of the dimension value, kn :
A data-driven selection method such as penalized cross validation may be used in that case (see Vieu, 1995,
which uses such a criterion for the order choice in nonlinear autoregressive models).
We have drawn the estimates ˆ kn of function 1 . Fig. 1 shows the good performance of the estimation
procedure for reasonably large sample size. For smaller sample size the estimator shows rough features even
if the general form of the function is restituted. We think that this aspect of the estimator could be corrected
by the introduction of a preliminary smoothing procedure such as in Besse and Cardot (1996). We will
investigate this topic in a further study.
5. Proof of theorems
Let (Vj )j∈N∗ be a sequence of orthonormal eigenvectors associated with (j )j∈N∗ and let us deÿne in H
the operator kn as the “theoretical” version of ˆkn
kn = kn (kn kn )−1 ;
(7)
where kn is the orthogonal projection onto the space Hkn spanned by V1 ; : : : ; Vkn . First of all, let us remark
that
(8)
k − ˆkn kH 0 6k − kn kH 0 + kkn − ˆkn kH 0 :
We have for the ÿrst term on the right side of inequality (8)
∞
X
X
2
|( − kn )(Vj0 )|2 =
|(Vj0 )|2 ;
k − kn kH 0 =
j=1
j¿kn
16
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
Fig. 1.
where
Vj0 = (signhV̂j ; Vj iH )Vj ;
j¿1:
Since ∈ H 0 , we get
k − kn kH 0 −→ 0:
(9)
n→+∞
We derive in the following lemma an upper bound for kkn − ˆkn kH 0 . Bosq (1991) proves the analogous
of this lemma for an ARH(1). Let H be the space of Hilbert–Schmidt operators deÿned on H . We consider
in H the usual Hilbert–Schmidt norm deÿned as
!1=2
X
2
kUei kH
;
kU kH =
i∈N
or the uniform norm deÿned as
kU k∞ = sup kUxkH (6 k U kH ):
kxkH
Lemma 5.1.
kkn − ˆkn kH 0 6n k −
where


n k∞
+
1
k − n kH 0 ;
ˆkn

! k
n

X
1
1
1
+2
+
aj :
n = kkH 0
 kn ˆkn
kn
ˆkn j=1 
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
17
Proof. We ÿrst deÿne the following operator e kn in H
e kn =
kn
X
j V̂j ⊗H V̂j :
j=1
We have
k∞ + kkn e k−1
− ˆkn kH 0 :
kkn − ˆkn kH 0 6kkn kH 0 k(kn kn )−1 − e k−1
n
n
(10)
Since kkn k∞ = 1; and then using Lemma 3:1 in Bosq (1991), we have
k∞ 6
kkn kH 0 k(kn kn )−1 − e k−1
n
6
kn
2kkH 0 X
kVj0 − V̂j kH
kn
j=1
2kkH 0
k −
kn
n k∞
kn
X
aj :
(11)
j=1
The second term in (10) gives us
− ˆkn k∞ 6 kkn kH 0 k e k−1
− (ˆ kn
kkn e k−1
n
n
ˆ
n kn )
+kn ˆ kn − kn kH 0 k(ˆ kn
−1
ˆ
k∞
n kn )
−1
k∞ :
Then, by Lemma 3:1 in Bosq (1991) and since the functions V̂j are orthonormal and kˆ kn k∞ = 1, we ÿnd
2
!
X
kn
1
1
−1
2
ˆ


sup −
V̂j ⊗H V̂j (x)
n kn ) k∞ ) =
ˆ
j
kxkH =1 j=1 j

− (ˆ kn
(k e k−1
n
H
6 sup
kn
X
(j − ˆj )2
(j ˆj )2
kxkH =1 j=1
=
k −
2
n k∞
(kn ˆkn )2
2
kV̂j ⊗H V̂j (x)kH
:
(12)
It is easy to see that
k(ˆ kn
ˆ
n kn )
−1
−1
k∞ = ˆkn :
(13)
Finally, with the same arguments as above, we have
kn ˆ kn − kn kH 0 62kkH 0 k −
n k∞
Using (11) – (14) in (10) gives us the lemma.
kn
X
j=1
aj + k − n kH 0 :
(14)
18
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
5.1. Convergence in probability
The following lemma gives us the mean square convergence for operators
n
and n .
Lemma 5.2. If X satisÿed (H2 ) then
Ek −
4
EkX kH
2
n k∞ 6
n
2
Ek − n kH 0 6
;
(15)
2
4
kkH 0 EkX kH
2
2
+ EkX kH :
n
n
(16)
Proof. The proof for (15) is similar to the analogous for the real case. Since En = ; we have
2
2
2
Ek − n kH 0 = Ekn kH 0 − kkH 0 ;
(17)
and with the independence of the Xi ’s we get
2
Ekn kH 0 =
1X
1
2
2
E(hX; ej iH Y )2 + kkH 0 − kkH 0 :
n
n
(18)
j∈N
Now, we can write
kkH 0 EkX kH + 2 EkX kH
1X
:
E(hX; ej iH Y )2 6
n
n
2
4
2
(19)
j∈N
The result is a consequence of (18) and (19).
Let us now consider the following event:
3kn
kn
¡ ˆkn ¡
:
En =
2
2
(20)
In En ; we have with Lemma 5:1
kkn − ˆkn kH 0 6n kkH 0 k −
n k∞
+
2
k − n kH 0 ;
kn
where
n =
kn
2
6 X
+
aj :
kn
k2n
j=1
It follows that
P(kkn − ˆkn kH 0 ¿ ; En )6P k −
n k∞ ¿
2n kkH 0
k + P k − n kH 0 ¿ n :
4
(21)
Then, we have
2
2
4 kk 0
P(kkn − ˆkn kH 0 ¿ ; En )6 n 2 H Ek −
2
n k∞
+
16
2
Ek − n kH 0 :
k2n 2
(22)
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
19
Otherwise, we get
P(E n ) 6 P k −
6
4
Ek −
k2n
n k∞ ¿
kn
2
2
n k∞ :
(23)
We get with (22), (23) and Lemma 5:2
P(kkn − ˆkn kH 0 ¿ ) 6
2
4
4kkH 0 EkX kH 2n
:
2
n
+
1
1
16
2
4
2
4
(kkH 0 EkX kH + 2 EkX kH ) : 2 + 4EkX kH : 2 :
2
nkn
nkn
(24)
It suces to use (9), (24), (H2 ) and (H3 ) to get the proof of Theorem 3:1.
5.2. Almost sure convergence
(1) We give in the following lemma, bounds for P(k
n
− k∞ ¿ ) and P(kn − kH 0 ¿ ).
Lemma 5.3. Under (H4 ) and (H5 ) we have
(a)
2 n
;
P(k n − k∞ ¿ )62 exp −
2c3 (c3 + c4 )
(b)
P(kn − kH 0 ¿ )62 exp −
2 n
2c1 c2 (c1 c2 + c4 )
;
where c1 and c2 are deÿned in (H4 ) and (H5 ) and where c3 and c4 are positive constants.
Proof. The lemma is a consequence of corollary from Yurinski (Yurinski, 1976, p. 491). Let us deÿne for
16i6n
Zi = Xi ⊗H Xi − :
It is obvious that E(Zi ) = 0. The hypothesis (H4 ) implies that
2
2
kZi kH 6 kXi kH + E(kXi kH )
6 2c12
a:s:
20
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
This last inequality implies with c3 = 2c12 that
m! 2 m−2
bc
; ∀m¿2;
2 i
Pn
with bi = c3 . Let Bn2 = i=1 b2i = nc32 ; applying now Yurinski’s result to (Zi )i=1;:::; n , we get part (a) of lemma
since for ¿ 0
m
E(kZi kH )6c3m 6
P(k
n
− k∞ ¿ ) 6 P(k n − kH ¿ )
n
!
√
X n
Zi ¿
Bn
=P c3
i=1
H


2 n
6 2 exp −  2c2 1 + c
3


:

4 c3
Part (b) can be shown in the same way using Yurinski’s corollary for the sequence (Ui )i=1;:::; n deÿned as
Ui = Xi ⊗H 0 i ;
i = 1; : : : ; n:
(2) Lemma (5:3) allows us to write
P k
n
− k∞ ¿
Then, we get
P k n − k∞ ¿
2n kkH 0


2
62 exp −
2
 8kk 0 c3 (c3 +
H
c4 2n kkH 0

n
: 2 :
) n 
n
62 exp −A 2 ;
2n kkH 0
n
(25)
where A is a positive constant independent of n since
∀n ∈ N∗ ;
n ¿
1
:
12
Analogously, we have
k P kn − kH 0 ¿ n 62 exp −B nk2n ;
4
(26)
where B is a positive constant independent of n. Moreover with En deÿned as in (20) we have
)
(
nk2n
:
P(En )6 2 exp −
4c3 (2c3 + c4 kn )
This implies that
P(En )62 exp −Cnk2n ;
(27)
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
21
where C is a positive constant independent of n. Finally, with (25) – (27) and using decomposition (21), we
obtain the following inequality:








 
n
2
2
ˆ
:
P(kkn − kn kH 0 ¿ )62 exp −A 2 + exp −B nkn + exp −Cnkn

{z
} |
{z
}
|


|

{z n }


vn
wn
un
It suces now to show that un , vn and wn are general terms of a convergent series. Let us remark that
−
A n
log un
= 2
:
log n
n log n
Now,
Pkn
(
log n
2n log n
= 4 4 + 36
n
nkn
j=1
aj )2 log n
nk2n
and with (H6 ), we get
lim
n→+∞
log n
=0
nk4n
Pkn
and
lim
(
j=1
n→+∞
Pkn
+ 24
(
aj )2 log n
nk2n
j=1
aj )log n
nk3n
;
= 0;
which implies that
Pkn
( j=1
aj )log n
= 0;
lim
n→+∞
nk3n
and also the convergence of series un since
lim −
n→+∞
log un
= +∞:
log n
The result for sequences vn and wn are obtained in the same way. We then get
X
P(kkn − ˆkn kH 0 ¿ ) ¡ + ∞;
n∈N∗
which gives us with Borel–Cantelli Lemma
kkn − ˆkn kH 0 −→ 0;
n→+∞
a:s:
(28)
The proof of Theorem 3:2 is complete with (8), (9) and (28).
References
Ash, R.B., Gardner, M.F., 1975. Topics in Stochastic Processes. Academic Press, New York.
Besse, P., Cardot, H., 1996. Approximation spline de la prÃevision d’un processus fonctionnel autorÃegressif d’ordre 1. Revue Canadienne
de Statistique=Canad. J. Statist. 24, 467– 487.
Bosq, D., 1991. Modelization, non-parametric estimation and prediction for continuous time processes. In: Roussas, G. (Ed.),
Nonparametric Functional Estimation and Related Topics, NATO, ASI Series, pp. 509–529.
Cardot, H., 1998. Convergence du lissage spline de la prÃevision des processus autorÃegressifs fonctionnels. C.R. Acad. Sci. Paris, SÃer. I,
t. 326, 755–758.
Cardot, H., Ferraty, F., Sarda, P., 1998. ModÂele linÃeaire fonctionnel. Publ. Lab. Statist. Probab. 04:98, Toulouse, France.
Dauxois, J., Pousse, A., Romain, Y., 1982. Asymptotic theory for the principal component analysis of a random vector function: some
applications to statistical inference. J. Multivariate Anal. 12, 136 –154.
22
H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22
Frank, I.E., Friedman, J.H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35, 109–148.
Grenander, U., 1963. Probabilities on Algebraic Structures. Almqvist & Wiksell, Stockholm.
Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized discriminant analysis. Ann. Statist. 23, 73–102.
Hastie, T., Mallows, C., 1993. A discussion of “A Statistical View of Some Chemometrics Regression Tools” by I.E. Frank and
J.H. Friedman. Technometrics 35, 140–143.
Marx, B.D., Eilers, P.H., 1996. Generalized linear regression on sampled signals with penalized likelihood. In: Forcina, A., Marchetti,
G.M., Hatzinger, R., Galmacci, G. (Eds.), Statistical Modelling, Proceedings of the Eleventh International Workshop on Statistical
Modelling, Orvietto.
Ramsay, J.O., Silverman, B.W., 1997. Functional Data Analysis. Springer, Berlin.
Vieu, P., 1995. Order choice in nonlinear autoregressive models. Statistics 26, 307–328.
Yurinski, V.V., 1976. Exponential inequalities for sums of random vectors. J. Multivariate Anal. 6, 473– 499.