* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Functional linear model
Survey
Document related concepts
Transcript
Statistics & Probability Letters 45 (1999) 11 – 22 www.elsevier.nl/locate/stapro Functional linear model HervÃe Cardot a , FrÃedÃeric Ferraty b , Pascal Sardab;∗ a Unità e BiomÃetrie et Intelligence Artiÿcielle, INRA, Toulouse, BP 27, 31326 Castanet-Tolosan CÃedex, France de Statistique et ProbabilitÃes, UMR CNRS C5583, UniversitÃe Paul Sabatier, 118, route de Narbonne, 31062 Toulouse Cedex, France b Laboratoire Received June 1998; received in revised form December 1998 Abstract In this paper, we study a regression model in which explanatory variables are sampling points of a continuous-time process. We propose an estimator of regression by means of a Functional Principal Component Analysis analogous to the one introduced by Bosq [(1991) NATO, ASI Series, pp. 509–529] in the case of Hilbertian AR processes. Both c 1999 Elsevier Science B.V. All convergence in probability and almost sure convergence of this estimator are stated. rights reserved Keywords: Functional linear model; Functional data analysis; Hilbert spaces; Convergence 1. Introduction Classical regression models, such as generalized linear models, may be inadequate in some statistical studies: it is the case when explanatory variables are digitized points of a curve. Examples can be found in dierent ÿelds of application such as chemometrics (Frank and Friedman, 1993), linguistic (Hastie et al., 1995) and many other areas (see Hastie and Mallows, 1993; Ramsay and Silverman, 1997, among others). In this context, Frank and Friedman (1993) describe and compare dierent estimation procedures – Partial Least Squares, Ridge Regression and Principal Component Regression – which take into account both the number of explanatory variables (which may exceed the sample size) and the high correlation between these variables. On the other hand, several authors (see below) have developped models which allow to describe the “functional” nature of explanatory variables. Formally, the above situation can be described through the following functional linear model. Let Y be a real random variable (r.r.v.) and X = (X (t); t ∈ [0; 1]) be a continuous-time process deÿned on the same space ∗ Corresponding author. E-mail address: [email protected] (P. Sarda) c 1999 Elsevier Science B.V. All rights reserved 0167-7152/99/$ - see front matter PII: S 0 1 6 7 - 7 1 5 2 ( 9 9 ) 0 0 0 3 6 - X 12 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 R1 ( ; A; P). Assuming that E( 0 X 2 (t) dt) ¡ ∞, the dependence between X and Y is expressed as Z 1 (t)X (t) dt + ; Y= 0 (1) where is a square integrable function deÿned on [0; 1] and is an r.r.v. independent of X with zero mean and variance equal to 2 . Hastie and Mallows (1993) introduce an estimator of function based on the minimization of a cubic spline criterion and Marx and Eilers (1996) use a smooth basis of B-splines and then introduce a dierence penalty in the log-likelihood in order to derive a P-splines estimator of . Alternatively, model (1) can be generalized to the case where X is a random variable valued in a real separable Hilbert space H and the relation between X and Y can now be written as Y = (X ) + ; (2) 0 0 where is an element of H , and H is the space of R-valued continuous linear operators deÿned on H . Following ideas from Bosq (1991) in the case of ARH processes, we propose in Section 2 below, an estimator of the operator . This estimator is based on the spectral analysis of the empirical second moment operator of X , which is then inverted in the space spanned by kn eigenvectors associated with the kn greatest eigenvalues. The main results are stated in Section 3, that is convergence in probability and almost sure convergence for this estimator. Computational aspects for the method are discussed in Section 4 through a simulation study. A sketch of the proofs are given in Section 5 (detailed proofs may be found in Cardot et al., 1998). 2. Deÿnition of estimator The inner product and norm in H are, respectively, denoted by h: ; :iH and k : kH and the usual norm k : kH 0 in H 0 is deÿned as ∀T ∈ H 0 ; kT kH 0 = sup |Tx|; kxkH =1 and satisÿes ∀T ∈ H 0 ; kT kH 0 = X !1=2 (Tei )2 ; i∈N where (ei )i∈N is an orthonormal basis in H . Assuming that the Hilbertian variable X satisÿes Z 2 kX (!) k2H dP(!) ¡ + ∞; E[kX kH ] = we deÿne (cf. Grenander, 1963), from Riesz’s Theorem, the second moment operator (x) = E(X ⊗H X (x)) = E(hX; xiH X ); of X by ∀x ∈ H: The operator is nuclear (and therefore is an Hilbert–Schmidt operator), self-adjoint and positive. Let us deÿne as the cross second moment operator between X and Y (x) = E(X ⊗H 0 Y (x)) = E(hX; xiH Y ); ∀x ∈ H: It is easy to see that we have by relation (2) = : (3) In general, the inverse of does not exist and even if it does, since is nuclear, its inverse is not bounded when H is a set with inÿnite dimension. Direct estimation of −1 is then problematic. Nevertheless, in order H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 13 to estimate , one can think of projecting observations in a subspace of H with ÿnite dimension (depending on n). Let (Xi ; Yi ); i = 1; : : : ; n, be a sample from (X; Y ). Empirical versions of operators and are deÿned by n 1X Xi ⊗H Xi ; n n= (4) i=1 n n = 1X Xi ⊗H 0 Yi : n (5) i=1 Let us note ˆ1 ¿ˆ2 ¿ · · · ¿ˆn ¿0 = ˆn+1 = : : : ; the eigenvalues of n and V̂ 1 ; V̂ 2 ; : : : ; orthonormal eigenvectors associated with these eigenvalues. Let (kn )n∈N∗ be a sequence of positive integers such that limn→∞ kn = +∞ with kn 6n and Ĥ kn be the space spanned by {V̂j ; j = 1; : : : ; kn }. In H , let ˆ kn be the orthogonal projection on this subspace ˆ kn = kn X V̂j ⊗H V̂j : j=1 If we suppose ˆkn ¿ 0, we deÿne an estimator ˆkn of as ˆkn = n ˆ kn (ˆ kn ˆ n kn ) −1 : (6) Remark. Projecting observations onto the space Ĥ kn spanned by eigenvectors of n leads us to an “optimal” linear representation of Xi with respect to the explained variance (see Dauxois et al., 1982). 3. Main results In order to state the main results of the paper, let us introduce the following condition: (H0 ) ˆ1 ¿ ˆ2 ¿ · · · ¿ ˆkn ¿ 0 a:s:; which insures almost surely that ˆ kn n ˆ kn is regular and its eigenvectors are identiÿable. Let us note (j )j∈N∗ the sequence of decreasing eigenvalues of and let us deÿne √ 2 2 if j = 1; aj = 1 − 2 √ 2 2 if j 6= 1: aj = min(j−1 − j ; j − j+1 ) Theorem 3.1. Suppose that (H0 ) and the following hypotheses are satisÿed: (H1 ) 1 ¿ 2 ¿ · · · ¿ 0; (H2 ) EkX kH ¡ + ∞; (H3 ) 4 lim nk4n = +∞; n→+∞ lim n→+∞ nk2n = +∞: Pkn ( j=1 aj )2 14 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 Then k ˆkn − kH 0 −→ 0 n→+∞ in probability: Theorem 3.2. Suppose that (H0 ); (H1 ) and the following hypotheses are satisÿed: (H4 ) kX kH 6c1 (H5 ) ||6c2 (H6 ) lim n→+∞ lim n→+∞ a:s:; a:s:; nk4n = +∞; log n ( Pkn nk2n j=1 aj )2 log n = +∞: Then kˆkn − kH 0 −→ 0; n→+∞ a:s: Remark. If we have kn = o(log n) and j = ar j with a ¿ 0; r ∈ ]0; 1[ or j = aj − with a ¿ 0; ¿ 1; then (H3 ) and (H6 ) are satisÿed. In other words, if the eigenvalues of are geometrically or exponentially decreasing, we may have Theorems 3:1 and 3:2 satisÿed so long as the sequence kn tends slowly enough to inÿnity. The same kind of hypotheses are also introduced in Bosq (1991) or in Cardot (1998) and they allow them to obtain rates of convergence. 4. A simulation study We have simulated samples (Xi ; Yi ); i = 1; : : : ; n, from model (1) in which X (t) is a Brownian motion deÿned on [0; 1], is normal with mean 0 and variance 0.2 var ((X )). The Hilbert space H is L2 [0; 1] and the eigenelements of the covariance operator of X are known to be (see Ash and Gardner, 1975): √ 1 ; Vj (t) = 2 sin{( j − 0:5)t}; t ∈ [0; 1]; j = 1; 2; : : : j = 2 2 ( j − 0:5) In that case, assumptions (H3 ) (respectively (H6 )) on the sequence of eigenvalues in Theorem 3:1 (respectively Theorem 3:2) are fulÿlled provided that the dimension kn tends slowly enough to inÿnity, i.e. satisfying the constraint kn = o(log n) (see the remark at the end of Section 3). We made simulations for two dierent functions : • • 1 (t) 2 (t) = sin(t=2) + 0:5 sin(3t=2) + 0:25 sin(5t=2): = sin(4t): In the ÿrst case, the function 1 is a linear combination of eigenfunctions associated with the three greatest eigenvalues of , so that the best dimension kn should be 3. We tried several sample sizes in each case: H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 Table 1 Quadratic error for the estimate of 15 Table 2 Quadratic error for the estimate of 1 2 kn n = 50 n = 200 n = 1000 kn n = 50 n = 200 n = 1000 2 3 4 5 6 3.33 5.46 11.27 10.5 74.3 6.58 3.93 3.93 3.92 3.97 2.99 1.76 1.79 1.96 3.72 4 5 6 7 8 9 10 81.8 9.9 13.6 15 19.2 51.6 53 40.8 18.14 6.02 5.58 4.38 7.64 11.62 15.76 1.9 1.02 0.92 0.49 0.54 1.71 n = 50; 200; 1000: To deal practically with the Brownian random functions Xi (t), their sample paths were discretized by 100 points equispaced in [0; 1]. The aim of our study is to look at the best dimension kn for the estimation procedure and so we have considered the following error criterion: R1 ( (t) − ˆ kn (t))2 dt ˆ : R( kn ) = 0 R 1 2 (t) dt 0 Table 1 (resp. Table 2) gives the quadratic errors of estimators of 1 (resp. 2 ) for each sample size and dierent dimensions kn . In each case, one can notice that R( ˆ kn ) looks like a convex function of dimension by increasing the variance of the estimate. Also, it appears kn and a too large kn gives bad estimates of that, for the ÿrst example, the best dimension selected for the estimation procedure is reasonably close to the theoretical “optimal” dimension. This last point illustrates the good behaviour of our estimator. In real life study, this quadratic criterion error cannot be computed and on the other hand, it is clear from the above simulation that the quality of the estimator depends considerably on the choice of the dimension value, kn : A data-driven selection method such as penalized cross validation may be used in that case (see Vieu, 1995, which uses such a criterion for the order choice in nonlinear autoregressive models). We have drawn the estimates ˆ kn of function 1 . Fig. 1 shows the good performance of the estimation procedure for reasonably large sample size. For smaller sample size the estimator shows rough features even if the general form of the function is restituted. We think that this aspect of the estimator could be corrected by the introduction of a preliminary smoothing procedure such as in Besse and Cardot (1996). We will investigate this topic in a further study. 5. Proof of theorems Let (Vj )j∈N∗ be a sequence of orthonormal eigenvectors associated with (j )j∈N∗ and let us deÿne in H the operator kn as the “theoretical” version of ˆkn kn = kn (kn kn )−1 ; (7) where kn is the orthogonal projection onto the space Hkn spanned by V1 ; : : : ; Vkn . First of all, let us remark that (8) k − ˆkn kH 0 6k − kn kH 0 + kkn − ˆkn kH 0 : We have for the ÿrst term on the right side of inequality (8) ∞ X X 2 |( − kn )(Vj0 )|2 = |(Vj0 )|2 ; k − kn kH 0 = j=1 j¿kn 16 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 Fig. 1. where Vj0 = (signhV̂j ; Vj iH )Vj ; j¿1: Since ∈ H 0 , we get k − kn kH 0 −→ 0: (9) n→+∞ We derive in the following lemma an upper bound for kkn − ˆkn kH 0 . Bosq (1991) proves the analogous of this lemma for an ARH(1). Let H be the space of Hilbert–Schmidt operators deÿned on H . We consider in H the usual Hilbert–Schmidt norm deÿned as !1=2 X 2 kUei kH ; kU kH = i∈N or the uniform norm deÿned as kU k∞ = sup kUxkH (6 k U kH ): kxkH Lemma 5.1. kkn − ˆkn kH 0 6n k − where n k∞ + 1 k − n kH 0 ; ˆkn ! k n X 1 1 1 +2 + aj : n = kkH 0 kn ˆkn kn ˆkn j=1 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 17 Proof. We ÿrst deÿne the following operator e kn in H e kn = kn X j V̂j ⊗H V̂j : j=1 We have k∞ + kkn e k−1 − ˆkn kH 0 : kkn − ˆkn kH 0 6kkn kH 0 k(kn kn )−1 − e k−1 n n (10) Since kkn k∞ = 1; and then using Lemma 3:1 in Bosq (1991), we have k∞ 6 kkn kH 0 k(kn kn )−1 − e k−1 n 6 kn 2kkH 0 X kVj0 − V̂j kH kn j=1 2kkH 0 k − kn n k∞ kn X aj : (11) j=1 The second term in (10) gives us − ˆkn k∞ 6 kkn kH 0 k e k−1 − (ˆ kn kkn e k−1 n n ˆ n kn ) +kn ˆ kn − kn kH 0 k(ˆ kn −1 ˆ k∞ n kn ) −1 k∞ : Then, by Lemma 3:1 in Bosq (1991) and since the functions V̂j are orthonormal and kˆ kn k∞ = 1, we ÿnd 2 ! X kn 1 1 −1 2 ˆ sup − V̂j ⊗H V̂j (x) n kn ) k∞ ) = ˆ j kxkH =1 j=1 j − (ˆ kn (k e k−1 n H 6 sup kn X (j − ˆj )2 (j ˆj )2 kxkH =1 j=1 = k − 2 n k∞ (kn ˆkn )2 2 kV̂j ⊗H V̂j (x)kH : (12) It is easy to see that k(ˆ kn ˆ n kn ) −1 −1 k∞ = ˆkn : (13) Finally, with the same arguments as above, we have kn ˆ kn − kn kH 0 62kkH 0 k − n k∞ Using (11) – (14) in (10) gives us the lemma. kn X j=1 aj + k − n kH 0 : (14) 18 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 5.1. Convergence in probability The following lemma gives us the mean square convergence for operators n and n . Lemma 5.2. If X satisÿed (H2 ) then Ek − 4 EkX kH 2 n k∞ 6 n 2 Ek − n kH 0 6 ; (15) 2 4 kkH 0 EkX kH 2 2 + EkX kH : n n (16) Proof. The proof for (15) is similar to the analogous for the real case. Since En = ; we have 2 2 2 Ek − n kH 0 = Ekn kH 0 − kkH 0 ; (17) and with the independence of the Xi ’s we get 2 Ekn kH 0 = 1X 1 2 2 E(hX; ej iH Y )2 + kkH 0 − kkH 0 : n n (18) j∈N Now, we can write kkH 0 EkX kH + 2 EkX kH 1X : E(hX; ej iH Y )2 6 n n 2 4 2 (19) j∈N The result is a consequence of (18) and (19). Let us now consider the following event: 3kn kn ¡ ˆkn ¡ : En = 2 2 (20) In En ; we have with Lemma 5:1 kkn − ˆkn kH 0 6n kkH 0 k − n k∞ + 2 k − n kH 0 ; kn where n = kn 2 6 X + aj : kn k2n j=1 It follows that P(kkn − ˆkn kH 0 ¿ ; En )6P k − n k∞ ¿ 2n kkH 0 k + P k − n kH 0 ¿ n : 4 (21) Then, we have 2 2 4 kk 0 P(kkn − ˆkn kH 0 ¿ ; En )6 n 2 H Ek − 2 n k∞ + 16 2 Ek − n kH 0 : k2n 2 (22) H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 19 Otherwise, we get P(E n ) 6 P k − 6 4 Ek − k2n n k∞ ¿ kn 2 2 n k∞ : (23) We get with (22), (23) and Lemma 5:2 P(kkn − ˆkn kH 0 ¿ ) 6 2 4 4kkH 0 EkX kH 2n : 2 n + 1 1 16 2 4 2 4 (kkH 0 EkX kH + 2 EkX kH ) : 2 + 4EkX kH : 2 : 2 nkn nkn (24) It suces to use (9), (24), (H2 ) and (H3 ) to get the proof of Theorem 3:1. 5.2. Almost sure convergence (1) We give in the following lemma, bounds for P(k n − k∞ ¿ ) and P(kn − kH 0 ¿ ). Lemma 5.3. Under (H4 ) and (H5 ) we have (a) 2 n ; P(k n − k∞ ¿ )62 exp − 2c3 (c3 + c4 ) (b) P(kn − kH 0 ¿ )62 exp − 2 n 2c1 c2 (c1 c2 + c4 ) ; where c1 and c2 are deÿned in (H4 ) and (H5 ) and where c3 and c4 are positive constants. Proof. The lemma is a consequence of corollary from Yurinski (Yurinski, 1976, p. 491). Let us deÿne for 16i6n Zi = Xi ⊗H Xi − : It is obvious that E(Zi ) = 0. The hypothesis (H4 ) implies that 2 2 kZi kH 6 kXi kH + E(kXi kH ) 6 2c12 a:s: 20 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 This last inequality implies with c3 = 2c12 that m! 2 m−2 bc ; ∀m¿2; 2 i Pn with bi = c3 . Let Bn2 = i=1 b2i = nc32 ; applying now Yurinski’s result to (Zi )i=1;:::; n , we get part (a) of lemma since for ¿ 0 m E(kZi kH )6c3m 6 P(k n − k∞ ¿ ) 6 P(k n − kH ¿ ) n ! √ X n Zi ¿ Bn =P c3 i=1 H 2 n 6 2 exp − 2c2 1 + c 3 : 4 c3 Part (b) can be shown in the same way using Yurinski’s corollary for the sequence (Ui )i=1;:::; n deÿned as Ui = Xi ⊗H 0 i ; i = 1; : : : ; n: (2) Lemma (5:3) allows us to write P k n − k∞ ¿ Then, we get P k n − k∞ ¿ 2n kkH 0 2 62 exp − 2 8kk 0 c3 (c3 + H c4 2n kkH 0 n : 2 : ) n n 62 exp −A 2 ; 2n kkH 0 n (25) where A is a positive constant independent of n since ∀n ∈ N∗ ; n ¿ 1 : 12 Analogously, we have k P kn − kH 0 ¿ n 62 exp −B nk2n ; 4 (26) where B is a positive constant independent of n. Moreover with En deÿned as in (20) we have ) ( nk2n : P(En )6 2 exp − 4c3 (2c3 + c4 kn ) This implies that P(En )62 exp −Cnk2n ; (27) H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 21 where C is a positive constant independent of n. Finally, with (25) – (27) and using decomposition (21), we obtain the following inequality: n 2 2 ˆ : P(kkn − kn kH 0 ¿ )62 exp −A 2 + exp −B nkn + exp −Cnkn {z } | {z } | | {z n } vn wn un It suces now to show that un , vn and wn are general terms of a convergent series. Let us remark that − A n log un = 2 : log n n log n Now, Pkn ( log n 2n log n = 4 4 + 36 n nkn j=1 aj )2 log n nk2n and with (H6 ), we get lim n→+∞ log n =0 nk4n Pkn and lim ( j=1 n→+∞ Pkn + 24 ( aj )2 log n nk2n j=1 aj )log n nk3n ; = 0; which implies that Pkn ( j=1 aj )log n = 0; lim n→+∞ nk3n and also the convergence of series un since lim − n→+∞ log un = +∞: log n The result for sequences vn and wn are obtained in the same way. We then get X P(kkn − ˆkn kH 0 ¿ ) ¡ + ∞; n∈N∗ which gives us with Borel–Cantelli Lemma kkn − ˆkn kH 0 −→ 0; n→+∞ a:s: (28) The proof of Theorem 3:2 is complete with (8), (9) and (28). References Ash, R.B., Gardner, M.F., 1975. Topics in Stochastic Processes. Academic Press, New York. Besse, P., Cardot, H., 1996. Approximation spline de la prÃevision d’un processus fonctionnel autorÃegressif d’ordre 1. Revue Canadienne de Statistique=Canad. J. Statist. 24, 467– 487. Bosq, D., 1991. Modelization, non-parametric estimation and prediction for continuous time processes. In: Roussas, G. (Ed.), Nonparametric Functional Estimation and Related Topics, NATO, ASI Series, pp. 509–529. Cardot, H., 1998. Convergence du lissage spline de la prÃevision des processus autorÃegressifs fonctionnels. C.R. Acad. Sci. Paris, SÃer. I, t. 326, 755–758. Cardot, H., Ferraty, F., Sarda, P., 1998. ModÂele linÃeaire fonctionnel. Publ. Lab. Statist. Probab. 04:98, Toulouse, France. Dauxois, J., Pousse, A., Romain, Y., 1982. Asymptotic theory for the principal component analysis of a random vector function: some applications to statistical inference. J. Multivariate Anal. 12, 136 –154. 22 H. Cardot et al. / Statistics & Probability Letters 45 (1999) 11 – 22 Frank, I.E., Friedman, J.H., 1993. A statistical view of some chemometrics regression tools. Technometrics 35, 109–148. Grenander, U., 1963. Probabilities on Algebraic Structures. Almqvist & Wiksell, Stockholm. Hastie, T., Buja, A., Tibshirani, R., 1995. Penalized discriminant analysis. Ann. Statist. 23, 73–102. Hastie, T., Mallows, C., 1993. A discussion of “A Statistical View of Some Chemometrics Regression Tools” by I.E. Frank and J.H. Friedman. Technometrics 35, 140–143. Marx, B.D., Eilers, P.H., 1996. Generalized linear regression on sampled signals with penalized likelihood. In: Forcina, A., Marchetti, G.M., Hatzinger, R., Galmacci, G. (Eds.), Statistical Modelling, Proceedings of the Eleventh International Workshop on Statistical Modelling, Orvietto. Ramsay, J.O., Silverman, B.W., 1997. Functional Data Analysis. Springer, Berlin. Vieu, P., 1995. Order choice in nonlinear autoregressive models. Statistics 26, 307–328. Yurinski, V.V., 1976. Exponential inequalities for sums of random vectors. J. Multivariate Anal. 6, 473– 499.