Download A Nonparametric Framework for Statistical Modelling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Proceedings of the International Congress of Mathematicians
Berkeley, California, USA, 1986
A Nonparametric Framework for Statistical Modelling
CHARLES J. STONE
1. Introduction. Much of mathematical statistics deals with inference concerning the unknowns in a stochastic model for a random phenomenon. In
the parametric approach the unknowns are a specific finite number of real parameters. In the nonparametric approach they are functions, perhaps subject
to smoothness or other regularity conditions. These functions can be approximated by means of a flexible finite-dimensional function space. To some extent,
this reduces the nonparametric approach to the parametric approach. But the
asymptotic theory is different when the error of approximation is taken into account and the dimension of the approximating function space is allowed to tend
to infinity with the sample size. We will illustrate this theory by means of three
examples: density estimation, logistic regression, and additive logistic regression.
2. Density estimation. Let y be a d-dimensional random variable taking
on values from a known compact subset C of R d . It is assumed that the distribution of Y has a density function / which is continuous and positive on C. By
definition / / = 1. Set g = l o g / . Let Sn denote a pn-dimensional vector space
of functions on G having basis Bnj, 1 < j <pn. It is assumed that £\- Bnj = 1
on C and that no nontrivial linear combination of Bnj, 1 < j < pn, is almost
everywhere equal to zero on C. Given 9 G 0 n , the space of p n -dimensional
vectors, set
Cn(9) = \og\
expl^OjBnj
and
fn(-;9) = exp Y!0*3»'
-
C
"W
'
Then / fn( • ; 0) = 1 for 9 G 6 n . Observe that fn( • ; 9), 0 G 9 n , is an exponential family in canonical form. Let 6 n o denote the (pn — l)-dimensional space
consisting of those 9 G O n , the sum of whose elements is zero. Let 9n denote
This research was supported in part by National Science Foundation Grant DMS-8600409.
© 1987 International Congress of Mathematicians 1986
1052
A NONPARAMETRIC FRAMEWORK FOR STATISTICAL MODELLING
1053
the unique value of 9 G 6 n o that maximizes the expected log-likelihood function
A n ('), defined by
K(9)=E
= T,93 J*Bnjf-Cn(9),
J20jBnj(Y)-Cn(9)
9een0.
Consider the loglinear density approximation fn = fn( • \9n) to / .
Let Fi,... ,Yn be independent random variables each having density / . The
log-likelihood function for the parametric model is given by '
The maximum likelihood estimator (MLE) 9n is the value of 9 G Gno that
maximizes ln( • ) . Since ln( • ) is a strictly concave function on O n0 , the MLE is
unique if it exists. The corresponding estimate fn = fn( • ; 9n) of / is called a
loglinear density estimate, since log/ n G Sn.
Let || ||2 and || ||oo denote the usual L^ and L ^ norms of functions on C. Let
|| \\oo,A denote the L ^ norm of functions on A. It is assumed that pn —> oo as
n —* oo, that pn = o(n,5~£) for some e > 0, and that limn_,oo infsGsn ||s —ft||oo =
0 if h is continuous on G. Let Hnh denote the orthogonal projection of h onto
Sn with respect to Lt2(C). It is assumed that there is a positive constant M such
that ||IInJi||oo < M||A||oo for n > 1 and h G Loo (G); \Bnj\ < M on C for n > 1
and 1 < j < pn\ for 1 < j < pn, Bnj = 0 outside a set Cnj having diameter at
most Mpn ' and Cnj fi Cnk is nonempty for at most M values of k\ and
M'1^]2
< ^0hB,
>nk
oo, Cn
-MpnLi^
JCnj
9 kB;nk
\
k
for n > 1, 9 G 6 n , and 1 < j < pn.
These properties can be satisfied with Cnj, 1 < j < pn, a partition of C
and Bnj the indicator function for Cnj. Here fn is the corresponding histogram
density estimate. They can also be satisfied with d = 1, Sn a space of splines
and Bnj, 1 < j < pn, a basis consisting of jB-splines; see de Boor [1, 2] or Stone
[7, 8, 9] for details. Presumably, they can also be satisfied with tensor product
spaces of splines and with spaces of the type that arise in the use of the finite
element method (see arguments in Descloux [3] and de Boor [1]).
Some asymptotic properties of loglinear density estimation will now be summarized. The proofs follow from arguments in Stone [9]. Set
6n = inf
SCS»
•ffllc
THEOREM 1. (i) | | / „ - / | | o o = 0 ( « n ) ;
(ii) fn exists except on an event whose probability tends to zero with n;
R ||/n - /»Ila = O p r ((n-Vn) 1 / 2 ); and
H ||/n - /nlloo = OprKn-Vnlogten)) 1 / 2 ).
1054
C. J. STONE
Write fn - f = fn - f + fn - fn- The quantity / n - / is a "bias" term,
while fn — fn is a "noise" term whose magnitude is indicated by its asymptotic
variance. Under typical smoothness assumptions on g, 8n = 0(pnQ ) for some
positive number q (this holds with q = m if g has a bounded rath derivative).
Set 7 = l/(2<7 + d) and r = 57. Suppose that ^d < 1/2 or, equivalently, that
q > d/2. To get the optimal rate of convergence of ||/ n — /H2 to zero, choose
Pn ~ nld. Then 62 ~ n"1pn ~ n~2r and hence
= Opr(n-r)-
\\fn-fh
To get the optimal rate of convergence of ||/ n - /Hoo to zero choose pn ~
(n/log(n)) 7d . Then 62 ~ n~ 1 p n log(p n ) ~ ( n - 1 log(n))2r and hence
||/n-/||oc = O p r ((n- 1 log(n)r).
(See Stone [5] for a precise definition of optimal rate of convergence.)
Let In(') denote the information function based on the random sample of size
n. Then In(9) is the Hessian matrix of nGn( • ) at 9; that is, the pn x pn matrix
whose (j, fc)th element is n(d2Cn(9)/d9jd9k). Let In~1(9) denote the inverse
to In(9) viewed as a linear transformation of 9 n o- Set I~1 = In"1(9n) and
In1 — In1^)Let Gn(2/), Gn(y) G 0 n o, denote the pn-dimensional vectors
having elements
GnM = Bnj(y)-^(<>n)
and Ônj(y) = Bnj(y) - ^
(§n) ,
respectively. Set
=
fnmGniyyi^Gniv))1'2
SE(fn(y)) =
Uy){Gn{y)'î-xGn{y))xl\
SE(Uy))
and
THEOREM 2. Uniformly in y El,
SE{Uv))
~ (n _1 Pn) 1/2 ,
SE(fn{y))/SE(fn(y))
= 1 + Opr(l).
and
( f (.A — t („\ \
(Uy)-fn{y)\
•V(0,1).
\ SE(fn(y)) J
—It=follows=from-TheoremH2-ths,t—ff[(y)=±=z r=T5WiS^(=/rr(2/)=)==is^an-asymptotic—
(1 — a)-level confidence interval for fn(y)\ if an = o((n~1pn)1^2), it is also an
asymptotic (1 — a)-level confidence interval for f(y). Here $(zq) = q, $ being
the standard normal distribution function.
Let P denote the distribution corresponding to / , defined by P(A) =
fA f, and let Pn and Pn be defined similarly in terms of fn and / n . Let
A denote a class of subsets of C. Given distributions Q\ and Q2 on G set
A NONPARAMETRIC FRAMEWORK FOR STATISTICAL MODELLING
1055
||Qi — Q2II00 = sup^g^ IQi(-A) —Q2(A)\. Under reasonable conditions on A, Sn,
and / , \\Pn - PU«, = 0(pn1/d6n) and
It was shown in Stone [9] for the special case of d = 1, bases consisting of
J3-splines, and A the collection of subintervals of a compact interval G, that
11Ai "~ jFnlloo = O p r ( n - 1 / 2 ) . What is a corresponding result in the more general
context of the present paper?
3. Logistic regression. Let X, Y be a pair of random variables such that X
ranges over a known compact subset G of R d and Y takes on only two values, 0
and 1. It is assumed that the distribution of X is absolutely continuous and that
its density is bounded away from zero and infinity on G. Let / be the regression
function, defined on G by
f(x)=Pr(Y
= l\X = x).
It is assumed that / is continuous and that 0 < / < 1. Let g denote the
corresponding logistic regression function, defined by
0 = logit(/) = l o g ( / / ( l - / ) ) ;
so that / = exp(flf)/(l + exp(g)).
We can approximate g by a member of a p n -dimensional vector space Sn of
functions on G. Let Bnj, 1 < j < pn> denote a basis of Sn. Then the expected
log-likelihood function A n ( • ) is defined by
kn(0) = E
J2e3Bnj(X)Y
- log I 1 + exp j Ç 9 3 B n j ( X )
,
8 G 6n.
Let 9n be the unique 9 G 8 n that maximizes A n (0) and set
e
0n = E
njBnj
and
fn = exp(gn)/(l
+ exp(^ n )).
3
Let (X\, Y i ) , . . . , (Xn, Yn) be independent random pairs, each having the same
distribution as (X, Y). The log-likelihood function for the parametric model is
given by
ln(0) = J2 °3 E YiBm(Xi) - E l0S I * + exP ( E eJBn3iXi)
3
i
i
V
V J
J
which corresponds to an exponential family in canonical form. Let 9n denote
the MLE of 9 and set gn = Y^j KjBnj and fn = exp(gn)/(l + exp(£ n )). Under
appropriate regularity conditions, analogs of Theorems 1 and 2 of §2 should hold.
1056
C. J. STONE
4. Additive logistic regression. Let G be a rectangle, say, G — [0, l]d.
It is then useful in practice to assume that g is additive or, more generally, to
replace g by its best additive approximation g* ; this is defined to be the unique
additive function h on G that maximizes the expected log-likelihood
E[h(X)Y-
log(l + e h ( x ) ) ] .
Set / * = exp(<7*)/(l + exp(g*)). If g itself is additive, then g* = g and /* = / .
To obtain a p n -dimensional space of additive approximations to g*, we consider pnfc-dimensional vector spaces Snk of functions on [0,1] for 1 < k < d, each
containing the constant functions, and let Sn be the collection of all functions
of the form s(xi,. ..,Xd) = Ylk sk(%k), where Sk G Snk for 1 < k < d. Then
Pn = 1 + J2k(Pnk — !)• Analogs of Theorem 1 of §2 and its consequences for
optimal rates of convergence should hold with / replaced by /*, g replaced by g*,
and d replaced by 1; see Stone [7, 8] for what has been rigorously verified to date.
An analog to Theorem 2 should also hold if g itself is additive. Otherwise, a more
complicated standard error formula would be required since Pr(Y = 1\X = x)
would not be exactly equal to f*(x).
REFERENCES
1. C. de Boor, A bound on the Loo-norm of the Li-approximation
by splines in terms
of a global mesh ratio, Math. Comp. 30 (1976), 765-771.
2.
, A practical guide to splines, Springer-Verlag, New York, 1978.
3. J. Descloux, On finite element matrices, SIAM J. Numer. Anal. 9 (1972), 260-265.
4. C. J. Stone, Optimal rates of convergence for nonparametric
estimators, Ann.
Statist. 8 (1980), 1348-1360.
5.
, Optimal global rates of convergence for nonparametric regression, Ann.
Statist. 10 (1982), 1040-1053.
6.
, Optimal uniform rate of convergence for nonparametric
estimators of a
density function or its derivatives, Recent Advances in Statistics: Papers in Honor of Herman
Chernoff on his Sixtieth Birthday (M. H. Rezvi, J. S. Rustagi, and D. Siegmund, editors),
Academic Press, New York, 1983, pp. 393-406.
7.
, Additive regression and other nonparametric models, Ann. Statist. 13 (1985),
689-705.
8.
, The dimensionality reduction principle for generalized additive models, Ann.
Statist. 14 (1966), 590-606.
9.
, Asymptotic properties of logspline density estimation, Technical Report No.
69, Department of Statistics, Univ. of California, Berkeley, 1986.
UNIVERSITY OF CALIFORNIA, BERKELEY, CALIFORNIA 94720, USA