Download Estimation of Functions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Estimation of Functions
An interesting problem in statistics, and one that is generally
difficult, is the estimation of a continuous function such as a
probability density function.
The statistical properties of an estimator of a function are more
complicated than statistical properties of an estimator of a single
parameter or even of a countable set of parameters.
We consider the general case of a real scalar-valued function over
real vector-valued arguments (that is, a mapping from IRd into
IR).
One of the most common situations in which these properties
are relevant is in nonparametric probability density estimation.
1
Notation
We may denote a function by a single letter, f , for example, or
by the function notation, f (·) or f (x).
When f (x) denotes a function, x is merely a placeholder.
The notation f (x), however, may also refer to the value of the
function at the point x. The meaning is usually clear from the
context.
Using the common “hat” notation for an estimator, we use fb or
fb(x) to denote the estimator of f or of f (x).
2
More on Notation
The hat notation is also used to denote an estimate, so we must
determine from the context whether fb or fb(x) denotes a random
variable or a realization of a random variable.
The estimate or the estimator of the value of the function at
the point x may also be denoted by fb(x).
Sometimes, to emphasize that we are estimating the ordinate of
the function rather than evaluating an estimate of the function,
(x).
we use the notation fd
3
Optimality
The usual optimality properties that we use in developing a theory of estimation of a finite-dimensional parameter must be extended for estimation of a general function.
As we will see, two of the usual desirable properties of point estimators, namely unbiasedness and maximum likelihood, cannot
be attained in general by estimators of functions.
4
Estimation or Approximation?
There are many similarities in estimation of functions and approximation of functions, but we must be aware of the fundamental
differences in the two problems.
Estimation of functions is similar to other estimation problems:
we are given a sample of observations; we make certain assumptions about the probability distribution of the sample; and then
we develop estimators.
The estimators are random variables, and how useful they are
depends on properties of their distribution, such as their expected
values and their variances.
Approximation of functions is an important aspect of numerical analysis. Functions are often approximated to interpolate
functional values between directly computed or known values.
5
General Methods for Estimating
Functions
In the problem of function estimation, we may have observations
on the function at specific points in the domain, or we may have
indirect measurements of the function, such as observations that
relate to a derivative or an integral of the function.
In either case, the problem of function estimation has the competing goals of providing a good fit to the observed data and
predicting values at other points.
In many cases, a smooth estimate satisfies this latter objective. In other cases, however, the unknown function itself is not
smooth.
Functions with different forms may govern the phenomena in different regimes. This presents a very difficult problem in function
estimation, but we won’t go into it.
6
General Methods for Estimating
Functions
There are various approaches to estimating functions.
Maximum likelihood has limited usefulness for estimating functions because in general the likelihood is unbounded.
A practical approach is to assume that the function is of a particular form and estimate the parameters that characterize the
form.
For example, we may assume that the function is exponential,
possibly because of physical properties such as exponential decay. We may then use various estimation criteria, such as least
squares, to estimate the parameter.
7
Mixtures of Functions with Prescribed
Forms
An extension of this approach is to assume that the function is
a mixture of other functions.
The mixture can be formed by different functions over different
domains or by weighted averages of the functions over the whole
domain.
Estimation of the function of interest involves estimation of various parameters as well as the weights.
8
Use of Basis Functions
Another approach to function estimation is to represent the function of interest as a linear combination of basis functions, that
is, to represent the function in a series expansion.
The basis functions are generally chosen to be orthogonal over
the domain of interest, and the observed data are used to estimate the coefficients in the series.
9
Estimation of a Function at a Point
It is often more practical to estimate the function value at a
given point.
(Of course, if we can estimate the function at any given point,
we can effectively have an estimate at all points.)
One way of forming an estimate of a function at a given point
is to take the average at that point of a filtering function that
is evaluated in the vicinity of each data point.
The filtering function is called a kernel, and the result of this
approach is called a kernel estimator.
We must be concerned about the properties of the estimators at
specific points and also about properties over the full domain.
Global properties over the full domain are often defined in terms
of integrals or in terms of suprema or infima.
10
Kernel Methods
One approach to function estimation and approximation is to
use a filter or kernel function to provide local weighting of the
observed data.
This approach ensures that at a given point the observations
close to that point influence the estimate at the point more
strongly than more distant observations.
A standard method in this approach is to convolve the observations with a unimodal function that decreases rapidly away from
a central point.
A kernel has two arguments representing the two points in the
convolution, but we typically use a single argument that represents the distance between the two points.
11
Some Kernels
Some examples of univariate kernel functions are
uniform:
Ku(t) = 0.5,
for |t| ≤ 1,
quadratic:
Kq(t) = 0.75(1 − t2 ),
for |t| ≤ 1,
normal:
2
Kn(t) = √1 e−t /2,
for all t.
2π
The kernels with finite support are defined to be 0 outside that
range. Often, multivariate kernels are formed as products of
these or other univariate kernels.
12
Kernels Methods
In kernel methods, the locality of influence is controlled by a
window around the point of interest.
The choice of the size of the window is the most important issue
in the use of kernel methods.
In practice, for a given choice of the size of the window, the
argument of the kernel function is transformed to reflect the
size.
The transformation is accomplished using a positive definite matrix, V , whose determinant measures the volume (size) of the
window.
13
Kernels Methods
To estimate the function f at the point x, we first decompose f
to have a factor that is a probability density function, p,
f (x) = g(x)p(x).
For a given set of data, x1, . . . , xn, and a given scaling transformation matrix V , the kernel estimator of the function at the
point x is
fd
(x) = (n|V |)−1
n
X
g(xi)K V
−1
(x − xi) .
(1)
i=1
In the univariate case, the size of the window is just the width h.
The argument of the kernel is transformed to s/h, so the function
that is convolved with the function of interest is K(s/h)/h. The
univariate kernel estimator is
n
X
x − xi
1
d
f (x) =
g(x)K
.
nh i=1
h
14
Pointwise Properties of Function
Estimators
The statistical properties of an estimator of a function at a given
point are analogous to the usual statistical properties of an estimator of a scalar parameter.
The statistical properties involve expectations or other properties
of random variables.
15
Bias
The bias of the estimator of a function value at the point x is
E(fb(x)) − f (x).
If this bias is zero, we would say that the estimator is unbiased
at the point x.
If the estimator is unbiased at every point x in the domain of f ,
we say that the estimator is pointwise unbiased.
Obviously, in order for fb(·) to be pointwise unbiased, it must be
defined over the full domain of f .
16
Variance
The variance of the estimator at the point x is
V(fb(x)) = E
2
b
b
(f (x) − E(f (x))) .
Estimators with small variance are generally more desirable, and
an optimal estimator is often taken as the one with smallest
variance among a class of unbiased estimators.
17
Mean Squared Error
The mean squared error, MSE, at the point x is
MSE(fb(x)) = E((fb(x) − f (x))2).
(2)
The mean squared error is the sum of the variance and the square
of the bias:
MSE(fb(x)) = E((fb(x))2 − 2fb(x)f (x) + (f (x))2)
= V(fb(x)) + (E(fb(x)) − f (x))2.
(3)
18
Mean Squared Error
Sometimes, the variance of an unbiased estimator is much greater
than that of an estimator that is only slightly biased, so it is often appropriate to compare the mean squared error of the two
estimators.
In some cases, as we will see, unbiased estimators do not exist,
so rather than seek an unbiased estimator with a small variance,
we seek an estimator with a small MSE.
19
Mean Absolute Error
The mean absolute error, MAE, at the point x is similar to the
MSE:
MAE(fb(x)) = E(|fb(x) − f (x)|).
(4)
It is more difficult to do mathematical analysis of the MAE than
it is for the MSE.
Furthermore, the MAE does not have a simple decomposition
into other meaningful quantities similar to the MSE.
20
Consistency
Consistency of an estimator refers to the convergence of the
expected value of the estimator to what is being estimated as
the sample size increases without bound.
If m is a function (maybe a vector-valued function that is an
elementwise norm), we can define consistency of an estimator
Tn in terms of m if
E(m(Tn − θ)) → 0.
(5)
21
Rate of Convergence
If convergence does occur, we are interested in the rate of convergence.
We define rate of convergence in terms of a function of n, say
r(n), such that
E(m(Tn − θ)) = O(r(n)).
A common form of r(n) is nα, where α < 0.
For example, in the simple case of a univariate population with
a finite mean µ and finite second moment, use of the sample
mean x̄ as the estimator Tn, and use of m(z) = z 2, we have
E(m(x̄ − µ)) = E((x̄ − µ)2)
= MSE(x̄)
= O(n−1).
22
Pointwise Consistency
In the estimation of a function, we say that the estimator fb of
the function f is pointwise consistent if
E(fb(x)) → f (x)
(6)
for every x the domain of f .
If the convergence in expression (6) is in probability, for example,
we say that the estimator is weakly pointwise consistent.
We can also define other kinds of pointwise consistency in function estimation along the lines of other types of consistency.
23
Global Properties of Estimators of
Functions
Often, we are interested in some measure of the statistical properties of an estimator of a function over the full domain of the
function. The obvious way of defining statistical properties of an
estimator of a function is to integrate the pointwise properties.
Statistical properties of a function, such as the bias of the function, are often defined in terms of a norm of the function.
For comparing fb(x) and f (x), the Lp norm of the error is
Z
D
|fb(x) − f (x)|p dx
1/p
(7)
,
where D is the domain of f . The integral may not exist, of
course. Clearly, the estimator fb must also be defined over the
same domain.
24
Convergence Norms
Three useful measures are the L1 norm, also called the integrated
absolute error, or IAE,
Z
IAE(fb ) =
D
b
f (x) − f (x) dx,
(8)
the square of the L2 norm, also called the integrated squared
error, or ISE,
ISE(fb ) =
Z
D
2
b
f (x) − f (x) dx,
(9)
and the L∞ norm, the sup absolute error, or SAE,
SAE(fb ) = sup fb(x) − f (x) .
(10)
25
Convergence Norms
The L1 measure is invariant under monotone transformations of
the coordinate axes, but the measure based on the L2 norm is
not.
The L∞ norm, or SAE, is the most often used measure in general function approximation. In statistical applications, this measure applied to two cumulative distribution functions is the Kolmogorov distance.
The measure is not so useful in comparing densities and is not
often used in density estimation.
26
Convergence Measures
Other measures of the difference in fb and f over the full range
of x are the Kullback-Leibler measure,
Z
D
fb(x) log
fb(x)
!
f (x)
dx,
and the Hellinger distance,
Z
D
fb 1/p(x) − f 1/p (x)
p
dx
1/p
.
27
Integrated Bias and Variance
We now want to develop global concepts of bias and variance
for estimators of functions.
Bias and variance are statistical properties that involve expectations of random variables.
The obvious global measures of bias and variance are just the
pointwise measures integrated over the domain.
(In the case of the bias, of course, we must integrate the absolute
value, otherwise points of negative bias could cancel out points
of positive bias.)
28
Integrated Bias
Because we are interested in the bias over the domain of the
function, we define the integrated absolute bias as
Z
IAB(fb ) =
D
|E(fb(x)) − f (x)| dx
(11)
and the integrated squared bias as
ISB(fb ) =
Z
D
(E(fb(x)) − f (x))2 dx.
(12)
If the estimator is unbiased, both the integrated absolute bias
and integrated squared bias are 0.
This, of course, would mean that the estimator is pointwise
unbiased almost everywhere.
Although it is not uncommon to have unbiased estimators of
scalar parameters or even of vector parameters with a countable
number of elements, it is not likely that an estimator of a function
could be unbiased at almost all points in a dense domain.
29
Integrated Variance
The integrated variance is defined in a similar manner:
IV(fb ) =
=
Z
ZD
D
V(fb(x)) dx
E((fb(x) − E(fb(x)))2) dx.
(13)
30
Integrated Mean Squared Error
As we suggested before, global unbiasedness is generally not to
be expected.
An important measure for comparing estimators of funtions is,
therefore, based on the mean squared error.
The integrated mean squared error is
IMSE(fb ) =
Z
D
E((fb(x) − f (x))2) dx
= IV(fb ) + ISB(fb ).
(14)
(Compare equations (2) and (3) on slide 18.)
31
Integrated Mean Squared Error
If the expectation integration can be interchanged with the outer
integration in the expression above, we have
IMSE(fb ) = E
Z
D
(fb(x) − f (x))2 dx
= MISE(fb ),
the mean integrated squared error.
We will assume that this interchange leaves the integrals unchanged, so we will use MISE and IMSE interchangeably.
32
Integrated Mean Absolute Error
Similarly, for the integrated mean absolute error, we have
IMAE(fb ) =
Z
E(|fb(x) − f (x)|) dx
DZ
= E
D
|fb(x) − f (x)| dx
= MIAE(fb ),
the mean integrated absolute error.
33
Mean SAE
The mean sup absolute error, or MSAE, is
MSAE(fb ) =
Z
D
E(sup|fb(x) − f (x)|) dx.
(15)
This measure is not very useful unless the variation in the function f is relatively small. For example, if f is a density function, fb can be a “good” estimator, yet the MSAE may be quite
large. On the other hand, if f is a cumulative distribution function (monotonically ranging from 0 to 1), the MSAE may be a
good measure of how well the estimator performs. As mentioned
earlier, the SAE is the Kolmogorov distance. The Kolmogorov
distance (and, hence, the SAE and the MSAE) does poorly in
measuring differences in the tails of the distribution.
34
Large-Sample Properties
The pointwise consistency properties are extended to the full
function in the obvious way.
Consistency of the function estimator is defined in terms of
Z
D
E(m(fb(x) − f (x))) dx → 0.
The estimator of the function is said to be mean square consistent or L2 consistent if the MISE converges to 0; that is,
Z
D
E((fb(x) − f (x))2) dx → 0.
If the convergence is weak, that is, if it is convergence in probability, we say that the function estimator is weakly consistent;
if the convergence is strong, that is, if it is convergence almost
surely or with probability 1, we say the function estimator is
strongly consistent.
35
Large-Sample Properties
The estimator of the function is said to be L1 consistent if the
mean integrated absolute error (MIAE) converges to 0; that is,
Z
D
E(|fb(x) − f (x)|) dx → 0.
As with the other kinds of consistency, the nature of the convergence in the definition may be expressed in the qualifiers “weak”
or “strong”.
As we have mentioned above, the integrated absolute error is invariant under monotone transformations of the coordinate axes,
but the L2 measures are not.
As with most work in L1, however, derivation of various properties of IAE or MIAE is more difficult than for analogous properties
with respect to L2 criteria.
36
Large-Sample Properties
If the MISE converges to 0, we are interested in the rate of
convergence. To determine this, we seek an expression of MISE
as a function of n. We do this by a Taylor series expansion.
b
In general, if θb is an estimator of θ, the Taylor series for ISE(θ),
equation (9), about the true value is
ISE(θb) =
∞
X
1
k!
k=0
k
k0
b
(θ − θ ) ISE (θ),
(16)
0
where ISEk (θ) represents the kth derivative of ISE evaluated at
θ.
Taking the expectation in equation (16) yields the MISE. The
limit of the MISE as n → ∞ is the asymptotic mean integrated
squared error, AMISE.
One of the most important properties of an estimator is the
order of the AMISE.
37
Large-Sample Properties
In the case of an unbiased estimator, the first two terms in the
Taylor series expansion are zero, and the AMISE is
b ISE00(θ)
V(θ)
to terms of second order.
38
Other Global Properties of Estimators
of Functions: Roughness
There are often other properties that we would like an estimator
of a function to possess.
We may want the estimator to weight given functions in some
particular way.
For example, if we know how the function to be estimated, f ,
weights a given function r, we may require that the estimate fb
weight the function r in the same way; that is,
Z
D
r(x)fb(x)dx =
Z
r(x)f (x)dx.
D
We may want to restrict the minimum and maximum values of
the estimator. For example, because many functions of interest
are nonnegative, we may want to require that the estimator be
nonnegative.
39
Other Global Properties of Estimators
of Functions: Roughness
We may want to restrict the variation in the function.
This can be thought of as the “roughness” of the function.
A reasonable measure of the variation is
Z
f (x) −
Z
f (x)dx
2
dx.
D
D
R
If the integral D f (x)dx is constrained to be some constant (such
as 1 in the case that f (x) is a probability density), then the
variation can be measured by the square of the L2 norm,
S(f ) =
Z
(f (x))2dx.
(17)
D
40
Other Global Properties of Estimators
of Functions: Roughness
We may want to restrict the derivatives of the estimator or the
smoothness of the estimator.
Another intuitive measure of the roughness of a twice-differentiable
and integrable univariate function f is the integral of the square
of the second derivative:
R(f ) =
Z
(f 00 (x))2 dx.
(18)
D
Often, in function estimation, we may seek an estimator fb such
that its roughness (by some definition) is small.
41
Nonparametric Probability Density
Estimation
Estimation of a probability density function is similar to the estimation of any function, and the properties of the function estimators that we have discussed are relevant for density function
estimators.
A density function p(y) is characterized by two properties:
• it is nonnegative everywhere;
• it integrates to 1 (with the appropriate definition of “integrate”).
42
Nonparametric Probability Density
Estimation
We consider several nonparametric estimators of a density; that
is, estimators of a general nonnegative function that integrates
to 1 and for which we make no assumptions about a functional
form other than, perhaps, smoothness.
It seems reasonable that we require the density estimate to have
the characteristic properties of a density:
b
≥ 0 for all y;
• p(y)
R
b
• IRd p(y)
dy = 1.
43
Bona Fide Density Estimator
A probability density estimator that is nonnegative and integrates
to 1 is called a bona fide estimator.
Rosenblatt has shown that no unbiased bona fide estimator can
exist for all continuous p.
Rather than requiring an unbiased estimator that cannot be a
bona fide estimator, we generally seek a bona fide estimator with
small mean squared error or a sequence of bona fide estimators
pbn that are asymptotically unbiased; that is,
Ep(pbn (y)) → p(y)
for all y ∈ IRd as n → ∞.
44
The Likelihood Function
Suppose that we have a random sample, y1, . . . , yn, from a population with density p.
Treating the density p as a variable, we write the likelihood functional as
L(p; y1, . . . , yn) =
n
Y
p(yi ).
i=1
The maximum likelihood method of estimation obviously cannot
be used directly because this functional is unbounded in p.
We may, however, seek an estimator that maximizes some modification of the likelihood.
45
Modified Maximum Likelihood
Estimation
There are two reasonable ways to approach this problem.
One is to restrict the domain of the optimization problem. This
is called restricted maximum likelihood.
The other is to regularize the estimator by adding a penalty
term to the functional to be optimized. This is called penalized
maximum likelihood.
46
Restricted Maximum Likelihood
Estimation
We may seek to maximize the likelihood functional subject to
the constraint that p be a bona fide density.
If we put no further restrictions on the function p, however,
infinite Dirac spikes at each observation give an unbounded likelihood, so a maximum likelihood estimator cannot exist, subject
only to the restriction to the bona fide class.
An additional restriction that p be Lebesgue-integrable over some
domain D (that is, p ∈ L1(D)) does not resolve the problem because we can construct sequences of finite spikes at each observation that grow without bound.
We therefore must restrict the class further.
47
Restricted Maximum Likelihood
Estimation
Consider a finite dimensional class, such as the class of step
functions that are bona fide density estimators. We assume that
the sizes of the regions over which the step function is constant
are greater than 0.
For a step function with m regions having constant values, c1, . . . , cm,
the likelihood is
L(c1 , . . . , cm; y1, . . . , yn) =
n
Y
p(yi)
i=1
m
Y
n
ck k ,
=
k=1
(19)
where nk is the number of data points in the kth region.
48
continued ...
For the step function to be a bona fide estimator, all ck must be
nonnegative and finite. A maximum therefore exists in the class
of step functions that are bona fide estimators.
If vk is the measure of the volume of the kth region (that is, vk
is the length of an interval in the univariate case, the area in the
bivariate case, and so on), we have
m
X
ck vk = 1.
k=1
We incorporate this constraint together with equation (19) to
form the Lagrangian,

L(c1 , . . . , cm) + λ 1 −
m
X
k=1

ck vk  .
49
continued ...
Differentiating the Lagrangian function and setting the derivative
to zero, we have at the maximum point ck = c∗k , for any λ,
∂L
= λvk .
∂ck
Using the derivative of L from equation (19), we get
nk L = λc∗k vk .
Summing both sides of this equation over k, we have
nL = λ,
and then substituting, we have
nk L = nLc∗k vk .
50
continued ...
Therefore, the maximum of the likelihood occurs at
nk
∗
.
ck =
nvk
The restricted maximum likelihood estimator is therefore
nk
b
p(y)
=
, for y ∈ region k,
nvk
= 0,
(20)
otherwise.
51
Restricted Maximum Likelihood
Estimation
Instead of restricting the density estimate to step functions, we
could consider other classes of functions, such as piecewise linear
functions.
We may also seek other properties, such as smoothness, for the
estimated density.
One way of achieving other desirable properties for the estimator
is to use a penalizing function to modify the function to be
optimized.
52
continued ...
Instead of the likelihood function, we may use a penalized likelihood function of the form
Lp (p; y1, . . . , yn) =
n
Y
p(yi )e−T (p),
i=1
where T (p) is a transform that measures some property that we
would like to minimize.
For example, to achieve smoothness, we may use the transform
R(p) of equation (18) in the penalizing factor.
To choose a function p̂ to maximize Lp (p) we would have to use
some finite series approximation to T (p̂).
53