Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Estimation of Functions An interesting problem in statistics, and one that is generally difficult, is the estimation of a continuous function such as a probability density function. The statistical properties of an estimator of a function are more complicated than statistical properties of an estimator of a single parameter or even of a countable set of parameters. We consider the general case of a real scalar-valued function over real vector-valued arguments (that is, a mapping from IRd into IR). One of the most common situations in which these properties are relevant is in nonparametric probability density estimation. 1 Notation We may denote a function by a single letter, f , for example, or by the function notation, f (·) or f (x). When f (x) denotes a function, x is merely a placeholder. The notation f (x), however, may also refer to the value of the function at the point x. The meaning is usually clear from the context. Using the common “hat” notation for an estimator, we use fb or fb(x) to denote the estimator of f or of f (x). 2 More on Notation The hat notation is also used to denote an estimate, so we must determine from the context whether fb or fb(x) denotes a random variable or a realization of a random variable. The estimate or the estimator of the value of the function at the point x may also be denoted by fb(x). Sometimes, to emphasize that we are estimating the ordinate of the function rather than evaluating an estimate of the function, (x). we use the notation fd 3 Optimality The usual optimality properties that we use in developing a theory of estimation of a finite-dimensional parameter must be extended for estimation of a general function. As we will see, two of the usual desirable properties of point estimators, namely unbiasedness and maximum likelihood, cannot be attained in general by estimators of functions. 4 Estimation or Approximation? There are many similarities in estimation of functions and approximation of functions, but we must be aware of the fundamental differences in the two problems. Estimation of functions is similar to other estimation problems: we are given a sample of observations; we make certain assumptions about the probability distribution of the sample; and then we develop estimators. The estimators are random variables, and how useful they are depends on properties of their distribution, such as their expected values and their variances. Approximation of functions is an important aspect of numerical analysis. Functions are often approximated to interpolate functional values between directly computed or known values. 5 General Methods for Estimating Functions In the problem of function estimation, we may have observations on the function at specific points in the domain, or we may have indirect measurements of the function, such as observations that relate to a derivative or an integral of the function. In either case, the problem of function estimation has the competing goals of providing a good fit to the observed data and predicting values at other points. In many cases, a smooth estimate satisfies this latter objective. In other cases, however, the unknown function itself is not smooth. Functions with different forms may govern the phenomena in different regimes. This presents a very difficult problem in function estimation, but we won’t go into it. 6 General Methods for Estimating Functions There are various approaches to estimating functions. Maximum likelihood has limited usefulness for estimating functions because in general the likelihood is unbounded. A practical approach is to assume that the function is of a particular form and estimate the parameters that characterize the form. For example, we may assume that the function is exponential, possibly because of physical properties such as exponential decay. We may then use various estimation criteria, such as least squares, to estimate the parameter. 7 Mixtures of Functions with Prescribed Forms An extension of this approach is to assume that the function is a mixture of other functions. The mixture can be formed by different functions over different domains or by weighted averages of the functions over the whole domain. Estimation of the function of interest involves estimation of various parameters as well as the weights. 8 Use of Basis Functions Another approach to function estimation is to represent the function of interest as a linear combination of basis functions, that is, to represent the function in a series expansion. The basis functions are generally chosen to be orthogonal over the domain of interest, and the observed data are used to estimate the coefficients in the series. 9 Estimation of a Function at a Point It is often more practical to estimate the function value at a given point. (Of course, if we can estimate the function at any given point, we can effectively have an estimate at all points.) One way of forming an estimate of a function at a given point is to take the average at that point of a filtering function that is evaluated in the vicinity of each data point. The filtering function is called a kernel, and the result of this approach is called a kernel estimator. We must be concerned about the properties of the estimators at specific points and also about properties over the full domain. Global properties over the full domain are often defined in terms of integrals or in terms of suprema or infima. 10 Kernel Methods One approach to function estimation and approximation is to use a filter or kernel function to provide local weighting of the observed data. This approach ensures that at a given point the observations close to that point influence the estimate at the point more strongly than more distant observations. A standard method in this approach is to convolve the observations with a unimodal function that decreases rapidly away from a central point. A kernel has two arguments representing the two points in the convolution, but we typically use a single argument that represents the distance between the two points. 11 Some Kernels Some examples of univariate kernel functions are uniform: Ku(t) = 0.5, for |t| ≤ 1, quadratic: Kq(t) = 0.75(1 − t2 ), for |t| ≤ 1, normal: 2 Kn(t) = √1 e−t /2, for all t. 2π The kernels with finite support are defined to be 0 outside that range. Often, multivariate kernels are formed as products of these or other univariate kernels. 12 Kernels Methods In kernel methods, the locality of influence is controlled by a window around the point of interest. The choice of the size of the window is the most important issue in the use of kernel methods. In practice, for a given choice of the size of the window, the argument of the kernel function is transformed to reflect the size. The transformation is accomplished using a positive definite matrix, V , whose determinant measures the volume (size) of the window. 13 Kernels Methods To estimate the function f at the point x, we first decompose f to have a factor that is a probability density function, p, f (x) = g(x)p(x). For a given set of data, x1, . . . , xn, and a given scaling transformation matrix V , the kernel estimator of the function at the point x is fd (x) = (n|V |)−1 n X g(xi)K V −1 (x − xi) . (1) i=1 In the univariate case, the size of the window is just the width h. The argument of the kernel is transformed to s/h, so the function that is convolved with the function of interest is K(s/h)/h. The univariate kernel estimator is n X x − xi 1 d f (x) = g(x)K . nh i=1 h 14 Pointwise Properties of Function Estimators The statistical properties of an estimator of a function at a given point are analogous to the usual statistical properties of an estimator of a scalar parameter. The statistical properties involve expectations or other properties of random variables. 15 Bias The bias of the estimator of a function value at the point x is E(fb(x)) − f (x). If this bias is zero, we would say that the estimator is unbiased at the point x. If the estimator is unbiased at every point x in the domain of f , we say that the estimator is pointwise unbiased. Obviously, in order for fb(·) to be pointwise unbiased, it must be defined over the full domain of f . 16 Variance The variance of the estimator at the point x is V(fb(x)) = E 2 b b (f (x) − E(f (x))) . Estimators with small variance are generally more desirable, and an optimal estimator is often taken as the one with smallest variance among a class of unbiased estimators. 17 Mean Squared Error The mean squared error, MSE, at the point x is MSE(fb(x)) = E((fb(x) − f (x))2). (2) The mean squared error is the sum of the variance and the square of the bias: MSE(fb(x)) = E((fb(x))2 − 2fb(x)f (x) + (f (x))2) = V(fb(x)) + (E(fb(x)) − f (x))2. (3) 18 Mean Squared Error Sometimes, the variance of an unbiased estimator is much greater than that of an estimator that is only slightly biased, so it is often appropriate to compare the mean squared error of the two estimators. In some cases, as we will see, unbiased estimators do not exist, so rather than seek an unbiased estimator with a small variance, we seek an estimator with a small MSE. 19 Mean Absolute Error The mean absolute error, MAE, at the point x is similar to the MSE: MAE(fb(x)) = E(|fb(x) − f (x)|). (4) It is more difficult to do mathematical analysis of the MAE than it is for the MSE. Furthermore, the MAE does not have a simple decomposition into other meaningful quantities similar to the MSE. 20 Consistency Consistency of an estimator refers to the convergence of the expected value of the estimator to what is being estimated as the sample size increases without bound. If m is a function (maybe a vector-valued function that is an elementwise norm), we can define consistency of an estimator Tn in terms of m if E(m(Tn − θ)) → 0. (5) 21 Rate of Convergence If convergence does occur, we are interested in the rate of convergence. We define rate of convergence in terms of a function of n, say r(n), such that E(m(Tn − θ)) = O(r(n)). A common form of r(n) is nα, where α < 0. For example, in the simple case of a univariate population with a finite mean µ and finite second moment, use of the sample mean x̄ as the estimator Tn, and use of m(z) = z 2, we have E(m(x̄ − µ)) = E((x̄ − µ)2) = MSE(x̄) = O(n−1). 22 Pointwise Consistency In the estimation of a function, we say that the estimator fb of the function f is pointwise consistent if E(fb(x)) → f (x) (6) for every x the domain of f . If the convergence in expression (6) is in probability, for example, we say that the estimator is weakly pointwise consistent. We can also define other kinds of pointwise consistency in function estimation along the lines of other types of consistency. 23 Global Properties of Estimators of Functions Often, we are interested in some measure of the statistical properties of an estimator of a function over the full domain of the function. The obvious way of defining statistical properties of an estimator of a function is to integrate the pointwise properties. Statistical properties of a function, such as the bias of the function, are often defined in terms of a norm of the function. For comparing fb(x) and f (x), the Lp norm of the error is Z D |fb(x) − f (x)|p dx 1/p (7) , where D is the domain of f . The integral may not exist, of course. Clearly, the estimator fb must also be defined over the same domain. 24 Convergence Norms Three useful measures are the L1 norm, also called the integrated absolute error, or IAE, Z IAE(fb ) = D b f (x) − f (x) dx, (8) the square of the L2 norm, also called the integrated squared error, or ISE, ISE(fb ) = Z D 2 b f (x) − f (x) dx, (9) and the L∞ norm, the sup absolute error, or SAE, SAE(fb ) = sup fb(x) − f (x) . (10) 25 Convergence Norms The L1 measure is invariant under monotone transformations of the coordinate axes, but the measure based on the L2 norm is not. The L∞ norm, or SAE, is the most often used measure in general function approximation. In statistical applications, this measure applied to two cumulative distribution functions is the Kolmogorov distance. The measure is not so useful in comparing densities and is not often used in density estimation. 26 Convergence Measures Other measures of the difference in fb and f over the full range of x are the Kullback-Leibler measure, Z D fb(x) log fb(x) ! f (x) dx, and the Hellinger distance, Z D fb 1/p(x) − f 1/p (x) p dx 1/p . 27 Integrated Bias and Variance We now want to develop global concepts of bias and variance for estimators of functions. Bias and variance are statistical properties that involve expectations of random variables. The obvious global measures of bias and variance are just the pointwise measures integrated over the domain. (In the case of the bias, of course, we must integrate the absolute value, otherwise points of negative bias could cancel out points of positive bias.) 28 Integrated Bias Because we are interested in the bias over the domain of the function, we define the integrated absolute bias as Z IAB(fb ) = D |E(fb(x)) − f (x)| dx (11) and the integrated squared bias as ISB(fb ) = Z D (E(fb(x)) − f (x))2 dx. (12) If the estimator is unbiased, both the integrated absolute bias and integrated squared bias are 0. This, of course, would mean that the estimator is pointwise unbiased almost everywhere. Although it is not uncommon to have unbiased estimators of scalar parameters or even of vector parameters with a countable number of elements, it is not likely that an estimator of a function could be unbiased at almost all points in a dense domain. 29 Integrated Variance The integrated variance is defined in a similar manner: IV(fb ) = = Z ZD D V(fb(x)) dx E((fb(x) − E(fb(x)))2) dx. (13) 30 Integrated Mean Squared Error As we suggested before, global unbiasedness is generally not to be expected. An important measure for comparing estimators of funtions is, therefore, based on the mean squared error. The integrated mean squared error is IMSE(fb ) = Z D E((fb(x) − f (x))2) dx = IV(fb ) + ISB(fb ). (14) (Compare equations (2) and (3) on slide 18.) 31 Integrated Mean Squared Error If the expectation integration can be interchanged with the outer integration in the expression above, we have IMSE(fb ) = E Z D (fb(x) − f (x))2 dx = MISE(fb ), the mean integrated squared error. We will assume that this interchange leaves the integrals unchanged, so we will use MISE and IMSE interchangeably. 32 Integrated Mean Absolute Error Similarly, for the integrated mean absolute error, we have IMAE(fb ) = Z E(|fb(x) − f (x)|) dx DZ = E D |fb(x) − f (x)| dx = MIAE(fb ), the mean integrated absolute error. 33 Mean SAE The mean sup absolute error, or MSAE, is MSAE(fb ) = Z D E(sup|fb(x) − f (x)|) dx. (15) This measure is not very useful unless the variation in the function f is relatively small. For example, if f is a density function, fb can be a “good” estimator, yet the MSAE may be quite large. On the other hand, if f is a cumulative distribution function (monotonically ranging from 0 to 1), the MSAE may be a good measure of how well the estimator performs. As mentioned earlier, the SAE is the Kolmogorov distance. The Kolmogorov distance (and, hence, the SAE and the MSAE) does poorly in measuring differences in the tails of the distribution. 34 Large-Sample Properties The pointwise consistency properties are extended to the full function in the obvious way. Consistency of the function estimator is defined in terms of Z D E(m(fb(x) − f (x))) dx → 0. The estimator of the function is said to be mean square consistent or L2 consistent if the MISE converges to 0; that is, Z D E((fb(x) − f (x))2) dx → 0. If the convergence is weak, that is, if it is convergence in probability, we say that the function estimator is weakly consistent; if the convergence is strong, that is, if it is convergence almost surely or with probability 1, we say the function estimator is strongly consistent. 35 Large-Sample Properties The estimator of the function is said to be L1 consistent if the mean integrated absolute error (MIAE) converges to 0; that is, Z D E(|fb(x) − f (x)|) dx → 0. As with the other kinds of consistency, the nature of the convergence in the definition may be expressed in the qualifiers “weak” or “strong”. As we have mentioned above, the integrated absolute error is invariant under monotone transformations of the coordinate axes, but the L2 measures are not. As with most work in L1, however, derivation of various properties of IAE or MIAE is more difficult than for analogous properties with respect to L2 criteria. 36 Large-Sample Properties If the MISE converges to 0, we are interested in the rate of convergence. To determine this, we seek an expression of MISE as a function of n. We do this by a Taylor series expansion. b In general, if θb is an estimator of θ, the Taylor series for ISE(θ), equation (9), about the true value is ISE(θb) = ∞ X 1 k! k=0 k k0 b (θ − θ ) ISE (θ), (16) 0 where ISEk (θ) represents the kth derivative of ISE evaluated at θ. Taking the expectation in equation (16) yields the MISE. The limit of the MISE as n → ∞ is the asymptotic mean integrated squared error, AMISE. One of the most important properties of an estimator is the order of the AMISE. 37 Large-Sample Properties In the case of an unbiased estimator, the first two terms in the Taylor series expansion are zero, and the AMISE is b ISE00(θ) V(θ) to terms of second order. 38 Other Global Properties of Estimators of Functions: Roughness There are often other properties that we would like an estimator of a function to possess. We may want the estimator to weight given functions in some particular way. For example, if we know how the function to be estimated, f , weights a given function r, we may require that the estimate fb weight the function r in the same way; that is, Z D r(x)fb(x)dx = Z r(x)f (x)dx. D We may want to restrict the minimum and maximum values of the estimator. For example, because many functions of interest are nonnegative, we may want to require that the estimator be nonnegative. 39 Other Global Properties of Estimators of Functions: Roughness We may want to restrict the variation in the function. This can be thought of as the “roughness” of the function. A reasonable measure of the variation is Z f (x) − Z f (x)dx 2 dx. D D R If the integral D f (x)dx is constrained to be some constant (such as 1 in the case that f (x) is a probability density), then the variation can be measured by the square of the L2 norm, S(f ) = Z (f (x))2dx. (17) D 40 Other Global Properties of Estimators of Functions: Roughness We may want to restrict the derivatives of the estimator or the smoothness of the estimator. Another intuitive measure of the roughness of a twice-differentiable and integrable univariate function f is the integral of the square of the second derivative: R(f ) = Z (f 00 (x))2 dx. (18) D Often, in function estimation, we may seek an estimator fb such that its roughness (by some definition) is small. 41 Nonparametric Probability Density Estimation Estimation of a probability density function is similar to the estimation of any function, and the properties of the function estimators that we have discussed are relevant for density function estimators. A density function p(y) is characterized by two properties: • it is nonnegative everywhere; • it integrates to 1 (with the appropriate definition of “integrate”). 42 Nonparametric Probability Density Estimation We consider several nonparametric estimators of a density; that is, estimators of a general nonnegative function that integrates to 1 and for which we make no assumptions about a functional form other than, perhaps, smoothness. It seems reasonable that we require the density estimate to have the characteristic properties of a density: b ≥ 0 for all y; • p(y) R b • IRd p(y) dy = 1. 43 Bona Fide Density Estimator A probability density estimator that is nonnegative and integrates to 1 is called a bona fide estimator. Rosenblatt has shown that no unbiased bona fide estimator can exist for all continuous p. Rather than requiring an unbiased estimator that cannot be a bona fide estimator, we generally seek a bona fide estimator with small mean squared error or a sequence of bona fide estimators pbn that are asymptotically unbiased; that is, Ep(pbn (y)) → p(y) for all y ∈ IRd as n → ∞. 44 The Likelihood Function Suppose that we have a random sample, y1, . . . , yn, from a population with density p. Treating the density p as a variable, we write the likelihood functional as L(p; y1, . . . , yn) = n Y p(yi ). i=1 The maximum likelihood method of estimation obviously cannot be used directly because this functional is unbounded in p. We may, however, seek an estimator that maximizes some modification of the likelihood. 45 Modified Maximum Likelihood Estimation There are two reasonable ways to approach this problem. One is to restrict the domain of the optimization problem. This is called restricted maximum likelihood. The other is to regularize the estimator by adding a penalty term to the functional to be optimized. This is called penalized maximum likelihood. 46 Restricted Maximum Likelihood Estimation We may seek to maximize the likelihood functional subject to the constraint that p be a bona fide density. If we put no further restrictions on the function p, however, infinite Dirac spikes at each observation give an unbounded likelihood, so a maximum likelihood estimator cannot exist, subject only to the restriction to the bona fide class. An additional restriction that p be Lebesgue-integrable over some domain D (that is, p ∈ L1(D)) does not resolve the problem because we can construct sequences of finite spikes at each observation that grow without bound. We therefore must restrict the class further. 47 Restricted Maximum Likelihood Estimation Consider a finite dimensional class, such as the class of step functions that are bona fide density estimators. We assume that the sizes of the regions over which the step function is constant are greater than 0. For a step function with m regions having constant values, c1, . . . , cm, the likelihood is L(c1 , . . . , cm; y1, . . . , yn) = n Y p(yi) i=1 m Y n ck k , = k=1 (19) where nk is the number of data points in the kth region. 48 continued ... For the step function to be a bona fide estimator, all ck must be nonnegative and finite. A maximum therefore exists in the class of step functions that are bona fide estimators. If vk is the measure of the volume of the kth region (that is, vk is the length of an interval in the univariate case, the area in the bivariate case, and so on), we have m X ck vk = 1. k=1 We incorporate this constraint together with equation (19) to form the Lagrangian, L(c1 , . . . , cm) + λ 1 − m X k=1 ck vk . 49 continued ... Differentiating the Lagrangian function and setting the derivative to zero, we have at the maximum point ck = c∗k , for any λ, ∂L = λvk . ∂ck Using the derivative of L from equation (19), we get nk L = λc∗k vk . Summing both sides of this equation over k, we have nL = λ, and then substituting, we have nk L = nLc∗k vk . 50 continued ... Therefore, the maximum of the likelihood occurs at nk ∗ . ck = nvk The restricted maximum likelihood estimator is therefore nk b p(y) = , for y ∈ region k, nvk = 0, (20) otherwise. 51 Restricted Maximum Likelihood Estimation Instead of restricting the density estimate to step functions, we could consider other classes of functions, such as piecewise linear functions. We may also seek other properties, such as smoothness, for the estimated density. One way of achieving other desirable properties for the estimator is to use a penalizing function to modify the function to be optimized. 52 continued ... Instead of the likelihood function, we may use a penalized likelihood function of the form Lp (p; y1, . . . , yn) = n Y p(yi )e−T (p), i=1 where T (p) is a transform that measures some property that we would like to minimize. For example, to achieve smoothness, we may use the transform R(p) of equation (18) in the penalizing factor. To choose a function p̂ to maximize Lp (p) we would have to use some finite series approximation to T (p̂). 53