Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Curve and Surface Estimation using Dynamic Step Functions∗ Juha Heikkinen Finnish Forest Research Institute Unioninkatu 40a, FIN-00170 Helsinki, Finland [email protected] January 14, 1998 (references updated September 24, 1999) Abstract This chapter describes a nonparametric Bayesian approach to the estimation of curves and surfaces that act as parameters in statistical models. The approach is based on mixing variable dimensional piecewise constant approximations, whose ‘smoothness’ is regulated by a Markov random field prior. Random partitions of the domain are defined by Voronoi tessellations of random generating point patterns. Variable dimension Markov chain Monte Carlo methods are proposed for the numerical estimation, and a detailed algorithm is specified for one special case. General applicability of the approach is discussed in the context of density estimation, regression and interpolation problems, and an application to the intensity estimation for a spatial Poisson point process is presented. 1 Introduction Statistical modelling is often concerned with finding functions that describe some features of a given data set. Perhaps the most common examples are density estimation, where the object is a probability density function, and regression analysis, where the function of interest usually returns the expected value of the response variable corresponding to given value(s) of the predictor(s). This chapter addresses the general problem of estimating such functions nonparametrically. The method described here was motivated by Arjas and Gasbarra (1994), where a new approach was introduced to the nonparametric Bayesian curve estimation. Curves defined on a finite interval were parametrised by the function values at a finite number of points (to be called generating points here). In the prior (and hence This paper has been published as Chapter 14 (pp 255–272) of D. Dey, P. Müller and D. Sinha (editors): Practical Nonparametric and Semiparametric Bayesian Statistics Springer-Verlag, New York, 1998 ∗ 1 also in the posterior) the number and locations of generating points were random, i.e. they varied from one realisation to another. This lead to an infinite-dimensional parameter space, whereby the approach can be honestly called nonparametric. Deterministic interpolation was applied between the generating points by the simplest possible means, piecewise constant functions with jumps at the generating points. More sophisticated forms of interpolation can easily be applied in the curve estimation context; a concrete demonstration was provided by Denison, Mallick and Smith (1998), where sequences of piecewise polynomials were used. Since Bayesian inference is not concerned with selecting a point estimate (here a single curve) from the postulated model class, the precise functional form of its individual members is not as crucial as in the frequentist approach. More important is that the integrals of test functions of interest (e.g. predictive densities or probabilities) w.r.t. the posterior distribution obtained from the approximate model are close to those obtained from the ‘true’ model (see Arjas and Andreev, 1996). Furthermore, a ‘Bayesian point estimate’, the posterior mean, does not necessarily belong to the model class. Hence, in the model of Arjas and Gasbarra (1994) pointwise posterior means don’t need to form a piecewise constant curve since the jump points are variable, and indeed the posterior mean is typically a smooth continuous function. Further discussion on the topic can be found in Section 7, in the papers cited above, and in Arjas (1996). In Arjas and Gasbarra (1994) a prior distribution for the marked generating points was specified in terms of the corresponding local characteristics. A martingale structure was assumed, which penalises large differences between nearby function values. The aim was, besides smoothing the oscillations, to have the change points concentrated on the places of most rapid changes. Markov chain Monte Carlo (MCMC) methods were proposed to carry out the computations required for the actual inferences. It is not possible, however, to apply standard MCMC algorithms, such as the Gibbs sampler, when the length of the parameter vector is variable. Arjas and Gasbarra (1994) modified the Gibbs sampler in a rather ad hoc way that is practical only in curve estimation. A general framework for handling variable dimension was provided by the reversible jump MCMC approach of Green (1995). Two very similar extensions of this ‘dynamic step function approach’ to surface estimation, using Voronoi tessellations to define the regions of constant function value, were presented in Green (1995) and in Heikkinen and Arjas (1998). The main concern in the example of Green (1995) was in finding change-curves in a surface that is truly discontinuous, and accordingly independence was assumed between step function values in different regions. Heikkinen and Arjas (1998) followed the approach of Arjas and Gasbarra (1994) also in building a smoothing prior across the tiles. Locally dependent Markov random fields, widely used in Bayesian image analysis, were applied for this purpose. A related approach, using triangulations, was taken by Nicholls (1998) in the context of image segmentation. Arjas and Heikkinen (1997) presented the one-dimensional (curve estimation) version of the method of Heikkinen and Arjas (1998), and Heikkinen and Arjas 2 (1999) demonstrated the flexibility of the approach through an application, where the model contains three unknown functions to be estimated, non-parametrically and simultaneously. Based on the experiences from these works the purpose of this chapter is to suggest a rather simple, but in many cases sufficient, ‘prototype algorithm’ for the nonparametric Bayesian function estimation, and to discuss its possible modifications and extensions. Although the proposed method is theoretically applicable in any dimension, it is likely to be practical mainly in the estimation of uni- and bivariate functions, i.e. curves and surfaces. Both the computational effort and the amount of data required for useful nonparametric inference is usually prohibitive in higher dimensions. Section 2 introduces the general setup and some common statistical problems, for which the proposed method might be useful. Concrete demonstrations can be found in the papers cited above; a brief description of one of these examples is given in Section 6. Section 3 reviews some concepts from spatial statistics to be used in Section 4, where a basic version of the proposed prior model is defined. MCMC inference based on the corresponding posterior distribution is considered in Section 5 with a detailed algorithm specified in the Appendix. Possible modifications and extensions are discussed in Section 7. 2 Some statistical problems Suppose that a statistical model is specified for data x by a likelihood expression p(x | θ) containing an unknown parameter function θ:E→R defined on a bounded domain E ⊂ Rd , d ∈ {1, 2, . . . }. The general aim here is to describe a method for the nonparametric estimation of θ. Although the method is in principle applicable in any dimension d, it has been implemented in practice only for the estimation of curves (d = 1, Arjas and Heikkinen, 1997) and surfaces (d = 2, Heikkinen and Arjas, 1998). The form of data x and likelihood function p may be arbitrary as long as likelihood ratios can be easily evaluated. In density estimation, for example, θ would be a probability density function on E, x = (x 1 , . . . , xn ) a sample of n realisations xi ∈ E, and n Y p(x | θ) = θ(xi ). (1) i=1 Arjas and Heikkinen (1997) and Heikkinen and Arjas (1998) considered the estimation of the intensity function of a non-homogeneous Poisson process (see Section 3.1), a problem essentially analogous to density estimation. An example from Heikkinen and Arjas (1998) is briefly reviewed in Section 6 to demonstrate the method. 3 Another common example is regression analysis with a d-dimensional predictor, whose values lie within E. Each data item x i consists of a predictor value si ∈ E and a related observation zi of a real valued response variable. The unknown regression function θ : E → R describes the relationship between the variables, usually θ(si ) is the expected value of zi . Given θ, the records zi , i = 1, . . . , n, are assumed conditionally independent with conditional density functions p(zi | θ(si )). The likelihood function of this model is defined by p(x | θ) = n Y p(xi | θ) = n Y (2) p(zi | θ(si )). i=1 i=1 Using the same notation, the problem of interpolating data x can be formulated as a degenerate case of likelihood (2) with p(z i | θ(si )) having all its mass at θ(si ). Both cases are demonstrated in Heikkinen and Arjas (1999), where the model contains a regression curve and an interpolated surface. 3 Some spatial statistics This section reviews some concepts, originating from spatial statistics, that are needed to introduce the suggested prior distribution for parameter functions θ (Section 4). 3.1 Poisson point processes A point process on a bounded region E ⊂ R d is a statistical model for point patterns ξ = {ξ1 , . . . , ξK } ⊂ E with (usually) a random number K = K(ξ) of points. For K = 1, 2, . . . , let ΩK be the space of all such patterns on E which have K points, and let Ω0 = {∅}, where ∅ denotes an empty configuration. Formally, a S point process may be defined as a random variable taking values in Ω = ∞ Ω . K K=0 The Poisson process on E is specified by an intensity function λ : E → [0, ∞), R or by the corresponding intensity measure Λ(A) = A λ(s) ν(ds), where ν is the Lebesque measure on Rd . The number of points, K, follows the Poisson distribution with expectation Λ(E), whereas conditional on K, the locations ξ 1 , . . . , ξK are distributed on E as i.i.d. random variables with density λ/Λ(E). The distribution P of the Poisson process is the probability measure on Ω defined by P (F ) = e −Λ(E) Z K ∞ Y X 1 1({ξ1 , . . . , ξK } ∈ F ) Λ(dξk ) . 1(∅ ∈ F ) + K! E K K=1 k=1 A Poisson process with a constant intensity function λ is called homogeneous. The density function p of a Poisson process with distribution P can be defined as a Randon-Nikodym derivative of P w.r.t. the distribution of a homogeneous 4 Figure 1: Voronoi tessellation generated by the +’s (left); point × is added to the pattern, its Voronoi tile consists of parts ‘conquered’ from the neighbours, other tiles are not affected (middle), Delaunay graph of the +’s (right). reference process, yielding K(ξ) p(ξ) ∝ e −Λ(E) Y λ(ξk ). k=1 For more detailed discussion on the Poisson process and point processes in general, see e.g. Karr (1991), Stoyan, Kendall and Mecke (1995) or chapter 8 of Cressie (1993). 3.2 Voronoi tessellations The Voronoi tessellation of E generated by point pattern ξ is the partition E = S K k=1 Ek (ξ) into non-overlapping tiles Ek (ξ) = {s ∈ E : ks − ξk k ≤ ks − ξj k for all j}. Thus the interior of Ek (ξ) consists of all points in E which are closer to ξ k than to any other point of ξ. On the real line, E = [T1 , T2 ] ⊂ R, Voronoi tiles are intervals, whose endpoints lie half way between successive points of ξ: 1 k = 1, [T1 , 2 (ξ1 + ξ2 )] 1 1 Ek (ξ) = [ 2 (ξk−1 + ξk ), 2 (ξk + ξk+1 )] k = 2, . . . , K − 1, 1 k = K. [ 2 (ξK−1 + ξK ), T2 ] On the plane, d = 2, Voronoi tiles are convex polygons, whose edges are half way between two points of ξ and perpendicular to the line segment joining them (see Figure 1, left). A natural neighbourhood relation ∼ between the points of ξ can be derived ξ from the Voronoi tessellation by letting those pairs be neighbours, whose tile boundaries intersect. Hence on the real line successive points are neighbours, and on 5 the plane those whose Voronoi polygons share a common boundary segment. This neighbourhood relation defines the Delaunay graph (Figure 1, right). On the plane, if ξ is a realisation from such point process which is absolutely continuous w.r.t. the Poisson process, then (with probability 1) the Delaunay graph is a triangulation of ξ. For regular patterns, such as lattices, this is not necessarily the case. In order to avoid unnecessarily complex notation for Voronoi tiles and neighbours, the references to the underlying point pattern ξ will be omitted whenever possible without ambiguity. Shorthand notation j ∼ k will also be applied for ξj ∼ ξk . The index set {j : j ∼ k} will be denoted by ∂ k = ∂k (ξ), and its cardinality, i.e. the number of neighbours of point ξ k , by Nk = Nk (ξ). A particularly useful property of the Voronoi tessellations in the current context is that they can be locally updated to small changes in the generating pattern ξ: When a new pont is added to a tessellated pattern (or when one is removed), only the neighbours of the new (or removed) tile need to be modified (see Figure 1, middle). For a theoretical treatment of random Voronoi tessellations and mathematical proofs concerning their properties the reader is referred to Møller (1994). Okabe, Boots and Sugihara (1992) takes a more application oriented approach. 3.3 Locally dependent Markov random fields Consider then the joint distribution of K random variables η 1 , . . . , ηK . In general, the term Markov random field (MRF) only means that the joint density p(η 1 , . . . , ηK ) is specified though its full conditionals p(η k | ηj , j 6= k). To define a locally dependent MRF a symmetric neighbourhood relation ∼ between the sites 1, . . . , K is specified, and it is assumed that the full conditionals depend only on the neighbouring values: p(ηk | ηj , j 6= k) = p(ηk | ηj , j ∼ k). Unobvious consistency conditions are required from the conditional distributions in order to specify a well defined joint distribution by the MRF approach. The general conditions were identified by the famous unpublished HammersleyClifford theorem (see Besag, 1974). Sufficient conditions for Gaussian random fields can be found in Besag and Kooperberg (1995), for example. 4 Prototype prior Each realisation θ of the suggested prior distribution is a piecewise constant function on a partition of E, but the number and the boundaries of the subregions are random. Random partitions are obtained as Voronoi tessellations (Section 3.2) of random generating point patterns ξ. Step function θ can then be parametrised by locations ξ1 , . . . , ξK and corresponding function values η 1 , . . . , ηK , letting θ take 6 value ηk on tile Ek (ξ) θ(s) = K X ηk 1{s ∈ Ek (ξ)}. k=1 In other words, each realisation is parametrised by a marked point pattern (ξ1 , η1 ), . . . , (ξK , ηK ). The prior distribution of functions θ is then naturally determined by specifying the joint prior of ξ and η = (η1 , . . . , ηK ). This, in turn, is built via the chain rule decomposition p(ξ, η) = p(ξ)p(η | ξ). The prior for the generating points ξ is a homogeneous Poisson process (Section 3.1) constrained to have at least two points. It has one parameter, the intensity λ > 0, and its density (for fixed λ) is specified by p(ξ) ∝ λK 1(K > 1). The choice of λ determines the average tile size in the individual partitions. In other respects, the homogeneous Poisson process serves as a non-informative prior for point patterns. The prior distribution of marks η is developed conditionally on the locations ξ of the generating points, and it contains an assumption of smoothness in the sense that the differences |ηk − ηj | between function values on nearby tiles are expected to be small. Natural means of formulating this assumption is provided by locally dependent Markov random fields (Section 3.3). If the parameter surface may in principle obtain any real values, then a simple model is specified by assuming Gaussian full conditionals with expected values given by the averages of the neighbouring values: 1 X ηj . (3) E(ηk | η −k ; ξ) = Nk j∼k For such conditional specifications to be consistent, the precision parameters must be set as τ Nk for some τ > 0 2 τ Nk 1 X 1/2 p(ηk | η −k , ξ, τ ) = (τ Nk /2π) exp − ηk − ηj . 2 Nk j∼k The implied joint distribution is a Gaussian pairwise difference prior τX 1/2 K/2 2 p(η | ξ, τ ) ∝ (τ /2π) |W | exp − (ηk − ηj ) , 2 j∼k where W = W (ξ) is the positive semi-definite K × K matrix with elements Wkk = Nk , Wkj = 1 for j ∼ k, and Wkj = 0 otherwise, and |W | denotes the 7 product of the non-zero eigenvalues of W (Besag and Higdon, 1997). The prior is improper, because the density is invariant under shifts of every coordinate by the same amount, but the posterior is proper in the presence of any informative data. There are two hyperparameters, λ and τ , in the complete prior distribution p(θ) = p(ξ, η). In the algorithm specified in the Appendix the hyperparameters are assumed to be given. It is not technically difficult, however, to extend the algorithm so that one or both of them are treated as unknown (see Heikkinen and Arjas, 1999). Section 7 contains further discussion on this topic. 5 Posterior inference The general idea of Markov chain Monte Carlo methods in Bayesian inference is to construct a Markov chain that has the posterior distribution of interest as its equilibrium distribution. A realisation obtained by simulating this chain can be regarded as a dependent, approximate sample from the posterior, and various posterior inferences can be drawn from empirical data analysis of this sample. For example, posterior means can be approximated by corresponding averages and posterior probabilities by corresponding proportions in the MCMC sample. In the simulation of an appropriate Markov chain the reversible jump Markov chain Monte Carlo method of Green (1995) follows the general Metropolis-Hastings recipe (Metropolis et al., 1953; Hastings, 1970) by iterating the two-stage basic update step: • Given the current state θ (defined by ξ and η in the present case), propose a move of (a randomly or systematically chosen) type m by drawing the proposed new state θ 0 (defined by ξ 0 and η 0 ) from the proposal distribution corresponding to move type m and conditional on θ. • With probability αm (θ 0 | θ), chosen so as to maintain detailed balance within each move type, accept θ 0 as the new state of the sampler chain; otherwise retain the current state θ. The novelty in Green (1995) was to provide a general approach for handling pairs θ and θ 0 of possibly differing dimension (different numbers of generating points in our case). It should be mentioned here that such an algorithm was developed already in Geyer and Møller (1994) in the context of spatial point processes, and that Tierney (1995) formulated an even more general framework than that of Green. To define an irreducible prototype sampler for our posterior it is sufficient to consider the following proposal types: 1. Change the function value to ηk0 in a randomly chosen tile Ek . 2. Add a new randomly located (marked) generating point (ξ k0 , ηk0 ), k = K + 1. 3. Remove a randomly chosen generating point (ξ k , ηk ). 8 A detailed description of the suggested sampler is given in the Appendix. This section concludes with some computational considerations related to the evaluation of the posterior ratio p(θ 0 ) p(x | θ 0 ) p(θ 0 | x) = , p(θ | x) p(θ) p(x | θ) required in computing the appropriate acceptance probabilities. For a proposal of type 1 the prior ratio, p(ηk0 | η −k , ξ) p(θ 0 ) = p(θ) p(ηk | η −k , ξ) is immediately available from the Markov random field specification of the function value prior. Owing to the local dependence these full conditionals are also easy to evaluate. Obviously, this is one reason for the popularity of locally dependent MRF priors in Bayesian image analysis. Move types 2 and 3 are more problematic, since the matrix W = W (ξ) in p(η | ξ) is different from W (ξ 0 ). The ‘determinant’ |W | does not factorise, and it would be practically infeasible to evaluate it in each update. Heikkinen and Arjas (1998) suggested a local approximation: Replace W by a smaller matrix corresponding to a reduced generating point pattern that consists of ξ k (or ξk0 , as appropriate) and of its neighbours. This means acting as if E were the union of the corresponding tiles only. Then the computations stay local, which is an essential requirement in designing a successful Markov chain Monte Carlo sampler in a high-dimensional parameter space. The data enter into the computation through the likelihood ratio p(x | θ 0 ) , p(x | θ) where θ is the current and θ 0 the proposed parameter function. For the models of Section 2 this is easy to evaluate. Consider, for example, likelihood (1) for density estimation, and let A ⊂ E be the set where θ 0 differs from θ. For move type 1 A is just Ek , and for move types 2 and 3 A consists of E k (ξ) (or Ek (ξ 0 )) and its neighbouring tiles. Then the likelihood ratio Y θ 0 (xi ) p(x | θ 0 ) = p(x | θ) θ(xi ) i:xi ∈A only involves the restriction of x, θ and θ 0 to A, i.e. it can be locally evaluated. In this case, as well as in the other examples of Section 2, the essential reason for easy evaluation of the likelihood ratio is the conditional independence of data items xi given θ. Usually it is only dependencies (not contained in θ) between the data items that can cause serious problems in evaluating the ratio. A concrete example of such case is given in Heikkinen and Penttinen (1999), where θ is the pair potential function of the Gibbs point process model for dependent spatial locations. 9 6 Example in intensity estimation To illustrate a simple application in a surface estimation context, this section briefly reviews an example from Heikkinen and Arjas (1998), where more details can be found. The top right display of Figure 2 shows a point pattern simulated from a non-homogeneous planar Poisson process (Section 3.1) on a unit square with the intensity function shown on the top left. In the bottom row are a simple kernel density estimate (left) and a posterior mean estimate obtained by the method proposed here (right). The latter seems to exhibit better adaptivity: smoothness in the ‘flat’ regions, but also abrupt changes of intensity level on the steep slopes of the ‘ridge’. Heikkinen and Arjas (1998) also contains some quantitative comparisions indicating that the proposed method indeed restores the original intensity surface better than simple kernel density estimates. It should be remembered that the output of our algorithm is an (approximate) sample from the entire posterior distribution, and that the posterior mean estimate is only its most basic summary. Various other aspects of the posterior can be easily studied from the MCMC sample. As just one example, Figure 3 shows the marginal posterior densities of the intensity λ(s) at three locations, s = (0.2, 0.7), (0.4, 0.4), (0.8, 0.8). 7 Discussion The approach described here is not limited to step function approximations. The prior is specified for the marked generating points, and the interpolation between them can be, in principle, carried out in any other way, as long as it is local. In curve estimation, for example, one could add fixed generating point locations at the ends of the interval E, and apply linear interpolation between the generating points. This change would only affect the likelihood computations. On the plane one could obtain piecewise linear functions with the help of the Delaunay triangulation (see Section 3.2): The marked generating points determine a unique function that is linear within each Delaunay triangle. Figure 1 clearly reveals one problem with this idea: How to handle the boundary areas of E? In Nicholls (1998) this is solved by defining another process of generating points along the boundary. Further work is currently being pursued towards implementing this kind of extensions to the method. The choice of pairwise difference prior in Section 4 is convenient for two reasons. First, it has only one parameter, as opposed to the corresponding proper multivariate normal prior used in Heikkinen and Arjas (1998), where three parameters are required. Secondly, the realisations can be re-scaled without changing the prior density value. This is useful in density estimation, and in models where identifiability constraints are needed (see Heikkinen and Arjas, 1999). For positive valued functions, the log-transform can naturally be applied to enable the use of a Gaussian prior. In image classification, for example, the surface of interest is truly piecewise constant and its values are restricted to a (small) finite set of la10 · ·· ··· ···· · ·· · ···· ·· ··· ······················· ···· ···· ·· ·· · · · ····· ··· · · · ··· ········· ·· · ··· · ····· ···································· ·················· ······ ··············································· ····· ··· ··· ························ ·· ··· ··········· ·· ····························· ··· · · · ··· ········································ ························· ··· ························································· ·· · · ·· · · · ··· · · · · ··············· ···································· ·············································································· · · · · · · · · · ·· ···· ··············· ·············· ································ ·· ······························································· · · · · · · · · · · · · · · · · · ·· · · ·· · ······································································································································· ········· ··· ·· ··· ···· · ··· ·· ···· ···· · ······· · ··· ··· · · ································ ·························································································································· ··············································································································································· ············ ···· ······················································································································ ········ ··· ·· ···· ················································ ···· ····· ······· ······· ················· ········ ····· ·· ·· ·· ·· · ·· · · ··········································································· ·················· ······· ······························· ·································· ···· · ····· ······ ·· ········ · · ··············································· ····· ·· ··· ·· ········· ···· ·· · · ····· · ····· ·· ·· ·· ·· · · Figure 2: Point pattern (top left) simulated using the intensity surface on top right, kernel estimate (bottom left), and posterior mean estimate by the proposed method (bottom right). 11 · · · Figure 3: Marginal posterior densities of the intensity values at points s = (0.2, 0.7) (solid line), s = (0.4, 0.4) (dashed line), and s = (0.8, 0.8) (dotted line). The dots on the estimated density curves are horizontally located at the corresponding true intensity values. 12 bels. In that case one could use a discrete valued MRF, such as the Potts model, to encourage segments of complex shape than the convex Voronoi polygons. To specify the conditional expectations (3) we have experimented with different weighted averages X wkj ηj /wk+ , j∼k where wk+ = j∼k wkj . One choice is to assign the weight as a function of the distance between the corresponding generating points (Heikkinen and Arjas, 1999). With surfaces there are other natural alternatives such as the length of the common boundary segment (Heikkinen and Arjas, 1998). We have not systematically studied the effects of different weighting schemes, but it seems that they influence the adaptivity to rapid changes (Arjas and Heikkinen, 1997). It should be remembered that in Gaussian schemes the variances 1/τ w k+ of the full conditionals are also determined up to scale by the choice of the weights. In Heikkinen and Arjas (1999) we treat parameter τ as unknown. We believe, however, that it is generally not a good idea to let everything loose. It seems natural to use λ as a control variable to adjust the degree of smoothing, as its choice also has a considerable direct effect on the computational burden through the typical number of generating points. The role of the normalising ‘determinant’ |W | is to ensure that the marginal prior of the generating points is indeed a Poisson distribution. Using the wrong normalisation has no effect on the conditional prior of the function values, but it may change the prior distribution of the number and locations of generating points. This far we have not recognized any undesirable implications of using simple approximations. Intended future work includes a more systematic study of this issue. The prototype sampler specified in Appendix has been designed to be simple rather than efficient. Green (1995) suggests that it is a good idea to preserve the integral of θ in the dimension changing moves by appropriately modifying the function values in the neighbourhood of the added or deleted generating point. The performance of the current sampler can be controlled by tuning the sampler parameters. Parameters δ and C control the magnitude of proposed changes and thereby the proportions of accepted proposals (small changes are typically more often accepted). It has been suggested that proportions around 0.4, or even less, should indicate a reasonably well mixing sampler. P Acknowledgments This chapter is mainly based on joint work with Elja Arjas. Fruitful discussions with Jesper Møller and Peter Green have had a considerable effect on the views expressed here. I am also grateful to the anonymous referee for valuable comments. 13 Appendix: Details of the posterior sampler The reversible jump Markov chain Monte Carlo algorithm described here is nearly the simplest possible for sampling from the posterior distribution p(θ | x) derived from the prior p(θ) of Section 4 and an arbitrary likelihood p(x | θ). It is intented to serve as a prototype algorithm that contains the vital components and should do the job, but that can certainly be improved for better mixing and extented to cope with more complicated problems. The procedure is started by choosing an arbitrary initial pattern ξ (0) of at least two generating points located within E and associated function values η (0) to define an initial piecewise constant parameter function θ (0) . An iterative sequence θ (1) , θ (2) , . . . of functions is then produced by proposing changes to the current state and accepting or rejecting them according to a probabilistic rule. Denoting by θ 0 the proposed function, acceptance in the tth iteration implies that θ (t) is chosen to be θ 0 , and in case of rejection the current state is retained: θ (t) = θ (t−1) . To specify the details of one iteration, let θ denote the current state, ξ = (ξ1 , . . . , ξK ) its generating points, E = {E1 , . . . , EK } their Voronoi tessellation and η = (η1 , . . . , ηK ) the associated function values. The proposal is created by applying one of the following moves. 1. Change of one function value: Sample an index k from the uniform distribution on {1, . . . , K} and the new function value η k0 for the associated tile from the uniform distribution on [η k − δ, ηk + δ], where δ > 0 is a parameter of the sampler. Let K 0 = K, ξ 0 = ξ and η 0−k = η −k . 2. Birth of a new marked generating point: Sample a new location ξ 0 from the uniform distribution on E and the associated function value η 0 as described 0 ) with ξ 0 0 0 below. Let K 0 = K + 1, ξ 0 = (ξ10 , . . . , ξK 0 −K 0 = ξ and ξK 0 = ξ , 0 and define η analogously. 3. Death of a generating point: Sample an index k from the uniform distribution on {1, . . . , K}. Re-index ξ and η so that kth and Kth components 0 ) = ξ 0 are switched. Let K 0 = K − 1, ξ 0 = (ξ10 , . . . , ξK 0 −K , and define η analogously. The actual proposal θ 0 is the piecewise constant function defined by pattern ξ 0 , its 0 }, and function values η 0 . tessellation E 0 = {E10 , . . . , EK 0 In each iteration one of the proposal types 1–3 is chosen randomly with probabilities hK , bK and dK , respectively, depending on the current number of generat- 14 ing points. Following Green (1995) they are defined as ( c if K ≤ λ |E| − 1, bK = λ|E| c K+1 if K > λ |E| − 1, if K = 2, 0 K if 2 < K ≤ λ |E|, dK = c λ|E| c if K > λ |E|, and hK = 1 − b K − d K , where |E| is the area of E, and sampler parameter c ∈ (0, 12 ) controls the rate at which dimension changing moves are proposed. 0 = In type 2 move the function value proposed for the new tile is defined as η K 0 η̃ + ε, where X |Ek | − |E 0 | k η̃ = E 0 0 ηk 0 K k∈∂K 0 (ξ ) is a weighted average of the current neighbouring values, and perturbation ε ∈ R is drawn from density g(ε) = CeCε /(1 + eCε )2 . Here C > 0 is yet another sampler parameter. Type 1 move is a usual dimension preserving Metropolis update, and accordingly it is accepted with probability p(θ 0 | x) 0 . α(θ | θ) = min 1, p(θ | x) After some rather straightforward algebra it is found that η 0 + ηk X p(θ 0 | x) p(x | θ 0 ) = exp −τ (ηk0 − ηk ) Nk k − ηj . p(θ | x) 2 p(x | θ) j∼k To work out the appropriate acceptance probabilities for proposal types 2 and 3, let us now forget what used to be move type 1, and re-label the dimension changing moves by index m = 2, 3, . . . . In a move of type m in the new labelling either the current number of generating points is K = m and a birth of a new one is proposed, or K = m + 1 and a death is proposed. To ensure that this is a pair of reversible jumps, it was necessary in the death move to give positive proposal probabilities to the deletion of any generating point, and in the birth move to have the proposal density for the location of the new point positive all over E, as well as the proposal density for the new function value positive in the entire real axis (by defining g to be non-zero everywhere). To put it a little more formally let P (dθ) be 15 the posterior probability of dθ and Q m (dθ 0 | θ) the probability to propose a move of type m and obtain a proposal in dθ 0 given the current state θ. Then a symmetric measure can be constructed, such that the equilibrium joint proposal distribution Rm (dθ, dθ 0 ) = P (dθ)Qm (dθ 0 | θ) has density rm (θ, θ 0 ) = p(θ | x)qm (θ 0 | θ) with respect to that measure. Here 0 if K = m and K 0 = m + 1, bK g(ηK 0 − η̃)/ |E| , qm (θ 0 | θ) = dK /K, if K = m + 1 and K 0 = m, 0 otherwise, and rm (θ, θ 0 ) is positive if and only if rm (θ 0 , θ) is. Hence the dimension matching criterion of Green (1995) is satisfied. Accordingly, a birth proposal from θ with K = m to θ 0 with K 0 = m + 1 should be accepted with probability p(θ 0 | x)qm (θ | θ 0 ) 0 , α(θ | θ) = min 1, p(θ | x)qm (θ 0 | θ) since the Jacobian 0 ∂θ (θ, ε) ∂θ∂ε is always equal to 1. The posterior ratio turns out to be τ W (ξ 0 ) 1/2 p(θ 0 | x) =λ p(θ | x) 2π |W (ξ)| 1 τ X 0 2 (ηK exp − 0 − ηk ) − 2 2 k∈∂K 0 (ξ) X (ηk − ηj )2 j∈∂k (ξ)\∂k (ξ 0 ) p(x | θ 0 ) . p(x | θ) For the reverse proposal, the death of ξ K , K = m + 1, the acceptance probability is obtained similarly with 0 1/2 p(θ 0 | x) −1 2π W (ξ ) =λ p(θ | x) τ |W (ξ)| X τX 1 exp − 2 2 0 k∈∂ (ηk − ηj )2 − (ηK − ηk )2 j∈∂k (ξ )\∂k (ξ) p(x | θ 0 ) . p(x | θ) Except for the ratios W (ξ0 ) / |W (ξ)|, all computations required can be handled locally: Only the mark of the generating point to be added or deleted, and those of its neighbours are needed. For the ratio of the ‘determinants’, an approximation, similar to that in Heikkinen and Arjas (1997a), is suggested. To be specific, consider a death move where ξ 0 = ξ −K . Let ξ̃ = (ξ̃1 , . . . , ξ̃NK +1 ) be the subpattern of ξ, where ξ̃NK +1 = ξK and ξ̃1 , . . . , ξ̃NK are the neighbouring points ξk , k ∈ ∂K (ξ), indexed arbitrarily. Then the approximation consists of replacing 16 |W (ξ)| by a NK +1×NK +1 matrix W̃ with elements W̃kk = #{j ∈ ∂K ∪{K} : ξj ∼ ξ̃k }, W̃kj = 1 for ξ̃j ∼ ξ̃k , and W̃kj = 0 otherwise. In other words, W̃ is ξ ξ built in the same way as the original W , only everything outside the union of E K and its neighbours is neglected. Matrix W (ξ 0 ) is replaced by a NK × NK matrix 0 W̃ 0 built from pattern ξ̃ = ξ̃ −(NK +1) in a similar manner using the neighbourhood relation ∼0 . ξ References Arjas, E. (1996), Discussion of paper by Hartigan, in J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds, ‘Bayesian Statistics 5’, Oxford University Press, pp. 221–222. Arjas, E. and Andreev, A. (1996), A note on histogram approximation in Bayesian density estimation, in J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds, ‘Bayesian Statistics 5’, Oxford University Press, pp. 487–490. Arjas, E. and Gasbarra, D. (1994), ‘Nonparametric Bayesian inference from right censored survival data, using the Gibbs sampler’, Statistica Sinica 4, 505– 524. Arjas, E. and Heikkinen, J. (1997), ‘An algorithm for nonparametric Bayesian estimation of a Poisson intensity’, Computational Statistics 12, 385–402. Besag, J. (1974), ‘Spatial interaction and the statistical analysis of lattice systems (with discussion)’, Journal of the Royal Statistical Society, Series B 36, 192– 236. Besag, J. and Higdon, D. (1997), Bayesian inference for agricultural field experiments, Technical report, Department of Statistics, University of Washington. Besag, J. and Kooperberg, C. (1995), ‘On conditional and intrinsic autoregressions’, Biometrika 82, 733–746. Cressie, N. A. C. (1993), Statistics for Spatial Data, revised edn, Wiley, New York. Denison, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998), ‘Automatic Bayesian curve fitting’, Journal of the Royal Statistical Society, Series B 60, 333–350. Geyer, C. J. and Møller, J. (1994), ‘Simulation procedures and likelihood inference for spatial point processes’, Scandinavian Journal of Statistics 21, 359–373. Green, P. J. (1995), ‘Reversible jump MCMC and Bayesian model determination’, Biometrika 82, 711–732. Hastings, W. K. (1970), ‘Monte Carlo sampling methods using Markov chains and their applications’, Biometrika 57, 97–109. 17 Heikkinen, J. and Arjas, E. (1998), ‘Nonparametric Bayesian estimation of a spatial Poisson intensity’, Scandinavian Journal of Statistics 25, 435–450. Heikkinen, J. and Arjas, E. (1999), ‘Modeling a Poisson forest in variable elevations: a nonparametric Bayesian approach’, Biometrics 55, 738–745. Heikkinen, J. and Penttinen, A. (1999), ‘Bayesian smoothing in the estimation of the pair potential function of Gibbs point processes”’, Bernoulli 5, 1119– 1136. Karr, A. F. (1991), Point Processes and their Statistical Inference, 2nd edn, Dekker, New York. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953), ‘Equation of state calculations by fast computing machines’, The Journal of Chemical Physics 21, 1087–1092. Møller, J. (1994), Lectures on Random Voronoi Tessellations, number 87 in ‘Lecture Notes in Statistics’, Springer-Verlag, New York. Nicholls, G. (1998), ‘Bayesian image analysis with Markov chain Monte Carlo and colored continuum triangulation models’, Journal of the Royal Statistical Society, Series B 60, 643–659. Okabe, A., Boots, B. and Sugihara, K. (1992), Spatial Tessellations. Concepts and Applications of Voronoi Diagrams, Wiley, Chichester. Stoyan, D., Kendall, W. S. and Mecke, J. (1995), Stochastic Geometry and its Applications, 2nd edn, Wiley, New York. Tierney, L. (1995), A note on Metropolis Hastings kernels for general state spaces, Technical report, School of Statistics, University of Minnesota. 18