Download Recursive partitioning and Bayesian inference on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Recursive partitioning and Bayesian inference on
conditional distributions
Li Ma
June 13, 2012
Abstract
In this work we introduce a Bayesian framework for nonparametric inference on
conditional distributions in the form of a prior called the conditional optional Pólya
tree. The prior is constructed based on a two-stage nested procedure, which in the first
stage recursively partitions the predictor space, and then in the second generates the
conditional distribution on those predictor blocks using a further recursive partitioning
procedure on the response space. This design allows adaptive smoothing on both the
predictor space and the response space. We show that this prior obtains desirable properties such as large support, posterior conjugacy and weak consistency. Moreover, the
prior can be marginalized analytically producing a closed-form marginal likelihood,
and based on the marginal likelihood, the corresponding posterior can also be computed analytically, allowing direct sampling without Markov Chain Monte Carlo. In
addition, we show that this prior can be considered a nonparametric extension of the
Bayesian classification and regression trees (CART), and therefore many of its theoretical results extend to the Bayesian CART as well. Our prior serves as a general
tool for nonparametric inference on conditional distributions. We illustrate its work in
density estimation, model selection, hypothesis testing, and regression through several
numerical examples.
1
1
Introduction
Wong and Ma [37] introduced the optional Pólya tree (OPT) distribution as a nonparametric prior for random density functions. The OPT prior extends the standard Pólya tree
distribution [9, 19], which, as a tail-free process, generates distributions through top-down
randomized probability assignment into a fixed recursive partition sequence of the sample
space. The proportion of probability mass that goes into each child set at each level of
partitioning is determined by drawing Beta (or Dirichlet) assignment variables. The OPT
prior introduces randomization into the recursive partitioning procedure by allowing optional
stopping and selective division.
The original purpose for randomizing the partitioning is to allow the prior to generate
absolutely continuous distributions without imposing extraneous conditions on the parameters of the Beta (or Dirichlet) assignment variables, which is required under the Pólya tree
prior [18]. But a surprising observation is that incorporating such randomness into the partitioning procedure does not impair three crucial properties enjoyed by the Pólya tree. The
first is the availability of a closed-form marginal likelihood, which allows problems such as
nonparametric hypothesis testing and regression to be carried out conveniently by marginalizing out the prior [17]. The second is the posterior conjugacy—after observing i.i.d. data,
the corresponding posterior of an OPT is still an OPT, that is, the same random partitioning
and probability assignment procedure with its parameters updated to their posterior values.
The third, which we call “posterior exactness”, is that the parameter values of the posterior
OPT can be computed exactly. (In this case through a sequence of recursive computations.)
These three properties, together with the constructive nature of the OPT, allow one to compute summary statistics such as the posterior mean and mode analytically as well as to
draw samples from this posterior directly, without resorting to Markov Chain Monte Carlo
(MCMC) procedures. This is extremely valuable as the convergence behavior of MCMC
2
sampling algorithms cannot be guaranteed and is often very difficult to judge for complex
models involving a large (or even infinite) number of parameters.
In addition, randomized partitioning also enables the OPT prior to be used for inferring good divisions of the sample space that best reflect the structure of the underlying
distribution. A posteriori the sample space will be more finely divided where the underlying
distribution has richer structure, i.e. less uniform in shape. This is essentially a “model
selection” feature, particularly important in higher dimensional settings where an even division across the entire sample space is undesirable both computationally and statistically due
to the “curse of dimensionality”. Moreover, the randomized recursive-tree representation
of the posterior opens new doors to efficient computation in high dimensional problems—
algorithms such as k-step look-ahead that are commonly used for approximating recursions
on trees can be applied when exact recursion becomes computationally prohibitive [21].
In this work, we extend the randomized recursive partitioning construction of priors to
a broader class of inferential problems involving conditional distributions. Not only do we
use randomized partitioning to characterize how probability mass is spread over the sample
space as in the OPT, but we use it to specify the relationship between a response vector and
a set of predictor variables. We introduce a new prior, called the conditional optional Pólya
tree, based on a two-stage nested recursive partition and assignment procedure that first
randomly divides the predictor space, and then for each predictor partition block, randomly
divides the response space and generates the conditional distribution using a local OPT
distribution.
Our prior on conditional densities enjoys all of the desirable properties of the OPT—
closed-from marginal likelihood, posterior conjugacy, exactness and consistency. Moreover,
the corresponding posterior, which is the same two-stage nested procedure with updated
parameter values, allows inference on the structure of the underlying conditional distributions
at two levels. First, the posterior divides the predictor space more finely in parts where
3
the conditional distribution changes most abruptly, shedding light on how the response
distribution depends on the predictors, thereby providing a means for model selection on the
response-predictor relationship. Second, the response space is divided adaptively for varying
predictor values, effectively capturing the local shapes of the conditional density.
It is worth noting that recursive partitioning has been fruitfully applied in earlier works to
capture response-predictor relationships. One of the most notable works is the classification
and regression trees (CART) [2]. In fact, Bayesian versions of CART [3, 5] also utilize
randomized partitioning to achieve model selection. In particular, our two-stage prior bears
much similarity to the Bayesian CART model/prior introduced in [3], which is based on
a constructive procedure that randomly divides the predictor space sequentially and then
generates the conditional distribution from base models within the blocks. In this sense,
our prior can be considered a nonparametric extension of Bayesian CART. Consequently,
several desirable properties that we establish for our prior in fact also hold for the Bayesian
CART, most importantly posterior conjugacy and exactness. This implies that inference
using Bayesian CART can often also be carried out in an exact manner by computing the
posterior in closed-form. Up till now, inference using Bayesian CART models have relied
on MCMC algorithms. As noted in [3], among many others, however, the convergence and
mixing of MCMC chains on tree spaces are often poor and hard to evaluate. This makes our
alternative approach to exploring posterior tree spaces particularly valuable.
We end this introduction by placing this current work in the larger context of existing
methods for nonparametric inference on conditional distributions. This topic has been extensively studied from both frequentist and Bayesian perspectives. Many frequentist works are
based on kernel estimation methods [7, 13, 8] and the selection of bandwidth often involves
resampling procedures such as cross-validation [1, 14, 8] and the bootstrap [13]. On the other
hand, existing Bayesian nonparametric methods on modeling covariate dependent random
measures mostly fall into two categories. The first is methods that construct priors for the
4
joint distribution of the response and the covariates, and then use the induced conditional
distribution for inference. Some examples are [25, 26, 32], which proposed using mixtures
of multivariate normals as the model for joint distributions, along with different priors for
the mixing distribution. The other category is methods that construct conditional distributions directly without specifying the marginal distribution of the predictors. Many of these
methods are based on extending the stick breaking construction for the Dirichlet Process
(DP) [31]. Some notable examples, among others, are proposed in [24, 15, 10, 12, 6, 4, 29].
Three recent works in this category do not utilize the stick breaking procedure. In [34],
the authors propose to use the logistic Gaussian process [20, 33] together with subspace
projection to construct smooth conditional distributions. In [16], the authors incorporate
dependence structure into tail-free processes by generating the conditional tail probabilities
from covariate-dependent logistic Gaussian processes, and propose a mixture of such processes as a way for modeling conditional distributions. Finally, in [35] the authors introduce
the covariate-dependent multivariate Beta process, and use it to generate the conditional
tail probabilities of Pólya trees. Inference using these Bayesian nonparametric priors on
conditional distributions generally relies on MCMC sampling.
Compared to these existing frequentist and Bayesian methods, our approach is unique in
several ways. First, in comparison to the existing Bayesian methods, the conditional optional
Pólya tree is not based on mixtures of Gaussian or other simple distributions. (However,
one may still consider our method a mixture modeling approach, with a model that is a
mixture of “itself” [22].) Neither does it use stick breaking or logistic Gaussian processes
to construct random probability distributions. Second, compared to the existing frequentist
methods, the corresponding posterior partitions the predictor and response spaces according
to the underlying data structure, which provides a principled means for adaptive bandwidth
selection—different parts of the spaces will have different resolution of inference—without the
need for any extra resampling steps. Finally, as mentioned earlier, posterior inference using
5
our method can be carried out exactly in many problems without the need for MCMC, and
the recursive tree representation of the posterior affords new ways to efficient computation.
The rest of the paper is organized as follows. In Section 2 we introduce our two-stage prior
and establish its various theoretical properties—namely large support, posterior conjugacy
and posterior exactness. In addition, we make the connection to Bayesian CART and show
that our method is a nonparametric version of the latter. In Section 3 we provide a recipe for
carrying out Bayesian inference using our prior and establish its (weak) posterior consistency.
In Section 4 we discuss practical computational issues in implementing the inference. In
Section 5 we provide five examples to illustrate the work of our method. The first two are
for estimating conditional densities, the third illustrates model selection, the fourth concerns
hypothesis testing of independence, and the last is for semiparametric median regression
with heteroscedastic error. Section 6 concludes with some discussions.
2
Conditional optional Pólya trees
In this section we introduce our proposed prior, which is a constructive two-stage procedure
that randomly generates conditional densities. First we introduce some notations that will
be used throughout this work. Let each observation be a predictor-response pair (X, Y ),
where X denotes the predictor (or covariate) vector and Y the response vector with ΩX
being the predictor space and ΩY the response space. In this work we consider sample spaces
that are either finite spaces, compact Euclidean rectangles, or a product of the two, and ΩX
and ΩY do not have to be of the same type. (See Example 3.) Let µX and µY be the
“natural” measures on ΩX and ΩY . (That is, the counting measure for finite spaces, the
Lebesgue measure for Euclidean rectangles, and the corresponding product measure if the
space is a product of the two.) Let µ = µX × µY be the “natural” product measure on the
joint sample space ΩX × ΩY .
6
A notion that will be useful in introducing our prior is that of a partition rule [23, Sec. 2].
A partition rule R on a sample space Ω specifies a collection of possible ways to divide any
subset A of Ω into a finite number of smaller sets. For example, for Ω = [0, 1]k , the unit
rectangle in Rk , the coordinate-wise diadic split rule allows each rectangular subset A of Ω
whose sides are parallel to the k coordinates to be divided into two halves at the middle
of the range of each coordinate, while other subsets cannot be divided under this rule. A
similar rule can be adopted when Ω is a 2k contingency table. For simplicity, in this work we
only consider partition rules that allow a finite number of ways for dividing each set. Such
partition rules are said to be finite. (Interested readers can refer to [23, Sec. 2] for a more
detailed treatment of partition rules and to Examples 1 and 2 in [37] for examples of the
coordinate-wise diadic split rule over Euclidean rectangles and 2k contingency tables.)
We are now ready to introduce our prior for conditional distributions as a two-stage constructive procedure. In the first stage, the predictor space ΩX is randomly partitioned into
blocks, in the second stage, the conditional distribution of the response given the predictor
values is generated for each of the predictor blocks. This design bears much similarity to
that of the Bayesian CART model proposed in [3]. We will see that our two-stage prior can
be considered a nonparametric generalization of the Bayesian CART. Next we describe each
of the two stages in detail.
Stage I. Covariate partition: Let RX be the partition rule under consideration for the
predictor space ΩX . We randomly partition ΩX according to this rule in the following recursive manner. Starting from A = ΩX , draw a Bernoulli variable with success probability ρ(A)
S(A) ∼ Bernoulli(ρ(A)).
If S(A) = 1, then the partitioning procedure on A terminates and we arrive at a trivial
partition of a single block over A. (Thus S(A) is called the stopping variable, and ρ(A) the
7
stopping probability.) If instead S(A) = 0, then we randomly select one out of the possible
ways for dividing A under RX and partition A accordingly. More specifically, if there are
N (A) ways to divide A under RX , we randomly draw J(A) from {1, 2, . . . , N (A)} such that
P (A)
P (J(A) = j) = λj (A) for j = 1, 2, . . . , N (A) with N
j=1 λj (A) = 1, and partition A in the
jth way if J(A) = j. (We call λ(A) = λ1 (A), λ2 (A), . . . , λN (A) (A) the partition selection
probabilities for A.) Let K j (A) be the number of child sets that arise from this partition, and
let Aj1 , Aj2 , . . . , AjK(A) denote these children. We then repeat the same partition procedure,
starting from the drawing of a stopping variable, on each of these children.
The following lemma shows that as long as the stopping probabilities are (uniformly)
away from 0, this random recursive partitioning procedure will eventually terminate almost
everywhere and produce a well-defined partition of ΩX . (All proofs in this paper are provided
in the Supplementary Materials.)
Lemma 1. If there exists a δ > 0 such that the stopping probability ρ(A) > δ for all A ⊂ ΩX
that could arise after a finite number of levels of recursive partition, then with probability 1
the recursive partition procedure on ΩX will stop µX a.e.
Stage II. Generating conditional densities: Next we move onto the second stage of
the procedure to generate the conditional density of the response Y on each of the partition
blocks generated in Stage I. Specifically, for each stopped subset A on ΩX produced in
Stage I, we let the conditional distribution of Y given X = x be the same across all x ∈ A,
and generate this (conditional) distribution on ΩY , denoted as qY0,A , from a “local” prior.
When the reponse space ΩY is a finite, qY0,A is a multinomial distribution, and so a simple
and natural choice of such a local prior is the Dirichlet prior:
qY0,A ∼ Dirichlet(αA
Y)
(2.1)
where αA
Y represents the pseudo-count hyperparameters of the Dirichlet. In this case, we note
8
that the two-stage prior essentially reduces to the Bayesian CART proposed by Chipman et
al in [3] for the classification problem.
When ΩY is infinite (or finite but with a large number of elements), one may restrict
qY0,A to assume certain parametric form F(Θ) indexed by some parameters Θ, and adopt
corresponding priors for Θ. For example, when ΩY = R, one may require qY0,A to be normal
with some mean µA and variance σA2 , and let
µA |σA2 ∼ N(µ0 , σ 2 ) and σA2 ∼ inverse-Gamma(ν/2, νκ/2).
(2.2)
In this case, the two-stage prior again reduces to the Bayesian CART, this time for the
regression problem [3]. Other choices of such parametric models include local regression
models which in addition to the Gaussian assumption, imposes a linear relationship between
the mean of the response and the predictor values. This will allow the two-stage process to
randomly generate piecewise linear regression models.
The focus of our current work, however, is on the case when no parametric assumptions
are placed on the conditional distribution. To this end, one can draw qY0,A from a nonparametric prior. A desirable choice for the local prior, for reasons that we will see in Theorem 3
and the next section, is the optional Pólya tree (OPT) distribution [23, Sec. 2]:
A
A
A
qY0,A ∼ OPT(RA
Y ; ρ Y , λY , α Y )
A
A
A
where RA
Y denotes a partition rule on ΩY and ρY , λY , and αY are the hyperparameters.
Note that in general we allow the partition rule for these “local” OPTs to depend on A as
indicated in the superscript, but adopting a common partition rule on ΩY —that is to let
RA
Y ≡ RY for all A—will suffice for most problems. In the rest of the paper, unless stated
otherwise we assume that a common rule RY is adopted. This completes the description of
our two-stage procedure. We are now ready to present a formal definition of the prior.
9
Definition 1. A conditional density that arises from the above two-stage procedure with
the OPT being the local prior for qYA is said to have a conditional optional Pólya tree (condOPT) distribution with partition rule RX , stopping rule ρ, partition selection probabilities
A
A
λ, and local priors {OPT(RY ; ρA
Y , λY , αY ) : for all A ⊂ ΩX that could arise in Stage I}.
Remark: To ensure that this definition is meaningful, one must check that the two-stage
procedure will in fact generate a well-defined conditional distribution with probability 1.
To see this, first note that because the collection of all potential sets A on ΩX that can
arise during Stage I is countable, by Theorem 1 in [37], with probability 1, the two-stage
procedure will generate the conditional density f (y|x) for all x lying in the stopped part of
ΩX , provided that ρA
Y is uniformly away from 0. The two-stage generation procedure for the
conditional density of Y can then be completed by letting f (y|x) be the uniform density on
ΩY for the µX -null subset of ΩX on which the recursive partition in Stage I never stops.
We have emphasized that the cond-OPT prior imposes no parametric assumptions on
the conditional distribution. One may wonder just what types of conditional distributions
can be generated from this prior. More specifically, is this prior truly “nonparametric”
in the sense that it covers all possible conditional densities? Our next theorem gives a
positive answer—under mild conditions on the parameters, the cond-OPT will place positive
probability in arbitrarily small L1 neighborhoods of any conditional density. (A definition of
an L1 neighborhood for conditional densities is also implied in the statement of the theorem.)
Theorem 2 (Large support). Suppose q(·|·) is a conditional density function that arises
from a cond-OPT prior whose parameters ρ(A) and λ(A) for all A that could arise during
the recursive partitioning on ΩX are uniformly away from 0 and 1, and also the parameters
of the local OPT distributions on ΩY all satisfy the conditions specified in Theorem 2 of [37].
Moreover, suppose that the underlying partition rules RX and RY both satisfy the following
“fine partition criterion”: ∀ > 0, there exists a partition of the corresponding sample space
10
such that the diameter of each partition block is less than . Then for any conditional density
function f (·|·) : ΩY × ΩX → [0, ∞), and any τ > 0,
Z
P
|q(y|x) − f (y|x)|µ(dx × dy) < τ
> 0.
Furthermore, let fX (x) be any density function on ΩX w.r.t. µX . Then we have ∀τ > 0,
Z
P
3
|q(y|x) − f (y|x)|fX (x)µ(dx × dy) < τ
> 0.
Bayesian inference with cond-OPT
In the previous section we have presented the construction of the cond-OPT prior and
showed that it is fully supported on all conditional densities. Next we investigate how
Bayesian inference on conditional densities can be carried out using this prior. First, we note
that Chipman et al [3] and Denison et al [5] each proposed MCMC algorithms that enable
posterior inference for Bayesian CART. These sampling and stochastic search algorithms can
be applied directly here as the local OPT priors can be marginalized out and so the marginal
likelihood under each partition tree that arises in Stage I of the cond-OPT is available in
closed-form [37, 23]. However, as noted in [3] along with others, due to the multi-modal
nature of tree structured models, the mixing behavior of the MCMC algorithms is often
undesirable. This problem is exacerbated in higher dimensional settings. Chipman et al [3]
suggested using the MCMC algorithm as a tool for searching for good models rather than a
reliable way of sampling from the actual posterior.
The main result of this section is that Bayesian inference under a cond-OPT prior (and
in fact for some specifications of Bayesian CART as well) can be carried out in an exact
manner, in that the corresponding posterior distribution can be computed in closed form
and directly sampled from, without resorting to MCMC algorithms.
11
First let us investigate what the posterior of a cond-OPT prior is. Suppose we have
observed (x, y) = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where given the xi ’s, the yi ’s are independent with some density q(y|x). We assume that q(·|·) has a cond-OPT prior denoted by π.
Further, for any A ⊂ ΩX we let
x(A) := {x1 , x2 , . . . , xn } ∩ A and y(A) := {yi : xi ∈ A, i = 1, 2, . . . , n},
and let n(A) denote the number of observations with predictors lying in A, that is n(A) =
|x(A)| = |y(A)|.
For A ⊂ ΩX , we use q(A) to denote the (conditional) likelihood under q(·|·) from the
data with predictors x ∈ A. That is
q(A) :=
Y
q(yi |xi ).
i:xi ∈A
Then by the two-stage construction of the cond-OPT prior, given that A is a set that arises
during the recursive partition procedure on ΩX , we can write q(A) recursively in terms of
S(A), J(A), and qYA as follows
q(A) =



q 0 (A)



if S(A) = 1




 QK j (A) q(Aj )
i=1
i
if S(A) = 0 and J(A) = j,
where
q 0 (A) :=
Y
qY0,A (yi ),
i:xi ∈A
the likelihood from the data with x ∈ A if the recursive partitioning procedure stops on A.
12
Equivalently, we can write
K J(A) (A)
0
q(A) = S(A)q (A) + (1 − S(A))
Y
J(A)
q(Ai
).
(3.1)
i=1
Integrating out the randomness over both sides of Eq. (3.1), we get
N (A)
Y
X
Φ(A) = ρ(A)M (A) + 1 − ρ(A)
λj (A)
Φ(Aji ),
j=1
(3.2)
i
where
Z
Φ(A) :=
q(A)π(dq | A arises during the recursive partitioning)
is defined to be the marginal likelihood from data with x ∈ A given that A arises during the
recursive partitioning on ΩX , whereas
Z
M (A) :=
q 0 (A)π(dqY0,A )
(3.3)
is the marginal likelihood from the data with x ∈ A if the recursive partitioning procedure
stops on A and the integration is taken over the local prior. In particular, for the cond-OPT
A
A
prior, M (A) is the marginal likelihood under OPT(RY ; ρA
Y , λY , αY ) from the data y(A).
We note that Eqs. (3.1), (3.2) and (3.3) hold for Bayesian CART as well, with M (A) being
the corresponding marginal likelihood of the local normal model or the multinomial model
under the corresponding priors such as those given in (2.1) and (2.2).
Eq. (3.2) provides a recursive recipe for calculating Φ(A) for all A. (Of course, to carry
out the calculation the recursion must eventually terminate. We defer the discussions on
terminal conditions for this recursion to the next section.) But why do we want to compute
these Φ(A) terms? It turns out that the posterior of a cond-OPT prior can be expressed in
closed form in terms of these Φ(A) terms. More specifically, the next theorem shows that the
13
cond-OPT is conjugate—if a conditional distribution has a cond-OPT prior then after data
are observed it will have a cond-OPT posterior, whose parameter values can be expressed
explicitly in terms of the Φ(A)’s and the prior parameters.
Theorem 3 (Conjugacy). After observing {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where given the
xi ’s, the yi ’s are independent with density q(y|x), which has a cond-OPT prior, the posterior
of q(·|·) is again a cond-OPT (with the same partition rules on ΩX and ΩY ). Moreover, for
each A ⊂ ΩX that could arise during the recursive partitioning, the posterior parameters are
given as follows.
1. Stopping probability:
ρ(A|x, y) = ρ(A)M (A)/Φ(A).
2. Selection probabilities:
Q j (A)
Φ(Aji )
(1 − ρ(A)) K
i=1
.
λj (A|x, y) = λj (A)
Φ(A) − ρ(A)M (A)
3. The local posterior for the conditional distribution of Y given X on a newly stopped
set A:
A
A
A
qY0,A |x, y ∼ OPT(RA
Y ; ρ̃Y , λ̃Y , α̃Y )
A
A
where ρ̃A
Y , λ̃Y , and α̃Y represent the corresponding posterior parameters for the local
OPT after updating using the observed values for the response y(A).
This theorem shows that a posteriori our knowledge about the underlying conditional
distribution of Y given X can again be represented by the same two-stage procedure that
randomly partitions the predictor space and then generate the response distribution accordingly on each of the predictor block, except that now the parameters that characterize this
two-stage procedure have been updated to reflect the information contained in the data.
14
Moreover the theorem also provides a recipe for computing these posterior parameters based
on the Φ(A) terms. Given this exact posterior, Bayesian inference can then proceed—
samples can be drawn from the posterior cond-OPT directly and summary statistics can be
calculated—without resorting to any MCMC procedure. Note that as long as we can compute the “local” marginal likelihood M (A) terms, we can use the recursion formula Eq. (3.2)
to calculate all of the Φ(A) terms, and the theorem follows accordingly. Therefore, this way
of carrying out exact inference is applicable to Bayesian CART as well so long as the local
prior for qY0,A is so chosen that M (A) can be evaluated. For example, the commonly adopted
Dirichlet-multinomial and normal-inverse-gamma conjugate models as we see in (2.1) and
(2.2) satisfy this condition.
In the next section, we provide more details on how to implement such inference in
practice, and discuss computational strategies that can substantially improve the scalability
of the method. Before we move on however, we present our last theoretical result about the
cond-OPT prior—its (weak) posterior consistency, which assures the statistician that the
posterior cond-OPT distribution will “converge” in some sense to the truth as the amount
of data increases. In order to present this result, we need a notion of neighborhoods for
conditional densities. We adopt the notion discussed in [28] and [27], by which a (weak)
neighborhood of a conditional density function is defined in terms of a (weak) neighborhood
of the corresponding joint density. More specifically, for a conditional density function
0
(·)
f0 (·|·) : ΩY × ΩX → [0, ∞), weak neighborhoods with respect to a marginal density fX
on ΩX are collections of conditional densities of the form
Z
Z
n
o
0
0
U = f (·|·) : gi f (·|·)fX dµ − gi f0 (·|·)fX
dµ < i , i = 1, 2, . . . , l
where the gi ’s are bounded continuous functions on ΩX × ΩY .
Theorem 4 (Weak consistency). Let (x1 , y1 ), (x2 , y2 ), . . . be independent identically dis15
tributed vectors from a probability distribution on ΩX × ΩY , F , with density dF/dµ =
f (x, y) = f (y|x)fX (x). Suppose the conditional density f (·|·) is generated from a condOPT prior for which the conditions given in Theorem 2 all hold. In addition, assume that
the conditional density function f (·|·) and the joint density f (·, ·) are bounded. Then for any
weak neighborhood of f (·|·) w.r.t fX , U , we have
π(U |(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )) −→ 1
with F ∞ probability 1, where π(·|(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )) denotes the posterior conditional optional Pólya tree distribution for f (·|·).
4
Practical implementation
In the previous sections, we have laid the theoretical base for Bayesian inference using the
cond-OPT prior. In particular, Theorem 3 gives a recipe for finding the exact posterior
distribution. Once the posterior is found, Bayesian inference can proceed in the usual manner
through calculating summary statistics of the posterior and/or sampling from the posterior.
In this section we address some practical issues in computing the posterior and implementing
the inference. For simplicity, from now on we shall refer to a set A ⊂ ΩX that can arise
during the (Stage I) recursive partitioning procedure as a “node”.
4.1
Terminal conditions
A prerequisite for applying Theorem 3 is the availability of the Φ(A) terms, which can be
determined recursively through Eq. (3.2). Of course, to carry out the computation of Φ(A)
one must specify terminal conditions on Eq. (3.2). More specifically, on what kind of A’s
shall we stop the recursive computation? We shall call such sets terminal nodes.
16
There are three kinds of nodes for which the value of Φ(A) is available directly according
to theory, and thus recursion can terminate on them. They are (1) nodes that contain no
data point, (2) nodes that cannot be further divided, and (3) nodes that contain one data
point. For a node A with no data point, Φ(A) = 1. For a node A that cannot be further
divided, we must have ρ(A) = 1 and so Φ(A) = M (A). For a node A with exactly one data
point, one can show that when the prior parameters are “symmetric” and “self-similar”,
Φ(A) = M (A). Here symmetry and self-similarity for the parameters mean that for all A we
A
A
have (i) ρ(A) ≡ ρ, (ii) λj (A) = 1/N (A), and (iii) ρA
Y ≡ ρY , λY ≡ λY , and αY ≡ αY . We
note that when useful prior knowledge about the structure of the underlying distribution is
not available or when one is unwilling to assume particular structure over the distribution,
it is desirable to specify the prior parameters in a symmetric and self-similar way. We will
adopt such a specification in all of our numerical examples. (The derivation of these terminal
conditions are very similar to those given in Examples 3 and 4 in [37] and Examples 1 and
2 in [23]. Thus we do not repeat the details here.)
Note that with these three types of “theoretical” terminal nodes, in principle the recursion
will eventually terminate if one divides the predictor space deep enough. In practice, however,
it is unnecessary—and infeasible in high dimensional problems—to take the recursion all the
way to the “bottom”. Instead, one can adopt early termination by imposing a technical
limit, such as a minimum size of the nodes, to compute the recursion over. Here the “size”
of a node can be the measure of the node µ(A), which is typically inversely related to the
number of levels down the recursion, or it can be defined in terms of the number of data
points within the set, e.g. 1% of the data points. Nodes that are smaller than the chosen size
threshold are forced to stop, which is equivalent to setting ρ(A) = 1 and thus Φ(A) = M (A)
for these nodes.
With the terminal nodes, either “theoretical” or “technical”, specified as above, one can
then compute the Φ(A) terms through recursion formula (3.2), and compute the posterior
17
according to Theorem 3. Putting all the pieces together, we can summarize the procedure
to carry out Bayesian inference with the cond-OPT prior as a four-step recipe:
I. For each terminal node A, compute M (A) and Φ(A).
II. For each non-terminal node A (those that are ancesters of the terminal nodes), compute
M (A) and use Eq. (3.2) to recursively compute Φ(A).
III. Given the values of M (A) and Φ(A), apply Theorem 3 to get the parameter values of
the posterior cond-OPT distribution.
IV. Sample from the posterior by direct simulation of the random two-stage procedure,
and/or compute summary statistics of the posterior such as the mean, the mode, and
the Bayes Factor.
For the last step, direct simulation from the posterior is straight-forward, but we have not
discussed how to compute summary statistics. This is problem-specific and will be illustrated
in our numeric examples in Section 5.
4.2
Computational strategies for high-dimensional problems
An essential step in finding the posterior cond-OPT is the recursive computation of the Φ(A)
terms. Recursion on tree structures is exponential in nature, and therefore even with the
most efficient implementation, brute-force computation can quickly become prohibitive both
in computing time and in memory when the sample space is more than just a few dimensions.
For example, if one sets the technical terminal nodes to be those lying 10 levels down the
tree, then one cannot afford computing the posterior for problems involving more than 5 to
6 continuous predictors or about 15 to 20 binary ones on a desktop computer with 16 GBs
of memory). This is not satisfactory for modern high-dimensional situations where data sets
involving hundreds to thousands of predictors are common.
18
Fortunately, the tree-structured construction of the cond-OPT opens new doors to approximating the posterior in high dimensional settings. In particular, if we lower the number
of levels the recursions are computed over from 10 to, say, 3, then the number of predictors that can be handled increases remarkably to the order of hundreds or even thousands
with parallelization. Consequently, strategies such as k-step look-ahead that are commonly
adopted for approximating recursive computations on trees can be adopted to substantially
improve the scalability of the method. In this proof-of-principle work we do not delve into
much detail about such algorithms. Interested readers may refer to [21, Sec. 3] for two
such algorithms that can directly be applied to the inference under cond-OPT. For example,
adopting a 3-step look-ahead strategy we are able to scale our method up to problems involving about 100 predictors on a desktop using about 10 GBs of memory and a single core.
Much of our current effort is in massive parallelization of such algorithms to further improve
the scalability.
5
Examples
In this section we provide five examples to illustrate inference using the cond-OPT prior. The
first two illustrate the estimation of conditional distributions, the third is on model selection,
the fourth is for hypothesis testing (in particular testing independence), and the last is for
linear median regression with heterogeneous non-Gaussian error. In these examples, the
partition rule used on both ΩX and ΩY is always the coordinate-wise diadic split rule. That
is, during the recursive partitioning procedure, each set A is allowed to be partitioned into
two equal pieces at the mid-point of each of its divisible dimensions. We adopt the same prior
specification across all the examples: the prior stopping probability on each non-terminal
node is always set to 0.5, the prior partition selection probability is always evenly spread
over the possible ways to partition each set, and the probability assignment pseudo-counts
19
for the local OPTs are all set to 0.5.
Example 1 (Estimating conditional density with clear covariate boundaries). In this example we simulate (X, Y ) pairs according to the following distributions.
X ∼ Beta(2, 2)
Y |X < 0.25 ∼ Beta(30, 20)
Y |0.25 ≤ X ≤ 0.5 ∼ Beta(10, 30)
Y |X > 0.5 ∼ Beta(0.5, 0.5).
We generate data sets of three different sample sizes, n = 100, n = 500, and n = 2, 500, and
place the cond-OPT prior on the distribution of Y given X. Following the four-step recipe
given in the previous section, we can compute the posterior cond-OPT and sample from it.
A representative sample from the posterior is the hierarchical maximum a posteriori
(hMAP) [37] model, which, as a summary statistic, can also be computed from the posterior
analytically [37] and is plotted in Figure 1 for the different sample sizes. (Chipman et al [3],
Wong and Ma [37], and Ma [21] discussed reasons why the commonly adopted MAP is not
a very good summary for tree-structured posteriors due to their multi-level nature. See [37,
Sec. 4.2] and the appendix of [21] for why the hMAP is preferred.)
In Figure 1, within each stopped node we plot the corresponding posterior mean estimate
of the conditional distribution of Y . Also plotted in each node is the posterior stopping probability ρ. Even with only 100 data points, the posterior suggests that ΩX should be divided
into three pieces—[0,0.25], [0.25,0.5], and [0.5,1]—within which the conditional distribution
of Y |X is homogeneous across X. Note that the posterior stopping probabilities ρ on those
three intervals are large, in contrast to the near 0 values on the larger sets. Of course, reliably
estimating the actual conditional density function on these sets nonparametrically requires
more than 100 data points. In this example, a sample size of 500 already does a decent job.
20
!"#$%&$
!"'(#$%&$
6
8
!"#$"'(&$
!"')(#$"'(&$
8
6
0
2
8
6
4
!"#$"')(&$
0.2
0.4
0.6
0.8
1.0
0
0
2
2
4
4
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(a) n = 100
!"#$%&$
!"'(#$%&$
6
8
!"#$"'(&$
!"')(#$"'(&$
8
6
0
2
8
6
4
!"#$"')(&$
0.4
0.6
0.8
1.0
2
4
0.2
0
0
2
4
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(b) n = 500
!"#$%&$
" = 0.00, n = 2500
!"#$"'(&$
!"'(#$%&$
" = 1.00, n = 1264
6
8
" = 0.00, n = 1246
4
0
2
8
6
" = 1.00, n = 863
8
!"')(#$"'(&$
6
!"#$"')(&$
" = 1.00, n = 373
0.2
0.4
0.6
0.8
1.0
0
0
2
2
4
4
0.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) n = 2500
Figure 1: The hMAP partition tree structures on X and the posterior mean estimate of Y |X
given the tree structure for Example 1. For each node, ρ indicates the posterior stopping
probability for each node and n represents the number of data points in each node. The plot
under each stopped node gives the mean of the posterior local OPT for Y within that node
(solid line) along with the true conditional densities (dashed line).
21
The previous example favors our method because (1) there are a small number of clear
boundaries of change for the underlying conditional distribution—namely 0.25 and 0.5, and
(2) those boundaries lie on the potential partition points of the diadic split rule. In the next
example, we examine the case in which the conditional distribution changes smoothly across
a continuous X without any boundary of abrupt change.
Example 2 (Estimating conditional density with no clear covariate boundaries). In this
example we generate (X, Y ) from a bivariate normal distribution.

 

!
2
0.6
0.1
0.005




(X, Y )0 ∼ BN   , 
 .
0.4
0.005 0.12
We generate data sets of size n = 2, 000, and apply the cond-OPT prior on the distribution
of Y given X as we did in the previous example. Again we compute the posterior cond-OPT
following our four-step recipe. The hMAP tree structure and the posterior mean estimate
of the conditional densities on the stopped nodes are presented in Figure 2. Because the
underlying covariate space ΩX is unbounded, for simplicity we use the empirically observed
range of X as ΩX , which happens to be ΩX = [0.24, 0.92] for our simulated example. (Other
ways to handle this situation include transforming X to have a compact support. We do
not discuss that in detail here but will adopt that strategy in Example 5.) As illustrated
in the figure, the posterior divides ΩX into pieces within each of which the conditional
density of Y is treated approximately as fixed over different values of X. As the sample size
increases, the posterior cond-OPT will partition ΩX into finer blocks, reflecting the fact that
the conditional density changes smoothly across ΩX . One interesting observation is that
the stopped nodes in Figure 2 have very large (close to 1) posterior stopping probability.
This may seem surprising as the underlying conditional distribution is not the same for any
neighboring values of X. The large posterior stopping probabilities indicate that on those
22
!"')*#$"'+)&$
!"')*#$"'(,&$
!"'(,#$"'+)&$
!"'-(#$"'+)&$
!"'(,#$"'-(&$
!"'*%#$"'(,&$
0
!"'*%#$"'*+&$
0.0
0.2
0.4
0.6
0.8
!"'*+#$"'(,&$
0.0
" = 1.00,n = 577
1.0
0.4
0.6
0.8
1.0
0.0
Y
0.2
0.4
0.6
0.8
0.4
0.6
0.8
1.0
1.0
8
0
2
4
6
8
0
2
4
6
8
0
2
4
6
8
6
4
2
0
0.2
0.2
Y
Y
0.0
2
!"'..#$"'-(&$
!"'(,#$"'..&$
0
2
4
4
6
6
8
8
!"')*#$"'*%&$
0.0
Y
0.2
0.4
0.6
Y
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Y
Figure 2: The hMAP tree structure on ΩX and the posterior mean estimate of Y |X within
the stopped sets for Example 2. The plot under each stopped node gives the mean of the
posterior local OPT for Y within that node (solid line) along with the true conditional
densities at the center value of the stopped predictor intervals (dashed line).
sets, where the sample size is not large, the gain in achieving better estimate of the common
features of the conditional distribution for nearby X values outweighs the loss in ignoring
the difference among them.
Example 3 (Model selection over binary covariates). This example illustrates how one can
use cond-OPT to carry out model selection. Consider the case in which X = (X1 , X2 , . . . , X30 ) ∈
{0, 1}30 forming a Markov Chain:
X1 ∼ Bernoulli(0.5) and P (Xi = Xi−1 |Xi−1 ) = 0.7
23
for i = 2, 3, . . . , 30. Suppose the conditional distribution of a continuous response Y is



Beta(1, 6)
if (X5 , X20 , X30 ) = (1, 0, 1)



Y ∼
Beta(12, 16) if (X5 , X20 ) = (0, 1)




 Beta(3, 4)
otherwise.
In other words, three predictors X5 , X20 and X30 impact the response in an interactive
manner. Our interest is in recovering this underlying interactive structure (i.e. the “model”).
To illustrate, we simulate 500 data points from this scenario and place a cond-OPT prior
on Y |X which is supported on recursive partitions up to four levels deep. This is achieved
by setting ρ(A) = 1 for A that arises after four steps of partitioning, and it allows us to
search for models involving up to four-way interactions. We again carry out the four-step
recipe to get the posterior and calculate the hMAP. The hMAP tree structure along with the
predictive density for Y |X within each stopped set is presented in Figure 3. The posterior
concentrates on partitions involving X5 , X20 and X30 out of the 30 variables. While the
predictive conditional density for Y |X is very rough given the limited number of data points
in the stopped sets, the posterior recovers the exact interactive structure of the predictors
with little uncertainty.
In addition, we sample from the posterior and use the proportion of times each predictor
appears in the sampled models to estimate the posterior marginal inclusion probabilities.
Our estimates based on 1,000 draws from the posterior are presented in Figure 4(b). Note
that the sample size 500 is so large that the posterior marginal inclusion probabilities for the
three relevant predictors are all close to 1 while those for the other predictors are close to 0.
We carry out the same simulation with a reduced sample size of 200, and plot the estimated
posterior marginal inclusion probabilities in Figure 4(a). We see that with a sample size of
200, one can already use the posterior to reliably recover the relevant predictors.
For problems where the underlying model involves high-order interactions and/or a large
24
Figure 3: The hMAP tree structure on ΩX and the posterior mean estimate of Y |X in each
of the stopped sets for Example 3. The bold arrows indicate the “true model”—predictor
combinations that correspond to “non-null” Y |X distributions.
number of predictors, strategies such as k-step look-ahead can be employed. Interested
readers may refer to [21, Sec. 3] for some examples. In [21] the author also introduces a
recursive partitioning based prior for the variable selection problem in regression.
Example 4 (Test of independence). In this example, we illustrate an application of the condOPT prior for hypothesis testing. In particular, we use it to test the independence between
X and Y . To begin, note that ρ(A|x, y) in Theorem 3 gives the posterior probability for
the conditional distribution of Y to be constant over all values of X in A, or in other words,
for Y to be independent of X on A. Hence, one can consider ρ(ΩX |x, y) as a statistic that
measures the strength of dependence between the observed variables. A permutation null
distribution of this statistic can be constructed by randomly pairing the observed x and
y values, and based on this, p-values can be computed for testing the null hypothesis of
independence.
25
0.0
0.2
0.4
0.6
0.8
1.0
Estimated marginal inclusion probabilities
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
(a) n = 200
0.0
0.2
0.4
0.6
0.8
1.0
Estimated marginal inclusion probabilities
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
(b) n = 500
Figure 4: Estimated posterior marginal inclusion probabilities for the 30 predictors in Example 3. The estimates are computed over 1,000 draws from the corresponding posteriors.
To illustrate, we simulate X = (X1 , X2 , . . . , X10 ) for a sample of size 400 under the same
Markov Chain model as in the previous example, and simulate a response variable Y as
follows.



Beta(4, 4)
if (X1 , X2 , X5 ) = (1, 1, 0)



Y ∼
Beta(0.5, 0.5) if (X5 , X8 , X10 ) = (1, 0, 0)




 Unif[0, 1]
otherwise.
In particular, Y is dependent on X but there is no mean or median shift in the conditional
26
distribution of Y over different values of X. Figure 5 gives the histogram of ρ(ΩX |x, y)
for 1,000 permuted samples where the vertical dashed line indicates the ρ(ΩX |x, y) for the
original simulated data, which equals 0.0384. For this particular simulation, 7 out of the
100
0
50
Frequency
150
200
1,000 permuted samples produced a more extreme test statistic.
0.0
0.2
0.4
0.6
0.8
1.0
Figure 5: Histogram of ρ(ΩX |x, y) for 1,000 permuted samples. The vertical dashed line
indicates the value of ρ(ΩX |x, y) for the original data.
Remark I: Note that by symmetry one can place a cond-OPT prior on the conditional distribution of X given Y as well and that will produce a corresponding posterior stopping
probability ρ(ΩY |y, x). One can thus alternatively use min{ρ(ΩX |x, y), ρ(ΩY |y, x)} as the
test statistic for independence.
Remark II: Testing using the posterior stopping probability ρ(ΩX |x, y) is equivalent to using
the Bayes factor (BF). To see this, note that the BF for testing independence under the cond27
OPT can be written as
PN (A)
BFY |X =
j=1
Q
λj (A) i Φ(Aji )
M (A)
with A = ΩX where the numerator is the marginal conditional likelihood of Y given X
if the conditional distribution of Y is not constant over X (i.e. ΩX is divided) and the
denominator is that if the conditional distribution of Y is the same for all X (i.e. ΩX is
undivided). By Eq. (3.2) and Theorem 3, this BF is equal to
ρ(ΩX )
1 − ρ(ΩX )
1
−1 ,
ρ(ΩX |x, y)
which is in a one-to-one correspondence to ρ(ΩX |x, y) given the prior parameters.
Example 5 (Linear median regression with heteroscedastic errors). So far we have used the
cond-OPT prior to make inference on conditional distributions in a nonparametric fashion.
In this example we illustrate how to use the cond-OPT as a prior for the error distribution
in a parametric regression model. This allows the error distribution to be heteroscedastic,
that is shifting across different covariate values.
We draw 100 (x, y) pairs according to the following scenario.
Xi ∼ Beta(2, 2)


 0.5 × Beta(100, 200) + 0.5 × U([−1, 0]) if x ≤ 0.5
i |Xi = x ∼

 0.5 × |N(0, (1 + x)2 )| + 0.5 × U([−1, 0]) if x > 0.5
Yi = −2 + 3.5Xi + i .
Figure 6 shows a scatterplot of the data.
Now pretending not to know the data generating mechanism, we try to estimate the
28
5
4
●
●
3
●
●
2
●
●
●
y
●
●
●
●
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
● ● ● ●●
● ● ●
●●
●●●●
●
●
●
−1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
● ●
●
●
−2
●
●
●
●
●
●
●
0.2
0.4
0.6
0.8
x
Figure 6: Scatterplot of 100 samples of (x, y) pairs.
regression coefficients a and b from a simple linear regression model
Yi = a + bXi + i .
Because we do not know whether the error distribution is bounded, we take a transformation
of the error i 7→ ˜i = 2FCauchy,5 (i ) − 1 to map it onto a bounded support, where FCauchy,5
denotes the CDF of a Cauchy distribution with scale parameter 5. We place the cond-OPT
prior with ΩX = [0, 1] and ΩY = [−1, 1] on the conditional distribution for the transformed
error ˜i |Xi with the median set to 0 for identifiability. (This can be achieved by requiring
the local OPTs for qY0,A to have 0 as the partition point at the top level, and that at the first
level the process always assign 1/2 probability to each of the two children, i.e. α1A (ΩY ) =
α2A (ΩY ) ↑ ∞. This is similar to what Walker and Mallick did for the Pólya tree [36].)
29
As we have seen in Section 2, an important feature of cond-OPT is that the prior can be
marginalized out—the marginal likelihood of the data, conditional on the predictors, can be
computed exactly through the recursion formula Eq. (3.2). In the current example, given the
values of a and b, the marginal conditional likelihood of the data is just Φ˜|X (Ω˜), which can
be computed using {(xi , FCauchy,5 (yi − a − bxi ))} and Eq. (3.2). Therefore, if π(a, b) denotes
the prior density of a and b, and π(a, b|X, Y ) the posterior density, then by Bayes’ Theorem
π(a, b|X, Y ) ∝ Φ˜|X (Ω˜)π(a, b).
Hence, MCMC procedures can be used to draw from this posterior. (Note that here we
are not drawing samples from a posterior cond-OPT, which does not require MCMC.) The
marginalization over the prior allows us to carry out MCMC on a low dimensional parameter
space. Figure 7 shows the samples for a and b from a run of Metropolis algorithm.
For completeness, we provide the technical specifications of this sample run. We have
placed independent N(0, 25) priors on a and b. The proposal distributions for a adopts the
following mixture of normals form 0.3 × N(0, 1) + 0.3 × N(0, 0.12 ) + 0.4 × N(0.012 ). Similarly,
the proposal distribution for b, which is independent of that for a, is 0.3 × N(0, 0.22 ) + 0.3 ×
N(0, 0.022 ) + 0.4 × N(0.0022 ). The mixture of normal proposals are admittedly an inefficient
way to explore the posterior at multiple resolutions—it is adopted here for simplicity.
6
Discussion
In this work we have introduced a Bayesian nonparametric prior on the space of conditional density functions. This prior, which we call the conditional optional Pólya tree, is
constructed based on a two-stage procedure that first divides the predictor space ΩX and
then generates the conditional distribution of the response through local OPT processes on
30
3.60
3.50
−1.8
3.30
−2.2
3.40
b
−2.0
a
0
50000
100000
150000
200000
0
Sample
50000
100000
150000
200000
Sample
Figure 7: Posterior samples from a typical Metropolis run, exluding 5000 burn-in iterations.
ΩY . We have established several important theoretical properties of this prior, namely large
support, posterior conjugacy, exactness and weak posterior consistency, and have provided
the recipe for Bayesian inference using this prior.
The construction of this prior does not depend on the marginal distribution of X. One
particular implication is that one can transform X before applying the prior on Y |X without
invalidating the posterior inference. (Note that transforming X is equivalent to choosing
a different partition rule on ΩX .) In certain situations it is desirable to perform such a
transformation on X. For example, if the data points are very unevenly spread over ΩX ,
then some parts of the space may contain a very small number of data points. There
the posterior is mostly dominated by the prior specification and does not provide much
information about the underlying conditional distribution. One way to mitigate this problem
is to transform X so that the data are more evenly distributed over ΩX , thereby achieving
an effectively “minimax” estimate of the conditional density. When ΩX is one-dimensional,
for example, this can be achieved by a rank transformation on X. Another situation in
which a transformation of X may be useful is when the dimensionality of X is very high.
In this case a dimensionality reduction transformation can be applied on X before carrying
31
out the inference. Of course, in doing so one often loses the ability to interpret the posterior
conditional distribution of Y directly in terms of the original predictors.
In some problems, one is not so interested in inference on a particular conditional distribution as in the relation between multiple conditional distributions. For example, one may
be interested in learning whether Y1 and Y2 are conditionally independent given X. For
another instance, one may want to compare the conditional distribution of Y1 |X1 versus
that of Y2 |X2 where Y1 and Y2 lie on a common space ΩY and similarly X1 and X2 lie
on the same space ΩX . (In fact X1 and X2 can be the same random vector.) The current
framework can be extended to address these inferential problems. We introduce in a companion paper such an extension with an emphasis on hypothesis testing along with further
development in using the framework for Bayesian model averaging in the context of testing
conditional association in case-control studies.
Appendix: Proofs
Proof of Lemma 1. The proof of this lemma is very similar to that of Theorem 1 in [37].
Let T1k be the part of ΩX that has not been stopped after k levels of recursive partitioning.
The random partition of ΩX after k levels of recursive partitioning can be thought of as
being generated in two steps. First suppose there is no stopping on any set and let J ∗(k)
be the collection of partition selection variables J generated in the first k levels of recursive
partitioning. Let Ak (J ∗(k) ) be the collection of sets A that arise in the first k levels of nonstopping recursive partitioning, which is determined by J ∗(k) . Then we generate the stopping
variables S(A) for each A ∈ Ak (J ∗(k) ) successively for k = 1, 2, . . ., and once a set is stopped,
let all its descendants be stopped as well. Now for each A ∈ Ak (J ∗(k) ), let I k (A) be the
indicator for A’s stopping status after k levels of recursive partitioning, with I k (A) = 1 if A
32
is not stopped and = 0 otherwise.


X
E(µX (T1k )|J ∗(k) ) = E 
µX (A)I k (A)|J ∗(k) 
A∈Ak (J ∗(k) )
X
=
µX (A)E(I k (A)|J ∗(k) )
A∈Ak (J ∗(k) )
≤ µX (ΩX )(1 − δ)k .
Hence E(µX (T1k )) ≤ µX (ΩX )(1 − δ)k , by Markov inequality and Borel-Contelli lemma, we
have µX (T1k ) ↓ 0 with probability 1.
Proof of Theorem 2. We prove only the second result as the first follows by choosing fX (x) ≡
1/µX (ΩX ). Also, we consider only the case when ΩX and ΩY are both compact Euclidean
rectangles, because the cases when at least one of the two spaces is finite follow as simpler
special cases. For x ∈ ΩX and y ∈ ΩY , let f (x, y) := fX (x)f (y|x) denote the joint density.
First we assume that the joint density f (x, y) is uniformly continuous. In this case it is
bounded on ΩX × ΩY . We let M := sup f (x, y) and
δ() :=
|f (x1 , y1 ) − f (x2 , y2 )|.
sup
|x1 −x2 |+|y1 −y2 |<
By uniform continuity, we have δ() ↓ 0 as ↓ 0. In addition, we define
δX () :=
sup
|fX (x1 ) − fX (x2 )|
|x1 −x2 |<
Z
≤
sup
|f (x1 , y) − f (x2 , y)|µY (dy) ≤ δ()µY (ΩY ).
|x1 −x2 |<
Note that in particular the continuity of f (x, y) implies the continuity of fX (x).
Let
σ > 0 be any positive constant. Choose a positive constant (σ) such that δX ((σ)) =
δ((σ))µY (ΩY ) < max(σ/2, σ 3 /2). Because all the parameters in the cond-OPT are uni33
formly bounded away from 0 and 1, there is positive probability that ΩX will be partitioned into ΩX = ∪K
i=1 Bi where the diameter of each Bi is less than (σ), and the partition
stops on each of the Bi ’s. (The existence of such a partition follows from the fine partiR
tion criterion.) Let Ai = Bi ∩ {X : fX (x) ≥ σ}, P (X ∈ Ai ) = Ai fX (x)µX (dx), and
R
fi (y) := Ai f (x, y)µX (dx)/µX (Ai ) if µX (Ai ) > 0, and 0 otherwise. Let I ⊂ {1, 2, . . . , K}
be the set of indices i such that µX (Ai ) > 0. Then
Z
|q(y|x) − f (y|x)|fX (x)µ(dx × dy)
Z
≤
|q(y|x) − f (y|x)|fX (x)µ(dx × dy)
fX (x)<σ
XZ
+
Ai ×ΩY
i∈I
XZ
+
Ai ×ΩY
i∈I
XZ
+
Ai ×ΩY
i∈I
µX (Ai ) fX (x)µ(dx × dy)
P (X ∈ Ai )
µ (A )
1 X
i
fi (y)
−
fX (x)µ(dx × dy)
P (X ∈ Ai ) fX (x)
fi (y) − f (x, y)µ(dx × dy).
q(y|x) − fi (y) ·
Let us consider each of the four terms on the right hand side in turn. First,
Z
|q(y|x) − f (y|x)|fX (x)µ(dx × dy) ≤ 2σµX (ΩX ).
fX (x)<σ
Note that for each i ∈ I, fi (y)µX (Ai )/P (X ∈ Ai ) is a density function in y. Therefore by
the large support property of the OPT prior (Theorem 2 in [37]), with positive probability,
Z
ΩY
0,Bi
qY (y) − fi (y) ·
µX (Ai ) µY (dy) < σ,
P (X ∈ Ai )
and so
Z
Ai ×ΩY
q(y|x) − fi (y) ·
µX (Ai ) fX (x)µ(dx × dy) < σP (X ∈ Ai )
P (X ∈ Ai )
34
for all i ∈ I. Also, for any x ∈ Ai , by the choice of (σ),
µ (A )
1 σ 3 /2
δX ((σ))
X
i
−
≤
= σ.
≤
P (X ∈ Ai ) fX (x)
σ(σ − δX ((σ))
σ 2 /2
Thus
Z
Ai ×ΩY
µ (A )
1 X
i
fi (y)
−
fX (x)µ(dx × dy) ≤ σM µY (ΩY )P (X ∈ Ai ).
P (X ∈ Ai ) fX (x)
Finally, again by the choice of (σ), |fi (y) − f (x, y)| ≤ δ((σ)) < σ, and so
Z
Ai ×ΩY
fi (y) − f (x, y)µ(dx × dy) < σµY (ΩY )µX (Ai ).
Therefore for any τ > 0, by choosing a small enough σ, we can have
Z
|q(y|x) − f (y|x)|fX (x)µ(dx × dy) < τ
with positive probability. This completes the proof of the theorem for continuous f (x, y).
Now we can approximate any density function f (x, y) arbitrarily close in L1 distance by a
continuous one f˜(x, y). The theorem still holds because
Z
Z
q(y|x)|fX (x) − f˜X (x)|µ(dx × dy)
Z
|q(y|x) − f˜(y|x)|f˜X (x)µ(dx × dy)
Z
|f˜(x, y) − f (x, y)|µ(dx × dy).
Z
|q(y|x) − f˜(y|x)|f˜X (x)µ(dx × dy)
|q(y|x) − f (y|x)|fX (x)µ(dx × dy) ≤
+
+
≤
Z
+2
35
|f˜(x, y) − f (x, y)|µ(dx × dy),
where f˜X (x) and f˜(y|x) denote the corresponding marginal, and conditional density functions for f˜(x, y).
Proof of Theorem 3. Given that a set A is reached during the random partitioning steps on
ΩX , Φ(A) is the marginal conditional likelihood of
{Y (A) = y(A)} given {X(A) = x(A)}.
The first term on the right hand side of Eq. (3.2), ρ(A)M (A), is the marginal conditional
likelihood of
{Stop partitioning on A, Y (A) = y(A)} given {X(A) = x(A)}.
Each summand in the second term, (1 − ρ(A))λj (A)
Q
i
Φ(Aji ), is the marginal conditional
likelihood of
{Partition A in the jth way, Y (A) = y(A)} given {X(A) = x(A)}.
A
A
A
Thus the conjugacy of the prior and the posterior updates for ρ, λj and OPT(RA
Y ; ρY , λY , αY )
follows from Bayes’ Theorem and the posterior conjugacy of the standard optional Pólya tree
prior (Theorem 3 in [37]).
Proof of Theorem 4. By Theorem 2.1 in [27], which follows directly from Schwartz’s theorem
(see [30] and [11, Theorem 4.4.2]), we just need to prove that the prior places positive
probability mass in arbitrarily small Kullback-Leibler (K-L) neighborhoods of f (·|·) w.r.t
fX . Here a K-L neighborhood w.r.t fX is defined to be the collection of conditional densities
Z
n
o
f (y|x)
K (f ) = h(·|·) : f (y|x) log
fX (x)µ(dx × dy) < h(y|x)
36
for some > 0.
To prove this, we just need to show that any conditional density that satisfies the conditions given in the theorem can be approximated arbitrarily well in K-L distance by a
piecewise constant conditional density of the sort that arises from the cond-OPT procedure. We first assume that f (·|·) is continuous. Following the proof of Theorem 2, let δ()
denote the modulus of continuity of f (·|·). Let ΩX = ∪K
i=1 Ai be a reachable partition of
ΩX such that the diameter of each partition block Ai is less than . Next, for each Ai , let
Ai
Ai
Ai
ΩY = ∪N
j=1 Bij be a partition on ΩY allowed under OPT(RY ; ρY , λY , αY ) such that the
diameter of each Bij is also less than . Let
gij =
sup
f (y|x)
and
x∈Ai ,y∈Bij
Let Gi =
R
Ai ×ΩY
0≤
gi (y) =
X
gij IBij (y).
j
gi (y)fX (x)µ(dx × dy). Then
X
i
Gi − 1 =
XZ
i
gi (y) − f (y|x) fX (x)dµ ≤ δ(2)µY (ΩY ),
Ai ×ΩY
P
Gi ≤ 1 + δ(2)µY (ΩY ).
R
P Now let g(y|x) = i gi (y)/ ΩY gi (ỹ)µY (dỹ) IAi (x), which is a step function that can
and so
i
37
arise from the cond-OPT prior. Then
Z
f (y|x) log f (y|x)/g(y|x) fX (x)dµ
0≤
=
X Z
f (y|x) log f (y|x)/gi (y) fX (x)dµ
Ai ×ΩY
i
Z
Z
+
f (y|x) log
Ai ×ΩY
≤
X
Z
ΩY
gi (ỹ)µY (dỹ) P (X ∈ Ai )
log
ΩY
i
≤ log
!
gi (ỹ)µY (dỹ) fX (x)dµ
!
XZ
gi (ỹ)µY (dỹ)P (X ∈ Ai )
= log(
ΩY
i
X
Gi ) ≤ δ(2)µY (ΩY ),
i
which can be made arbitrarily close to 0 by choosing a small enough . Now if f (·|·) is not
continuous, then for any 0 > 0, there exists a compact set E ⊂ ΩX × ΩY such that f (·|·) is
uniformly continuous on E and µ(E c ) < 0 . Now let
!
gij =
sup
f (y|x) + δ(/2)
∨ 00
(x,y)∈E∩(Ai ×Bij )
for some constant 00 > 0, while keeping the definitions of gi , Gi and g(y|x) in terms of gij
unchanged. Let M be a finite upperbound of f (·|·) and f (·, ·). We have
X
Gi − 1 =
i
XZ
gi (y) − f (y|x) fX (x)dµ
E∩(Ai ×ΩY )
i
+
XZ
i
gi (y) − f (y|x) fX (x)dµ.
E c ∩(Ai ×ΩY )
Thus,
X
Gi − 1 ≥ δ(/2)µY (ΩY ) − (2M + 00 )M µY (ΩY )0 ,
i
38
which is positive for small enough 0 . At the same time,
X
Gi − 1 ≤ δ(2) + 00 µY (ΩY ) + (2M + 00 )M µY (ΩY )0 ,
i
which can be made arbitrarily small by taking , 0 , and 00 all ↓ 0.
Now
Z
f (y|x) log f (y|x)/g(y|x) fX (x)dµ
0≤
=
X Z
f (y|x) log f (y|x)/gi (y) fX (x)dµ
Ai ×ΩY
i
f (y|x) log
+
Ai ×ΩY
=
ΩY
XZ
XZ
i
+
f (y|x) log f (y|x)/gi (y) fX (x)dµ
E∩(Ai ×ΩY )
i
+
!
gi (ỹ)µY (dỹ) fX (x)dµ
Z
Z
E c ∩(Ai ×ΩY )
XZ
i
f (y|x) log f (y|x)/gi (y) fX (x)dµ
Ai ×ΩY
gi (ỹ)µY (dỹ) fX (x)dµ
Z
f (y|x) log
ΩY
X
≤ 0 + M 0 log(M/00 ) + log(
Gi )
i
≤ M 0 log(M/00 ) + δ(2) + 00 µY (ΩY ) + (2M + 00 )M µY (ΩY )0 .
The right hand side ↓ 0 if we take ↓ 0 and 0 = 00 ↓ 0. This completes the proof.
References
[1] Bashtannyk, D. M. and Hyndman, R. J. (2001). Bandwidth selection for kernel
conditional density estimation. Computational Statistics & Data Analysis 36, 279–298.
[2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classifica39
tion and Regression Trees. Wadsworth Statistics/Probability Series. Wadsworth Advanced
Books and Software, Belmont, CA. MR726392 (86b:62101)
[3] Chipman, H. A., George, E. I., and McCulloch, R. E. (1998). Bayesian CART
model search. Journal of the American Statistical Association 93, 443, 935–948.
[4] Chung, Y. and Dunson, D. B. (2009). Nonparametric Bayes conditional distribution
modeling with variable selection. Journal of The American Statistical Association 104,
1646–1660.
[5] Denison, D. G. T., Mallick, B. K., and Smith, A. F. M. (1998). A Bayesian
CART algorithm. Biometrika 85, 2, 363–377.
[6] Dunson, D. B. and Park, J.-H. (2008).
Kernel stick-breaking processes.
Biometrika 95, 307–323.
[7] Fan, J., Yao, Q., and Tong, H. (1996). Estimation of conditional densities and
sensitivity measures in nonlinear dynamical systems. Biometrika 83, 189–206.
[8] Fan, J. and Yim, T. H. (2004). A crossvalidation method for estimating conditional
densities. Biometrika 91, 4 (Dec.), 819–834. http://dx.doi.org/10.1093/biomet/91.4.819.
[9] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann.
Statist. 1, 209–230. MR0350949 (50 #3441)
[10] Gelfand, A. E., Kottas, A., and MacEachern, S. N. (2005). Bayesian nonparametric spatial modeling with Dirichlet process mixing. Journal of the American Statistical
Association 100, 1021–1035.
[11] Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics. Springer
Series in Statistics. Springer-Verlag, New York. MR1992245 (2004g:62004)
40
[12] Griffin, J. and Steel, M. (2006). Order-based dependent Dirichlet processes. Journal of the American Statistical Association 101, 179–194.
[13] Hall, P., Wolff, R. C., and Yao, Q. (1999). Methods for estimating a conditional
distribution function. Journal of the American Statistical Association 94, 445, 154–163.
http://eprints.qut.edu.au/5939/.
[14] Hyndman, R. J. and Yao, Q. (2002). Nonparametric estimation and symmetry tests
for conditional density functions. Nonpara. Statist 14, 259–278.
[15] Iorio, M. D., Rosner, P., and MacEachern, S. N. (2004). An anova model
for dependent random measures. Journal of The American Statistical Association 99,
205–215.
[16] Jara, A. and Hanson, T. E. (2011). A class of mixtures of dependent tail-free
processes. Biometrika 98, 3, 553–566.
[17] Jara,
A.,
Hanson,
T. E.,
and Lesaffre,
E. (2009).
Robustifying
generalized linear mixed models using a new class of mixtures of multivariate
Pólya trees.
Journal of Computational and Graphical Statistics 18, 4, 838–860.
http://amstat.tandfonline.com/doi/abs/10.1198/jcgs.2009.07062.
[18] Kraft, C. H. (1964). A class of distribution function processes which have derivatives.
J. Appl. Probability 1, 385–388. MR0171296 (30 #1527)
[19] Lavine, M. (1992). Some aspects of Pólya tree distributions for statistical modelling.
Ann. Statist. 20, 3, 1222–1235. MR1186248 (93k:62006b)
[20] Lenk, P. J. (1988). The logistic normal distribution for Bayesian, nonparametric,
predictive densities. Journal of the American Statistical Association 83, 402, 509–516.
http://dx.doi.org/10.2307/2288870.
41
[21] Ma, L. (2012a). Bayesian recursive variable selection. Discussion paper 2012-04, Duke
University Department of Statistical Science.
[22] Ma, L. (2012b). Recursive mixture modeling and nonparametric testing of association in case-control studies. Discussion paper 2012-07, Duke University Department of
Statistical Science.
[23] Ma, L. and Wong, W. H. (2011). Coupling optional Pólya trees and the two sample
problem. Journal of the American Statistical Association 106, 496, 1553–1565.
[24] MacEachern, S. (1999). Dependent Dirichlet processes. In Proceedings of the section
on Bayesian Statistical Science.
[25] Müller, P., Erkanli, A., and West, M. (1996).
ting using multivariate normal mixtures.
Bayesian curve fit-
Biometrika 83, 1 (Mar.),
67–79.
http://dx.doi.org/10.1093/biomet/83.1.67.
[26] Norets, A. (2011). Bayesian modeling of joint and conditional distributions. Tech.
rep., Princeton University.
[27] Norets, A. and Pelenis, J. (2011). Posterior consistency in conditional density
estimation by covariate dependent mixtures. Tech. rep., Princeton University.
[28] Pati, D., Dunson, D., and Tokdar, S. (2011). Posterior consistency in conditional
distribution estimation. Tech. rep., Duke University Department of Statistical Science.
[29] Rodrı́guez, A. and Dunson, D. B. (2011). Nonparametric Bayesian models through
probit stick-breaking processes. Bayesian Analysis 6, 1, 145–178.
[30] Schwartz, L. (1965). On Bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw.
Gebiete 4, 10–26. MR0184378 (32 #1851)
42
[31] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica
Sinica 4, 639–650.
[32] Taddy, M. A. and Kottas, A. (2010). A Bayesian nonparametric approach to
inference for quantile regression. Journal of Business & Economic Statistics 28, 3, 357–
369. http://econpapers.repec.org/RePEc:bes:jnlbes:v:28:i:3:y:2010:p:357-369.
[33] Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic Gaussian
process priors in density estimation. Journal of Statistical Planning and Inference 137, 1
(Jan.), 34–42. http://dx.doi.org/10.1016/j.jspi.2005.09.005.
[34] Tokdar, S. T., Zhu, Y. M., and Ghosh, J. K. (2010). Bayesian density regression
with logistic gaussian process and subspace projection. Bayesian analysis 5, 2, 319–344.
[35] Trippa, L., Mller, P., and Johnson, W. (2011).
process and an extension of the polya tree model.
The multivariate beta
Biometrika 98, 1, 17–34.
http://ideas.repec.org/a/oup/biomet/v98y2011i1p17-34.html.
[36] Walker, S. G. and Mallick, B. K. (1997).
Hierarchical generalized lin-
ear models and frailty models with Bayesian nonparametric mixing.
the Royal Statistical Society:
Journal of
Series B (Statistical Methodology) 59, 4, 845–860.
http://dx.doi.org/10.1111/1467-9868.00101.
[37] Wong, W. H. and Ma, L. (2010). Optional Pólya tree and Bayesian inference. Annals
of Statistics 38, 3, 1433–1459. http://projecteuclid.org/euclid.aos/1268056622.
43