Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IMS Collections Beyond Parametrics in Interdisciplinary Research: Festschrift in Honor of Professor Pranab K. Sen Vol. 1 (2008) 105–115 c Institute of Mathematical Statistics, 2008 DOI: 10.1214/193940307000000077 Posterior consistency of Dirichlet mixtures of beta densities in estimating positive false discovery rates Subhashis Ghosal1,∗ , Anindya Roy2,† and Yongqiang Tang3 North Carolina State University, University of Maryland – Baltimore County and SUNY Downstate Medical Center Abstract: In recent years, multiple hypothesis testing has come to the forefront of statistical research, ostensibly in relation to applications in genomics and some other emerging fields. The false discovery rate (FDR) and its variants provide very important notions of errors in this context comparable to the role of error probabilities in classical testing problems. Accurate estimation of positive FDR (pFDR), a variant of the FDR, is essential in assessing and controlling this measure. In a recent paper, the authors proposed a model-based nonparametric Bayesian method of estimation of the pFDR function. In particular, the density of p-values was modeled as a mixture of decreasing beta densities and an appropriate Dirichlet process was considered as a prior on the mixing measure. The resulting procedure was shown to work well in simulations. In this paper, we provide some theoretical results in support of the beta mixture model for the density of p-values, and show that, under appropriate conditions, the resulting posterior is consistent as the number of hypotheses grows to infinity. 1. Introduction Consider the problem of testing m null hypotheses H0,1 , . . . , H0,m simultaneously, where m is a large number. This type of multiple hypothesis testing problem has received a lot of attention in recent years, primarily due to advanced data collection techniques in genomics, microarray analysis, proteomics, fMRI and some other fields. The analog of type I error probability in multiple testing problems is given by the family-wise error rate, which is defined as the probability of making at least one false rejection. Such a measure is too stringent when m is even moderately large and will block many genuine discoveries (i.e., rejection of a false null hypothesis). In a pioneering paper, Benjamini and Hochberg [2] introduced the concept of the false discovery rate (FDR), the expected value of the ratio of the number of false rejections to the total number of rejections, and described a ∗ Supported in part by NSF Grant DMS-03-49111. in part by NIH Grant 1R01GM075298-01. 1 Department of Statistics, North Carolina State University, 2501 Founders Drive, Raleigh, NC 27695, USA, e-mail: [email protected] 2 Department of Mathematics and Statistics, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA, e-mail: [email protected] 3 SUNY Downstate Medical Center, 450 Clarkson Avenue, Brooklyn, NY 11203, USA, e-mail: yongqiang [email protected] AMS 2000 subject classifications: Primary 62G05, 62G20; secondary 62G10. Keywords and phrases: Dirichlet process, Dirichlet mixture, multiple testing, positive false discovery rate, posterior consistency. † Supported 105 106 S. Ghosal, A. Roy and Y. Tang procedure to control it. Mathematically, the FDR at a nominal level γ is given by E(V / max(R, 1)) = E(V /R|R > 0)P(R > 0), where R = R(γ) stands for the number of hypotheses rejected at nominal level γ and V = V (γ) is the number of false rejections among these. Storey [11, 12] argued that the positive false discovery rate (pFDR) (at nominal level γ) defined as E(V /R|R > 0), is a more relevant measure to control. Storey’s approach consists of estimating the pFDR function at each γ and choosing a γ so that the estimated pFDR function is within a given limit, α. Storey [11, 12] showed that under a certain natural setup, the resulting procedure controls pFDR by α. Some other related measures have also been considered in the literature; see Benjamini and Hochberg [2], Efron and Tibshirani [3], Tsai et al. [14] and Sarkar [10]. In order to estimate the pFDR function, Storey [11] considered a mixture model setup, where each null hypothesis has a fixed probability, π, of being true. Thus, the number of true null hypotheses, m0 , is taken to be a random variable distributed as binomial (m, π). If the null hypothesis is true, then it is assumed that the p-value associated with the corresponding test statistic is uniformly distributed. The p-value when the alternative is true and has a fixed value θ, follows a distribution H = H(·|θ). It is somewhat unnatural to assume that the alternative value remains fixed when the hypotheses themselves are appearing randomly. A more natural assumption would be to assume that, given that null hypothesis is false, the alternative is chosen randomly according to a distribution μ. Then, marginally, the conclusion that the p-value under the alternative is distributed as H remains unaffected, where now H stands for the mixture H(·|θ)dμ(θ). Under this setup, Storey [11] showed that the pFDR at nominal level γ is given by the expression πγ/[πγ + (1 − π)H(γ)]. To estimate the pFDR, it then suffices to estimate π, since the denominator can be estimated essentially by the empirical proportion R/m. Actually, Storey [11] used a slightly different estimator to take into account the problem of zeros in finite samples. Estimation of π is more delicate. Storey [11] assumed that for some appropriate threshold value λ, all p-values over λ are associated with true null hypotheses. Equating the observed proportion of rejected hypotheses with the expectation λ(1−π), and choosing λ appropriately, an estimate of π, and hence that of pFDR, can be obtained. Although Storey [11] did not make any explicit assumption about H, implicitly it was assumed that H is concentrated near zero. It is this assumption that leads to the conclusion that almost every p-value over level λ must arise from null hypotheses. While this is reasonable, it introduces some bias in the analysis because, although relatively rare, p-values bigger than λ can occur under alternatives as well. The density of p-values under alternatives usually has more features than is assumed above. These important features may be exploited to construct a more refined estimator of pFDR. For instance, the density of p-values under an alternative value is often decreasing, dropping from an infinite height at 0 to a very low or no height at 1, and the derivative of the density approaches zero near the point 1. These densities resemble beta (a, b) densities be(x; a, b) = (1/B(a, b))xa−1 (1−x)b−1 with a < 1 and b ≥ 1, or their mixtures, where B(a, b) = Γ(a)Γ(b)/Γ(a + b) is the beta function. A reasonable model may be proposed for this type of densities, and based on the model it may be possible to estimate the pFDR function more accurately. Tang et al. [13] modeled the p-value density under the alternative as a mixture of beta densities and thereby incorporated some of the salient features of the p-value density directly into the model. They followed a Bayesian approach by putting a Dirichlet process prior on the mixing distribution of the beta parameters. The resulting posterior is amenable to Markov chain Monte-Carlo methods of com- Posterior consistency in estimating pFDR 107 putation. Tang et al. [13] showed by simulation that the resulting procedure gives more stable and accurate estimates of the pFDR function. In this paper, we theoretically study the appropriateness of the model assumptions made in Tang et al. [13] and investigate the support of the Dirichlet mixture of beta prior. Our results provide important theoretical justification for the setup assumed in Tang et al. [13]. Under certain conditions, we show that the posterior distribution of the pFDR function is consistent as the number of hypotheses tends to infinity. 2. Mixture model framework 2.1. Basic setup Suppose we have observed the values of the test statistics for testing m null hypotheses H0,i , i = 1, . . . , m, against appropriate alternatives. Let X1 , . . . , Xm stand for the p-values for the respective m tests. We assume that the tests are based on independent data, so that X1 , . . . , Xm are independent. We also assume that there is a random mechanism which independently determines whether H0,i ’s are true or false, respectively with probability π and 1 − π. Let Hi = I(H0,i is true), be the indicator that the ith null hypothesis is true. Of course, Hi ’s are unobserved. The distribution of Xi under H0,i can be assumed to be the uniform distribution on [0, 1]. This happens whenever the test statistic is a continuous random variable and the null hypothesis is simple, or in situations like the t-test or F-test, where the null hypothesis has been reduced to a simple one by considerations of similarity or invariance. Under more general situations, the property can still be expected to be approximately true if, for instance, a conditional predictive p-value or a partial predictive p-value (Bayarri and Berger [1]) is used; see Robins et al. [9] for details. If the null and alternative hypotheses are one-sided and the underlying distribution has the monotone likelihood ratio (MLR) property, then the power function is increasing in the parameter, and, as a result, the null distribution of the p-value is stochastically larger than the uniform. Many estimation procedures remain valid in a conservative sense when the actual null distribution is replaced by the uniform. It is easy to show that Storey’s estimators have this property. The Bayesian estimator of Tang et al. [13] also enjoys the same property – see Tang et al. [13] for discussion. Henceforth we shall assume that the null distribution of p-values is U [0, 1]. Let f (x) stand for the density of the p-value under an alternative distribution. The following result shows that under a natural condition, f (x) is decreasing. Proposition 1. Suppose that the p-value is computed using a statistic, T , whose density, gθ , has the MLR property. Then the p-value density f (x) is decreasing. Proof. Let θ0 stand for the value of the parameter under the null hypothesis and θ1 stand for the value under the alternative. Let Tobs stand for the observed value of T . Denote the cumulative distribution function (c.d.f.) of gθ by Gθ . Then the distribution function of the p-value under θ1 is Fθ1 (x) = Pθ1 (Pθ0 (Tn > Tobs ) ≤ x) = 1 − Gθ1 (G−1 θ0 (1 − x)). Hence the p-value density is given by (2.1) fθ1 (x) = gθ1 (G−1 θ0 (1 − x)) gθ0 (G−1 θ0 (1 − x)) = gθ1 (z) , gθ0 (z) S. Ghosal, A. Roy and Y. Tang 108 where z = G−1 θ0 (1 − x). By the MLR property, the expression in (2.1) is increasing in z, equivalently decreasing in x. For standard two-sided tests like the z-test or t-test, the density of the p-value under the alternative is also decreasing. Under certain assumptions which are satisfied generally, the following result shows a two-sided analog of the previous proposition. Proposition 2. Suppose that the p-value is computed using a statistic T whose density gθ is symmetric under the null hypothesis H0 : θ = θ0 . Further suppose that for the symmetrized density g̃θ (z) = (gθ (z) + gθ (−z))/2, the ratio g̃θ (z)/gθ0 (z) is increasing in z. Then the p-value density h(x) is decreasing. Proof. With notations as in the last proof, the distribution function of the p-value under θ1 is Fθ1 (x) = Pθ1 (2Pθ0 (Tn > |Tobs |) ≤ x) −1 = 1 − Gθ1 (G−1 θ0 (1 − x/2)) + Gθ1 (−Gθ0 (1 − x/2)). The p-value density can be seen to be given by (2.2) f (x) = fθ1 (x) = gθ1 (G−1 θ0 (1 − x/2)) 2gθ0 (G−1 θ0 (1 − x/2)) + gθ1 (−G−1 θ0 (1 − x/2)) 2gθ0 (−G−1 θ0 (1 − x/2)) = g̃θ1 (z) , gθ0 (z) which is decreasing in x by the given assumption. The p-value density for a one-sided hypothesis generally decays to zero as x tends to 1. Let L stand for the lower limit of the value of the test statistic, which is often −∞. Assume that as z → L, we have that gθ1 (z)/gθ0 (z) → 0. Then, clearly it follows from (2.1) that f (x) → 0 as x → 1, since z = G−1 θ0 (1 − x) → L as x → 1. For a two-sided hypothesis, g̃θ1 (z)/gθ0 (z) will not generally go to 0 as z → L, and hence the minimum value of the p-value density will be a (small) positive number. For instance, for the two-sided normal location model, the minimum value 2 is e−nθ /2 , where n is the sample size on which the test is based. 2.2. Identifiability and continuity properties If a c.d.f. F on [0,1] can be written as F (x) = πx + (1 − π)H(x), where H(·) is another c.d.f. on [0,1], then the representation is generally not unique, so that π and H are not separately identifiable. The components π and H can be identified by imposing the additional condition that H cannot be represented as a mixture with another uniform component, which, for the case when H has a continuous density h, translates into h(1) = 0. Define the map π(F ) from the space of continuous c.d.f. on [0,1] to [0,1] as the maximum possible value of π in the mixture representation F (x) = πx + (1 − π)H(x). As in all mixture problems, H is not defined when π(F ) is one, that is, F is the uniform distribution on [0,1]. When F physically stands for the p-value distribution, π(F ) is an upper bound for the proportion of null hypothesis and therefore π(F )γ/F (γ) is an upper bound for the actual pFDR. Thus this choice of π is appropriate in a conservative sense in that in order to control pFDR, it suffices to control the auxiliary quantity pFDR(F ; γ) defined as π(F )γ/F (γ). Let F stand for all F representable as F (x) = πx + (1 − π)H(x) for π ∈ [0, 1]. The following proposition shows an important upper-semicontinuity property of the map π(F ). Let →w stand for weak convergence of probability distributions. Posterior consistency in estimating pFDR (a) 109 (b) Fig 1. Plots of p-value density for t-test with 3 d.f. (a) p-value density of one-sided t-test. (b) p-value density of two-sided t-test. Proposition 3. The class F is weakly closed and the map F → π(F ) on F is upper-semicontinuous, that is, if Fn →w F, then lim sup π(Fn ) ≤ π(F ). n→∞ Further, for any γ, lim supn→∞ pFDR(Fn ; γ) ≤ pFDR(F ; γ). Proof. Let Fn ∈ F and Fn →w F . Because πn = π(Fn ) is a bounded sequence and Hn in the representation Fn (x) = πn x + (1 − πn )Hn (x) is tight, we may assume that both are convergent along a subsequence, to π ∗ and H ∗ , respectively. Then F (x) = π ∗ x + (1 − π ∗ )H ∗ (x), and hence F ∈ F. Observe that for any F ∈ F, F̄ (λ) ≥ π(F )(1 − λ) for all 0 ≤ λ ≤ 1 and that π(F ) = inf{F̄ (λ)/(1 − λ) : 0 < λ ≤ 1}. The infimum is attained because, by our choice, π(F ) is the largest π in the mixture representation. Now for any fixed λ0 which is a continuity point of F , we have that lim sup π(Fn ) = lim sup inf n→∞ n→∞ λ F̄ (λ0 ) F̄n (λ) F̄n (λ0 ) ≤ lim = . n→∞ 1−λ 1 − λ0 1 − λ0 Since λ0 is arbitrary and the set of continuity points of F is dense in [0,1], the first assertion follows. The last relation clearly follows from the expression for pFDR. Under additional restrictions, identifiability of the components π and H and continuity of π(F ) may be established. For example, the following class of c.d.f. F allows π and H to be identified from F . Assume that the p-value distribution H under the alternative belongs to D, the class of c.d.f. on [0,1] which admits a density h, with h(1) = 0. Let FD denote the class of all c.d.f. on [0,1] of the form 110 S. Ghosal, A. Roy and Y. Tang F (x) = πx + (1 − π)H(x) for π ∈ (0, 1) and H ∈ D. Let fπ,h = π + (1 − π)h be the corresponding mixture density. Proposition 4. If fπ,h = fπ∗ ,h∗ , then π = π ∗ and h = h∗ . Proof. fπ,h = fπ∗ ,h∗ implies π + (1 − π)h(x) = π ∗ + (1 − π ∗ )h∗ (x) for all x. Putting x = 1 and using the fact that h(x) = h∗ (x) = 0, we have π = π ∗ . This now implies h = h∗ or H = H ∗ . To study consistency, we need to show that π and h can be continuously solved from f . However, the class FD is not weakly closed. We need to impose a restriction on the class of alternative densities so that the tail at 1 remains thin even in the weak limit. Let B denote a class of c.d.f. on [0,1] that is weakly closed and for all H ∈ B we have limy→0 y −1 H̄(1 − y) = 0. The interval (1 − y, 1] is open in [0, 1]. Hence, by the portmanteau theorem, Hn →w H implies that H̄(1 − y) ≤ lim inf n→∞ H̄n (1 − y). Thus for the weak limit H of a sequence Hn ∈ B to be in B, one needs to be able to interchange the order of the limits with respect to y and n. For instance, if B = {H : H̄(1 − x) ≤ ψ(x) for all x < δ}, where δ > 0 is a fixed number and ψ is a fixed function which satisfies ψ(x) = o(x) as x → 0 (like Cx1+ ), then the class B satisfies the requirement. Let FB denote the class of c.d.f. on [0,1] representable as F (x) = πx + (1 − π)H(x) for π ∈ (0, 1) and H ∈ B. Note that FB need not be a subset of FD as the c.d.f. in B need not have a density. Proposition 5. Identifiability in Proposition 4 holds if F ∈ FB . Proof. If πx + (1 − π)H(x) = π ∗ x + (1 − π ∗ )H ∗ (x) for all x, then π(1 − x) + (1 − π)H̄(x) = π ∗ (1 − x) + (1 − π ∗ )H̄ ∗ (x). Dividing both sides by 1 − x and letting x → 1, we obtain π = π ∗ and hence H = H ∗. Proposition 6. The map (π, H) → Fπ,H is a homeomorphism from (0, 1) × B to FB , where B and FB are the weak topology. Proof. (Forward side) If πn → π and Hn →w H, then Hn (x) → H(x) at all continuity points x, giving πn x + (1 − πn )Hn (x) → πx + (1 − π)H(x). (Reverse side) Let Fπn ,Hn →w Fπ,H . To show that πn → π and Hn →w H. Fix any subsequence n . It is enough to extract a further subsequence n along which πn → π and Hn →w H. Because πn is bounded and Hn is tight, we can extract a further subsequence n such that πn → π ∗ and Hn →w H ∗ for some π ∗ and H ∗ . By the closedness of B under the weak topology, H ∗ ∈ B (note that (1 − x, 1] is an open subset of [0, 1]). By the forward side, Fπn ,Hn →w Fπ∗ ,H ∗ . Thus Fπ∗ ,H ∗ = Fπ,H . By identifiability in the class FB , π ∗ = π and H ∗ = H, and hence πn → π and Hn →w H. This completes the proof. 2.3. Mixtures of beta densities The shape of p-value densities under alternatives has similarities with the beta density be(x; a, b) = (1/B(a, b))xa−1 (1 − x)b−1 , 0 < x < 1, for a < 1 and b ≥ 1. Indeed, for the exponential model λe−λz , z > 0, with parameter λ and hypotheses H0 : λ = λ0 against H : λ > λ0 , it follows from elementary calculations that the p-value density is exactly beta(a, 1) for some a < 1. Mixtures of beta (a, b) with Posterior consistency in estimating pFDR 111 a < 1 and b ≥ 1 make up a considerably large class still preserving the shape of the p-value density, and hence can be considered as a model for p-value densities under the alternative. The following result shows that many similar-shaped densities can be pointwise represented as a mixture of beta (a, 1), a much narrower class. Recall that a function ϕ on [0, ∞] is called completely monotone if it has derivatives ϕ(n) of all orders and (−1)n ϕ(n) (z) ≥ 0 for all z ≥ 0 and n = 1, 2, . . .. Proposition 7. If a density h(x) on (0, 1) with c.d.f. H can be represented as 1 h(x) = 0 axa−1 dG(a) for all 0 < x < 1, then H(e−y ) is a completely monotone function of y on [0, ∞). then Conversely, if h(x) is decreasing and H(e−y ) is completely monotone, 2 ∞ a−1 ax dG(a) for some probability measure G on (0, ∞) with a dG(a) ≤ h(x) = 0 a dG(a). Proof. If h(x) is a mixture of be(a, 1), we have that 1 1 −1 a H(x) = x dG(a) = e−a log x dG(a). 0 0 Thus H(x) is the Laplace transform of G at the point log x−1 . Put y = log x−1 so 1 that x = e−y and H(e−y ) = 0 e−ay dG(a), the Laplace transform of the probability measure G. Hence it is completely monotone by Theorem 1 of Section XIII.4 of Feller (1971). To prove the converse, applying the same theorem ∞ and using the fact that H(1) = 1, we obtain the representation that H(e−y ) = 0 e−ay dG(a) for some probability ∞ ∞ measure G on (0, ∞). Thus H(x) = 0 xa dG(a), and so h(x) = 0 axa−1 dG(a). Now, as h is decreasing, 0 ≥ h (x) = a(a − 1)xa−2 dG(a). The result now follows by letting x → 1. Observe that a2 dG(a) ≤ a dG(a) holds if G is concentrated on (0, 1], but it is not necessary. Remark 1. ∞By a similar argument, if a density h(x) on (0, 1) can be represented as h(x) = 1 b(1 − x)b−1 dG(b) for all 0 < x < 1, then the function H̄(1 − e−y ) is completely monotone as a function of y, where H̄(x) = 1 − H(x). Conversely, if H̄(1 − e−y ) is completely monotone in y and h(x) is decreasing, ∞ then h(x) = 0 b(1 − x)b−1 dG(b) for some probability measure G on (0, ∞) with bdG(b) ≤ b2 dG(b). Proposition 8. Let H1 stand for the class of decreasing densities h such that H(e−y ) is completely monotone and H2 stand for the class of decreasing densities h such that H̄(1 − e−y ) is completely monotone. A density h(x) on (0, 1) can be represented as a mixture of be(a, b) if h(x) is a convex combination of densities of the form ch1 (x)h2 (x) where h1 ∈ H1 and h2 ∈ H2 . Proof. Clearly it suffices to h(x) = ch1 (x)h2 (x), where h1 (x) = ∞assume that ∞ a−1 b−1 ax dG (a), h (x) = b(1 − x) dG 1 2 2 (b) and 0 0 ∞ ∞ c−1 = abB(a, b)dG1 (a)dG2 (b). 0 0 Now defining dG(a, b) = cabB(a, b)dG1 (a)dG2 (b), we may write h(x) be(x; a, b)dG(a, b). The total mass of G is given by ∞ ∞ cabB(a, b)dG1 (a)dG2 (b) = cc−1 = 1, 0 0 = 112 S. Ghosal, A. Roy and Y. Tang so that G is also a probability measure. This completes the proof. 2.4. Dirichlet mixture prior Tang et al. [13] proposed a Dirichlet process prior (Ferguson [5]) for the mixing distribution G. The parameters of a Dirichlet process DP(G0 , τ ) are the center measure G0 = E(G), and the precision parameter τ > 0. The center measure G0 is the subjective guess about G, while τ controls the concentration of DP(G0 , τ ) around G0 . The equivalent hierarchical representation in terms of latent variable (ai , bi ), Xi |ai , bi ∼ π + (1 − π)be(xi |ai , bi ), i.i.d. (ai , bi )|G ∼ G, G ∼ DP(G0 , τ ), is extremely useful in developing the relevant MCMC algorithms for the computation of posterior. Tang et al. [13] used the reparameterization a = exp(−|La |) and b = exp(|Lb |), and specified G0 (a, b) = N (La |0, σa2 )N (Lb |0, σb2 ). Actually, any base measure with full support on (0, 1) × (1, ∞) will lead to a Dirichlet process with large support. 3. Asymptotic properties of posterior Consider a prior Π for H and independently a prior μ for π with full support on [0, 1]. Let the true value of π and h be, respectively, π0 and h0 where 0 < π0 < 1. Theorem 1 (General consistency). If h0 belongs to the L1 -support of Π in the sense that Π(h − h0 1 < ) > 0 for all > 0, then for every > 0, Pr(sup{|F (x) − F0 (x)| : 0 ≤ x ≤ 1} < |X1 , . . . , Xm ) → 1 a.s. Proof. For any sequence Fn such that Fn (x) → F0 (x) for all x, continuity of F0 and Polya’s theorem imply that supx |Fn (x)−F0 (x)| → 0. Thus given > 0, we can find a weak neighborhood W of F0 such that F ∈ W implies supx |F (x) − F0 (x)| < . Thus it suffices to prove that for any weak neighborhood W of F0 , Pr{|π − π0 | < , F ∈ W|X1 , . . . , Xm } → 1 a.s. as m → ∞. By Schwartz’s theorem for weak consistency (see Theorem 4.4.2 of Ghosh and Ramamoorthi [8]), it suffices to show that for every > 0, fπ0 ,h0 (μ × Π) (π, h) : fπ0 ,h0 log < > 0. fπ,h Now fπ,h ≥ π, so fπ0 ,h0 /fπ,h ≤ π −1 fπ0 ,h0 , which is integrable, and the integral π is bounded by a constant when π lies in a neighborhood of π0 . So by Lemma 7 of Ghosal and van der Vaart [7] or Theorem 5 of Wong and Shen [15] 1 fπ ,h , fπ0 ,h0 log 0 0 ≤ Ad2H (fπ0 ,h0 , fπ,h ) log+ 2 fπ,h dH (fπ0 ,h0 , fπ,h ) −1 where dH stands for the Hellinger distance. Also, as d2H (f, g) ≤ f − g1 , it suffices to show that L1 -neighborhoods of fπ0 ,h0 gets positive probabilities under μ × Π. Posterior consistency in estimating pFDR 113 Now, 1 |[π + (1 − π)h(x)] − [π0 + (1 − π0 )h0 (x)]|dx 0 1 ≤ |π − π0 | + |(1 − π) − (1 − π0 )|h(x)dx 0 1 +(1 − π0 ) |h(x) − h0 (x)|dx 0 ≤ 2|π − π0 | + h − h0 1 . Since μ gives positive probabilities to neighborhoods of π0 and Π gives positive probabilities to L1 -neighborhoods of h0 , the condition of prior positivity holds. In view of Proposition 3, the following “upper semi-consistency” (a form of a one-sided consistency) may be concluded. Corollary 1. Under the conditions of Theorem 1, we have that for any > 0, Pr(π < π0 + |X1 , . . . , Xn ) → 1 a.s. and that the posterior mean π̂m satisfies lim supm→∞ π̂m ≤ π0 a.s. Unfortunately, the above corollary has limited significance since typically one would not like to underestimate the true π (and the pFDR) while overestimation is less serious. In order to ensure that the convergence takes place, we need to enforce additional restriction on the support of the prior to ensure continuity of π(F ) with respect to the weak topology on the restricted space. Corollary 2. Assume that Π is supported in B ∩ D and that h0 belongs to the L1 -support of Π. Then for any > 0, Pr(|π − π0 | < |X1 , . . . , Xn ) → 1 a.s. and that π̂m → π0 a.s. Further, for any 0 < α < 1 and > 0, πα π0 α < X1 , . . . , Xn → 1 a.s. − Pr F (α) F0 (α) and the above convergence is uniform for α lying in compact subsets of (0, 1]. Proof. The proof of the first assertion follows from Theorem 1 and Proposition 6. The second assertion follows from the first because πn → π0 and Fn (α) → F0 (α) implies that πn α/Fn (α) → π0 α/F0 (α), whenever 0 < F0 (α) < 1, and this holds whenever 0 < α < 1. In fact, the convergence is uniform over compact subsets of (0, 1], because F0 (α) remains uniformly bounded below there. Now we consider a concrete prior obtained from a Dirichlet mixture of betas: Let h(x) = be(x; a, b)dG(a, b), where G ∼ DP(τ, G0 ) and G0 is a probability measure on (0, 1) × (1 + , ∞) with full support. The lower bound b ≥ 1 + ensures that 1 1 H̄(1 − x) = y a−1 (1 − y)b−1 dy 1−x B(a, b) 1 b(1 − y)b−1 dy = xb ≤ x1+ ≤ 1−x since be(a, b) is stochastically dominated by be(1, b) (by the MLR property of beta distribution) and taking mixtures preserves bounds for the probability of a given set. This ensures that any H in the support of the prior lies in B. This leads to the following consistency result for a Dirichlet mixture of beta prior. 114 S. Ghosal, A. Roy and Y. Tang Theorem 2 (Full L1 -support of beta mixture prior). For any true h0 ∈ B ∩D lying in the L1 -closure of the above beta mixtures, consistency of pFDR holds for the Dirichlet mixture of beta prior if the center measure G0 has support [0, 1]×[1+, ∞). Proof. First let h0 (x) = hQ0 (x) = be(x; a, b)dQ0 (a, b). Given > 0, find η > 0 and M < ∞ such that Q0 {a < η or b > M } < . Let Q∗0 be Q0 restricted and re-normalized to [η, 1] × [1, M ]. Then by Lemma A.3 of Ghosal and van der Vaart (2001), it follows that hQ0 − hQ∗0 1 < 2. Thus it suffices to assume that Q0 is supported over [η, 1]×[1+, M ] for some η > 0 and M < ∞. Now if Qn is a sequence converging weakly to Q0 , we may also assume that Qn {a < η or b > M } < for all n and so that hQn − hQ∗n 1 < 2 and Q∗n converges weakly to Q0 . For any 0 < x < 1, the beta kernel is a bounded continuous function on [η, 1) × (1, M ], and hence hQ∗n (x) → hQ0 (x). Scheffe’s theorem then implies that hQ∗n − hQ0 1 → 0. Thus, given any > 0, if Q lies in a sufficiently small weak neighborhood of Q0 , then hQ − hQ0 1 < . As the center measure G0 has support [0, 1] × [1 + , ∞], the corresponding Dirichlet process has full weak support. Thus h0 belongs to the L1 -support of the prior, and hence consistency holds by Corollary 2. Now more generally, if h0 can be approximated by beta mixtures in the L1 -sense, then also h0 lies in the L1 -support as the support is a closed set. Hence consistency is obtained. Remark 2. Proposition 8 gives a sufficient condition for h0 to be in the L1 -closure of beta mixtures. Remark 3. By Fubini’s theorem, the result continues to hold even if τ is given a prior and G0 contains hyperparameters. 4. Conclusion A mixture of beta densities be(a, b) with a < 1 and b > 1 forms a rich class of densities with shapes like a reflected J. It is shown that, under various natural scenarios, such densities are appropriate for modeling the density of p-values arising from alternative hypotheses. We have also shown that if for any c.d.f. H, H(e−y ) is a completely monotone function of y, then the corresponding density H is representable exactly as a mixture of the above mentioned beta densities. The mixture model is especially useful for Bayesian inference, where priors can be induced upon the mixture densities through a Dirichlet process prior on the mixing distribution. When hypotheses are randomly assigned as null or alternative with a specific probability, then the p-value distribution is a mixture of a uniform component and a mixture of beta densities of the type mentioned above. By applying the general theory of posterior consistency for density estimation, we have shown that the posterior distribution for estimating the density of p-values is consistent at the true density if it is of the given form and the prior on the mixing distribution has every distribution in its weak support. Under some further conditions which essentially separate mixtures of beta densities from the uniform, it follows that posterior consistency for density estimation leads to consistency in estimating positive false discovery rates for multiple hypotheses testing. This property gives asymptotic justification of a recently proposed Bayesian method of estimating positive false discovery rates by the same set of authors. Posterior consistency in estimating pFDR 115 References [1] Bayarri, M. J. and Berger, J. O. (2000). p-values for composite null models. J. Amer. Statist. Assoc. 95 1127–1142. [2] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289–300. [3] Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genetic Epidemiology 23 70–86. [4] Feller, W. (1971). An Introduction to Probability Theory and Its Applications. II. Wiley, New York. [5] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209–230. [6] Ghosal, S. and van der Vaart, A. W. (2001). Entropies and rates of convergence for maximum likelihood and Bayes estimation for mixtures of normal densities. Ann. Statist. 29 1233–1263. [7] Ghosal, S. and van der Vaart, A. W. (2007). Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist. 35 697–723. [8] Ghosh, J. K. and Ramamoorthi, R. V. (2003). Bayesian Nonparametrics. Springer, New York. [9] Robins, J. M., van der Vaart, A. W. and Ventura, V. (2000). Asymptotic distribution of p-values in composite null models. J. Amer. Statist. Assoc. 95 1143–1167. [10] Sarkar, S. K. (2002). Some results on false discovery rate in multiple testing procedures. Ann. Statist. 30 239–257. [11] Storey, J. D. (2002). A direct approach to false discovery rates. J. Roy. Statist. Soc. Ser. B 64 479–498. [12] Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q-value. Ann. Statist. 31 2013–2035. [13] Tang, Y., Ghosal, S. and Roy, A. (2007). Nonparametric Bayesian estimation of positive false discovery rates. Biometrics 63 1126–1134. [14] Tsai, C., Hsueh, H. and Chen, J. (2003). Estimation of false discovery rates testing: Application to gene microarray data. Biometrics 59 1071–1081. [15] Wong, W. H. and Shen, X. (1995). Probability inequalities for likelihood ratios and convergence rates of sieved MLEs. Ann. Statist. 23 339–362.