* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Improving and extending the testing of distributions
Survey
Document related concepts
Transcript
Improving and extending the testing of distributions for shape-restricted properties Eldar Fischer∗ Oded Lachish † Yadu Vasudev ‡ March 6, 2017 Abstract Distribution testing deals with what information can be deduced about an unknown distribution over {1, . . . , n}, where the algorithm is only allowed to obtain a relatively small number of independent samples from the distribution. In the extended conditional sampling model, the algorithm is also allowed to obtain samples from the restriction of the original distribution on subsets of {1, . . . , n}. In 2015, Canonne, Diakonikolas, Gouleakis and Rubinfeld unified several previous results, and showed that for any property of distributions satisfying a “decomposability” criterion, there exists an algorithm (in the basic model) that can distinguish with high probability distributions satisfying the property from distributions that are far from it in the variation distance. We present here a more efficient yet simpler algorithm for the basic model, as well as very efficient algorithms for the conditional model, which until now was not investigated under the umbrella of decomposable properties. Additionally, we provide an algorithm for the conditional model that handles a much larger class of properties. Our core mechanism is a way of efficiently producing an interval-partition of {1, . . . , n} that satisfies a “fine-grain” quality. We show that with such a partition at hand we can directly move forward with testing individual intervals, instead of first searching for the “correct” partition of {1, . . . , n}. 1 Introduction 1.1 Historical background In most computational problems that arise from modeling real-world situations, we are required to analyze large amounts of data to decide if it has a fixed property. The amount of data involved is usually too large for reading it in its entirety, both with respect to time and storage. In such situations, it is natural to ask for algorithms that can sample points from the data and obtain a significant estimate for the property of interest. The area of property testing addresses this issue by studying algorithms that look at a small part of the data and then decide if the object that generated the data has the property or is far (according to some metric) from having the property. There has been a long line of research, especially in statistics, where the underlying object from which we obtain the data is modeled as a probability distribution. Here the algorithm is only ∗ Faculty of Computer Science, Israel Institute of Technology (Technion), Haifa, Israel. [email protected] Birkbeck, University of London, London, UK. [email protected] ‡ Faculty of Computer Science, Israel Institute of Technology (Technion), Haifa, Israel. [email protected] † 1 allowed to ask for independent samples from the distribution, and has to base its decision on them. If the support of the underlying probability distribution is large, it is not practical to approximate the entire distribution. Thus, it is natural to study this problem in the context of property testing. The specific sub-area of property testing that is dedicated to the study of distributions is called distribution testing. There, the input is a probability distribution (in this paper the domain is the set {1, 2, . . . , n}) and the objective is to distinguish whether the distribution has a certain property, such as uniformity or monotonicity, or is far in `1 distance from it. See [Can15] for a survey about the realm of distribution testing. Testing properties of distributions was studied by Batu et al in [BFR+ 00], where they gave a sublinear query algorithm for testing closeness of distributions supported over the set {1, 2, . . . , n}. They extended the idea of collision counting, which was implicitly used for uniformity testing in the work of Goldreich and Ron ([GR00]). Consequently, various properties of probability distributions were studied, like testing identity with a known distribution ([BFF+ 01, VV14, ADK15, DK16]), testing independence of a distribution over a product space ([BFF+ 01, ADK15]), and testing k-wise independence ([AAK+ 07]). In recent years, distribution testing has been extended beyond the classical model. A new model called the conditional sampling model was introduced. It first appeared independently in [CRS15] and [CFGM13]. In the conditional sampling model, the algorithm queries the input distribution µ with a set S ⊆ {1, 2, . . . , n}, and gets an index sampled according to µ conditioned on the set S. Notice that if S = {1, 2, . . . , n}, then this is exactly like in the standard model. The conditional sampling model allows adaptive querying of µ, since we can choose the set S based on the indexes sampled until now. Chakraborty et al ([CFGM13]) and Canonne et al ([CRS15]) showed that testing uniformity can be done with a number of queries not depending on n (the latter presenting an optimal test), and investigated the testing of other properties of distributions. In [CFGM13], it is also shown that uniformity can be tested with poly(log n) conditional samples by a non-adaptive algorithm. In this work, we study the distribution testing problem in the standard (unconditional) sampling model, as well as in the conditional model. A line of work which is central to our paper, is the testing of distributions for structure. The objective is to test whether a given distribution has some structural properties like being monotone ([BKR04]), being a k-histogram ([ILR12, DK16]), or being log-concave ([ADK15]). Canonne et al ([CDGR15]) unified these results to show that if a property of distributions has certain structural characteristics, then membership in the property can be tested efficiently using samples from the distribution. More precisely, they introduced the notion of L-decomposable distributions as a way to unify various algorithms for testing distributions for structure. Informally, an L-decomposable distribution µ supported over {1, 2, . . . , n} is one that has an interval partition I of {1, 2, . . . , n} of size bounded by L, such that for every interval I, either the weight of µ on it is small or the distribution over the interval is close to uniform. A property C of distributions is L-decomposable if every distribution µ ∈ C is L-decomposable (L is allowed to depend on n). This generalizes various properties of distributions like being monotone, unimodal, log-concave etc. In this setting, their result for a set of distributions C supported over {1, 2, . . . , n} translates to the following: if every distribution µ from C is L-decomposable, then there is an efficient algorithm for testing whether a given distribution belongs to the property C. To achieve their results, Canonne et al ([CDGR15]) show that if a distribution µ supported over [n] is L-decomposable, then it is O(L log n)-decomposable where the intervals are of the form [j2i + 1, (j + 1)2i ]. This presents a natural approach of computing the interval partition in a 2 recursive manner, by bisecting an interval if it has a large probability weight and is not close to uniform. Once they get an interval partition, they learn the “flattening” of the distribution over this partition, and check if this distribution is close to the property C. The term “flattening” refers to the distribution resulting from making µ conditioned on any interval of the partition to be uniform. When applied to a partition corresponding to a decomposition of the distribution, the learned flattening is also close to the original distribution. Because of this, in the case where there is a promise that µ is L-decomposable, the above can be viewed as a learning algorithm, where they obtain an explicit distribution that is close to µ. Without the promise it can be viewed as an agnostic learning algorithm. For further elaboration of this connection see [Dia16]. 1.2 Results and techniques In this paper, we extend the body of knowledge about testing L-decomposable properties. We improve upon the previously known bound on the sample complexity, and give much better bounds when conditional samples are allowed. Additionally, for the conditional model, we provide a test for a broader family of properties, that we call atlas-characterizable properties. Our approach differs from that of [CDGR15] in the manner in which we compute the interval partition. We show that a partition where most intervals that are not singletons have small probability weight is sufficient to learn the distribution µ, even though it is not the original Ldecomposition of µ. We show that if a distribution µ is L-decomposable, then the “flattening” of µ with respect to a this partition is close to µ. It turns out that such a partition can be obtained in “one shot” without resorting to a recursive search procedure. We obtain a partition as above using a method of partition pulling that we develop here. Informally, a pulled partition is obtained by sampling indexes from µ, and taking the partition induced by the samples in the following way: each sampled index is a singleton interval, and the rest of the partition is composed of the maximal intervals between sampled indexes. Apart from the obvious simplicity of this procedure, it also has the advantage of providing a partition with a significantly smaller number of intervals, linear in L for a fixed , and with no dependency on n unless L itself depends on it. This makes our algorithm more efficient in query complexity than the one of [CDGR15] in the unconditional sampling model, and leads to a dramatically small sampling complexity in the (adaptive) conditional model. Another feature of the partition pulling method is that it provides a partition with small weight intervals also when the distribution is not L-decomposable. This allows us to use the partition in a different manner later on, in the algorithm for testing atlas characterizable properties using conditional samples. The main common ground between our approach for L-decomposable properties and that of [CDGR15] is the method of testing by implicit learning, as defined formally in [DLM+ 07] (see [Ser10]). In particular, the results also provide a means to learn a distribution close to µ if µ satisfies the tested property. We also provide a test under the conditional query model for the extended class of atlas characterizable properties that we define below, which generalizes both decomposable properties and symmetric properties. A learning algorithm for this class is not provided; only an “atlas” of the input distribution rather than the distribution itself √ is learned. Our result for unconditional testing (Theorem 7.4) gives a nL/poly() query algorithm in the standard (unconditional) sampling model for testing an L-decomposable property of distributions. Our method of finding a good partition for µ using pulled partitions, that we explained above, avoids the log n factor present in Theorem 3.3 of [CDGR15]. We also avoid the additional O(L2 ) 3 additive term present there. The same method enables us to extend our results to the conditional query model, which we present for both adaptive and non-adaptive algorithms. Table 1 summarizes our results and compares them with known lower bounds1 . Table 1: Summary of our results L-decomposable Unconditional Adaptive conditional Non-adaptive conditional k-characterized by atlases Adaptive conditional 2 Result (testing and learning) √ nL/poly() L/poly() L · poly(log n, 1/) (testing) k · poly(log n, 1/) Known lower bound √ Ω( n/2 ) for L = 1 [Pan08] Ω(L) for some fixed [CFGM13] Ω(log n) for L = 1 and some fixed [ACK15] √ Ω( log log n) for k = 1, and some fixed [CFGM13] Preliminaries We denote the set {1, . . . , n} by [n]. We study the problem of testing properties of probability distributions supported over [n], when given samples from the distribution. For two distributions µ and χ, we say that µ is -far from P χ if they are far in the `1 norm, that is, d(µ, χ) = i∈[n] |µ(i) − χ(i)| > . For a property of distributions C, we say that µ is -far from C if for all χ ∈ C, d(µ, χ) > . Outside the `1 norm between distributions, we also use the `∞ norm, kµ−χk∞ = maxi∈[n] |µ(i)− χ(i)|, and the following measure for uniformity. Definition 2.1. For a distribution µ over a domain I, we define the bias of µ to be bias(µ) = maxi∈I µ(i)/ mini∈I µ(i) − 1. The following observation is easy and will be used implicitly throughout. Observation 2.2. For any two distributions µ and χ over a domain I of size m, d(µ, χ) ≤ 1 mkµ − χk∞ . Also, kµ − UI k∞ ≤ m bias(µ), where UI denotes the uniform distribution over I. Proof. Follows from the definitions. We study the problem, both in the standard model, where the algorithm is given indexes sampled from the distribution, as well as in the model of conditional samples. The conditional model was first studied in the independent works of Chakraborty et al ([CFGM13]) and Canonne et al ([CRS15]). We first give the definition of a conditional oracle for a distribution µ. 1 The lower bounds for unconditional and non-adaptive conditional testing of L-decomposable properties with L = 1 are exactly the lower bounds for uniformity testing; the lower bound for adaptive conditional testing follows easily from the proved existence of properties that have no sub-linear complexity adaptive conditional tests; finally, the lower bound for properties k-characterized by atlases with k = 1 is just a bound for a symmetric property constructed there. About the last one, we conjecture that there exist properties with much higher lower bounds. 4 Definition 2.3 (Conditional oracle). A conditional oracle to a distribution µ supportedP over [n] is a black-box that takes as input a set A ⊆ [n], samples a point i ∈ A with probability µ(i)/ j∈A µ(j), and returns i. If µ(j) = 0 for all j ∈ A, then it chooses i ∈ A uniformly at random. Remark. The behaviour of the conditional oracle on sets A with µ(A) = 0 is as per the model of Chakraborty et al [CFGM13]. However, upper bounds in this model also hold in the model of Canonne et al [CRS15], and most lower bounds can be easily converted to it. Now we define conditional distribution testing algorithms. We will define and analyze both adaptive and non-adaptive conditional testing algorithms. Definition 2.4. An adaptive conditional distribution testing algorithm for a property of distributions C, with parameters , δ > 0, and n ∈ N, with query complexity q(, δ, n), is a randomized algorithm with access to a conditional oracle of a distribution µ with the following properties: • For each i ∈ [q], at the ith phase, the algorithm generates a set Ai ⊆ [n], based on j1 , j2 , · · · , ji−1 and its internal coin tosses, and calls the conditional oracle with Ai to receive an element ji , drawn independently of j1 , j2 , · · · , ji−1 . • Based on the received elements j1 , j2 , · · · , jq and its internal coin tosses, the algorithm accepts or rejects the distribution µ. If µ ∈ C, then the algorithm accepts with probability at least 1 − δ, and if µ is -far from C, then the algorithm rejects with probability at least 1 − δ. Definition 2.5. A non-adaptive conditional distribution testing algorithm for a property of distributions C, with parameters , δ > 0, and n ∈ N, with query complexity q(, δ, n), is a randomized algorithm with access to a conditional oracle of a distribution µ with the following properties: • The algorithm chooses sets A1 , . . . , Aq (not necessarily distinct) based on its internal coin tosses, and then queries the conditional oracle to respectively obtain j1 , . . . , jq . • Based on the received elements j1 , . . . , jq and its internal coin tosses, the algorithm accepts or rejects the distribution µ. If µ ∈ C, then the algorithm accepts with probability at least 1 − δ, and if µ is -far from C, then the algorithm rejects with probability at least 1 − δ. 2.1 Large deviation bounds The following large deviation bounds will be used in the analysis of our algorithms through the rest of the paper. Lemma 2.6 (Chernoff P bounds). Let X1 , X2 , . . . , Xm be independent random variables taking values in {0, 1}. Let X = i∈[m] Xi . Then for any δ ∈ (0, 1], we have the following bounds. 2 −δ E[X] 1. Pr[X > (1 + δ)E[X]] ≤ exp . 3 2. Pr[X < (1 − δ)E[X]] ≤ exp −δ 2 E[X] 2 . 5 When δ ≥ 1, we have Pr[X ≥ (1 + δ)E[X]] < exp( −δE[X] ). 3 Lemma 2.7 (Hoeffding bounds [Hoe63]). Let X1 , X2 , P . . . , Xm be independent random variables 1 such that 0 ≤ Xi ≤ 1 for i ∈ [m], and let X = m X . Then Pr |X − E[X]| > ≤ i i∈[m] 2 exp −2m2 2.2 Basic distribution procedures The following is a folklore result about learning any distribution supported over [n], that we prove here for completeness. Lemma 2.8 (Folklore). Let µ be a distribution supported over [n]. Using n+log(2/δ) unconditional 22 samples from µ, we can obtain an explicit distribution µ0 supported on [n] such that, with probability at least 1 − δ, d(µ, µ0 ) ≤ . Proof. Take m = n+log(2/δ) samples from µ, and for each i ∈ [n], let mi be the number of times i 22 0 was sampled. Define µ (i) = mi /m. Now, we show that maxS⊆[n] |µ(S) − µ0 (S)| ≤ /2. The lemma follows from this since the `1 distance is equal to twice this amount. For any set S ⊆ [n], let X1 , X2 , . . . , Xm be random variables such that Xj = 1 if the j th sample 1 P was in S, and otherwise Xj = 0. Let X = m j∈[m] Xj . Then, X = µ0 (S) and E[X] = µ(S). 2 By Lemma 2.7, Pr[|X − E[X]| > /2] ≤ 2e−2m . Substituting for m, we get that Pr[|µ0 (S) − µ(S)| > /2] ≤ 2e−n−log(2/δ) . Taking a union bound over all sets, with probability at least 1 − δ, |µ0 (S) − µ(S)| ≤ /2 for every S ⊆ [n]. Therefore, d(µ, µ0 ) ≤ . We also have the following simple lemma about learning a distribution in `∞ distance. Lemma 2.9. Let µ be a distribution supported over [n]. Using log(2n/δ) unconditional samples from 22 0 µ, we can obtain an explicit distribution µ supported on [n] such that, with probability at least 1−δ, kµ − µ0 k∞ ≤ . Proof. Take m = log(2n/δ) samples from µ, and for each i ∈ [n], let mi be the number of times i 22 was sampled. For each i ∈ [n], define µ0 (i) = mi /m. th sample For an index i ∈ [n], let X1 , X2 , . . . , X m be random variables such that Xj = 1 if the j 1 P 0 is i, and otherwise Xj = 0. Let X = m j∈[m] Xj . Then X = µ (i) and E[X] = µ(i). By Lemma 2 2.7, Pr[|X − E[X]| > ] ≤ 2e−2m . Substituting for m, we get that Pr[|µ(i) − µ0 (i)| > ] ≤ δ/n. By a union bound over [n], with probability at least 1 − δ, |µ(i) − µ0 (i)| ≤ for every i ∈ [n]. 3 Fine partitions and how to pull them We define the notion of η-fine partitions of a distribution µ supported over [n], which are central to all our algorithms. Definition 3.1 (η-fine interval partition). Given a distribution µ over [n], an η-fine interval partition of µ is an interval-partition I = (I1 , I2 , . . . , Ir ) of [n] such that for all j ∈ [r], µ(Ij ) ≤ η, excepting the case |Ij | = 1. The length |I| of an interval partition I is the number of intervals in it. 6 Algorithm 1: Pulling an η-fine partition Input: Distribution µ supported over [n], parameters η > 0 (fineness) and δ > 0 (error probability) 3 3 1 Take m = η log ηδ unconditional samples from µ 2 3 4 5 6 7 Arrange the indices sampled in increasing order i1 < i2 < · · · < ir without repetition and set i0 = 0 for each j ∈ [r] do if ij > ij−1 + 1 then add the interval {ij−1 + 1, . . . , ij − 1} to I Add the singleton interval {ij } to I if ir < n then add the interval {ir + 1, . . . , n} to I return I The following Algorithm 1 is the pulling mechanism. The idea is to take independent unconditional samples from µ, make them into singleton intervals in our interval-partition I, and then take the intervals between these samples as the remaining intervals in I. Lemma 3.2. Let µ be a distribution that is supported over [n], and η, δ > 0, and suppose that these are fed to Algorithm 1. Then, with probability at least 1− δ, the set of intervals I returned 1 1 by Algorithm 1 is an η-fine interval partition of µ of length O η log ηδ . Proof. Let I the set of intervals returned by Algorithm 1. The guarantee on the length of I follows from the number of samples taken in Step 1, noting that |I| ≤ 2r − 1 = O(m). Let J be a maximal set of pairwise disjoint minimal intervals I in [n], such that µ(I) ≥ η/3 for every interval I ∈ J . Note that every i for which µ(i) ≥ η/3 necessarily appears as a singleton interval {i} ∈ J . Also clearly |J | ≤ 3/η. We shall first show that if an interval I 0 is such that µ(I 0 ) ≥ η, then it fully contains some interval I ∈ J . Then, we shall show that, with probability at least 1 − δ, the samples taken in Step 1 include an index from every interval I ∈ J . By Steps 2 to 6 of the algorithm and the above, this implies the statement of the lemma. Let I 0 be an interval such that µ(I 0 ) ≥ η, and assume on the contrary that it contains no interval from J . Clearly it may intersect without containing at most two intervals Il , Ir ∈ J . Also, µ(I 0 ∩ Il ) < η/3 because otherwise we could have replaced Il with I 0 ∩ Il in J , and the same holds for µ(I 0 ∩ Ir ). But this means that µ(I \ (Il ∪ Ir )) > η/3, and so we could have added I \ (Il ∪ Ir ) to J , again a contradiction. Let I ∈ J . The probability that an index from I is not sampled is at most (1−η/3)3 log(3/ηδ)/η ≤ δη/3. By a union bound over all I ∈ J , with probability at least 1 − δ the samples taken in Step 1 include an index from every interval in J . The following is a definition of a variation of a fine partition, where we allow some intervals of small total weight to violate the original requirements. Definition 3.3 ((η, γ)-fine partitions). Given a distributionPµ over [n], an (η, γ)-fine interval partition is an interval partition I = (I1 , I2 , . . . , Ir ) such that I∈HI µ(I) ≤ γ, where HI is the set of violating intervals {I ∈ I : µ(I) > η, |I| > 1}. 7 In our applications, γ will be larger than η by a factor of L, which would allow us through the following Algorithm 2 to avoid having additional log L factors in our expressions for the unconditional and the adaptive tests. Algorithm 2: Pulling an (η, γ)-fine partition Input: Distribution µ supported over [n], parameters η, γ > 0 (fineness) and δ > 0 (error probability) 3 5 1 Take m = η log γδ unconditional samples from µ 2 3 Perform Step 2 through Step 6 of Algorithm 1. return I Lemma 3.4. Let µ be a distribution that is supported over [n], and γ, η, δ > 0, and suppose that these are fed to Algorithm 2. Then, with probability at least 1 − δ, the set ofintervals I returned 1 1 by Algorithm 2 is an (η, γ)-fine interval partition of µ of length O η log γδ . Proof. Let I the set of intervals returned by Algorithm 2. The guarantee on the length of I follows from the number of samples taken in Step 1. As in the proof of Lemma 3.2, Let J be a maximal set of pairwise disjoint minimal intervals I in [n], such that µ(I) ≥ η/3Sfor every interval I ∈ J . Here, also define the set J 0 to be the set of maximal intervals in [n] \ I∈J I. Note now that J ∪ J 0 is an interval partition of [n]. Note also that between every two consecutive intervals of J 0 lies an interval of J . Finally, since J is maximal, all intervals in J 0 are of weights less than η/3. Recalling the definition H = {I ∈ I : µ(I) > η, |I| > 1}, as in the proof of Lemma 1 every I 0 ∈ H must fully contain an interval from J from which no point was sampled. Moreover, I 0 may not fully contain intervals from J from which any points were sampled. Note furthermore that the weight of such an I 0 is not more than 5 times the total weight of the intervals in J that it fully contains. To see this, recall that the at most two intervals from J that intersect I 0 without containing it have intersections of weight not more than η/3. Also, there may be the intervals of J 0 intersecting I 0 , each of weight at most η/3. However, because there is an interval J between any two consecutive intervals of J 0 , the number of intervals from J 0 intersecting I 0 is at most 1 more than the number of intervals of J fully contained in I 0 . Thus the number of intersecting intervals from J ∪ J 0 is not more than 5 times the number of fully contained intervals from J , and together with their weight bounds we get the bound on µ(I 0 ). Let I ∈ J . The probability that an index from I is not sampled is at most (1−η/3)3 log(5/γδ)/η ≤ δγ/5. By applying the Markov bound over all I ∈ J (along with their weights), with probability at least 1 − δ the samples taken in Step 1 include an index from every interval P in J but at most a subset of them of total weight at most γ/5. By the above this means that I∈H µ(I) ≤ γ. 4 Handling decomposable distributions The notion of L-decomposable distributions was defined and studied in [CDGR15]. They showed that a large class of properties, such as monotonicity and log-concavity, are L-decomposable. We now formally define L-decomposable distributions and properties, as given in [CDGR15]. 8 Definition 4.1 ((γ, L)-decomposable distributions [CDGR15]). For an integer L, a distribution µ supported over [n] is (γ, L)-decomposable, if there exists an interval partition I = (I1 , I2 , . . . , I` ) of [n], where ` ≤ L, such that for all j ∈ [`], at least one of the following holds. 1. µ(Ij ) ≤ γ L. 2. maxi∈Ij µ(i) ≤ (1 + γ) mini∈Ij µ(i). The second condition in the definition of a (γ, L)-decomposable distribution is identical to saying that bias(µ Ij ) ≤ γ. An L-decomposable property is now defined in terms of all its members being decomposable distributions. Definition 4.2 (L-decomposable properties, [CDGR15]). For a function L : (0, 1] × N → N, we say that a property of distributions C is L-decomposable, if for every γ > 0, and µ ∈ C supported over [n], µ is (γ, L(γ, n))-decomposable. Recall that part of the algorithm for learning such distributions is finding (through pulling) what we referred to as a fine partition. Such a partition may still have intervals where the conditional distribution over them is far from uniform. However, we shall show that for L-decomposable distributions, the total weight of such “bad” intervals is not very high. The next lemma shows that every fine partition of an (γ, L)-decomposable distribution has only a small weight concentrated on “non-uniform” intervals and thus it will be sufficient to deal with the “uniform” intervals. Lemma 4.3. Let µ be a distribution supported over [n] which is (γ, L)-decomposable. For every γ/L-fine interval partition I 0 = (I10 , I20 , . . . , Ir0 ) of µ, the following holds. X µ(Ij0 ) ≤ 2γ. j∈[r]:bias(µI 0 )>γ j Proof. Let I = (I1 , I2 , . . . , I` ) be the L-decomposition of µ, where ` ≤ L. Let I 0 = (I10 , I20 , . . . , Ir0 ) be an interval partition of [n] such that for all j ∈ [r], µ(Ij0 ) ≤ γ/L or |Ij0 | = 1. Any interval Ij0 for which bias(µ Ij0 ) > γ, is either completely inside an interval Ik such that µ(Ik ) ≤ γ/L, or intersects more than one interval (and in particular |Ij0 | > 1). There are at most L − 1 intervals in I 0 that intersect more than one interval in I. The sum of the weights of all such intervals is at most γ. For any interval Ik of I such that µ(Ik ) ≤ γ/L, the sum of the weights of intervals from I 0 that lie completely inside Ik is at most γ/L. Thus, the total weight of all such intervals is bounded by γ. Therefore, the sum of the weights of intervals Ij0 such that bias(µ Ij0 ) > γ is at most 2γ. In order to get better bounds, we will use the counterpart of this lemma for the more general (two-parameter) notion of a fine partition. Lemma 4.4. Let µ be a distribution supported over [n] which is (γ, L)-decomposable. For every (γ/L, γ)-fine interval partition I 0 = (I10 , I20 , . . . , Ir0 ) of µ, the following holds. X µ(Ij0 ) ≤ 3γ. j∈[r]:bias(µI 0 )>γ j 9 Proof. Let I = (I1 , I2 , . . . , I` ) be the L-decomposition of µ, where ` ≤ L. Let I 0 = (I10 , I20 , . . . , Ir0 ) be an interval partition of [n] such that for a set HI of total weight at most γ, for all Ij0 ∈ I \ HI , µ(Ij0 ) ≤ γ/L or |Ij0 | = 1. Exactly as in the proof of Lemma 4.3, the total weight of intervals Ij0 ∈ I \ HI for which bias(µ Ij0 ) > γ is at most 2γ. In the worst case, all intervals in HI are also such that bias(µ Ij0 ) > γ, adding at most γ to the total weight of such intervals. As previously mentioned, we are not learning the actual distribution but a “flattening” thereof. We next formally define the flattening of a distribution µ with respect to an interval partition I. Afterwards we shall describe its advantages and how it can be learned. Definition 4.5. Given a distribution µ supported over [n] and a partition I = (I1 , I2 , . . . , I` ), of [n] to intervals, the flattening of µ with respect to I is a distribution µI , supported over [n], such that for i ∈ Ij , µI (i) = µ(Ij )/|Ij |. The following lemma shows that the flattening of any distribution µ, with respect to any interval partition that has only small weight on intervals far from uniform, is close to µ. Lemma 4.6. Let µ be aPdistribution supported on [n], and let I = (I1 , I2 , . . . , Ir ) be an interval partition of µ such that j∈[r]:d(µI ,UI )≥γ µ(Ij ) ≤ η. Then d(µ, µI ) ≤ γ + 2η. j j Proof. We split the sum d(µ, µI ) into parts, one for Ij such that d(µ Ij , UIj ) ≤ γ, and one for the remaining intervals. For Ij s such that d(µ Ij , UIj ) ≤ γ, we have X X 1 µ(i) − µ(Ij ) = µ(I ) = µ(Ij )d(µ Ij , UIj ) ≤ γµ(Ij ). (1) µ (i) − j Ij |Ij | |Ij | i∈Ij i∈Ij For Ij such that d(µ Ij , UIj ) > γ, we have X X 1 µ(i) − µ(Ij ) = µ(Ij ) µ Ij (i) − ≤ 2µ(Ij ) |Ij | |Ij | i∈Ij (2) i∈Ij We know that the sum of µ(Ij ) over all Ij such that d(µ Ij , UIj ) ≥ γ is at most η. Using Equations 2 and 1, and summing up over all the sets Ij ∈ I, the lemma follows. The good thing about a flattening (for an interval partition of small length) is that it can be efficiently learned. For this we first make a technical definition and note some trivial observations: Definition 4.7 (coarsening). Given µ and I, where |I| = `, we define the coarsening of µ according to I the distribution µ̂I over [`] as by µ̂I (j) = µ(Ij ) for all j ∈ [`]. Observation 4.8. Given a distribution µ̂I over [`], define µI over [n] by µ(i) = µ̂I (ji )/|Iji |, where ji is the index satisfying i ∈ Iji . This is a distribution, and for any two distributions µ̂I and χ̂I we have d(µI , χI ) = d(µ̂I , χ̂I ). Moreover, if µ̂I is a coarsening of a distribution µ over [n], then µI is the respective flattening of µ. Proof. All of this follows immediately from the definitions. 10 The following lemma shows how learning can be achieved. We will ultimately use this in conjunction with Lemma 4.6 as a means to learn a whole distribution through its flattening. Lemma 4.9. Given a distribution µ supported over [n] and an interval partition I = (I1 , I2 , . . . , I` ), using 2(`+log(2/δ)) , we can obtain an explicit distribution µ0I , supported over [n], such that, with 2 probability at least 1 − δ, d(µI , µ0I ) ≤ . Proof. First, note that an unconditional sample from µ̂I can be simulated using one unconditional sample from µ. To obtain it, take the index i sampled from µ, and set j to be the index for which i ∈ Ij . Using Lemma 2.8, we can now obtain a distribution µ̂0I , supported over [`], such that with probability at least 1 − δ, d(µ̂I , µ̂0I ) ≤ . To finish, we construct and output µ0I as per Observation 4.8. 5 Weakly tolerant interval uniformity tests To unify our treatment of learning and testing with respect to L-decomposable properties to all three models (unconditional, adaptive-condition and non-adaptive-conditional), we first define what it means to test a distribution µ for uniformity over an interval I ⊆ [n]. The following definition is technical in nature, but it is what we need to be used as a building block for our learning and testing algorithms. Definition 5.1 (weakly tolerant interval tester). A weakly tolerant interval tester is an algorithm T that takes as input a distribution µ over [n], an interval I ⊆ [n], a maximum size parameter m, a minimum weight parameter γ, an approximation parameter and an error parameter δ, and satisfies the following. 1. If |I| ≤ m, µ(I) ≥ γ, and bias(µ I ) ≤ /100, then the algorithm accepts with probability at least 1 − δ. 2. If |I| ≤ m, µ(I) ≥ γ, and d(µ I , UI ) > , then the algorithm rejects with probability at least 1 − δ. In all other cases, the algorithm may accept or reject with arbitrary probability. For our purposes we will use three weakly tolerant interval testers, one for each model. First, a tester for uniformity which uses unconditional samples, a version of which has already appeared implicitly in [GR00]. We state below the tester with the best dependence on n and . We first state it in its original form, where I is the whole of [n], implying that m = n and γ = 1, and δ = 1/3. Lemma 5.2 ([Pan08]). For the input (µ, [n], n, 1, , 1/3), there is a weakly tolerant interval tester √ using O( n/2 ) unconditional samples from µ. The needed adaptation to our purpose is straightforward. Lemma 5.3. For the input (µ, I, m, γ, , δ), there is a weakly tolerant interval tester which uses √ O( m log(1/δ)/γ2 ) unconditional samples from µ. 11 Proof. To adapt the tester of Lemma 5.2 to the general m and γ, we just take samples according to µ and keep from them those samples lie in I. This simulates samples from µ I , over which √ we employ the original tester. This gives a tester using O( m/γ2 ) unconditional samples and providing an error parameter of, say, δ = 2/5 (the extra error is due to the probability of not getting enough samples from I even when µ(I) ≥ γ). To move to a general δ, we repeat this O(1/δ) times and take the majority vote. Next, a tester that uses adaptive conditional samples. For this we use the following tester from [CRS15] (see also [CFGM13]). Its original statement does not have the weakly tolerance (acceptance for small bias) guarantee, but it is easy to see that the proof there works for the stronger assertion. This time we skip the question of how to adapt the original algorithm from I = [n] and δ = 2/3 to the general parameters here. This is since γ does not matter (due to using adaptive conditional samples), the query complexity is independent of the domain size to begin with, and the move to a general δ > 0 is by standard amplification. Lemma 5.4 ([CRS15], see also [CFGM13]). For the input (µ, I, m, γ, , δ), there is a weakly tolerant interval tester that adaptively takes log(1/δ)poly(log(1/))/2 conditional samples from µ. Finally, a tester that uses non-adaptive conditional samples. For this to work, it is also very important that the queries do not depend on I as well (but only on n and γ). We just state here the lemma, the algorithm itself is presented and analyzed in Section 8. Lemma 5.5. For the input (µ, I, m, γ, , δ), there is a weakly tolerant interval tester that nonadaptively takes poly(log n, 1/) log(1/δ)/γ conditional samples from µ, in a manner independent of the interval I. 6 Assessing an interval partition Through either Lemma 3.2 or Lemma 3.4 we know how to construct a fine partition, and then through either Lemma 4.3 or Lemma 4.4 respectively we know that if µ is decomposable, then most of the weight is concentrated on intervals with a small bias. However, eventually we would like a test that works for decomposable and non-decomposable distributions alike. For this we need a way to asses an interval partition as to whether it is indeed suitable for learning a distribution. This is done through a weighted sampling of intervals, for which we employ a weakly tolerant tester, The following is the formal description, given as Algorithm 3. Algorithm 3: Assessing a partition Input: A distribution µ supported over [n], parameters c, r, an interval partition I satisfying |I| ≤ r, parameters , δ > 0, a weakly tolerant interval uniformity tester T taking input values (µ, I, m, γ, , δ). 1 for s = 20 log(1/δ)/ times do 2 Take an unconditional sample from µ and let I ∈ I be the interval that contains it 3 Use the tester T with input values (µ, I, n/c, /r, , δ/2s) 4 if test rejects then add I to B 5 if |B| > 4s then reject else accept 12 To analyze it, first, for a fine interval partition, we bound the total weight of intervals where the weakly tolerant tester is not guaranteed a small error probability; recall that T as used in Step 3 guarantees a correct output only for an interval I satisfying µ(I) ≥ /r and |I| ≤ n/r. Observation 6.1. SDefine NI = {I ∈ I : |I| > n/r or µ(I) < /r}. If I is (η, γ)-fine, where cη + γ ≤ , then µ( I∈NI I) ≤ 2. Proof. Intervals in NI must fall into at least one of the following categories. • Intervals in HI , whose total weight is bounded by γ by the definition of a fine partition. • Intervals whose weight is less than /r. Since there are at most r such intervals (since |I| ≤ r) their total weight is bounded by . • Intervals whose size is more than n/c and are not in HI . Every such interval is of weight bounded by η (by the definition of a fine partition) and clearly there are no more than c of those, giving a total weight of cη. Summing these up concludes the proof. The following “completeness” lemma states that the typical case for a fine partition of a decomposable distribution, i.e. the case where most intervals exhibit a small bias, is correctly detected. Lemma 6.2.SSuppose that I is (η, γ)-fine, where cη + γ ≤ . Define GI = {i : I : bias(µ I ) ≤ /100}. If µ( I∈GI ) ≥ 1 − , then Algorithm 3 accepts with probability at least 1 − δ. Proof. Note by Observation 6.1 that the total weight of GI \ NI is at least 1 − 3. By the Chernoff bound of Lemma 2.6, with probability at least 1 − δ/2 all but at most 4s of the intervals drawn in Step 2 fall into this set. Finally, note that if I as drawn in Step 2 belongs to this set, then with with probability at least 1 − δ/2s the invocation of T in Step 3 will accept it, so by a union bound with probability at least 1 − δ/2 all sampled intervals from this set will be accepted. All events occur together and make the algorithm accept with probability at least 1 − δ, concluding the proof. The following “soundness” lemma states that if too much weight is concentrated on intervals where µ is far from uniform in the `1 distance, then the algorithm rejects. Later we will show that this is the only situation where µ cannot be easily learned through its flattening according to I. Lemma S 6.3. Suppose that I is (η, γ)-fine, where cη + γ ≤ . Define FI = {i : I : d(µ I , UI ) > }. If µ( I∈FI ) ≥ 7, then Algorithm 3 rejects with probability at least 1 − δ. Proof. Note by Observation 6.1 that the total weight of FI \ NI is at least 5. By the Chernoff bound of Lemma 2.6, with probability at least 1 − δ/2 at least 4s of the intervals drawn in Step 2 fall into this set. Finally, note that if I as drawn in Step 2 belongs to this set, then with with probability at least 1 − δ/2s the invocation of T in Step 3 will reject it, so by a union bound with probability at least 1 − δ/2 all sampled intervals from this set will be rejected. All events occur together and make the algorithm reject with probability at least 1 − δ, concluding the proof. 13 Finally, we present the query complexity of the algorithm. It is presented as generally quadratic in log(1/δ), but this can be made linear easily by first using the algorithm with δ = 1/3, and then repeating it O(1/δ) times and taking the majority vote. When we use this lemma later on, both r and c will be linear in the decomposability parameter L for a fixed , and δ will be a fixed constant. Lemma 6.4. Algorithm 3 requires O(q log(1/δ)/) many samples, where q = q(n/c, /r, , δ/2s) is the number of samples that the invocation of T in Step 3 requires. In particular, Algorithm 3 can be implemented either as an unconditional sampling algorithm p taking r n/c log2 (1/δ)/poly() many samples, an adaptive conditional sampling algorithm taking r log2 (1/δ)/poly() many samples, or a non-adaptive conditional sampling algorithm taking r log2 (1/δ)poly(log n, 1/) many samples. Proof. A single (unconditional) sample is taken each time Step 2 is reached, and all other samples are taken by the invocation of T in Step 3. This makes the total number of samples to be s(q + 1) = O(q log(1/δ)/). The bound for each individual sampling model follows by plugging in Lemma 5.3, Lemma 5.4 and Lemma 5.5 respectively. For the last one it is important that the tester makes its queries completely independently of I, as otherwise the algorithm would not have been non-adaptive. 7 Learning and testing decomposable distributions and properties Here we finally put things together to produce a learning algorithm for L-decomposable distribution. This algorithm is not only guaranteed to learn with high probability a distribution that is decomposable, but is also guaranteed with high probability to not produce a wrong output for any distribution (though it may plainly reject a distribution that is not decomposable). This is presented in Algorithm 4. We present it with a fixed error probability 2/3 because this is what we use later on, but it is not hard to move to a general δ. Algorithm 4: Learning an L-decomposable distribution Input: Distribution µ supported over [n], parameters L (decomposability), > 0 (accuracy), a weakly tolerant interval uniformity tester T taking input values (µ, I, m, γ, , δ) 1 Use Algorithm 2 with input values (µ, /2000L, /2000, 1/9) to obtain a partition I with |I| ≤ r = 105 L log(1/)/ 2 Use Algorithm 3 with input values (µ, L, r, I, /20, 1/9, T) 3 if Algorithm 3 rejected then reject 4 Use Lemma 4.9 with values (µ, I, /10, 1/9) to obtain µ0I 5 return µ0I First we show completeness, that the algorithm will be successful for decomposable distributions. Lemma 7.1. If µ is (/2000, L)-decomposable, then with probability at least 2/3 Algorithm 4 produces a distribution µ0 so that d(µ, µ0 ) ≤ . Proof. By Lemma 3.4, withPprobability at least 8/9 the partition I is (/2000L, /2000)-fine, which means by Lemma 4.4 that j∈[r]:bias(µ 0 )>/2000 µ(Ij0 ) ≤ 3/2000. When this occurs, by Lemma 6.2 I j with probability at least 8/9 Algorithm 3 will accept and so the algorithm will move past Step 3. 14 In this situation, in particular by Lemma 4.6 we have that d(µI , µ) ≤ 15/20 (in fact this can be bounded much smaller here), and with probability at least 8/9 (by Lemma 4.9) Step 4 provides a distribution that is /10-close to µI and hence -close to µ. Next we show soundness, that the algorithm will with high probability not mislead about the distribution, whether it is decomposable or not. Lemma 7.2. For any µ, the probability that Algorithm 4 produces (without rejecting) a distribution µ0 for which d(µ, µ0 ) > is bounded by δ. Proof. Consider the interval partition I. By Lemma 3.4, with P probability at least 8/9 it is (/2000L, /2000)-fine. When this happens, if I is such that j:d(µI ,UI ) µ(Ij ) > 7/20, then j j by Lemma 6.3 with probability at least 8/9 the algorithm will reject in Step 3, and we are done (recall that here a rejection is an allowable P outcome). On the other hand, if I is such that j:d(µI ,UI ) µ(Ij ) ≤ 7/20, then by Lemma 4.6 we have that j j d(µI , µ) ≤ 15/20, and with probability at least 8/9 (by Lemma 4.9) Step 4 provides a distribution that is /10-close to µI and hence -close to µ, which is also an allowable outcome. And finally, we plug in the sample complexity bounds. Lemma 7.3. Algorithm 4 requires O(L log(1/)/ + q/ + L log(1/)/3 ) many samples, where the value q = q(n/L, 2 /105 L log(1/), /20, 2000/) is a bound on the number of samples that each invocation of T inside Algorithm 3 requires. In particular, Algorithm 4 can be implemented either as an unconditional sampling algorithm √ taking nL/poly() many samples, an adaptive conditional sampling algorithm taking L/poly() many samples, or a non-adaptive conditional sampling algorithm taking Lpoly(log n, 1/) many samples. Proof. The three summands in the general expression follow respectively from the sample complexity calculations of Lemma 3.4 for Step 1, Lemma 6.4 for Step 2, and Lemma 4.9 for Step 4 respectively. Also note that all samples outside Step 2 are unconditional. The bound for each individual sampling model follows from the respective bound stated in Lemma 6.4. Let us now summarize the above as a theorem. Theorem 7.4. Algorithm 4 is capable of learning an (/2000, L)-decomposable distribution, giving with probability at least 2/3 a distribution that is epsilon-close to it, such that for no distribution will it give as output a distribution -far from it with probability more than 1/3.√ It can be implemented either as an unconditional sampling algorithm taking nL/poly() many samples, an adaptive conditional sampling algorithm taking L/poly() many samples, or a nonadaptive conditional sampling algorithm taking Lpoly(log n, 1/) many samples. Proof. This follows from Lemmas 7.1, 7.2 and 7.3 respectively. Let us now move to the immediate application of the above for testing decomposable properties. The algorithm achieving this is summarized as Algorithm 5 15 Algorithm 5: Testing L-decomposable properties. Input: Distribution µ supported over [n], function L : (0, 1] × N → N (decomposability), parameter > 0 (accuracy), an L-decomposable property C of distributions, a weakly tolerant interval uniformity tester T taking input values (µ, I, m, γ, , δ). 1 Use Algorithm 4 with input values (µ, L(/4000, n), /2, T) to obtain µ0 2 if Algorithm 4 accepted and µ0 is /2-close to C then accept else reject Theorem 7.5. Algorithm 5 is a test (with error probability 1/3) for the L-decomposable property C. For L = L(/4000, n), It can be implemented either as an unconditional sampling algorithm taking √ nL/poly() many samples, an adaptive conditional sampling algorithm taking L/poly() many samples, or a non-adaptive conditional sampling algorithm taking Lpoly(log n, 1/) many samples. Proof. The number and the nature of the samples are determined fully by the application of Algorithm 4 in Step 1, and are thus the same as in Theorem 7.4. Also by this theorem, for a distribution µ ∈ C, with probability at least 2/3 an /2-close distribution µ0 will be produced, and so it will be accepted in Step 2. Finally, if µ is -far from C, then with probability at least 2/3 Step 1 will either produce a rejection, or again produce µ0 that is /2-close to µ. In the latter case, µ0 will be /2-far from C by the triangle inequality, and so Step 2 will reject in either case. 8 A weakly tolerant tester for the non-adaptive conditional model Given a distribution µ, supported over [n], and an interval I ⊆ [n] such that µ(I) ≥ γ, we give a tester that uses non-adaptive conditional queries to µ to distinguish between the cases bias(µ I ) ≤ /100 and d(µ I , UI ) > , using ideas from [CFGM13]. A formal description of the test is given as Algorithm 6. It is formulated here with error probability δ = 1/3. Lemma 5.5 is obtained from this the usual way, by repeating the algorithm O(1/δ) times and taking the majority vote. We first make the observation that makes Algorithm 6 suitable for a non-adaptive setting. Observation 8.1. Algorithm 6 can be implemented using only non-adaptive conditional queries to the distribution µ, that are chosen independently of I. Proof. First, note that the algorithm samples elements from µ at three places. Initially, it samples unconditionally from µ in Step 1, and then it performs conditional samples from the sets Uk in Steps 10 and 13. In Steps 10 and 13, the samples are conditioned on sets Uk , where k depends on I. However, observe that we can sample from all sets Uk , for all 0 ≤ k ≤ log n, at the beginning, and then use just the samples from the appropriate Uk at Steps 10 and 13. This only increases the bound on the number of samples by a factor of log n. Thus we have only non-adaptive queries, all of which are made at the start of the algorithm, independently of I. The following lemma is used in Step 6 of our algorithm. Lemma 8.2. Let µ be a distribution supported over [n] and I ⊆ [n] be an interval such that µ(I) ≥ γ. Using t = 4(|I|+log(2/δ)) unconditional queries to µ, we can construct a distribution µ0 2 γ over I such that, with probability at least 1 − δ, d(µ I , µ0 ) ≤ (in other cases µ0 may be arbitrary). 16 Algorithm 6: Non-adaptive weakly tolerant uniformity tester Input: Distribution µ supported over [n], interval I ⊆ [n], weight bound γ, accuracy > 0. 4(log10 n+3) 1 Sample t = elements from µ. 2 γ 2 for k ∈ {0, . . . , log n} do 3 Set pk = 2−k . 4 Choose a set Uk ⊆ [n], where each i ∈ [n] is in Uk with probability pk , independently of other elements in [n]. 7 if |I| ≤ log10 n then Use Lemma 8.2, using the t unconditional samples from µ, to construct a distribution µ0 if d(µ0 , UI ) ≤ /2 then accept, else reject. 8 else 5 6 9 10 11 12 for Uk such that k ≤ log 14 15 16 17 18 |I| 2 log8 n , and |I ∩ Uk | ≥ log8 n do Sample log3 n elements from µ Uk . if the same element from I ∩ Uk has been sampled twice then reject. 2 3 n log(3 log n) 2 γ Choose an index k such that 16 13 log8 n ≤ |I|pk < 4 3 log8 n. Sample mk = C log elements from µ Uk , for a large constant C. if |I ∩ Uk | > 2|I|pk or the number of samples in I ∩ Uk is less than γmk /40 then reject. else Use Lemma 2.9 with the samples received from I ∩ Uk , to construct µ0 , supported on I ∩ Uk , such that kµ0 − µ I∩Uk k∞ ≤ 80|I∩U with probability at least 9/10. k| if kµ0 − UI∩Uk k∞ ≤ 3 80|I∩Uk | then accept, else reject. 17 Proof. Take t = 4(|I|+log(2/δ)) unconditional samples. Let tI be the number of samples that belong 2 γ to I. Then, E[tI ] = tµ(I) ≥ tγ. Therefore, by Hoeffding bounds, with probability at least 1 − exp(−tµ(I)/4), tI ≥ tµ(I)/2 ≥ tγ/2. The tI samples are distributed according to µ I . By the choice of t, with probability at least 1 − δ/2, tI ≥ 2(|I| + log(2/δ))/2 . Therefore, by Lemma 2.8, we can obtain a distribution µ0 , supported over I, such that with probability at least 1 − δ, d(µ I , µ0 ) ≤ . If we did not obtain sufficiently many samples (either because µ(I) < γ or due to a low probability event) then we just output an arbitrary distribution supported on I. Lemma 8.3 (Completeness). If µ(I) ≥ γ and bias(µ I ) ≤ /100, then Algorithm 6 accepts with probability at least 2/3. Proof. First note that if |I| ≤ log10 n, then we use Lemma 8.2 to test the distance of µ I to uniform with probability at least 9/10 in Step 7. For the remaining part of the proof, we will assume that |I| > log10 n. For a set Uk chosen by the algorithm, and any i ∈ I ∩ Uk , the probability that it is sampled 2 3 µ(i) twice in Step 11 is at most log2 n µ(U . Since µ(Uk ) ≥ µ(I ∩ Uk ), the probability of sampling k) 2 3 µ(i) . By Observation 2.2 bias(µ I ) ≤ /100 implies twice in Step 11 is at most log2 n µ(I∩U k) kµ I −UI k∞ ≤ 100|I| , so we have µ(I) µ(I) 1− ≤ µ(i) ≤ 1+ . |I| 100 |I| 100 From Equation 3 we get the following for all Uk . |I ∩ Uk |µ(I) |I ∩ Uk |µ(I) 1− ≤ µ(I ∩ Uk ) ≤ 1+ . |I| 100 |I| 100 (3) (4) Therefore, the probability that the algorithm samples the same element in I ∩ Uk at Step 11 twice is bounded as follows. 3 X log3 n µ(i) 2 log n maxi∈I∩Uk µ(i)2 ≤ |I ∩ Uk | 2 µ(I ∩ Uk ) 2 µ(I ∩ Uk )2 i∈I∩Uk 3 1 log n 1 + /100 2 ≤ |I ∩ Uk | 2 1 − /100 Since |I ∩ Uk | ≥ log8 n for the k chosen in Step 9, we can bound the sum as follows. X log3 n µ(i) 2 1 + /100 2 1 ≤ . 2 µ(I ∩ Uk ) log2 n 1 − /100 i∈I∩Uk Therefore, with probability at least 1 − o(1), the algorithm does not reject at Step 11. To show that the algorithm accepts with probability at least 2/3 in Step 18, we proceed as follows. Combining Equations 3 and 4, we get the following. 1 1 − /100 1 1 + /100 ≤ µ I∩Uk (i) ≤ |I ∩ Uk | 1 + /100 |I ∩ Uk | 1 − /100 18 From this it follows that kµ I∩Uk −UI∩Uk k∞ ≤ 40|I∩U . k| We now argue that in this case, the test does not reject at Step 14, for the k chosen in Step 12. Observe that E[µ(I ∩ Uk )] ≥ pk γ. Also, the expected size of the set I ∩ Uk is pk |I|. Since the k chosen in Step 12 is such that |I|pk ≥ 23 log8 n, with probability at least 1 − exp(−O(log8 n)), pk |I|/2 ≤ |I ∩ Uk | ≤ 2pk |I| (and in particular Step 14 does not reject). Therefore from Equation 4, we get that, with probability at least 1 − exp(−O(log8 n)), µ(I ∩ Uk ) ≥ pk γ/3. Since E[µ(Uk )] = pk , we can conclude using Markov’s inequality that, with probability at least 9/10, µ(Uk ) ≤ 10pk . The expected number of samples from I∩Uk among the mk samples used in Step 17 is mk µ(I∩Uk )/µ(Uk ). Therefore, with probability at least 9/10, the expected number of samples from I ∩ Uk among the mk samples is at least mk γ/30. Therefore, with probability, at least 9/10 − o(1), at least mk γ/40 elements of I ∩ Uk are sampled, and the tester does not reject at Step 14. The indexes that are sampled in Step 13 that lie in I ∩ Uk are distributed according to µ I∩Uk and we know that |I ∩ Uk | ≤ 2|I|pk ≤ 83 log8 n. Therefore, with probability at least 9/10, we get a distribution µ0 such that kµ0 − µ I∩Uk k∞ ≤ 80|I∩U in Step 17. k| Therefore, the test correctly accepts in Step 18 for the k chosen in Step 12. Now we prove the soundness of the tester mentioned above. First we state a lemma from Chakraborty et al [CFGM13]. Lemma 8.4 ([CFGM13], adapted for intervals). Let µ be a distribution, and I ⊆ [n] be an interval such that d(µ I , UI ) ≥ . Then the following two conditions hold. n o 1. There exists a set B1 = i ∈ I | µ I (i) < 1+/3 such that |B1 | ≥ |I|/2. |I| n o log |I| 2. There exists an index j ∈ 3, . . . , log(1+/3) , and a set Bj of cardinality at least such that (1+/3)j−1 |I| ≤ µ I (i) < (1+/3)j |I| 2 |I| , 96(1+/3)j log |I| for all i ∈ Bj . Now we analyze the case where d(µ I , UI ) > . Lemma 8.5 (Soundness). Let µ be a distribution supported on [n], and let I ⊆ [n] be an interval such that µ(I) ≥ γ. If d(µ I , UI ) ≥ , then Algorithm 6 rejects with probability at least 2/3. Proof. Observe that when |I| ≤ log10 n, the algorithm rejects with probability at least 9/10 in Step 7. For the remainder of the proof, we will assume that |I| > log10 n. We analyze two cases according to the value of j given by Lemma 8.4. 1. Suppose that j > 2 is such that |Bj | ≥ 2 |I| , 96(1+/3)j log |I| and (1 + /3)j ≤ log6 n. The expected number of elements from this set that is chosen in Uk is at least of k made in Step 12, we have |I|pk ≥ 2 3 2 |I|pk . 96(1+/3)j log |I| For the choice 8 log n. The probability that no index from Bj is chosen 8 2 j n |I|/(1+) log |I| which is at most (1 − 2 log . Since (1 + /3)j ≤ log6 n, 3|I| ) 2 log n this is at most exp − 144 . Therefore, with probability 1 − o(1), at least one element i is chosen from Bj . in Uk is (1 − pk )|Bj | Since |B1 | ≥ |I|/2, the probability that no element from B1 is chosen in Uk is at most (1 − pk )|I|/2 . Substituting for pk , we can conclude that, with probability 1 − o(1), at least one element i0 is chosen from the set B1 . 19 Now, µ I (i) ≥ (1 + /3)µ I (i0 ). Hence, µ I∩Uk (i) ≥ (1 + /3)µ I∩Uk (i0 ). This implies that kµ I∩Uk −UI∩Uk k∞ ≥ 20|I∩U . The algorithm will reject with high probability in Step k| 18, unless it has already rejected in Step 14. 2 |I| 2. Now, we consider the case where j > 2 is such that |Bj | ≥ 96(1+/3) , and (1 + j log |I| n o |I| /3)j > log6 n. Let k = max 0, blog 4(1+/3) j log2 n c . Then, for this value of k, pk ≥ o n o n 2 j 4(1+/3)j log2 n log n . Also, for this value of k, p ≤ min 1, . With probamin 1, 2(1+/3) k |I| |I| bility at least 1 − exp(−O(log5 n)), |Uk ∩ I| ≥ log8 n, for this value of k. Furthermore, the probability that Bj ∩ Uk is empty is (1 − pk )|Bj | . Substituting the values of |Bj | and pk , we get that Pr[Bj ∩ Uk = ∅] ≤ exp(−2 log n/48). Therefore, with probability at least 1 − exp(−2 log n/48), Uk contains an element of Bj . Let i ∈ Bj ∩ Uk . Since i ∈ Bj , from Lemma 8.4 we know that µ I (i) = µ(i) µ(I) ≥ (1+)j−1 . |I| j−1 µ(I) From this bound, we get that µ Uk (i) ≥ (1+) |I|µ(Uk ) . The expected value of µ(Uk ) is pk . By Markov’s inequality, with probability at least 9/10, µ(Uk ) ≤ 10pk . Therefore, µ Uk (i) ≥ (1+)j−1 γ γ 10|I|pk ≥ 40(1+/3) log2 n . The probability that i is sampled at most once in this case is at log3 (n)−1 γ . Therefore, with probability at least 1 − o(1), i is most log3 (n) 1 − 40(1+/3) 2 log n sampled at least twice and the tester rejects at Step 11. Proof of Lemma 5.5. Given the input values (µ, I, m, γ, , δ), we iterate Algorithm 6 O(1/δ) independent times with input values (µ, I, γ, ) (we may ignore m here), and take the majority vote. The sample complexity is evident from the description of the algorithm. If indeed µ(I) ≥ γ then Lemma 8.3 and Lemma 8.5 provide that every round gives the correct answer with probability at least 2/3, making the majority vote correct with probability at least 1 − δ. 9 Introducing properties characterized by atlases In this section, we give a testing algorithm for properties characterized by atlases, which we formally define next. We will show in the next subsection that distributions that are L-decomposable are, in particular, characterized by atlases. First we start with the definition of an inventory. Definition 9.1 (inventory). Given an interval I = [a, b] ⊆ [n] and a real-valued function ν : [a, b] → [0, 1], the inventory of ν over [a, b] is the multiset M corresponding to (ν(a), . . . , ν(b)). That is, we keep count of the function values over the interval including repetitions, but ignore their order. In particular, for a distribution µ = (p1 , . . . , pn ) over [n], the inventory of µ over [a, b] is the multiset M corresponding to (pa , . . . , pb ). Definition 9.2 (atlas). Given a distribution µ over [n], and an interval partition I = (I1 , . . . , Ik ) of [n], the atlas A of µ over I is the ordered pair (I, M), where M is the sequence of multisets (M1 , . . . , Mk ) so that Mj is the inventory of µ over Ij for every j ∈ [k]. In this setting, we also say that µ conforms to A. 20 We note that there can be many distributions over [n] whose atlas is the same. We will also denote by an atlas A any ordered pair (I, M) where I is an interval partition of [n] and M is a sequence of multisets of the same length, so that the total sum of all members of all multisets is 1. It is a simple observation that for every such A there exists at least one distribution that conforms to it. The length of an atlas |A| is defined as the shared length of its interval partition and sequence of multisets. We now define what it means for a property to be characterized by atlases, and state our main theorem concerning such properties. Definition 9.3. For a function k : R+ × N → N, we say that a property of distributions C is k-characterized by atlases if for every n ∈ N and every > 0 we have a set A of atlases of lengths bounded by k(, n), so that every distribution µ over [n] satisfying C conforms to some A ∈ A, while on the other hand no distribution µ that conforms to any A ∈ A is -far from satisfying C. Theorem 9.4. If C is a property of distributions that is k-characterized by atlases, then for any > 0 there is an adaptive conditional testing algorithm for C with query complexity k(/5, n) · poly(log n, 1/) (and error probability bound 1/3). 9.1 Applications and examples We first show that L-decomposable properties are in particular characterized by atlases. Lemma 9.5. If C is a property of distributions that is L-decomposable, then C is k-characterized by atlases, where k(, n) = L(/3, n). Proof. Every distribution µ ∈ C that is supported over [n] defines an atlas in conjunction with the interval partition of the L-decomposition of µ for L = L(γ, n). Let A be the set of all such atlases. We will show that C is L(3γ, n)-characterized by A. Let µ ∈ C. Since µ is L-decomposable, µ conforms to the atlas given by the L-decomposition and it is in A as defined above. Now suppose that µ conforms to an atlas A ∈ A, where I = (I1 , . . . , I` ) is the sequence of intervals. By the construction of A, there exists a distribution χ ∈ C that conforms with A. Now, for each j ∈ [`] such that µ(Ij ) ≤ γ/L, we have (noting that χ(Ij ) = µ(Ij )) X |µ(i) − χ(i)| ≤ X µ(i) + i∈Ij i∈Ij X i∈Ij χ(i) ≤ 2µ(Ij ) ≤ 2γ . ` (5) Noting that µ and χ have the same maximum and minimum over Ij (as they have the same inventory), for each j ∈ [`] and i ∈ Ij , we know that |µ(i) − χ(i)| ≤ maxi∈Ij µ(i) − mini∈Ij µ(i). Therefore, for all j ∈ [`] such that maxi∈Ij µ(i) ≤ (1 + γ) mini∈Ij µ(i), |µ(i) − χ(i)| ≤ γ mini∈Ij µ(i). Therefore, X |µ(i) − χ(i)| ≤ |Ij |γ min µj (i) ≤ γµj (Ij ). (6) i∈Ij i∈Ij Finally, recall that since A came from an L-decomposition of χ, all intervals are covered by the above cases. Summing up Equations 5 and 6 for all j ∈ {1, 2, . . . , `}, we obtain d(µ, χ) ≤ 3γ. 21 Note that atlases characterize also properties that do not have shape restriction. The following is a simple observation. Observation 9.6. If C is a property of distributions that is symmetric over [n], then C is 1characterized by atlases. It was shown in Chakraborty et al [CFGM13] that such properties are efficiently testable with conditional queries, so Theorem 9.4 in particular generalizes this result. Also, the notion of characterization by atlases provides a natural model for tolerant testing, as we will see in the next subsection. 10 Atlas characterizations and tolerant Testing We now show that for all properties of distributions that are characterized by atlases, there are efficient tolerant testers as well. In [CDGR15], it was shown that for a large property of distribution properties that have “semi-agnostic” learners, there are efficient tolerant testers. In this subsection, we show that when the algorithm is given conditional query access, there are efficient tolerant testers for the larger class of properties that are characterized by atlases, including decomposable properties that otherwise do not lend themselves to tolerant testing. The mechanism presented here will also be used in the proof of Theorem 9.4 itself. First, we give a definition of tolerant testing. We note that the definition extends naturally to algorithms that make conditional queries to a distribution. Definition 10.1. Let C be any property of probability distributions. An (η, )-tolerant tester for C with query complexity q and error probability δ, is an algorithm that samples q elements x1 , . . . , xq from a distribution µ, accepts with probability at least 1−δ if d(µ, C) ≤ η, and rejects with probability at least 1 − δ if d(µ, C) ≥ η + . In [CDGR15], they show that for every α > 0, there is an > 0 that depends on α, such that there is an (, α − )-tolerant tester for certain shape-restricted properties. On the other hand, tolerant testing using unconditional queries for other properties, such as the (1-decomposable) property of being uniform, require Ω (n/ log n) many samples ([VV10]). We prove that, in the presence of conditional query access, there is an (η, )-tolerant tester for every η, > 0 such that η + < 1 for all properties of probability distributions that are characterized by atlases. We first present a definition and prove an easy lemma that will be useful later on. Definition 10.2. Given a partition I = (I1 , . . . , Ik ) of [n], we say that a permutation σ : [n] → [n] is I-preserving if for every 1 ≤ j ≤ k we have σ(Ij ) = Ij . Lemma 10.3. Let χ and χ0 be two distributions, supported on [n], both of which conform to an atlas A = (I, M). If A0 = (I, M0 ) is another atlas with the same interval partition as A, such that χ is -close to conforming to A0 , then χ0 is also -close to conforming to A0 . Proof. It is an easy observation that there exists an I-preserving permutation σ that moves χ to χ0 . Now let µ be the distribution that conforms to A0 such that d(µ, χ) ≤ , and let µ0 be the distribution that results from having σ operate on µ. It is not hard to see that µ0 conforms to A0 (which has the same interval partition as A), and that it is -close to χ0 . 22 For a property C of distributions that is k-characterized by atlases, let Cη be the property of distributions of being η-close to C. The following lemma states that Cη is also k-characterized by atlases. This lemma will also be important for us outside the context of tolerant testing per-se. Lemma 10.4. Let C be a property of probability distributions that is k(, n)-characterized by atlases. For any η > 0, let Cη be the set of all probability distributions µ such that d(µ, C) ≤ η. Then, there is a set Aη of atlases, of length at most k, such that every µ ∈ Cη conforms to at least one atlas in Aη , and every distribution that conforms to an atlas in Aη is η + -close to C, and is -close to Cη . Proof. Since C is k-characterized by atlases, there is a set of atlases A of length at most k(, n) such that for each µ ∈ C, there is an atlas A ∈ A to which it conforms, and any χ that conforms to an atlas A ∈ A is -close to C. Now, let Aη be obtained by taking each atlas A ∈ A, and adding all atlases, with the same interval partition, corresponding to distributions that are η-close to conforming to A. First, note that the new atlases that are added have the same interval partitions as atlases in A, and hence have the same length bound k(, n). To complete the proof of the lemma, we need to prove that every µ ∈ Cη conforms to some atlas in Aη , and that no distribution that conforms to Aη is η + -far from C. Take any µ ∈ Cη . There exists some distribution µ0 ∈ C such that d(µ, µ0 ) ≤ η. Since C is k-characterized by atlases, there is some atlas A ∈ A such that µ0 conforms with A. Also, observe that µ is η-close to conforming to A through µ0 . Therefore, there is an atlas A0 with the same interval partition as A that was added in Aη , which is the atlas corresponding to the distribution µ. Hence, there is an atlas in Aη to which µ conforms. Conversely, let χ be a distribution that conforms with an atlas A0 ∈ Aη . From the construction of Aη , we know that there is an atlas A ∈ A with the same interval partition as A0 , and there is a distribution χ0 that conforms to A0 and is η-close to conforming to A. Therefore, by Lemma 10.3 χ is also η-close to conforming to A. Let µ0 be the distribution conforming to A such that d(χ, µ0 ) ≤ η. Since µ0 conforms to an atlas A ∈ A, d(µ0 , C) ≤ . Therefore, by the triangle inequality, d(χ, C) ≤ η + . This also implies that d(χ, Cη ) ≤ by considering χ̃ = (ηχ+µ̃)/(+η) where µ̃ is the distribution in C that is + η-close to χ. Note that χ̃ is η-close to µ̃ and -close to χ. Using Lemma 10.4 we get the following corollary of Theorem 9.4 about tolerant testing of distributions characterized by atlases. Corollary 10.5. Let C be a property of distributions that is k-characterized by atlases. For every η, > 0 such that η+ < 1, there is an (η, )-tolerant tester for C that takes k(/5, n)·poly(log n, 1/) conditional samples and succeeds with probability at least 2/3. 11 Some useful lemmas about atlases and characterizations We start with a definition and a lemma, providing an alternative equivalent definition of properties k-characterizable by atlases Definition 11.1 (permutation-resistant distributions). For a function k : R+ × N → N, a property C of probability distributions is k-piecewise permutation resistant if for every n ∈ N, every > 0, and every distribution µ over [n] in C, there exists a partition I of [n] into up to k(, n) intervals, 23 so that every I-preserving permutation of [n] transforms µ into a distribution that is -close to a distribution in C. Lemma 11.2. For k : R+ × N → N, a property C of probability distributions over [n] is k-piecewise permutation resistant if and only if it is k-characterized by atlases. Proof. If C is k-piecewise permutation resistant, then for each distribution µ ∈ C, there exists an interval partition I of [n] such that every I-preserving permutation of [n] transforms µ into a distribution that is -close to C. Each distribution µ thus gives an atlas over I, and the collection of these atlases for all µ ∈ C characterizes the property C. Therefore, C is k-characterized by atlases. Conversely, let C be a property of distributions that are k-characterized by atlases and let A be the set of atlases. For each µ ∈ C, let Aµ be the atlas in A that characterizes µ and let Iµ be the interval partition corresponding to this atlas. Now, every Iµ -preserving permutation σ of µ gives a distribution µσ that has the same atlas Aµ . Since C is k-characterized by atlases, µσ is -close to C. Therefore, C is k-piecewise permutation resistant as well. We now prove the following lemma about /k-fine partitions of distributions characterized by atlases, having a similar flavor as Lemma 4.3 for L-decomposable properties. Since we cannot avert a poly(log n) dependency anyway, for simplicity we use the 1-parameter variant of fine partitions. Lemma 11.3. Let C be a property of distributions that is k(, n)-characterized by atlases through A. For any µ ∈ C, any /k-fine interval partition I 0 of µ, and the corresponding atlas A0 = (I 0 , M0 ) for µ (not necessarily in A), any distribution µ0 that conforms to A0 is 3-close to C. Proof. Let A = (I, M) be the atlas from A to which µ conforms, and let I 0 = (I10 , I20 , . . . , Ir0 ) be an /k-fine interval partition of µ. Let P ⊆ I 0 be the set of intervals that more than one S intersect 0 0 interval in I. Since I is /k-fine, and the length of A is at most k, µ( I 0 ∈P Ij ) ≤ (note that P S j cannot contain singletons). Also, since µ0 conforms to A0 , we have µ0 ( I 0 ∈P Ij0 ) ≤ . j Let µ̃ be a distribution supported over [n] obtained as follows: For each interval Ij0 ∈ P, µ̃(i) = µ(i) for every i ∈ Ij0 . For each interval Ij0 ∈ I 0 \ P, µ̃(i) = µ0 (i) for every i ∈ Ij0 . Note that that the inventories of µ̃ and µ are identical over any Ij0 in I 0 \ P. From this it follows that µ̃ also conforms to A, and in particular µ̃ is a distribution. To see this, for any Ij in I partition it to its intersection with the members of I 0 \ P contained in it, and all the rest. For the former we use that µ and µ0 have the same inventories, and for the latter we specified that µ̃ has the same values as µ. Since µ0 and µ̃ are identical at all points except those in P, we have d(µ0 , µ̃) ≤ 2. Furthermore, d(µ̃, C) ≤ since µ̃ conforms to A ∈ A. Therefore, by the triangle inequality, d(µ0 , C) ≤ 3. The main idea of our test for a property of distributions k-characterized by atlases, starts with a γ/k-fine partition I obtained by through Algorithm 1. We then show how to compute an atlas A with this interval partition such that there is a distribution µI that conforms to A that is close to µ. We use the -trimming sampler from [CFGM13] to obtain such an atlas corresponding to I. To test if µ is in C, we show that it is sufficient to check if there is some distribution conforming to A that is close to a distribution in C. 12 An adaptive test for properties characterized by atlases Our main technical lemma, which we state here and prove in Section 13, is the following possibility of “learning” an atlas of an unknown distribution for an interval partition I, under the conditional 24 sampling model. Lemma 12.1. Given a distribution µ supported over [n], and a partition I = (I1 , I2 , . . . , Ir ), using r · poly(log n, 1/, log(1/δ)) conditional samples from µ we can construct, with probability at least 1 − δ, an atlas for some distribution µI that is -close to µ. First, we show how this implies Theorem 9.4. To prove it, we give as Algorithm 7 a formal description of the test. Algorithm 7: Adaptive conditional tester for properties k-characterized by atlases Input: A distribution µ supported over [n], a function k : (0, 1] × N → N, accuracy parameter > 0, a property C of distributions that is k-characterized by the set of atlases A 1 Use Algorithm 1 with input values (µ, /5k(/5, n), 1/6) to obtain a partition I with |I| ≤ 20k(/5, n) log(n) log(1/)/ 2 Use Lemma 12.1 with accuracy parameter /5 and error parameter 1/6 to obtain an atlas AI 0 corresponding to I 0 3 if there exists χ ∈ C that is /5-close to conforming to AI 0 then accept else reject Lemma 12.2 (completeness). Let C be a property of distributions that is k-characterized by atlases, and let µ be any distribution supported over [n]. If µ ∈ C, then with probability at least 2/3 Algorithm 7 accepts. Proof. In Step 2, with probability at least 5/6 > 2/3, we get an atlas AI 0 such that there is a distribution µI 0 that is /5-close to µ. Step 3 then accepts on account of χ = µ. Lemma 12.3 (soundness). Let C be a property of distributions that is k-characterized by atlases, and let µ be any distribution supported over [n]. If d(µ, C) > , then with probability at least 2/3 Algorithm 7 rejects. Proof. With probability at least 2/3, for k = k(/5, n) we get an /5k-fine partition I 0 in Step 1, as well as an atlas AI 0 in Step 2 such that there is a distribution µI 0 conforming to it that is /5-close to µ. Suppose that the algorithm accepted in Step 3 on account of χ ∈ C. Then there is a χ0 that is /5-close to χ and conforms to AI 0 . By Lemma 10.4, the property of being /5-close to C is itself k-characterized by atlases. Let A/5 be the collection of atlases characterizing it. Using Lemma 11.3 with χ0 and A/5 , we know that χ0 is 3/5-close to some χ, which is in C/5 and thus /5-close to C. Since χ0 is also /5-close to µ, we obtain that µ is -close to C by the triangle inequality, contradicting d(µ, C) > . Proof of Theorem 9.4. Given a distribution µ, supported on [n], and a property C of distributions that is k-characterized by atlases, we use Algorithm 7. The correctness follows from Lemmas 12.2 and 12.3. The number of samples made in Step 1 is clearly dominated by the number of samples in Step 2, which is k(/5, n) · poly(log n, 1/). 25 13 Constructing an atlas for a distribution Before we prove Lemma 12.1, we will define the notion of value-distances and prove lemmas that will be useful for the proof of the theorem. Definition 13.1 (value-distance). Given two multisets A,B of real numbers, both of the same size (e.g. two inventories over an interval [a, b]), the value-distance between them is the minimum `1 distance between a vector that conforms to A and a vector that conforms to B. The following observation gives a simple method to calculate the value-distances between two multisets A and B. Observation 13.2. The value-distance between A and B is equal to the `1 distance between the two vectors conforming to the respective sorting of the two multisets. Proof. Given two vectors v and w corresponding to A and B achieving the value-distance, first assume that the smallest value of A is not larger than the smallest value of B. Assume without loss of generality (by permuting both v and w) that v1 is of the smallest value among those of A. It is not hard to see that in this case one can make w1 to be of the smallest value among those of B without increasing the distance (by swapping the two values), and from here one can proceed by induction over |A| = |B|. We now prove two lemmas that will be useful for the proof of Lemma 12.1. Lemma 13.3. Let A and B be two multisets of the same size, both with members whose values range in {0, α1 , . . . , αr }. Let mj be the number of appearances of αj in A, and nj the corresponding number Pr in B. If mj ≤ nj for every 1 ≤ j ≤ r, then the value-distance between A and B is bounded by j=1 (nj − mj )αj . Proof. Let vA = {a1 , . . . , al } and vB = {b1 , . . . , bl } be two vectors such that a1 = · · · = am1 = α1 , am1 +1 = · · · = an1 = 0 and b1 = · · · = bn1 = α1 , and similarly for j ∈ {1, . . . , r − 1}, anj +1 = · · · = anjP +mj+1 = αj+1 , anj +mj+1 +1 = · · · = anj +nj+1 = 0 and bnj +1 = · · · = bnj +nj+1 = αj+1 . conform to the multisets A and For k > rj=1 nj , we set ak = bk = 0. The vectors vA and vB P B respectively, and the `1 distance between the two vectors is rj=1 (nj − mj )αj , so the lemma follows. Lemma 13.4. Let µ be a probability distribution over {1, . . . , n},Pand let µ̃ be a vector of size n where each entry is a real number in the interval [0, 1], such that P i∈[n] |µ(i) − µ̃(i)| ≤ . Let µ̂ be a Pprobability distribution over {1, . . . , n} defined as µ̂(i) = µ̃(i)/ i∈[n] µ̃(i) for all i ∈ [n]. Then i∈[n] |µ(i) − µ̂(i)| ≤ 5. P P P Proof. We have | i∈[n] µ(i) − i∈[n] µ̃(i)| ≤ . Therefore, 1 − ≤ i∈[n] µ̃(i) ≤ 1 + . If P i∈[n] µ̃(i) < 1, then µ̂(i) ≤ µ̃(i)/(1 P− ) and µ̂(i) > µ̃(i). Therefore, µ̂(i) ≤ (1 + 2)µ̃(i) and hence 0 ≤ µ̂(i) − µ̃(i) ≤ 2µ̃(i). If i∈[n] µ̃(i) ≥ 1, then µ̂(i) ≥ µ̃(i)/(1 + ) ≥ (1 − )µ̃(i) and µ̂(i) ≤ µ̃(i). Therefore 0 ≤ µ̃(i) − µ̂(i) ≤ µ̃(i). Therefore |µ̃(i) − µ̂(i)| ≤ 2µ̃(i), in all cases, for all i. P P P P Now, i∈[n] |µ(i) − µ̂(i)| ≤ i∈[n] |µ(i) P P − µ̃(i)| + i∈[n] |µ̃(i) − µ̂(i)|. Since i∈[n] |µ̃(i) − µ̂(i)| ≤ 2 i∈[n] µ̃(i) ≤ 2(1 + ), we get that i∈[n] |µ(i) − µ̂(i)| ≤ 5. Now we recall the definition of an -trimming sampler from [CFGM13]. 26 Definition 13.5 (-trimming sampler). An -trimming sampler with s samples for a distribution µ supported over [n], is an algorithm that has conditional query access to the distribution µ and returns s pairs of values (r, µ(r)) (for r = 0, µ(r) is not output) from a distribution µ supported on P {0}∪[n], such that i∈[n] |µ(i)−µ(i)| ≤ 4, and each r is independently drawn from µ. Furthermore, there is a set P of poly(log n, 1/) real numbers such that for all i either µ(i) = 0 or µ(i) ∈ P . The existence of an -trimming sampler with a small set of values was proved in [CFGM13]. Let us formally state this lemma. Lemma 13.6 ([CFGM13]). Given conditional query access to a distribution µ supported on [n], there is an -trimming sampler that makes 32s · −4 · log5 n · log(sδ −1 log n) many conditional queries to µ, and returns, with probability at least 1 − δ, a sequence of s samples from a distribution µ, i−1 where P = { (1+)n | 1 ≤ i ≤ k − 1} for k = log n log(−1 )/ log2 (1 + ). 13.1 Proving the main lemma The proof of Lemma 12.1 depends on the following technical lemma. Lemma 13.7. Let µ be a distribution supported over [n], and let P = (P0 , P1 , P2 , . . . , Pr ) be a partition of [n] into r + 1 subsets with the following properties. 1. For each Pk ∈ P, µ(i) = pk for every i ∈ Pk , where p1 , . . . , pr (but not p0 ) are known. 2. Given an i ∈ [n] sampled from µ, we can find the k such that i ∈ Pk . µ, with probability at least 1 − δ, we can find m1 , . . . , mr such log rδ samples from P Using s = 6r 2 that mk ≤ |Pk | for all k ∈ [r], and k∈[r] pk (|Pk | − mk ) ≤ 4. Proof. Take s samples from µ. For each k ∈ [r], let sk be the number of samples in Pk , each with probability pk . We can easily see that E[sk ] = spk |Pk |. If pk |Pk | ≥ 1/r, then by Chernoff bounds, we know that, 2 E[s Pr [(1 − )E[sk ] ≤ sk ≤ (1 + )E[sk ]] ≥ 1 − 2e− k ]/3 By the choice of s, with probability at least 1 − δ/r, (1 − )pk |Pk | ≤ ssk ≤ (1 + )pk |Pk |. On the other hand, if pk |Pk | < 1/r, then by Chernoff bounds we have, Pr pk |Pk | − r ≤ ssk ≤ pk |Pk | + r = h i 2 s Pr 1 − rpk|Pk | spk |Pk | ≤ sk ≤ 1 + rpk|Pk | spk |Pk | ≥ 1 − 2 exp − 2 r pk |Pk | 2 s/3r ≥ 1 − 2e− By the choice of s, with probability at least 1 − δ/r, pk |Pk | − r ≤ ssk ≤ pk |Pk | + r . With probability at least 1 − δ, we get an estimate αk = sk /s for every k ∈ [r] such that αk ≤ max {pk |Pk | + /r, pk Pk (1 + )}, and αk ≥ min {pk |Pk | − /r, (1 − )pk |Pk |}. From now on we assume that αk satisfies the above bounds, and define αk0 = min {αk − /r, αk /(1 + )}. Notice that αk0 ≤ pk |Pk |. Furthermore, αk0 ≥ min {pk |Pk | − 2/r, (1 − 2)pk |Pk |}. Set mk = dαk0 /pk e. Since αk0 /pk ≤ |Pk | ∈ Z, we have mk ≤ |Pk | for all k. 27 P P Now, k∈[r] pk (|Pk | − mk ) ≤ k∈[r] (pk |P | − αk0 ). Under the above assumption on the value of αk , for every k such that pk |Pk | < 1/r, we have αk0 ≥ pk |Pk | − 2/r. Hence, this difference is at most 2/r. For every k such that pk |Pk | ≥ 1/r, αk0 ≥ (1 − 2)pk |Pk |. For any such k, the difference P pk |Pk | − αk0 is at most 2pk |Pk |. Therefore, k∈[r] pk (|Pk | − mk ) is at most 4. Proof of Lemma 12.1. Given a distribution µ and an interval partition I = (I1 , I2 , . . . , Ir ), let µ be the distribution presented by the /8-trimming sampler in Lemma 13.6. Let Ij,k ⊆ [n] be the set k−1 of indexes i such that i ∈ Ij and µ(i) = (1+/8) . Thus, each interval Ij in I is now split into 8n subsets Ij,0 , Ij,1 , Ij,2 , . . . , Ij,` , where ` ≤ log n log(8/)/ log2 (1 + /8) and Ij,0 is the set of indexes in Ij such that µ(i) = 0. Using Lemma 13.6, with s = r · poly(log n, 1/, log(1/δ)) samples from the distribution µ, we can estimate, with probability at least 1 − δ, the values mj,k such that mj,k ≤ |Ij,k | for all k > 0, and the following holds. X (1+/8)k−1 (Ij,k − mj,k ) ≤ /10. 8n j,k:k>0 P For every j, let MIj be the inventory provided by mj,1 , . . . , mj,` and mj,0 = |Ij |− k∈[`] mj,k . Thus, we have a inventory sequence M̃I = (MI1 , MI2 , . . . , MIr ) that is /10-close in value-distance to the corresponding atlas of µ (where we need to add the interval {0} to the partition to cover its entire support). Corresponding to M̃I , there is a vector µ̃ that is /10-close to µ. Using Lemma 13.4, we have a distribution µ̂ that is /2-close to µ. Since the [n] portion of µ is /2-close to µ, by the triangle inequality, µ̂ is -close to µ. Thus A = (I, MI ), where MI is obtained by multiplying all members of M̃I by the same factor used to produce µ̂ from µ̃, is an atlas for a distribution that is -close to µ. We need s · poly(log n, log s, 1/, log(1/δ) conditional samples from µ to get s samples from µ. Therefore, we require r · poly(log n, 1/, log(1/δ)) conditional samples to construct the atlas AI . References [AAK+ 07] Noga Alon, Alexandr Andoni, Tali Kaufman, Kevin Matulef, Ronitt Rubinfeld, and Ning Xie, Testing k-wise and almost k-wise independence, Proceedings of the Thirtyninth Annual ACM Symposium on Theory of Computing (New York, NY, USA), STOC ’07, ACM, 2007, pp. 496–505. [ACK15] Jayadev Acharya, Clément L. Canonne, and Gautam Kamath, A chasm between identity and equivalence testing with conditional queries, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2015, August 24-26, 2015, Princeton, NJ, USA (Naveen Garg, Klaus Jansen, Anup Rao, and José D. P. Rolim, eds.), LIPIcs, vol. 40, Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2015, pp. 449–466. [ADK15] Jayadev Acharya, Constantinos Daskalakis, and Gautam Kamath, Optimal testing for properties of distributions, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada (Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, eds.), 2015, pp. 3591–3599. 28 [BFF+ 01] Tugkan Batu, Lance Fortnow, Eldar Fischer, Ravi Kumar, Ronitt Rubinfeld, and Patrick White, Testing random variables for independence and identity, 42nd Annual Symposium on Foundations of Computer Science, FOCS 2001, 14-17 October 2001, Las Vegas, Nevada, USA, 2001, pp. 442–451. [BFR+ 00] Tugkan Batu, Lance Fortnow, Ronitt Rubinfeld, Warren D. Smith, and Patrick White, Testing that distributions are close, 41st Annual Symposium on Foundations of Computer Science, FOCS 2000, 12-14 November 2000, Redondo Beach, California, USA, IEEE Computer Society, 2000, pp. 259–269. [BKR04] Tugkan Batu, Ravi Kumar, and Ronitt Rubinfeld, Sublinear algorithms for testing monotone and unimodal distributions, Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, June 13-16, 2004 (László Babai, ed.), ACM, 2004, pp. 381–390. [Can15] Clément L. Canonne, A survey on distribution testing: Your data is big. but is it blue?, Electronic Colloquium on Computational Complexity (ECCC) 22 (2015), 63. [CDGR15] Clément L. Canonne, Ilias Diakonikolas, Themis Gouleakis, and Ronitt Rubinfeld, Testing shape restrictions of discrete distributions, CoRR abs/1507.03558 (2015). [CFGM13] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah, On the power of conditional samples in distribution testing, Innovations in Theoretical Computer Science, ITCS ’13, Berkeley, CA, USA, January 9-12, 2013, 2013, pp. 561–580. [CRS15] Clément L. Canonne, Dana Ron, and Rocco A. Servedio, Testing probability distributions using conditional samples, SIAM J. Comput. 44 (2015), no. 3, 540–616. [Dia16] Ilias Diakonikolas, Learning structured distributions, Handbook of Big Data (2016), 267. [DK16] Ilias Diakonikolas and Daniel M. Kane, A new approach for testing properties of discrete distributions, CoRR abs/1601.05557 (2016). [DLM+ 07] Ilias Diakonikolas, Homin K. Lee, Kevin Matulef, Krzysztof Onak, Ronitt Rubinfeld, Rocco A. Servedio, and Andrew Wan, Testing for concise representations, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings, 2007, pp. 549–558. [GR00] Oded Goldreich and Dana Ron, On testing expansion in bounded-degree graphs, Electronic Colloquium on Computational Complexity (ECCC) 7 (2000), no. 20. [Hoe63] Wassily Hoeffding, Probability inequalities for sums of bounded random variables, J. Amer. Statist. Assoc. 58 (1963), 13–30. MR 0144363 [ILR12] Piotr Indyk, Reut Levi, and Ronitt Rubinfeld, Approximating and testing k-histogram distributions in sub-linear time, Proceedings of the 31st ACM SIGMOD-SIGACTSIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, 2012, pp. 15–22. [Pan08] Liam Paninski, A coincidence-based test for uniformity given very sparsely sampled discrete data, IEEE Transactions on Information Theory 54 (2008), no. 10, 4750–4755. 29 [Ser10] Rocco A. Servedio, Testing by implicit learning: A brief survey, Property Testing Current Research and Surveys [outgrow of a workshop at the Institute for Computer Science (ITCS) at Tsinghua University, January 2010], 2010, pp. 197–210. [VV10] Gregory Valiant and Paul Valiant, A CLT and tight lower bounds for estimating entropy, Electronic Colloquium on Computational Complexity (ECCC) 17 (2010), 179. [VV14] , An automatic inequality prover and instance optimal identity testing, 55th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2014, Philadelphia, PA, USA, October 18-21, 2014, 2014, pp. 51–60. 30