Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Low-quality dimension reduction and high-dimensional Approximate Nearest Neighbor Ioannis Psarros Submitted to the Department of Mathematics Graduate Program in Logic, βοΈAlgorithms and Computation π πβ at the UNIVERSITY OF ATHENS February 2015 c University of Athens 2015. All rights reserved. β 2 Low-quality dimension reduction and high-dimensional Approximate Nearest Neighbor by Ioannis Psarros Submitted to the Department of Mathematics Graduate Program in Logic, βοΈAlgorithms and Computation π πβ Abstract The approximate nearest neighbor problem (ANN) in Euclidean settings is a fundamental question, which has been addressed by two main approaches: Data-dependent space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. On the other hand, locality sensitive hashing has polynomial dependence in the dimension, sublinear query time with an exponent inversely proportional to the error factor π, and subquadratic space requirement. We generalize the Johnson-Lindenstrauss lemma to define βlow-qualityβ mappings to a Euclidean space of significantly lower dimension, such that they satisfy a requirement weaker than approximately preserving all distances or even preserving the nearest neighbor. This mapping guarantees, with arbitrarily high probability, that an approximate nearest neighbor lies among the π approximate nearest neighbors in the projected space. This leads to a randomized tree based data structure that avoids the curse of dimensionality for (1 + π)-ANN. Our algorithm, given π points in dimension π, achieves space usage in π(ππ), preprocessing time in π(ππ log π), and query time in π(πππ log π), where π is proportional to 1 β 1/ln ln π, for fixed π β (0, 1). It employs a data structure, such as BBD-trees, that efficiently finds π approximate nearest neighbors. The dimension reduction is larger if one assumes that pointsets possess some structure, namely bounded expansion rate. Thesis Supervisor: Ioannis Z. Emiris Title: Professor 3 4 Acknowledgments First of all I would like to thank my parents, as well as the rest of my family for their continued support. I would also like to thank my advisor Ioannis Z. Emiris for his advices and all the stimulating conversations we have. Furthermore, special thanks to all members of the Erga lab for patiently answering my questions and creating more. 5 6 Contents 1 Introduction 13 1.1 Existing work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Our contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2 Randomized Embeddings and Approximate Nearest Neighbor Search 19 2.1 Low Quality Randomized Embeddings . . . . . . . . . . . . . . . . . . . . . 19 2.2 Approximate Nearest Neighbor Search . . . . . . . . . . . . . . . . . . . . . 24 3 Exploiting hidden structure 27 3.1 Bounded Expansion Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Doubling dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Experiments 33 5 Open Questions 37 7 8 List of Figures 1-1 Two approximate nearest neighbors. . . . . . . . . . . . . . . . . . . . . . . . 14 3-1 Bad case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4-1 Our approach against E2LSH: query time. . . . . . . . . . . . . . . . . . . . 34 4-2 Our approach against E2LSH: memory usage. . . . . . . . . . . . . . . . . . 34 9 10 List of Tables 1.1 Comparison with other results. . . . . . . . . . . . . . . . . . . . . . . . . . 11 18 12 Chapter 1 Introduction Nearest neighbor searching is a fundamental computational problem. Let π be a set of π points in Rπ and let π(π, πβ² ) be the (Euclidean) distance between any two points π and πβ² . The problem consists in reporting, given a query point π, a point π β π such that π(π, π) β€ π(πβ² , π), for all πβ² β π and π is said to be a βnearest neighborβ of π. For this purpose, we preprocess π into a data structure. However, an exact solution to high-dimensional nearest neighbor search, in sublinear time, requires prohibitively heavy resources. Thus, many techniques focus on the less demanding task of computing the approximate nearest neighbor (ANN). Given a parameter π β (0, 1), a (1 + π)-approximate nearest neighbor to a query π is a point π in π such that π(π, π) β€ (1 + π) · π(π, πβ² ), βπβ² β π. Hence, under approximation, the answer can be any point whose distance from π is at most (1 + π) times larger than the distance between π and its nearest neighbor. We can naturally have more than one approximate nearest neighbors and the nearest neighbor is also an approximate nearest neighbor as depicted in Figure 1-1. 13 Figure 1-1: Two approximate nearest neighbors. 1.1 Existing work As it was mentioned above, an exact solution to high-dimensional nearest neighbor search, in sublinear time, requires heavy resources. One notable solution to the problem [Mei93] shows that nearest neighbor queries can be answered in π(π5 log π) time, using π(ππ+πΏ ) space, for arbitrary πΏ > 0. One class of methods for π-ANN may be called data-dependent, since the decisions taken for partitioning the space are affected by the given data points. In [AMN+98] they introduced the Balanced Box-Decomposition (BBD) trees. The BBD-trees data structure achieves query time π(π log π) with π β€ π/2β1 + 6π/πβπ , using space in π(ππ), and preprocessing time in π(ππ log π). BBD-trees can be used to retrieve the π β₯ 1 approximate nearest-neighbors at an extra cost of π(π log π) per neighbor. BBD-trees have proved to be very practical, as well, and have been implemented in software library ANN. Another data structure is the Approximate Voronoi Diagrams (AVD). They are shown to establish a tradeoff between the space complexity of the data structure and the query time 14 it supports [AMM09]. With a tradeoff parameter 2 β€ πΎ β€ 1π , the query time is π(log(ππΎ) + 1/(ππΎ) πβ1 2 ) and the space is π(ππΎ πβ1 log 1π ). They are implemented on a hierarchical quadtree- based subdivision of space into cells, each storing a number of representative points, such that for any query point lying in the cell, at least one of the representatives is an approximate nearest neighbor. Further improvements to the space-time trade offs for ANN, are obtained in [AdFM11]. One might directly apply the celebrated Johnson-Lindenstrauss lemma and map the points to π( logπ2π ) dimensions with distortion equal to 1 + π in order to improve space requirements. 1 2 In particular, AVD combined with the Johnson-Lindenstrauss lemma require ππ(log π /π ) space which is prohibitive if π βͺ 1 and query time polynomial in log π, π and 1/π. On the other hand BBD-trees combined with the JL lemma require π(ππ) space but query time superlinear in π. Notice that we relate the approximation error with the distortion for simplicity. Our approach (Theorem 10) requires π(ππ) space and has query time sublinear in π and polynomial in π. In high dimensional spaces, data dependent data structures are affected by the curse of dimensionality. This means that, when the dimension increases, either the query time or the required space increases exponentially. It is known [HIN12] that the (1 + π)-ANN problem reduces to the (1 + π, π )-Approximate Near Neighbor problem with a roughly logarithmic increase in storage requirement and query time. A data structure which solves the (1 + π, π )-ANN problem answers queries of the following form: for some query π if there exists a point in distance β€ π return a point in distance β€ (1 + π)π . The (1 + π, π )-ANN problem is already hard when the dimension is high. An important method conceived for the (1 + π, π )-ANN problem for high dimensional data is locality sensitive hashing (LSH). LSH induces a data independent space partition and is dynamic, since it supports insertions and deletions. It relies on the existence of locality sensitive hash functions, which are more likely to map similar objects to the same bucket. The existence of such functions depends on the metric space. In general, LSH requires roughly π(ππ1+π ) space and π(πππ ) query time for some parameter π β (0, 1). In 15 [AI08] they show that in the Euclidean case, one can have π β€ 1 (1+π)2 which matches the lower bound of hashing algorithms proved in [OWZ11]. Lately, it was shown that it is possible to overcome this limitation with an appropriate change in the scheme which achieves πβ€ 7 8(1+π)2 1 + π( (1+π) 3 ) + π(1) [AINR14]. One approach with better space requirement is also achieved in [And09]. In particular, they present a data structure for the (1 + π, π)-ANN 2) problem achieving near-linear space and πππ(1/(1+π) query time. One different approach 2 [Pan06] achieves near linear space but query time proportional to π(ππ 1+π ). For comparison, in Theorem 10 we show that it is possible to use π(ππ) space, with query time roughly π(πππ ) where π < 1 is now higher than the one appearing in LSH. Exploiting the structure of the input is an important way to improve the complexity of nearest neighbor search. In particular, significant amount of work has been done for pointsets with low doubling dimension. In [HPM05], they provide an algorithm for ANN with expected preprocessing time π(2dim(π) π log π), space π(2dim(π) π) and query time π(2dim(π) log π + πβπ(dim(π)) ) for any finite metric space π of doubling dimension dim(π). In [IN07] they provide randomized embeddings that preserve nearest neighbor with constant probability, for points lying on low doubling dimension manifolds in Euclidean settings. Naturally, such an approach can be easily combined with any known data structure for ANN. In [DF08] they present random projection trees which adapt to pointsets of low doubling dimension. Like kd-trees, every split partitions the pointset into subsets of roughly equal cardinality; in fact, instead of splitting at the median, they add a small amount of βjitterβ. Unlike kd-trees, the space is split with respect to a random direction, not necessarily parallel to the coordinate axes. Classic ππ-trees also adapt to the doubling dimension of randomly rotated data [Vem12]. However, for both techniques, no related theoretical arguments about the efficiency of π-ANN search were given. In [KR02], they introduce a different notion of intrinsic dimension for an arbitrary metric space, namely its expansion rate π; it is formally defined in section 3.1. The doubling dimension is a more general notion of intrinsic dimension in the sense that, when a finite metric space has bounded expansion rate, then it also has bounded doubling dimension, but 16 the converse does not hold [GKL03]. Several efficient solutions are known for metrics with bounded expansion rate, including for the problem of exact nearest neighbor. In [KL04], they present a data structure which requires ππ(1) π space and answers queries in ππ(1) ln π. Cover Trees [BKL06] require π(π) space and each query costs π(π12 log π) time for exact nearest neighbors. In Theorem 13, we provide a data structure for the π-ANN problem with linear 3 space and π((πΆ 1/π + log π)πlog π/π2 ) query time, where πΆ depends on π. The result concerns pointsets in the π-dimensional Euclidean space. 1.2 Our contribution Tree based space partitioning techniques perform well when the dimension is relatively low, but are affected by the curse of dimensionality. To that end, randomized methods like Locality Sensitive Hashing are more efficient when the dimension is high. One may also apply the Johnson-Lindenstrauss Lemma to improve upon standard space partitioning techniques, but the properties guaranteed are stronger than what is required for efficient approximate nearest neighbor search. We define a "low-quality" mapping to a Euclidean space of dimension π(log ππ /π2 ), such that an approximate nearest neighbor lies among the π approximate nearest neighbors in the projected space. This leads to our main Theorem 10 which offers a new randomized algorithm for approximate nearest neighbor search with the following complexity. Given π points in Rπ , the data structure which is based on Balanced Box-Decomposition (BBD) trees, requires π(ππ) space, and reports an (1 + π)2 -approximate nearest neighbor in time π(πππ log π), where function π < 1 is proportional to 1 β 1/ ln ln π for fixed π β (0, 1) and shall be specified in Section 2.2. The total preprocessing time is π(ππ log π). For each query π β Rπ , the preprocessing phase succeeds with probability > 1 β πΏ for any constant πΏ β (0, 1). The low-quality embedding is extended to pointsets with bounded expansion rate π (see section 3.1 for exact definitions). The pointset is now mapped to a Euclidean space of dimension roughly π(log π/π2 ) for large enough π. One part of this work also appears in [AEP15]. Comparison of our result with previous 17 BBD-trees AVD BBD-trees+JL AVD +JL LSH multi-probe LSH Our approach Space π(ππ) Λ ππ ) π( π π(ππ) 1 2 ππ(log π /π ) 1+ 1 2 Λ (1+π) ) π(ππ Λ Λ π(ππ) π(ππ) Query π(( ππ )π log π) π(log π) π(π) π(log π) 1 Λ (1+π)2 ) π(ππ 2 Λ (1+π) ) π(ππ 1βπ2 /πΆ log log π Λ π(ππ ) Table 1.1: Comparison with other results. work can be viewed in Table 1.1. 18 Chapter 2 Randomized Embeddings and Approximate Nearest Neighbor Search 2.1 Low Quality Randomized Embeddings This chapter examines standard dimensionality reduction techniques and extends them to approximate embeddings optimized to our setting. In the following, we denote by β · β the Euclidean norm and by | · | the cardinality of a set. Let us start with the classic Johnson and Lindenstrauss lemma: Proposition 1. [JL84] For any set π β Rπ , π β (0, 1) there exists a distribution over linear β² mappings π : Rπ ββ Rπ , where πβ² = π(log |π|/π2 ), such that for any π, π β π, (1 β π)βπ β πβ2 β€ βπ (π) β π (π)β2 β€ (1 + π)βπ β πβ2 . In the initial proof [JL84], they show that this can be achieved by orthogonally projecting the pointset on a random linear subspace of dimension πβ² . In [DG02], they provide a proof based on elementary probabilistic techniques. In [IM98], they prove that it suffices to apply a gaussian matrix πΊ on the pointset. πΊ is a π × πβ² matrix with each of its entries independent random variables given by the standard normal distribution π (0, 1). Instead of a gaussian 19 matrix, we can apply a matrix whose entries are independent random variables with uniformly distributed values in {β1, 1} [Ach03]. However, it has been realized that this notion of randomized embedding is somewhat stronger than what is required for approximate nearest neighbor searching. The following definition has been introduced in [IN07] and focuses only on the distortion of the nearest neighbor. Definition 2. Let (π, ππ ), (π, ππ ) be metric spaces and π β π . A distribution over mappings π : π β π is a nearest-neighbor preserving embedding with distortion π· β₯ 1 and probability of correctness π β [0, 1] if, βπ β₯ 1 and βπ β π , with probability at least π , when π₯ β π is such that π (π₯) is an π-ANN of π (π) in π (π), then π₯ is a (π· · (1 + π))-approximate nearest neighbor of π in π. While in the ANN problem we search one point which is approximately nearest, in the π approximate nearest neighbors problem (πANNs) we seek an approximation of the π nearest points, in the following sense. Let π be a set of π points in Rπ , let π β Rπ and 1 β€ π β€ π. The problem consists in reporting a sequence π = {π1 , · · · , ππ } of π distinct points such that the π-th point is an (1 + π) approximation to the π-th nearest neighbor of π. Furthermore, the following assumption is satisfied by the search routine of tree-based data structures such as BBD-trees. Assumption 3. Let π β² β π be the set of points visited by the (1 + π)-πANNs search such that π = {π1 , · · · , ππ } β π β² is the set of points which are the π nearest to the query point π among the points in π β² . We assume that βπ₯ β π β π β² , π(π₯, π) > π(ππ , π)/(1 + π). Assuming the existence of a data structure which solves π-πANNs, we can weaken Definition 2 as follows. Definition 4. Let (π, ππ ), (π, ππ ) be metric spaces and π β π . A distribution over mappings π : π β π is a locality preserving embedding with distortion π· β₯ 1, probability of correctness π β [0, 1] and locality parameter π, if βπ β₯ 1 and βπ β π , with probability at least π , when 20 π = {π (π1 ), · · · , π (ππ )} is a solution to (1 + π)-πANNs for π, under Assumption 3 then there exists π (π₯) β π such that π₯ is a π· · π approximate nearest neighbor of π in π. According to this definition we can reduce the problem of (1 + π)-ANN in dimension π to the problem of computing π approximate nearest neighbors in dimension πβ² < π. We use the Johnson-Lindenstrauss dimensionality reduction technique. β² Lemma 5. [DG02] There exists a distribution over linear maps π΄ : Rπ β Rπ s.t., for any π β Rπ with βπβ = 1: β if π½ < 1 then Pr[βπ΄πβ2 β€ π½ 2 · πβ² ] π β€ ππ₯π( π2 (1 β π½ 2 + 2 ln π½), β² β if π½ > 1 then Pr[βπ΄πβ2 β₯ π½ 2 · πβ² ] π β€ ππ₯π( π2 (1 β π½ 2 + 2 ln π½). β² We prove the following lemma which will be useful. Lemma 6. For all π β N, π β (0, 1), the following holds: (1 + π/2) (1 + π/2)2 β 2 ln π β 1 > 0.05(π + 1)π2 . π 2 (2 (1 + π)) 2 (1 + π) Proof. For π = 0, it can be checked that if π β (0, 1) then, (1+π/2)2 (1+π)2 β 2 ln 1+π/2 β 1 > 0.05π2 . 1+π This is our induction basis. Let π β₯ 0 be such that the induction hypothesis holds, namely (1+π/2)2 (2π (1+π))2 β 2 ln (1+π/2) β 1 > 0.05(π + 1)π2 . Then, to prove the induction step 2π (1+π) (1 + π/2) 3 (1 + π/2)2 1 (1 + π/2)2 2 β 2 ln + 2 ln 2 β 1 > 0.05(π + 1)π β + 2 ln 2 > 4 (2π (1 + π))2 2π (1 + π) 4 (2π (1 + π))2 > 0.05(π + 1)π2 β 3 + 2 ln 2 > 0.05(π + 2)π2 , 4 since π β (0, 1). A simple calculation shows the following. Lemma 7. For all π₯ > 0, it holds: (1 + π₯)2 1+π₯ β 2 ln( ) β 1 < (1 + π₯)2 β 2 ln(1 + π₯) β 1. 2 (1 + 2π₯) 1 + 2π₯ 21 (2.1) Theorem 8. Under the notation of Definition 4, there exists a randomized mapping π : Rπ β β² π /π2 ), distortion π· = 1 + π and probability of Rπ which satisfies Definition 4 for πβ² = π(log πΏπ success 1 β πΏ, for any constant πΏ β (0, 1). Proof. Let π be a set of π points in Rπ and consider map β² π : Rπ β Rπ : π£ β¦β βοΈ π/πβ² · π΄ π£, where π΄ is a matrix as in Definition 4. Wlog the query point π lies in the origin and its nearest neighbor π’ lies at distance 1 from π. We denote by π β₯ 1 the approximation ratio guaranteed by the assumed data structure. That is the assumed data structure solves the π-πANNs problem. For each point π₯, πΏπ₯ = βπ΄π¦β2 where π¦ = π₯/βπ₯β. Let π be the random variable whose value indicates the number of βbadβ candidates, that is π = | {π₯ β π : βπ₯ β πβ > πΎ β§ πΏπ₯ β€ π½ 2 πβ² · } |, πΎ2 π where we define π½ = π(1 + π/2), πΎ = π(1 + π). Hence, by Lemma 5, πβ² π½2 π½ E[π ] β€ π · ππ₯π( (1 β 2 + 2 ln )). 2 πΎ πΎ By Markovβs inequality, Pr[π β₯ π] β€ E[π ] πβ² π½2 π½ =β Pr[π β₯ π] β€ π · ππ₯π( (1 β 2 + 2 ln ))/π. π 2 πΎ πΎ The event of failure is defined as the disjunction of two events: [ π β₯ π ] β¨ [ πΏπ’ β₯ (π½/π)2 πβ² ], π and its probability is at most equal to πβ² Pr[π β₯ π] + ππ₯π( (1 β (π½/π)2 + 2 ln(π½/π))), 2 22 by applying again Lemma 5. Now, we bound these two terms. For the first we choose πβ² such that πβ² β₯ 2 ln 2π πΏπ π½2 πΎ2 β 1 β 2 ln π½πΎ . (2.2) Therefore, β² ππ₯π( π2 (1 β π½2 πΎ2 + 2 ln π½πΎ )) π β€ πΏ πΏ =β Pr[π β₯ π] β€ . 2π 2 (2.3) Notice that π β€ π and due to expression (2.1), we obtain (π½/πΎ)2 β 2 ln(π½/πΎ) β 1 < (π½/π)2 β 2 ln(π½/π) β 1. Hence, inequality (2.2) implies inequality (2.4): πβ² β₯ 2 ln 2πΏ . (π½/π)2 β 1 β 2 ln(π½/π) (2.4) Therefore, πβ² πΏ ππ₯π( (1 β (π½/π)2 + 2 ln(π½/π))) β€ . 2 2 (2.5) Inequalities (2.3), (2.5) imply that the total probability of failure is at most πΏ. Using Lemma 6 for π = 0, we obtain, that there exists πβ² such that πβ² = π(log π 2 /π ) πΏπ and with probability of success at least 1 β πΏ, these two events occur: β βπ (π) β π (π’)β β€ (1 + 2π )βπ’ β πβ. β |{π β π|βπ β πβ > π(1 + π)βπ’ β πβ =β βπ (π) β π (π)β β€ π(1 + π/2)βπ’ β πβ}| < π. Now consider the case when the random experiment succeeds and let π = {π (π1 ), ..., π (ππ )} a solution of the π-πANNs problem in the projected space, given by a data-structure which satisfies Assumption 3. We have that βπ (π₯) β π (π) β π β² , βπ (π₯) β π (π)β > βπ (ππ ) β π (π)β/π where π β² is the set of all points visited by the search routine. Now, if π (π’) β π then π contains the projection of the nearest neighbor. If π (π’) β / π then 23 if π (π’) β / π β² we have the following: βπ (π’) β π (π)β > βπ (ππ ) β π (π)β/π =β βπ (ππ ) β π (π)β < π(1 + π/2)βπ’ β πβ, which means that there exists at least one point π (π* ) β π s.t. βπ β π* β β€ π(1 + π)βπ’ β πβ. Finally, if π (π’) β / π but π (π’) β π β² then βπ (ππ ) β π (π)β β€ βπ (π’) β π (π)β =β βπ (ππ ) β π (π)β β€ (1 + π/2)βπ’ β πβ, which means that there exists at least one point π (π* ) β π s.t. βπ β π* β β€ π(1 + π)βπ’ β πβ. Hence, π satisfies Definition 4 for π· = 1 + π. 2.2 Approximate Nearest Neighbor Search This section combines tree-based data structures which solve (1 + π)-πANNs with the results above, in order to obtain an efficient randomized data structure which solves (1 + π)-ANN. BBD-trees [AMN+98] require π(ππ) space, and allow computing π points, which are (1+π)-approximate nearest neighbors, within time π((β1+6 ππ βπ +π)π log π). The preprocessing time is π(ππ log π). Notice, that BBD-trees satisfy the Assumption 3. The algorithm for the (1 + π)-πANNs search, visits cells in increasing order with respect to their distance from the query point π. If the current cell lies in distance more than ππ /π where ππ is the current distance to the πth nearest neighbor, the search terminates. We apply the random projection for distortion π· = 1 + π, thus relating approximation error to the allowed distortion; this is not required but simplifies the analysis. Moreover, π = ππ ; the formula for π < 1 is determined below. Our analysis then focuses β² β² on the asymptotic behaviour of the term π(β1 + 6 ππ βπ + π). Lemma 9. With the above notation, there exists π > 0 s.t., for fixed π β (0, 1), it holds that β² β² β1 + 6 ππ βπ + π = π(ππ ), where π β€ 1 β π2 /Λ π(π2 + ln(max( 1π , ln π))) < 1 for some appropriate 24 constant πΛ > 1. Proof. Recall that πβ² β€ β² πΛ π2 ln ππ for some appropriate constant πΛ > 0. The constant πΏ is hidden β² β² β² in πΛ. Since ( ππ )π is a decreasing function of π, we need to choose π s.t. ( ππ )π = Ξ(π). Let β² β² β² β² π = ππ . Obviously β1 + 6 ππ βπ β€ (πβ² ππ )π , for some appropriate constant πβ² β (1, 7). Then, by substituting πβ² , π we have: π Λπβ² (1βπ) ln π π Λ(1βπ) πβ² β² ) π3 . (πβ² )π = π π2 ln( π We assume π β (0, 1) is a fixed constant. Hence, it is reasonable to assume that (2.6) 1 π < π. We consider two cases when comparing ln π to π: β 1 π β€ ln π. Substituting π = 1 β π2 2Λ π(π2 +ln(πβ² ln π)) into equation ( 2.6), the exponent of π is bounded as follows: πΛπβ² (1 β π) ln π πΛ(1 β π) ln( )= π2 π3 = β 1 π 2Λ π(π2 1 πΛ · [ln(πβ² ln π) + ln β ln (2π2 + 2 ln(πβ² ln π))] < π. β² + ln(π ln π)) π > ln π. Substituting π = 1 β π2 2Λ π(π2 +ln πβ² ) π into equation( 2.6), the exponent of π is bounded as follows: πΛπβ² (1 β π) ln π πΛ(1 β π) ln( )= π2 π3 = πΛ πβ² πβ² 2 β ln (2π + 2 ln )] < π. · [ln ln π + ln β² π π 2Λ π(π2 + ln ππ ) log π Notice that for both cases πβ² = π( π2 +log ). log π Combining Theorem 8 with Lemma 9 yields the following main theorem. Theorem 10 (Main). Given π points in Rπ , there exists a randomized data structure which requires π(ππ) space and reports an (1 + π)2 -approximate nearest neighbor in time 1 π(πππ log π), where π β€ 1 β π2 /Λ π(π2 + ln(max( , ln π))) π 25 for some appropriate constant πΛ > 1. The preprocessing time is π(ππ log π). For each query π β Rπ , the preprocessing phase succeeds with any constant probability. Proof. The space required to store the dataset is π(ππ). The space used by BBD-trees is π(πβ² π) where πβ² is defined in Lemma 9. We also need π(ππβ² ) space for the matrix π΄ as specified in Theorem 8. Hence, since πβ² < π and πβ² < π, the total space usage is bounded above by π(ππ). The preprocessing consists of building the BBD-tree which costs π(πβ² π log π) time and sampling π΄. Notice that we can sample a πβ² -dimensional random subspace in time π(ππβ²2 ) as follows. First, we sample in time π(ππβ² ), a π × πβ² matrix where its elements are independent random variables with the standard normal distribution π (0, 1). Then, we orthonormalize using Gram-Schmidt in time π(ππβ²2 ). Since πβ² = π(log π), the total preprocessing time is bounded by π(ππ log π). For each query we use π΄ to project the point in time π(ππβ² ). Next, we compute its ππ approximate nearest neighbors in time π(πβ² ππ log π) and we check its neighbors with their real coordinates in time π(πππ ). Hence, each query costs π(π log π + πβ² ππ log π + πππ ) = π(πππ log π) because πβ² = π(log π), πβ² = π(π). Thus, the query time is dominated by the time required for π-πANNs search and the time to check the returned sequence of π approximate nearest neighbors. To be more precise, the probability of success, which is the probability that the random projection succeeds according to Theorem. 8, is greater than 1 β πΏ, for any constant πΏ β (0, 1). Notice that the preprocessing time for BBD-trees has no dependence on π. 26 Chapter 3 Exploiting hidden structure 3.1 Bounded Expansion Rate This section models the structure that the data points may have so as to obtain more precise bounds. The bound on the dimension obtained in Theorem 8 is quite pessimistic. We expect that, in practice, the space dimension needed in order to have a sufficiently good projection is less than what Theorem 8 guarantees. Intuitively, we do not expect to have instances where all points in π, which are not approximate nearest neighbors of π, lie at distance almost equal to (1 + π)π(π, π) as in Figure 3-1. To this end, we consider the case of pointsets with bounded expansion rate. 27 Figure 3-1: Bad case. Definition 11. Let π a metric space and π β π a finite pointset and let π΅π (π) β π denote the points of π lying in the closed ball centered at π with radius π. We say that π has (π, π)-expansion rate if and only if, βπ β π and π > 0, |π΅π (π)| β₯ π =β |π΅π (2π)| β€ π · |π΅π (π)|. Theorem 12. Under the notation introduced in the previous definitions, there exists a β² randomized mapping π : Rπ β Rπ which satisfies Definition 4 for dimension πβ² = π( π log(π+ πΏπ ) ), 2 π distortion π· = 1 + π and probability of success 1 β πΏ, for any constant πΏ β (0, 1), for pointsets with (π, π)-expansion rate. Proof. We proceed in the same spirit as in the proof of Theorem 8, and using the notation from that proof. Let π0 be the distance to the πβth nearest neighbor, excluding neighbors at distance β€ 1 + π. For π > 0, let ππ = 2 · ππβ1 and set πβ1 = 1 + π. Clearly, E[π ] β€ β βοΈ π=0 β€ β βοΈ π=0 πβ² (1 + π/2)2 1 + π/2 |π΅π (ππ )| · ππ₯π( (1 β + 2 ln )) 2 2 ππβ1 ππβ1 πβ² (1 + π/2)2 1 + π/2 ππ π · ππ₯π( (1 β 2π + 2 ln π )). 2 2 2 (1 + π) 2 (1 + π) 28 Now, using Lemma 6, E[π ] β€ β βοΈ πβ² ππ π · ππ₯π(β 0.05(π + 1)π2 ), 2 π=0 and for πβ² β₯ 40 · ln(π + E[π ] β€ π · β βοΈ π=0 2π )/π2 , ππΏ β β βοΈ 1 1 π+1 1 π βοΈ 1 ππΏ π+1 π π+1 π+1 π ·( =π· π ·( ) ·( = · ( = . 2π ) 2π ) 2π ) π π π=0 1 + πππΏ 2 π + ππΏ 1 + πππΏ π=0 π Finally, Pr[π β₯ π] β€ πΏ E[π ] β€ . π 2 Employing Theorem 12 we obtain a result analogous to Theorem 10 which is weaker than those in [KL04, BKL06] but underlines the fact that our scheme shall be sensitive to structure in the input data, for real world assumptions. Theorem 13. Given π points in Rπ with (ln π, π)-expansion rate, there exists a randomized data structure which requires π(ππ) space and reports an (1+π)2 -approximate nearest neighbor 3 in time π((πΆ 1/π + log π)πlog π/π2 ), for some constant πΆ depending on π. The preprocessing time is π(ππ log π). For each query π β Rπ , the preprocessing phase succeeds with any constant probability. β² 1 β² ln π Proof. Set π = ln π. Then πβ² = π( lnπ2π ) and ( ππ )π = π(π π2 ln[ π3 ] ). Now the query time is 1 ln π π((π π2 ln[ π3 ] + ln π)π ln π ln π 3 ln π) = π((πΆ 1/π + log π)π 2 ), 2 π π 3 )/π2 for some constant πΆ such that πln(ln π/π 3.2 3 = π(πΆ 1/π ). Doubling dimension In this section, we generalize our idea for pointsets with bounded doubling dimension. 29 Definition 14. The doubling dimension of a metric space π is the smallest positive integer ππππ(π ) such that every set π with diameter π·π can be covered by 2ππππ(π ) (the doubling constant) sets of diameter π·π /2. Now, let π β Rπ s.t. |π| = π and π has doubling constant ππ . Now let ππ β π with log diameter 2ππ . Then we need ππ 8ππ π tiny balls ππ β π of diameter π/4 in order to cover ππ . However under the approximate nearest neighbor setting, it is impossible to bound the number of points per tiny ball which is needed in order to generalize our idea. To that end, we focus on the approximate near neighbor problem. Recall that in the (π, π )-ANN problem we build a data structure which answers queries of the following form: given a query point π, if there exists a point in π in distance β€ π from π then return a point in distance β€ ππ , where π > 1. We extend this definition in order to define the (π, π )-πANNs problem in which a data structure answers queries of the following form: given a query point π, if there exist π points in π in distance β€ π from π then return π points in distance β€ ππ , where π > 1. The (π, π )-ANN problem is already hard in high dimensions. We can assume that π = 1, since we can scale π. The idea is that we first compute π β² β π which satisfies the following two properties: β βπ, π β π β² βπ β πβ > π/8, β βπ β πβπ β π β² s.t. βπ β πβ β€ π/8. The obvious naive algorithm computes π β² in π(π2 ) time. β π =β β for all π₯ β π β π : β π β² β π β² βͺ {π₯} β π β π βͺ {π₯} β for all π¦ β π β π then * if βπ₯ β π¦β < π/8 then π β π βͺ {π¦} 30 log Then, for π β² we know that each ππ β π β² contains β€ ππ 8ππ π points. β² Theorem 15. The (π, π )-ANN problem in Rπ reduces to the (π, π )-πANNs problem in Rπ , where πβ² = π( 2 log(1/π) log ππ +log(1+ ππΏ ) ), 2 π by a randomized algorithm which succeeds with probability at least 1 β πΏ. Preprocessing costs an additional of π(π2 ) time and the delay in query time is proportional to π. Proof. Once again we proceed in the same spirit as in the proof of Theorem 8. Let ππ = 2π+1 (1 + π) for π β₯ β1 and let π΅π (π) β π denote the points of π lying in the closed ball centered at π with radius π. E[π ] β€ β βοΈ π=0 β€ β βοΈ πβ² (1 + π/2)2 1 + π/2 |π΅π (ππ )| · ππ₯π( (1 β + 2 ln )) 2 2 ππβ1 ππβ1 log ππ 8ππ π π=0 (1 + π/2)2 1 + π/2 πβ² + 2 ln π )). · ππ₯π( (1 β 2π 2 2 2 (1 + π) 2 (1 + π) Now, using Lemma 6, E[π ] β€ β βοΈ log( ππ 2π+4 (1+π) ) π π=0 = β βοΈ π=0 (π+4+log( ππ (1+π) ) π πβ² · ππ₯π(β 0.05(π + 1)π2 ), 2 πβ² · ππ₯π(β 0.05(π + 1)π2 ), 2 1 and for πβ² β₯ 40 · ((3 + log( (1+π) ) ln ππ β ln( 1+2/ππΏ ))/π2 , π E[π ] β€ ππΏ 2 and Pr[π β₯ π] β€ E[π ] πΏ β€ . π 2 Now if βπ β π s.t. βπ β πβ β€ 1 then βπβ² β π β² s.t. βπβ² β πβ β€ 1 + π/8. Hence, if πΌ = 1 + π/8 31 ln 2 πΏ and π½ = 1 + π/2, it suffices πβ² β₯ 2 (π½/πΌ)2 β1β2 ln(π½/πΌ) πΏ Pr[βπ (πβ² ) β π (π)β > (1 + π/2)βπβ² β πβ] β€ . 2 Hence, it suffices πβ² = π( 2 log(1/π) log ππ +log(1+ ππΏ ) ). 2 π Notice that if π = 1/πΏ then πβ² = π( log(1/π)π2log ππ ). This bound improves upon [IN07] by a factor of log 1/πΏ which is useful in case we need πΏ βͺ 1 even if the query time increases by a factor of 1/πΏ. 32 Chapter 4 Experiments Our method was implemented and tested1 in [AEP15]. The following chapter summarizes our experimental results. We generated our own synthetic datasets and query points to verify our results. First of all, as in [DI04], we followed the βplanted nearest neighbor modelβ for our datasets. This model guarantees for each query point π the existence of a few approximate nearest neighbors while keeping all others points sufficiently far from π. The benefit of this approach is that it represents a typical ANN search scenario, where for each point there exist only a handful approximate nearest neighbors. In contrast, in a uniformly generated dataset, all the points will tend to be equidistant to each other in high dimensions, which is quite unrealistic. More precisely, in our scenario each query π has a nearest neighbor at distance π and the rest of the points lie at distance > (1 + π)π . Moreover a significant amount of points lie close to the boundary of the ball centered at π with radius (1 + π)π . 1 Credit to Evangelos Anagnostopoulos 33 Ξ΅ = 0.1 β3 3 x 10 Ξ΅ = 0.2 β3 1.4 x 10 x 10 LSH d=200 LSH d=200 LSH d=200 LSH d=500 2.5 LSH d=500 LSH d=500 1.2 1 Emb d=200 Emb d=200 Emb d=200 Emb d=500 Emb d=500 Emb d=500 Ξ΅ = 0.5 β3 1.2 1 2 0.8 Seconds Seconds 0.6 1 0.6 0.4 0.4 0.5 0.2 0.2 0 0 2 4 6 8 Number of Points 0 0 10 2 4 6 8 0 0 10 Number of Points 4 x 10 2 4 6 8 10 Number of Points 4 x 10 4 x 10 Figure 4-1: Our approach against E2LSH: query time. Ξ΅ = 0.1 5 6 x 10 x 10 LSH d = 200 LSH d = 500 5 Emb d = 200 Emb d = 500 Emb d = 500 4 4 3 3 3 2 2 2 1 1 1 2 4 6 8 10 4 x 10 0 0 2 Emb d = 200 Emb d = 500 4 0 0 x 10 LSH d = 500 5 Emb d = 200 Ξ΅ = 0.5 5 6 LSH d = 200 LSH d = 500 5 Ξ΅ = 0.2 5 6 LSH d = 200 Kbβs Seconds 0.8 1.5 4 6 Number of Points 8 10 4 x 10 0 0 2 4 6 8 10 4 x 10 Figure 4-2: Our approach against E2LSH: memory usage. We built a BBD-tree data structure on the projected space using the ANN library with the default settings. Next, we measured the average time needed for each query π to find its β π-πANNs, for π = π, using the BBD-Tree data structure and then to select the first point at distance β€ π out of the π in the original space. We compare these times to the average times reported by E2LSH range queries, given the range parameter π , when used from its default script for probability of success 0.95. The script first performs an estimation of the best parameters for the dataset and then builds its data structure using these parameters. We required from the two approaches to have accuracy > 0.90, which in our case means that 34 in at least 90 out of the 100 queries they would manage to find the approximate nearest neighbor. It is clear from Figure 4-1 that E2LSH is faster than our approach by a factor of 3. However in Figure 4-2, where we present the memory usage comparison between the two approaches, it is obvious that E2LSH requires more space. Figure 4-2 also validates the linear space dependency of our embedding method. 35 36 Chapter 5 Open Questions In terms of practical efficiency it is obvious that checking the real distance to the neighbors while performing an πANNs search in the reduced space, is more efficient in practice than naively scanning the returned sequence of π-approximate nearest neighbors and looking for the best in the initial space. Moreover, we do not exploit the fact that BBD-trees return a sequence and not simply a set of neighbors. Our embedding possibly has further applications. One possible application is the problem of computing the π-th approximate nearest neighbor. The problem may reduce to computing all neighbors between the π-th and the π-th nearest neighbors in a space of significantly smaller dimension for some appropriate π < π < π. Other possible applications include computing the approximate minimum spanning tree or the closest pair of points. It is also worth mentioning that our ideas may apply in other norms. For example it is known that there exist JL-type random projections from π2 to π1 norm. However solving the ANN problem in π1 appears to be at least as difficult as solving it in π2 . 37 38 Bibliography [Ach03] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. In J. Computer and System Sciences, pages 671β687, 2003. [AdFM11] Sunil Arya, Guilherme D. da Fonseca, and David M. Mount. Approximate polytope membership queries. In Proc. ACM Symposium on Theory of Computing, pages 579β586, 2011. [AEP15] E. Anagnostopoulos, I. Z. Emiris and I. Psarros, Low-quality dimension reduction and high-dimensional Approximate Nearest Neighbor. In arXiv:1412.1683, to appear in SoCG 2015. [And09] Alexandr Andoni. Nearest Neighbor Search: the Old, the New, and the Impossible. PhD thesis, Massachusetts Institute of Technology, 2009. [AI05] Alexandr Andoni and Piotr Indyk. E2 LSH 0.1 User Manual, 2005, Implementation of LSH: E2LSH, http://www.mit.edu/ andoni/LSH [AI08] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Commun. ACM, pages 117β122, 2008. [AINR14] Alexandr Andoni, Piotr Indyk, H.L. Nguy En and Ilya Razenshteyn, Beyond Locality-Sensitive Hashing. In Proc. Symp. of Discrete Algorithms (SODA), 2014. [AMM09] Sunil Arya, Theocharis Malamatos, and David M. Mount. Space-time tradeoffs for approximate nearest neighbor searching. In J. ACM, pages 1:1β1:54, 2009. [AMN+98] S. Arya, D.M. Mount, N.S. Netanyahu, R. Silverman and A.Y. Wu. An Optimal Algorithm for Approximate Nearest Neighbor Searching Fixed Dimensions. In J. ACM, 1998. [BKL06] Alina Beygelzimer, Sham Kakade and John Langford, Cover Trees for Nearest Neighbor, In Proc. of the 23rd Inter. Conf. on Machine Learning, pages 97β104, 2006. [DF08] Sanjoy Dasgupta and Yoav Freund, Random projection trees and low dimensional manifolds, In Proc. ACM Symposium on Theory of Computing, pages 537β546. ACM, 2008. 39 [DG02] Sanjoy Dasgupta and Anupam Gupta, An Elementary Proof of a Theorem of Johnson and Lindenstrauss, In Random Struct. Algorithms, 2003. [DI04] Mayur Datar, Nicole Immorlica, Piotr Indyk and Vahab S. Mirrokni Locality-sensitive hashing scheme based on p-stable distributions, In Proc. of the 20th annual Symp. on Computational Geometry, pages 253β262. [GKL03] Anupam Gupta and Robert Krauthgamer and James R. Lee, Bounded Geometries, Fractals, and Low-Distortion Embeddings, In Proc. 44th Annual IEEE Symp. Foundations of Computer Science, pages 534β, 2003. [HIN12] Sariel Har-Peled and Piotr Indyk and Rajeev Motwani, Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality, Theory of Computing, 2012. [HPM05] Sariel Har-Peled and Manor Mendel. Fast construction of nets in low dimensional metrics, and their applications. In Proc. Symp. on Computational Geometry, pages 150β158. 2005. [IM98] Piotr Indyk and Rajeev Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, In Proc. 30th ACM Symp. on Theory of Computing, pages 604β613, 1998. [IN07] Piotr Indyk and Assaf Naor. Nearest-neighbor-preserving embeddings. In ACM Trans. Algorithms, 3(3), 2007. [JL84] William B. Johnson and Joram Lindenstrauss, Extensions of Lipschitz mappings into a Hilbert space, In Contemp. Math 26 (1984), 189β206. [KL04] Robert Krauthgamer and James R. Lee, Navigating Nets: Simple Algorithms for Proximity Search, In Proc. 15th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 798β807, 2004. [KR02] David R. Karger and Matthias Ruhl, Finding Nearest Neighbors in Growth-restricted Metrics, In Proc. 34th annual ACM Symp. on Theory of Computing, pages 741β750, 2002. [Mei93] Stefan Meiser. Point location in arrangements of hyperplanes. Information and Computation, pages 286β303, 1993. [ML09] Marius Muja and David G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. In Proc. Intern. Conf. Computer Vision Theory and Applications, pages 331β340, 2009. [Mou10] David M. Mount. ANN Programming Manual, 2010, Implementation of ANN: http://www.cs.umd.edu/ mount/ANN/ 40 [OWZ11] Ryan OβDonnell, Yi Wu, and Yuan Zhou. Optimal lower bounds for locality sensitive hashing (except when q is tiny). In Proc. ICS, pages 275β283, 2011. [Pan06] , Rina Panigrahy, Entropy Based Nearest Neighbor Search in High Dimensions, Proc. 17th Annual ACM-SIAM Symp. Discrete Algorithms, SODAβ06. [Vem12] Santosh Vempala. Randomly-oriented k-d Trees Adapt to Intrinsic Dimension, In Proc. Foundations of Software Technology and Theoretical Computer Science, pages 48β57, 2012. 41