Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data analysis wikipedia , lookup
Genetic algorithm wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Algorithm characterizations wikipedia , lookup
Pattern recognition wikipedia , lookup
Cluster analysis wikipedia , lookup
Dijkstra's algorithm wikipedia , lookup
Selection algorithm wikipedia , lookup
Theoretical computer science wikipedia , lookup
Similarity Join in Metric Spaces using eD-Index Vlastislav Dohnal1 , Claudio Gennaro2, and Pavel Zezula1 1 Masaryk University Brno, Czech Republic {xdohnal, zezula}@fi.muni.cz 2 ISTI-CNR Pisa, Italy {gennaro}@isti.pi.cnr.it Abstract. Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbor search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We present the eD-Index, an extension of D-index, and we study an application of the eDIndex to implement two algorithms for similarity self joins, i.e. the range query join and the overloading join. Though also these approaches are not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance improvements are confirmed by experiments. 1 Introduction Contrary to the traditional database approach, the Information Retrieval community has always considered search results as a ranked list of objects. Given a query, some objects are more relevant to the query specification than the others and users are typically interested in the most relevant objects, that is the objects with the highest ranks. This search paradigm has recently been generalized into a model in which a set of objects can only be pair-wise compared through a distance measure satisfying the metric space properties [1]. For illustration, consider the text data as the most common data type used in information retrieval. Since text is typically represented as a character string, pairs of strings can be compared and the exact match decided. However, the longer the strings are the less significant the exact match is: the text strings can contain errors of any kind and even the correct strings may have small differences. According to [8], text typically contain about 2% of typing and spelling errors. This gives a motivation to a search allowing errors, or approximate search, which requires a definition of the concept of similarity, as well as a specification of algorithms to evaluate it. Though the way how objects are compared is very important to guarantee the search effectiveness, indexing structures are needed to achieve efficiency of searching large data collections. Extensive research in this area, see [1], have produced a large number of index structures which support two similarity search conditions, the range query and the k-nearest neighbor query. Given a reference (query) object, the range queries retrieve objects with distances not larger than a user defined threshold, while the knearest neighbors queries provide k objects with the shortest distances to the reference. 2 Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula In order to complete the set of similarity search operations, similarity joins are needed. For example, consider a document collection of books and a collection of compact disk documents. A possible search request can require to find all pairs of books and compact disks which have similar titles. But the similarity joins are not only useful for text. Given a collection of time series of stocks, a relevant query can be: report all pairs of stocks that are within distance µ from each other. Though the similarity join has always been considered as the basic similarity search operation, there are only few indexing techniques, most of them concentrating on vector spaces. In this paper, we consider the problem from much broader perspective and assume distance measures as metric functions. Such a view extends the range of possible data types to the multimedia dimension, which is typical for modern information retrieval systems. The development of Internet services often requires an integration of heterogeneous sources of data. Such sources are typically unstructured whereas the intended services often require structured data. Once again, the main challenge is to provide consistent and error-free data, which implies the data cleaning, typically implemented by a sort of similarity join. In order to perform such tasks, similarity rules are specified to decide whether specific pieces of data may actually be the same things or not. A similar approach can also be applied to the copy detection. However, when the database is large, the data cleaning can take a long time, so the processing time (or the performance) is the most critical factor that can only be reduced by means of convenient similarity search indexes. The problem of approximate string processing has recently been studied in [4] in the context of data cleaning, that is removing inconsistencies and errors from large data sets such as those occurring in data warehouses. A technique for building approximate string join capabilities on top of commercial databases has been proposed in [6]. The core idea of these approaches is to transform the difficult problem of approximate string matching into other search problems for which some more efficient solutions exist. In this article, we extend the existing metric index structure, D-index [2], and compare two algorithms for similarity join built on top of this extended structure. In Section 2, we define principles of the similarity join search in metric spaces and describe the extension of the D-index. Performance evaluation of proposed algorithms is reported in Section 3. 2 Similarity Join A convenient way to assess similarity between two objects is to apply metric functions to decide the closeness of objects as a distance, which can be seen as a measure of the objects dis-similarity. A metric space M = (D, d) is defined by a domain of objects (elements, points) D and a total (distance) function d – a non negative (d(x, y) ≥ 0 with d(x, y) = 0 iff x = y) and symmetric (d(x, y) = d(y, x)) function, which satisfies the triangle inequality (d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ D). In general, the problem of indexing in metric spaces can be defined as follows: given a set X ⊆ D in the metric space M, preprocess or structure the elements of X so that similarity queries can be answered efficiently. Without any loss of generality, we assume that the maximum distance never exceeds the distance d+ . For a query object Similarity Join in Metric Spaces using eD-Index (a) bps( x0 ) = 0 bps ( x1 ) = 1 bps ( x2 ) = − x1 (b) 2ρ x2 3 Separable set 2 Separable set 4 Separable set 1 x0 xv dm Exclusion Set dm 2ρ Separable set 3 Fig. 1. The bps split function (a) and the combination of two bps functions (b). q ∈ D, two fundamental similarity queries can be defined. A range query retrieves all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r}. A k-nearest neighbors query retrieves the k closest elements to q, that is a set R ⊆ X such that |R| = k and ∀x ∈ R, y ∈ X − R, d(q, x) ≤ d(q, y). 2.1 Similarity Join: Problem Definition The similarity join is a search primitive which combines objects of two subsets of D into one set such that a similarity condition is satisfied. The similarity condition between two objects is defined according to the metric distance d. Formally, the similarity join sim X ./ Y between two finite sets X = {x1 , ..., xN } and Y = {y1 , ..., yM } (X ⊆ D sim and Y ⊆ D) is defined as the set of pairs: X ./ Y = {(xi , yj ) | d(xi , yj ) ≤ µ}, where the threshold µ is a real number such that 0 ≤ µ ≤ d+ . If the sets X and Y coincide, we talk about the similarity self join. 2.2 eD-Index The eD-Index is an extension of the D-Index [5, 2] structure. In the following, we provide a brief overview of the D-Index, we then present the eD-Index and show the differences to the original D-Index structure. D-Index: an access structure for similarity search. It is a multi-level metric structure, consisting of search-separable buckets at each level. The structure supports easy insertion and bounded search costs because at most one bucket needs to be accessed at each level for range queries up to a predefined value of search radius ρ. At the same time, the applied pivot-based strategy significantly reduces the number of distance computations in accessed buckets. In the following, we provide a brief overview of the D-Index, more details can be found in [5] and the full specification, as well as performance evaluations, are available in [2]. The partitioning principles of the D-Index are based on a multiple definition of a mapping function, called the ρ-split function. Figure 1a shows a possible implementation of a ρ-split function, called the ball partitioning split (bps), originally proposed in [10]. This function uses one reference object xv and the medium distance dm to partition a data set into three subsets. The result of the following bps function gives a unique 4 Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula identification of the set to which the object x belongs: 0 if d(x, xv ) ≤ dm − ρ bps(x) = 1 if d(x, xv ) > dm + ρ − otherwise The subset of objects characterized by the symbol ’−’ is called the exclusion set, while the subsets of objects characterized by the symbols 0 and 1 are the separable sets, because any range query with radius not larger than ρ cannot find qualifying objects in both the subsets. More separable sets can be obtained as a combination of bps functions, where the resulting exclusion set is the union of the exclusion sets of the original split functions. Furthermore, the new separable sets are obtained as the intersection of all possible pairs of the separable sets of original functions. Figure 1b gives an illustration of this idea for the case of two split functions. The separable sets and the exclusion set form the separable buckets and the exclusion bucket of one level of the D-index structure, respectively. Naturally, the more separable buckets we have, the larger the exclusion bucket is. For the large exclusion bucket, the D-index allows an additional level of splitting by applying a new set of split functions on the exclusion bucket of the previous level. The exclusion bucket of the last level forms the exclusion bucket of the whole structure. The ρ-split functions of individual levels should be different but they must use the same ρ. Moreover, by using a different number of split functions (generally decreasing with the level), the D-Index structure can have different number of buckets at individual levels. In order to deal with overflow problems and growing files, buckets are implemented as elastic buckets and consist of the necessary number of fixed-size blocks (pages) – basic disk access units. Due to the mathematical properties of the split functions, precisely defined in [2], the range queries up to radius ρ are solved by accessing at most one bucket per level, plus the exclusion bucket of the whole structure. This can intuitively be comprehended by the fact that an arbitrary object belonging to a separable bucket is at distance at least 2ρ from any object of other separable bucket of the same level. With additional computational effort, the D-Index executes range queries of radii greater than ρ. The D-index also supports the nearest neighbor(s) queries. eD-Index: an access structure for Similarity Self Join. The idea behind the eD-Index is to modify the ρ-split function so that the exclusion set and separable sets overlap of distance . Figure 2 depicts the modified ρ-split function. The objects which belong to both the separable and the exclusion sets are replicated. This principle, called the exclusion set overloading, ensures that there always exists a bucket for every qualifying pair (x, y)|d(x, y) ≤ µ ≤ where the pair occurs. As explained later, a special algorithm is used to efficiently find these buckets and avoid access to duplicates. In this way, the eD-Index speeds up the evaluation of similarity self joins. 2.3 Similarity Self Join Algorithm with eD-Index The outline of the similarity self join algorithm is following: execute the join query independently on every separable bucket of every level of the eD-Index and addition- Similarity Join in Metric Spaces using eD-Index 2ρ 5 2ρ Separable partitions ε ε dm dm Exclusion Set (a) (b) Fig. 2. The modified bps split function: (a) original ρ-split function; (b) modified ρ-split function. ally on the exclusion bucket of the whole structure. This behavior is correct due to the exclusion set overloading principle – every object of a separable set which can make a qualifying pair with an object of the exclusion set is copied to the exclusion set. The partial results are concatenated and form the final answer. The similarity self join algorithm which processes sub-queries in individual buckets is based on the sliding window algorithm. The idea of this algorithm is straightforward, see Figure 3. All objects of a bucket are ordered with respect to a pivot p, window d≤µ p o o 1 2 olo oup oj on d(p,oj) Fig. 3. The Sliding Window algorithm. which is the reference object of a ρ-split function used by the eD-Index, and we define a sliding window of objects as [olo , oup ]. The window always satisfies the constraint: d(p, oup ) − d(p, olo ) ≤ µ, i.e. the window’s width is ≤ µ. Algorithm 21 starts with the window [o1 , o2 ] and successively moves the upper bound of the window up by one object while the lower bound is increased to preserve the window’s width ≤ µ. The algorithm terminates when the last object on is reached. All pairs (oj , oup ), such that lo ≤ j < up, are collected in each window [olo , oup ] and, at the same time, the applied pivot-based strategy significantly reduces the number of pairs which have to be checked. Finally, all qualifying pairs are reported. Algorithm 21 Sliding Window lo = 1 for up = 2 to n # move the lower boundary up to preserve window’s width ≤ µ increment lo while d(oup , p) − d(olo , p) > µ 6 Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula pi x µ Fig. 4. Example of pivots behavior. for j = lo to up − 1 # for all objects in the window if PivotCheck() = FALSE then # apply the pivot-based strategy compute d(oj , oup ) if d(oj , oup ) ≤ µ then add pair (oj , oup ) to result end if end if end for end for The eD-Index structure stores distances between stored objects and reference objects of ρ-split functions. These distances are computed when objects are inserted into the structure and they are utilized by the pivot-based strategy. Figure 4 illustrates the basic principle of this strategy, the object x is one object of an examined pair and pi is the reference object, called pivot. Provided that the distance between any object and pi is known, the gray area represents the region of objects y that do not form a qualifying pair with x. This assertion can easily be decided without actually computing the distance between x and y. By using the triangle inequalities d(pi , y) + d(x, y) ≥ d(pi , x) and d(pi , x) + d(pi , y) ≥ d(x, y) we have that |d(pi , x) − d(pi , y)| ≤ d(x, y) ≤ d(pi , y) + d(pi , x), where d(pi , x) and d(pi , y) are pre-computed. It is obvious that, by using more pivots, we can improve the probability of excluding an object y without actually computing its distance to x. Note that we use all reference objects of ρ-split functions as pivots. The application of the exclusion set overloading principle implies two important issues. The former one concerns the problem of duplicate pairs in the result of a join query. This fact is caused by the copies of objects which are reinserted into the exclusion set. The described algorithm evades this behavior by coloring object’s duplicates. Precisely, each level of the eD-Index has its unique color and every duplicate of an object has colors of all the preceding levels where the replicated object is stored. For example, the object is replicated and stored at levels 1, 3, and 6. The object at level 1 has no color because it is not a duplicate. The object at level 3 has color of level 1 because it has already been stored at level 1. Similarly, the object at level 6 receives colors of levels 1 and 3, since it is stored at those preceding levels. Before the algorithm Similarity Join in Metric Spaces using eD-Index 7 examines a pair it decides whether the objects of the pair share any color. If they have at least one color in common the pair is eliminated. The concept of sharing a color by two objects means that these objects are stored at the same level thus they are checked in a bucket of that level. The latter issue limits the value of parameter ρ, that is 2ρ ≥ . If > 2ρ some qualifying pairs are not examined by the algorithm. In detail, a pair is missed if one object is from one separable set while the other object of the pair is from another separable set. Such the pairs cannot be found by this algorithm because the exclusion set overloading principle does not duplicate objects among separable sets. Consequently, the separable sets are not contrasted enough to avoid missing some qualifying pairs. 3 Performance Evaluation In order to demonstrate suitability of the eD-Index to the problem of similarity self join, we have compared several different approaches to join operation. The naive algorithm strictly follows the definition of similarity join and computes the Cartesian product between two sets to decide the pairs of objects that must be checked on the threshold µ. Considering the similarity self join, this algorithm has the time complexity O(N 2 ), where N = |X|. A more efficient implementation, called the nested loops, uses the symmetric property of metric distance functions for pruning some pairs. The time complexity is O( N ·(N2 −1) ). More sophisticated methods use pre-filtering strategies to discard dissimilar pairs without actually computing distances between them. A representative of these algorithms is the range query join algorithm applied on the eD-Index. Specifically, we assume a data set X ⊆ D organized by the eD-Index with = 0 (i.e., without overloading exclusion buckets) and apply the search strategy as follows: for ∀o ∈ X, perform range query(o, µ). Finally, the last compared method is the overloading join algorithm which is described in Section 2.3. We have conducted experiments on two real application environments. The first data set consisted of sentences of Czech language corpus compared by the edit distance measure, so-called Levenshtein distance [9]. The most frequent distance was around 100 and the longest distance was 500, equal to the length of the longest sentence. The second data set was composed of 45-dimensional vectors of color features extracted from images. Vectors were compared by the quadratic form distance measure. The distance distribution of this data set was practically normal distribution with the most frequent distance equal to 4,100 and the maximum distance equal to 8,100. In all experiments, we have compared three different techniques for the problem of the similarity self join, the nested loops (NL) algorithm, the range query join (RJ) algorithm applied on the eD-Index, and the overloading join (OJ) algorithm, again applied on the eD-Index. Join-cost ratio. The objective of this group of tests was to study the relationship between the query size (threshold, radius, or selectivity) and the search costs measured in terms of distance computations. The experiments were conducted on both the data sets each consisting of 11,169 objects. The eD-Index structure used on the text data set was 8 Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula Distance computations Distance computations 1e+08 1e+08 1e+07 1e+07 1e+06 1e+06 100000 100000 NL RJ OJ 10000 NL RJ OJ 10000 0 5 10 15 20 25 30 Join query size Fig. 5. Join queries on the text data set. 300 600 900 12001500 1800 Join query size Fig. 6. Join queries on vectors. 9 levels and 39 buckets. The structure for the vector data collection was 11 levels and 21 buckets. Both the structures were fixed for all experiments. We have tested several query radii upto µ = 28 for the text set and upto µ = 1800 for the vector data. The similarity self join operation retrieved about 900,000 text pairs for µ = 28 and 1,000,000 pairs of vectors for µ = 1800, which is much more than being interesting. Figure 5 shows results of experiments for the text data set. As expected, the number of distance computations performed by RJ and OJ increases quite fast with growing µ. However, RJ and OJ algorithms are still more than 4 times faster then NL algorithm for µ = 28. OJ algorithm has nearly exceeded the performance of RJ algorithm for large µ > 20. Nevertheless, OJ is more than twice faster than RJ for small values of µ ≤ 4, which are used in data cleaning area. Figure 6 demonstrates results for the vector data collection. The number of distance computations executed by RJ and OJ has the similar trend as for the text data set – it grows quite fast and it nearly exceeds the performance of NL algorithm. However, the OJ algorithm performed even better than RJ comparing to the results for the text data. Especially, OJ is 15 times and 9 times more efficient than RJ for µ = 50 and µ = 100, respectively. The results for OJ are presented only for radii µ ≤ 600. This limitation is caused by the distance distribution of the vector data set. Specifically, we have to choose values of and ρ at least equal to 1800 and 900, respectively, for join queries with µ = 1800. This implies that more than 80% of the whole data set is duplicated at each level of the eD-Index structure, this means that the exclusion bucket of the whole structure contains practically all objects of the data set, thus, the data set is indivisible in this respect. However, this behavior does not apply to small values of µ and where only a small portion of data sets were duplicated. Scalability. The scalability is probably the most important issue to investigate considering the web-based dimension of data. In the elementary case, it is necessary to study what happens with the performance of algorithms when the size of a data set grows. We have experimentally investigated the behavior of the eD-Index on the text data set with sizes from 50,000 to 250,000 objects (sentences). Similarity Join in Metric Spaces using eD-Index speedup speedup 400 1400 350 1200 300 9 1000 250 µ=1 µ=2 µ=3 200 150 µ=1 µ=2 µ=3 800 600 100 400 50 200 0 0 1 2 3 4 5 1 Data set size (x 50,000) 2 3 4 5 Data set size (x 50,000) Fig. 7. RJ algorithm scalability. Fig. 8. OJ algorithm scalability. We have mainly concentrated on small queries which are typical for data cleaning area. Figure 7 and Figure 8 report the speed-up (s) of RJ and OJ algorithms, respectively. The speed-up is defined as follows: s= N ·(N − 1) , 2·n where N is the number of objects stored in the eD-Index and n is the number of distance evaluations needed by the examined algorithm. In fact, the speed-up states how many times the examined algorithm is faster than the NL algorithm. The results indicate that both RJ and OJ have practically constant speedup when the size of data set is significantly increasing. The exceptions are the values for µ = 1 where RJ slightly deteriorates while OJ improves its performance. Nevertheless, OJ performs at least twice faster than RJ algorithm. In summary, the figures demonstrate that the speedup is very high and constant for different values of µ with respect to the data set size. This implies that the similarity self join with the eD-Index, specifically the overloading join algorithm, is also suitable for large and growing data sets. 4 Conclusions Similarity search is an important concept of information retrieval. However, the computational costs of similarity (dis-similarity or distance) functions are typically high – consider the edit distance with the quadratic computational complexity. We have observed by experiments that a sequential similarity range search on 50,000 sentences takes about 16 seconds. But to perform the nested loops similarity self join algorithm on the same data would take 25,000 times more, which is about 4 days and 15 hours. In order to reduce the computational costs, indexing techniques must be applied. Though a lot of research results on indexing techniques to support the similarity range and nearest neighbors queries have been published, there are only few recent 10 Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula studies on indexing of similarity joins. In this article, we have analyzed several implementation strategies for similarity join operation. We have applied the eD-Index, a metric index structure, to implement two similarity join algorithms and we have performed numerous experiments to analyze their search properties and suitability for the similarity join implementation. In general, we can conclude that the proposed overloading join algorithm outperforms the range query join algorithm. In [3], authors claim that the range query join algorithm applied on the D-index structure is never worse in performance that the specialized techniques [4, 6]. The eD-Index is extremely efficient for small query radii where practically on-line response times are guaranteed. The important feature is that the eD-Index scales up well to processing large files and experiments reveal linear scale up for similarity join queries. We have conducted some of our experiments on vectors and a deeper evaluation has been performed on sentences. However, it is easy to imagine that also text units of different granularity, such as individual words or paragraphs with words as string symbols, can easily be handled by analogy. However, the main advantage of the eDIndex is that it can also perform similar operations on other metric data. As suggested in [7], where the problem of similarity join on XML structures is investigated, metric indexes can be applied for approximate matching of tree structures. We consider this challenge as our future research direction. References 1. E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin: Searching in Metric Spaces. ACM Computing Surveys, 33(3):273-321, 2001. 2. V. Dohnal, C. Gennaro, P. Savino, P. Zezula: D-Index: Distance Searching Index for Metric Data Sets. To appear in ACM Multimedia Tools and Applications, 21(1), September 2003. 3. V. Dohnal, C. Gennaro, P. Zezula: A Metric Index for Approximate Text Management. Proceedings of IASTED International Conference on Information Systems and Databases (ISDB 2002), Tokyo, Japan, 2002, pp. 37-42. 4. H. Galhardas, D. Florescu,D. Shasha, E. Simon, and C.A. Saita: Declarative Data Cleaning: Language, Model, and Algorithms. Proceedings of the 27th VLDB Conference, Rome, Italy, 2001, pp. 371-380. 5. C. Gennaro, P. Savino, and P. Zezula: Similarity Search in Metric Databases through Hashing. Proceedings of ACM Multimedia 2001 Workshops, October 2001, Ottawa, Canada, pp. 1-5. 6. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava: Approximate String Joins in a Database (Almost) for Free. Proceedings of the 27th VLDB Conference, Rome, Italy, 2001, pp. 491-500. 7. S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu: Approximate XML Joins. Proceedings of ACM SIGMOD 2002, Madison, Wisconsin, June 3-6, 2002 8. K. Kukich: Techniques for automatically correcting words in text. ACM Computing Surveys, 1992, 24(4):377-439. 9. G. Navarro: A guided tour to approximate string matching. ACM Computing Surveys, 2001, 33(1):31-88. 10. P. N. Yianilos: Excluded Middle Vantage Point Forests for Nearest Neighbor Search. Tech. rep., NEC Research Institute, 1999, Presented at Sixth DIMACS Implementation Challenge: Nearest Neighbor Searches workshop, January 15, 1999.