Download Similarity Join in Metric Spaces using eD-Index

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Genetic algorithm wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Pattern recognition wikipedia , lookup

Algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Selection algorithm wikipedia , lookup

Theoretical computer science wikipedia , lookup

Corecursion wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Similarity Join in Metric Spaces using eD-Index
Vlastislav Dohnal1 , Claudio Gennaro2, and Pavel Zezula1
1
Masaryk University
Brno, Czech Republic
{xdohnal, zezula}@fi.muni.cz
2
ISTI-CNR
Pisa, Italy
{gennaro}@isti.pi.cnr.it
Abstract. Similarity join in distance spaces constrained by the metric postulates
is the necessary complement of more famous similarity range and the nearest
neighbor search primitives. However, the quadratic computational complexity of
similarity joins prevents from applications on large data collections. We present
the eD-Index, an extension of D-index, and we study an application of the eDIndex to implement two algorithms for similarity self joins, i.e. the range query
join and the overloading join. Though also these approaches are not able to eliminate the intrinsic quadratic complexity of similarity joins, significant performance
improvements are confirmed by experiments.
1 Introduction
Contrary to the traditional database approach, the Information Retrieval community has
always considered search results as a ranked list of objects. Given a query, some objects
are more relevant to the query specification than the others and users are typically interested in the most relevant objects, that is the objects with the highest ranks. This
search paradigm has recently been generalized into a model in which a set of objects
can only be pair-wise compared through a distance measure satisfying the metric space
properties [1].
For illustration, consider the text data as the most common data type used in information retrieval. Since text is typically represented as a character string, pairs of strings
can be compared and the exact match decided. However, the longer the strings are the
less significant the exact match is: the text strings can contain errors of any kind and
even the correct strings may have small differences. According to [8], text typically contain about 2% of typing and spelling errors. This gives a motivation to a search allowing
errors, or approximate search, which requires a definition of the concept of similarity,
as well as a specification of algorithms to evaluate it.
Though the way how objects are compared is very important to guarantee the search
effectiveness, indexing structures are needed to achieve efficiency of searching large
data collections. Extensive research in this area, see [1], have produced a large number
of index structures which support two similarity search conditions, the range query
and the k-nearest neighbor query. Given a reference (query) object, the range queries
retrieve objects with distances not larger than a user defined threshold, while the knearest neighbors queries provide k objects with the shortest distances to the reference.
2
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula
In order to complete the set of similarity search operations, similarity joins are
needed. For example, consider a document collection of books and a collection of compact disk documents. A possible search request can require to find all pairs of books
and compact disks which have similar titles. But the similarity joins are not only useful
for text. Given a collection of time series of stocks, a relevant query can be: report all
pairs of stocks that are within distance µ from each other. Though the similarity join
has always been considered as the basic similarity search operation, there are only few
indexing techniques, most of them concentrating on vector spaces. In this paper, we
consider the problem from much broader perspective and assume distance measures as
metric functions. Such a view extends the range of possible data types to the multimedia
dimension, which is typical for modern information retrieval systems.
The development of Internet services often requires an integration of heterogeneous
sources of data. Such sources are typically unstructured whereas the intended services
often require structured data. Once again, the main challenge is to provide consistent
and error-free data, which implies the data cleaning, typically implemented by a sort
of similarity join. In order to perform such tasks, similarity rules are specified to decide
whether specific pieces of data may actually be the same things or not. A similar approach can also be applied to the copy detection. However, when the database is large,
the data cleaning can take a long time, so the processing time (or the performance) is the
most critical factor that can only be reduced by means of convenient similarity search
indexes.
The problem of approximate string processing has recently been studied in [4] in
the context of data cleaning, that is removing inconsistencies and errors from large data
sets such as those occurring in data warehouses. A technique for building approximate
string join capabilities on top of commercial databases has been proposed in [6]. The
core idea of these approaches is to transform the difficult problem of approximate string
matching into other search problems for which some more efficient solutions exist.
In this article, we extend the existing metric index structure, D-index [2], and compare two algorithms for similarity join built on top of this extended structure. In Section 2, we define principles of the similarity join search in metric spaces and describe the
extension of the D-index. Performance evaluation of proposed algorithms is reported in
Section 3.
2 Similarity Join
A convenient way to assess similarity between two objects is to apply metric functions
to decide the closeness of objects as a distance, which can be seen as a measure of the
objects dis-similarity. A metric space M = (D, d) is defined by a domain of objects
(elements, points) D and a total (distance) function d – a non negative (d(x, y) ≥ 0
with d(x, y) = 0 iff x = y) and symmetric (d(x, y) = d(y, x)) function, which satisfies
the triangle inequality (d(x, y) ≤ d(x, z) + d(z, y), ∀x, y, z ∈ D).
In general, the problem of indexing in metric spaces can be defined as follows:
given a set X ⊆ D in the metric space M, preprocess or structure the elements of X so
that similarity queries can be answered efficiently. Without any loss of generality, we
assume that the maximum distance never exceeds the distance d+ . For a query object
Similarity Join in Metric Spaces using eD-Index
(a)
bps( x0 ) = 0
bps ( x1 ) = 1
bps ( x2 ) = −
x1
(b)
2ρ
x2
3
Separable
set 2
Separable
set 4
Separable
set 1
x0 xv
dm
Exclusion
Set
dm
2ρ
Separable
set 3
Fig. 1. The bps split function (a) and the combination of two bps functions (b).
q ∈ D, two fundamental similarity queries can be defined. A range query retrieves
all elements within distance r to q, that is the set {x ∈ X, d(q, x) ≤ r}. A k-nearest
neighbors query retrieves the k closest elements to q, that is a set R ⊆ X such that
|R| = k and ∀x ∈ R, y ∈ X − R, d(q, x) ≤ d(q, y).
2.1 Similarity Join: Problem Definition
The similarity join is a search primitive which combines objects of two subsets of D
into one set such that a similarity condition is satisfied. The similarity condition between
two objects is defined according to the metric distance d. Formally, the similarity join
sim
X ./ Y between two finite sets X = {x1 , ..., xN } and Y = {y1 , ..., yM } (X ⊆ D
sim
and Y ⊆ D) is defined as the set of pairs: X ./ Y = {(xi , yj ) | d(xi , yj ) ≤ µ}, where
the threshold µ is a real number such that 0 ≤ µ ≤ d+ . If the sets X and Y coincide,
we talk about the similarity self join.
2.2 eD-Index
The eD-Index is an extension of the D-Index [5, 2] structure. In the following, we provide a brief overview of the D-Index, we then present the eD-Index and show the differences to the original D-Index structure.
D-Index: an access structure for similarity search. It is a multi-level metric structure,
consisting of search-separable buckets at each level. The structure supports easy insertion and bounded search costs because at most one bucket needs to be accessed at each
level for range queries up to a predefined value of search radius ρ. At the same time, the
applied pivot-based strategy significantly reduces the number of distance computations
in accessed buckets. In the following, we provide a brief overview of the D-Index, more
details can be found in [5] and the full specification, as well as performance evaluations,
are available in [2].
The partitioning principles of the D-Index are based on a multiple definition of a
mapping function, called the ρ-split function. Figure 1a shows a possible implementation of a ρ-split function, called the ball partitioning split (bps), originally proposed in
[10]. This function uses one reference object xv and the medium distance dm to partition a data set into three subsets. The result of the following bps function gives a unique
4
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula
identification of the set to which the object x belongs:

 0 if d(x, xv ) ≤ dm − ρ
bps(x) = 1 if d(x, xv ) > dm + ρ

− otherwise
The subset of objects characterized by the symbol ’−’ is called the exclusion set, while
the subsets of objects characterized by the symbols 0 and 1 are the separable sets,
because any range query with radius not larger than ρ cannot find qualifying objects in
both the subsets.
More separable sets can be obtained as a combination of bps functions, where the
resulting exclusion set is the union of the exclusion sets of the original split functions.
Furthermore, the new separable sets are obtained as the intersection of all possible pairs
of the separable sets of original functions. Figure 1b gives an illustration of this idea for
the case of two split functions. The separable sets and the exclusion set form the separable buckets and the exclusion bucket of one level of the D-index structure, respectively.
Naturally, the more separable buckets we have, the larger the exclusion bucket is.
For the large exclusion bucket, the D-index allows an additional level of splitting by
applying a new set of split functions on the exclusion bucket of the previous level. The
exclusion bucket of the last level forms the exclusion bucket of the whole structure. The
ρ-split functions of individual levels should be different but they must use the same ρ.
Moreover, by using a different number of split functions (generally decreasing with the
level), the D-Index structure can have different number of buckets at individual levels.
In order to deal with overflow problems and growing files, buckets are implemented as
elastic buckets and consist of the necessary number of fixed-size blocks (pages) – basic
disk access units.
Due to the mathematical properties of the split functions, precisely defined in [2],
the range queries up to radius ρ are solved by accessing at most one bucket per level,
plus the exclusion bucket of the whole structure. This can intuitively be comprehended
by the fact that an arbitrary object belonging to a separable bucket is at distance at
least 2ρ from any object of other separable bucket of the same level. With additional
computational effort, the D-Index executes range queries of radii greater than ρ. The
D-index also supports the nearest neighbor(s) queries.
eD-Index: an access structure for Similarity Self Join. The idea behind the eD-Index
is to modify the ρ-split function so that the exclusion set and separable sets overlap of
distance . Figure 2 depicts the modified ρ-split function. The objects which belong to
both the separable and the exclusion sets are replicated. This principle, called the exclusion set overloading, ensures that there always exists a bucket for every qualifying pair
(x, y)|d(x, y) ≤ µ ≤ where the pair occurs. As explained later, a special algorithm
is used to efficiently find these buckets and avoid access to duplicates. In this way, the
eD-Index speeds up the evaluation of similarity self joins.
2.3 Similarity Self Join Algorithm with eD-Index
The outline of the similarity self join algorithm is following: execute the join query
independently on every separable bucket of every level of the eD-Index and addition-
Similarity Join in Metric Spaces using eD-Index
2ρ
5
2ρ
Separable
partitions
ε
ε
dm
dm
Exclusion
Set
(a)
(b)
Fig. 2. The modified bps split function: (a) original ρ-split function; (b) modified ρ-split function.
ally on the exclusion bucket of the whole structure. This behavior is correct due to the
exclusion set overloading principle – every object of a separable set which can make a
qualifying pair with an object of the exclusion set is copied to the exclusion set. The
partial results are concatenated and form the final answer.
The similarity self join algorithm which processes sub-queries in individual buckets is based on the sliding window algorithm. The idea of this algorithm is straightforward, see Figure 3. All objects of a bucket are ordered with respect to a pivot p,
window
d≤µ
p o o
1 2
olo
oup
oj
on
d(p,oj)
Fig. 3. The Sliding Window algorithm.
which is the reference object of a ρ-split function used by the eD-Index, and we define
a sliding window of objects as [olo , oup ]. The window always satisfies the constraint:
d(p, oup ) − d(p, olo ) ≤ µ, i.e. the window’s width is ≤ µ. Algorithm 21 starts with
the window [o1 , o2 ] and successively moves the upper bound of the window up by one
object while the lower bound is increased to preserve the window’s width ≤ µ. The
algorithm terminates when the last object on is reached. All pairs (oj , oup ), such that
lo ≤ j < up, are collected in each window [olo , oup ] and, at the same time, the applied pivot-based strategy significantly reduces the number of pairs which have to be
checked. Finally, all qualifying pairs are reported.
Algorithm 21 Sliding Window
lo = 1
for up = 2 to n
# move the lower boundary up to preserve window’s width ≤ µ
increment lo while d(oup , p) − d(olo , p) > µ
6
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula
pi
x
µ
Fig. 4. Example of pivots behavior.
for j = lo to up − 1
# for all objects in the window
if PivotCheck() = FALSE then
# apply the pivot-based strategy
compute d(oj , oup )
if d(oj , oup ) ≤ µ then
add pair (oj , oup ) to result
end if
end if
end for
end for
The eD-Index structure stores distances between stored objects and reference objects of ρ-split functions. These distances are computed when objects are inserted into
the structure and they are utilized by the pivot-based strategy. Figure 4 illustrates the
basic principle of this strategy, the object x is one object of an examined pair and pi is
the reference object, called pivot. Provided that the distance between any object and pi
is known, the gray area represents the region of objects y that do not form a qualifying
pair with x. This assertion can easily be decided without actually computing the distance between x and y. By using the triangle inequalities d(pi , y) + d(x, y) ≥ d(pi , x)
and d(pi , x) + d(pi , y) ≥ d(x, y) we have that |d(pi , x) − d(pi , y)| ≤ d(x, y) ≤
d(pi , y) + d(pi , x), where d(pi , x) and d(pi , y) are pre-computed. It is obvious that,
by using more pivots, we can improve the probability of excluding an object y without
actually computing its distance to x. Note that we use all reference objects of ρ-split
functions as pivots.
The application of the exclusion set overloading principle implies two important
issues. The former one concerns the problem of duplicate pairs in the result of a join
query. This fact is caused by the copies of objects which are reinserted into the exclusion set. The described algorithm evades this behavior by coloring object’s duplicates.
Precisely, each level of the eD-Index has its unique color and every duplicate of an
object has colors of all the preceding levels where the replicated object is stored. For
example, the object is replicated and stored at levels 1, 3, and 6. The object at level
1 has no color because it is not a duplicate. The object at level 3 has color of level 1
because it has already been stored at level 1. Similarly, the object at level 6 receives
colors of levels 1 and 3, since it is stored at those preceding levels. Before the algorithm
Similarity Join in Metric Spaces using eD-Index
7
examines a pair it decides whether the objects of the pair share any color. If they have
at least one color in common the pair is eliminated. The concept of sharing a color by
two objects means that these objects are stored at the same level thus they are checked
in a bucket of that level.
The latter issue limits the value of parameter ρ, that is 2ρ ≥ . If > 2ρ some qualifying pairs are not examined by the algorithm. In detail, a pair is missed if one object is
from one separable set while the other object of the pair is from another separable set.
Such the pairs cannot be found by this algorithm because the exclusion set overloading
principle does not duplicate objects among separable sets. Consequently, the separable
sets are not contrasted enough to avoid missing some qualifying pairs.
3 Performance Evaluation
In order to demonstrate suitability of the eD-Index to the problem of similarity self
join, we have compared several different approaches to join operation. The naive algorithm strictly follows the definition of similarity join and computes the Cartesian
product between two sets to decide the pairs of objects that must be checked on the
threshold µ. Considering the similarity self join, this algorithm has the time complexity
O(N 2 ), where N = |X|. A more efficient implementation, called the nested loops,
uses the symmetric property of metric distance functions for pruning some pairs. The
time complexity is O( N ·(N2 −1) ). More sophisticated methods use pre-filtering strategies to discard dissimilar pairs without actually computing distances between them.
A representative of these algorithms is the range query join algorithm applied on the
eD-Index. Specifically, we assume a data set X ⊆ D organized by the eD-Index with
= 0 (i.e., without overloading exclusion buckets) and apply the search strategy as
follows: for ∀o ∈ X, perform range query(o, µ). Finally, the last compared method is
the overloading join algorithm which is described in Section 2.3.
We have conducted experiments on two real application environments. The first
data set consisted of sentences of Czech language corpus compared by the edit distance
measure, so-called Levenshtein distance [9]. The most frequent distance was around
100 and the longest distance was 500, equal to the length of the longest sentence. The
second data set was composed of 45-dimensional vectors of color features extracted
from images. Vectors were compared by the quadratic form distance measure. The
distance distribution of this data set was practically normal distribution with the most
frequent distance equal to 4,100 and the maximum distance equal to 8,100.
In all experiments, we have compared three different techniques for the problem of
the similarity self join, the nested loops (NL) algorithm, the range query join (RJ) algorithm applied on the eD-Index, and the overloading join (OJ) algorithm, again applied
on the eD-Index.
Join-cost ratio. The objective of this group of tests was to study the relationship between the query size (threshold, radius, or selectivity) and the search costs measured in
terms of distance computations. The experiments were conducted on both the data sets
each consisting of 11,169 objects. The eD-Index structure used on the text data set was
8
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula
Distance computations
Distance computations
1e+08
1e+08
1e+07
1e+07
1e+06
1e+06
100000
100000
NL
RJ
OJ
10000
NL
RJ
OJ
10000
0
5
10
15
20
25
30
Join query size
Fig. 5. Join queries on the text data set.
300 600 900 12001500 1800
Join query size
Fig. 6. Join queries on vectors.
9 levels and 39 buckets. The structure for the vector data collection was 11 levels and
21 buckets. Both the structures were fixed for all experiments.
We have tested several query radii upto µ = 28 for the text set and upto µ =
1800 for the vector data. The similarity self join operation retrieved about 900,000 text
pairs for µ = 28 and 1,000,000 pairs of vectors for µ = 1800, which is much more
than being interesting. Figure 5 shows results of experiments for the text data set. As
expected, the number of distance computations performed by RJ and OJ increases quite
fast with growing µ. However, RJ and OJ algorithms are still more than 4 times faster
then NL algorithm for µ = 28. OJ algorithm has nearly exceeded the performance of RJ
algorithm for large µ > 20. Nevertheless, OJ is more than twice faster than RJ for small
values of µ ≤ 4, which are used in data cleaning area. Figure 6 demonstrates results for
the vector data collection. The number of distance computations executed by RJ and OJ
has the similar trend as for the text data set – it grows quite fast and it nearly exceeds
the performance of NL algorithm. However, the OJ algorithm performed even better
than RJ comparing to the results for the text data. Especially, OJ is 15 times and 9 times
more efficient than RJ for µ = 50 and µ = 100, respectively. The results for OJ are
presented only for radii µ ≤ 600. This limitation is caused by the distance distribution
of the vector data set. Specifically, we have to choose values of and ρ at least equal
to 1800 and 900, respectively, for join queries with µ = 1800. This implies that more
than 80% of the whole data set is duplicated at each level of the eD-Index structure, this
means that the exclusion bucket of the whole structure contains practically all objects
of the data set, thus, the data set is indivisible in this respect. However, this behavior
does not apply to small values of µ and where only a small portion of data sets were
duplicated.
Scalability. The scalability is probably the most important issue to investigate considering the web-based dimension of data. In the elementary case, it is necessary to study
what happens with the performance of algorithms when the size of a data set grows. We
have experimentally investigated the behavior of the eD-Index on the text data set with
sizes from 50,000 to 250,000 objects (sentences).
Similarity Join in Metric Spaces using eD-Index
speedup
speedup
400
1400
350
1200
300
9
1000
250
µ=1
µ=2
µ=3
200
150
µ=1
µ=2
µ=3
800
600
100
400
50
200
0
0
1
2
3
4
5
1
Data set size (x 50,000)
2
3
4
5
Data set size (x 50,000)
Fig. 7. RJ algorithm scalability.
Fig. 8. OJ algorithm scalability.
We have mainly concentrated on small queries which are typical for data cleaning
area. Figure 7 and Figure 8 report the speed-up (s) of RJ and OJ algorithms, respectively.
The speed-up is defined as follows:
s=
N ·(N − 1)
,
2·n
where N is the number of objects stored in the eD-Index and n is the number of distance evaluations needed by the examined algorithm. In fact, the speed-up states how
many times the examined algorithm is faster than the NL algorithm. The results indicate that both RJ and OJ have practically constant speedup when the size of data set is
significantly increasing. The exceptions are the values for µ = 1 where RJ slightly deteriorates while OJ improves its performance. Nevertheless, OJ performs at least twice
faster than RJ algorithm.
In summary, the figures demonstrate that the speedup is very high and constant for
different values of µ with respect to the data set size. This implies that the similarity
self join with the eD-Index, specifically the overloading join algorithm, is also suitable
for large and growing data sets.
4 Conclusions
Similarity search is an important concept of information retrieval. However, the computational costs of similarity (dis-similarity or distance) functions are typically high –
consider the edit distance with the quadratic computational complexity. We have observed by experiments that a sequential similarity range search on 50,000 sentences
takes about 16 seconds. But to perform the nested loops similarity self join algorithm
on the same data would take 25,000 times more, which is about 4 days and 15 hours. In
order to reduce the computational costs, indexing techniques must be applied.
Though a lot of research results on indexing techniques to support the similarity
range and nearest neighbors queries have been published, there are only few recent
10
Vlastislav Dohnal, Claudio Gennaro, and Pavel Zezula
studies on indexing of similarity joins. In this article, we have analyzed several implementation strategies for similarity join operation. We have applied the eD-Index, a
metric index structure, to implement two similarity join algorithms and we have performed numerous experiments to analyze their search properties and suitability for the
similarity join implementation.
In general, we can conclude that the proposed overloading join algorithm outperforms the range query join algorithm. In [3], authors claim that the range query join
algorithm applied on the D-index structure is never worse in performance that the specialized techniques [4, 6]. The eD-Index is extremely efficient for small query radii
where practically on-line response times are guaranteed. The important feature is that
the eD-Index scales up well to processing large files and experiments reveal linear scale
up for similarity join queries.
We have conducted some of our experiments on vectors and a deeper evaluation
has been performed on sentences. However, it is easy to imagine that also text units
of different granularity, such as individual words or paragraphs with words as string
symbols, can easily be handled by analogy. However, the main advantage of the eDIndex is that it can also perform similar operations on other metric data. As suggested
in [7], where the problem of similarity join on XML structures is investigated, metric
indexes can be applied for approximate matching of tree structures. We consider this
challenge as our future research direction.
References
1. E. Chavez, G. Navarro, R. Baeza-Yates, and J. Marroquin: Searching in Metric Spaces. ACM
Computing Surveys, 33(3):273-321, 2001.
2. V. Dohnal, C. Gennaro, P. Savino, P. Zezula: D-Index: Distance Searching Index for Metric
Data Sets. To appear in ACM Multimedia Tools and Applications, 21(1), September 2003.
3. V. Dohnal, C. Gennaro, P. Zezula: A Metric Index for Approximate Text Management. Proceedings of IASTED International Conference on Information Systems and Databases (ISDB
2002), Tokyo, Japan, 2002, pp. 37-42.
4. H. Galhardas, D. Florescu,D. Shasha, E. Simon, and C.A. Saita: Declarative Data Cleaning:
Language, Model, and Algorithms. Proceedings of the 27th VLDB Conference, Rome, Italy,
2001, pp. 371-380.
5. C. Gennaro, P. Savino, and P. Zezula: Similarity Search in Metric Databases through Hashing.
Proceedings of ACM Multimedia 2001 Workshops, October 2001, Ottawa, Canada, pp. 1-5.
6. L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava:
Approximate String Joins in a Database (Almost) for Free. Proceedings of the 27th VLDB
Conference, Rome, Italy, 2001, pp. 491-500.
7. S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu: Approximate XML Joins. Proceedings of ACM SIGMOD 2002, Madison, Wisconsin, June 3-6, 2002
8. K. Kukich: Techniques for automatically correcting words in text. ACM Computing Surveys,
1992, 24(4):377-439.
9. G. Navarro: A guided tour to approximate string matching. ACM Computing Surveys, 2001,
33(1):31-88.
10. P. N. Yianilos: Excluded Middle Vantage Point Forests for Nearest Neighbor Search. Tech.
rep., NEC Research Institute, 1999, Presented at Sixth DIMACS Implementation Challenge:
Nearest Neighbor Searches workshop, January 15, 1999.