Download The Power of Prefix Search, ADS 2007, Bertinoro

Document related concepts
Transcript
Efficient Search in Very Large Text
Collections, Databases, and Ontologies
DFG Priority Programme “Algorithm Engineering”
Kickoff Meeting in Karlsruhe, December 2 – 3, 2007
Holger Bast
Max-Planck-Institut für Informatik
Saarbrücken, Germany
General theme of this project

Search engines
– large variety of challenging algorithmic problems with
high practical relevance
– algorithm engineering is absolutely essential

Focus on scalability
– terabytes of data, hundreds of millions of documents
– query times in a fraction of a second

Focus on advanced queries
– beyond Google-style keyword search
– but still as efficient in time and space
Fancy Searches, yet Fast
efficiency is often a
secondary issue
in DB, AI, CL, or ML
research
Problems encountered in this project

Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast


Learning from text: scalable, yet effective
– large-scale spelling correction
algorythm  algorithm
– large-scale synonymy detection
web ≈ internet
– large-scale entity annotation
Einstein  the physicist?
the physical unit?
the musicologist?
“Basic Toolbox” (for search)
– fast intersection of (sorted) sequences
– efficient (de)compression
possible synergies with
Peter Sanders’ project
I will give a few glimpses in the following
Prefix Completion

Fundamental search problem
– definition on next slide
– many notoriously difficult search problems can be reduced
to it
– for example, faceted search:
for, say, an article by Peter Sanders that appeared in WEA
2007, add
author:Peter Sanders
venue:WEA
year:2007
Doc. 17
Doc. 17
Doc. 17
Prefix Completion — Problem Definition

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)

D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Prefix Completion — Problem Definition

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)


D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Answer
– all matching word-in-doc pairs
D13
E
D88
E
D88
G
…
…
– with scores
0.5
0.2
0.7
…
– and positions
5
7
1
…
Prefix Completion — Problem Definition

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)


D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Answer
– all matching word-in-doc pairs
D13
E
D88
E
D88
G
…
…
– with scores
0.5
0.2
0.7
…
– and positions
5
7
1
…
Prefix Completion — via the Inverted Index

For example, algor* eng*
given the documents: D13, D17, D88, … (ids of hits for algor*)
and the word range : C D E F G


(ids for eng*)
Iterate over all words from the given range
C (engage)
D8, D23, D291, ...
D (engel)
D24, D36, D165, ...
E (engine)
D13, D24, D88, ...
F (engines)
D56, D129, D251, ...
G (engineering)
D3, D15, D88, ...
Intersect each list with the given one and merge the results
D13
E
D88
E
D88
G
…
…
running time |D|∙ |W| + log |W|∙ merge volume
Prefix Completion — Status Quo & Problems

The inverted index
– highly compressible
Note: time for 100 disk seeks
= time for reading 200 MB of
compressed data
– perfect locality of access (T operations  T / block size IOs)
– but quadratic worst-case complexity

AutoTree [Bast, Weber, Mortensen, SPIRE’06]
99% correlation with
actual running times
– output-sensitive (query time linear in size of output)
– but poor locality of access (heavy use of bit rank operations)

The half-inverted index [Bast, Weber, SIGIR’06]
– highly compressible + perfect locality of access
perfect prediction of
time & space consum.
– query time linear in the number of docs, with small constant
Major open problem: output-sensitive and IO-efficient
Error-Tolerant Search

With prefix search available, reduces to the following
– Problem: Given a set of distinct words (lexicon), find all
clusters of words that are spelling variants of each other
algorithm
algorytm

alogrithm
logarithm
logaythm
machine
maschine
mahcine
Challenges
– find appropriate measure of distance between words
– algorithm that scales in theory as well as in practice
possible synergies with
Ernst Mayr’s project
Master thesis of Marjan Celikik (talk on Wednesday)
Semantic Search — Problems

Problem 1: how to index
Data Base Management System
– previous engines built on top of DBMS (e.g., Oracle)
– DBMSs are hard to control (opposite of algorithm engineering)
– ongoing work: reduction to prefix search and join

Problem 2: integrate an ontology
– relate words / phrases in text to entities from ontology
– no time for deep parsing, reasoning etc.
– learn from neighboring words
– numerous algorithmic and engineering problems to make it
scale to something like Wikipedia (> 10,000,000,000 words)
Semantic Search — Entity Recognition

Recognize entities by looking at neighboring words
Quantum inequalities
Einstein's theory of
General Relativity amounts
to a description …
Albert Einstein, the physicist
is a: physicist, mathematician,
vegetarian, person, entity, …
born in: 1879
Violin Sonata No. 5
…, according to Einstein's
Mozart: His Character, His
Work.
Alfred Einstein, the musicologist
is a: musicologist, scholar,
intellectual, person, entity, …
born in: 1880
Software

Enhance our prototype
– improve source code, documentation, …
– integrate our results into the system

Make available to others
– public demonstrators
– as a platform for experimentation
– as a fancy search engine construction toolkit
Thank you!
General theme of this project

Project title
Efficient Search in Very Large Text Collections, Databases, and
Ontologies

In short
Fancy searches, yet fast
– advanced search, yet highly scalable
– quality is an issue
– but must not sacrifice performance
(as often happens in AI, CL, ML)

General
“Search engines are a fascinating, multi-faceted field of
research giving rise to a multitude of challenging algorithmic
problems with a strong algorithm engineering component and
of high practical relevance.“
Overview

[just for myself not for the talk]
An Index for prefix search
– inverted index + our + open problem + top-k

Building such an index
– INV = sorting, HYB = semi-sorting

Error-tolerant search
– reduce to spelling variants clustering, define problem

Semantic Search
– point out entity annotation problem
Prefix Search

Show demo
– first explain prefix search
– then how to use if for faceted search
– use DBLP + show dblp.mpi-inf.mpg.de

Explain inverted index
– show for example prefix query
– point out IO-efficiency
– point out compressability
– but quadratic worst-case complexity
Problems encountered in this project

Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast


Learning from text: scalable, yet effective
– large-scale spelling correction
algorythm  algorithm
– large-scale synonymy detection
web ≈ internet
– large-scale entity annotation
Einstein  the physicist?
the physical unit?
the musicologist?
Fundamental problems
– fast intersection of (sorted) sequences
– efficient (de)compression
I will explain each of these in detail in the following
Problems encountered in this project

Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond keyword search)
– How to build them fast


Learning from text: scalable, yet effective
– large-scale spelling correction
algorythm  algorithm
– large-scale synonymy detection
web ≈ internet
– large-scale entity annotation
Einstein  the physicist?
the physical unit?
the musicologist?
Fundamental problems
– fast intersection of (sorted) sequences
– efficient (de)compression
just kidding
Problems encountered in this project

Indexing: fast queries, succinct index, fast construction
– Index structures for advanced queries (beyond
keyword
search)
Example:
prefix search
– How to build them fast

Learning from text: scalable, yet effective
– large-scale spelling correction
algorythm
 algorithm
Demo
+ problem
definition
– large-scale synonymy detection
– large-scale entity annotation

web ≈ internet
Einstein  the physicist?
Demo
Fundamental problems
the physical unit?
the musicologist?
– fast intersection of (sorted) sequences
– efficient (de)compression
I will give you a glimpse of some of these in the following
Overview

Part 1
– Definition of our prefix search problem
– Applications
– Demos of our search engine

Part 2
– Problem definition again
– One way to solve it
– Another way to solve it
– Your way to solve it
Part 1
Definition, Applications, Demos
Problem Definition — Formal

Context-Sensitive Prefix Search

Preprocess
– a given collection of text documents such that queries of
the following kind can be processed efficiently

Given
– an arbitrary set of documents D
– and a range of words W

Compute
– all word-in-document pairs (w , d)
such that w є W and d є D
Problem Definition — Visual

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)

D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Problem Definition — Visual

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)


D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Answer
– all matching word-in-doc pairs
D13
E
D88
E
D88
G
…
…
– with scores
0.5
0.2
0.7
…
– and positions
5
7
1
…
Problem Definition — Visual

Data is given as
– documents containing words
– documents have ids (D1, D2, …)
– words have ids (A, B, C, …)


D74
D3
D17 D43
J W QD92
D1 Q D
BW DQ
AOE A
U K AD53 D78P U D
D27
WH
EM
J
D
E
K
L
S
D9
KLD
D4
F D32 A D88 D98
D2
E
E
R
K L KD13
B F AA B
I L S P A EE B A
GQ
AOE
DH
S
WH
Query
– given a sorted list of doc ids
D13 D17 D88 …
– and a range of word ids
CDEFG
Answer
– all matching word-in-doc pairs
D13
E
D88
E
D88
G
…
…
– with scores
0.5
0.2
0.7
…
– and positions
5
7
1
…
Application 1: Autocompletion

After each keystroke
– display completions of the last query word that lead to the
best hits, together with the best such hits
– e.g., for the query google amp display amphitheatre and
the corresponding hits
Application 2: Error Correction

As before, but also …
– … display spelling variants of completions that would lead
to a hit
– e.g., for the query probabilistic algorithm also consider a
document containing probalistic aigorithm

Implementation
– if, say, aigorithm occurs as a misspelling of algorithm,
then for every occurrence of aigorithm in the index
aigorithm
Doc. 17
also add
algorithm::aigorithm
Doc. 17
Application 3: Query Expansion

As before, but also …
– … display words related to completions that would lead to
a hit
– e.g., for the query russia metal also consider documents
containing russia aluminium

Implementation
– for, say, every occurrence of aluminium in the index
aluminium
Doc. 17
also add (once for every occurrence)
s:67:aluminium
Doc. 17
and (one once for the whole collection)
s:aluminium:67
Doc. 00
Application 4: Faceted Search

As before, but also …
– … along with the completions and hits, display a
breakdown of the result set by various categories
– e.g., for the query algorithm show (prominent) authors of
articles containing these words

Implementation
– for, say, an article by Thomas Hofmann that appeared in
NIPS 2004, add
author:Thomas_Hofmann
venue:NIPS
year:2004
Doc. 17
Doc. 17
Doc. 17
– also add
thomas:author:Thomas_Hofmann
hofmann:author:Thomas_Hofmann
etc.
Doc. 17
Doc. 17
Application 5: Semantic Search

As before, but also …
– … display “semantic” completions
– e.g., for the query beatles musician display instances of the
class musician that occur together with the word beatles

Implementation
– cannot simply duplicate index entries of an entity for each
category it belongs to, e.g. John Lennon is a
singer, songwriter, person, human being, organism,
guitarist, pacifist, vegetarian, entertainer, musician, …
– tricky combination of completions and joins  SIGIR’07
and still more applications …
Part 2
Solutions and Open Problem
Solution 1: Inverted Index

For example, probab* alg*
given the documents: D13, D17, D88, … (ids of hits for probab*)
and the word range : C D E F G


(ids for alg*)
Iterate over all words from the given range
C (algae)
D8, D23, D291, ...
D (algarve)
D24, D36, D165, ...
E (algebra)
D13, D24, D88, ...
F (algol)
D56, D129, D251, ...
G (algorithm)
D3, D15, D88, ...
Intersect each list with the given one and merge the results
D13
E
D88
E
D88
G
…
…
running time |D|∙ |W| + log |W|∙ merge volume
A General Idea

Precompute inverted lists for ranges of words
list for
A-D

1 3
D A
3
C
5
A
5
B
6
A
7
C
8 8 9 11 11 11 12 13 15
A D A A B C A C A
Note
– each prefix corresponds to a word range
– ideally precompute list for each possible prefix
– too much space
– but lots of redundancy
Solution 2: AutoTree

SPIRE’06 / JIR’07
Trick 1: Relative bit vectors
– the i-th bit of the root node corresponds to the i-th doc
– the i-th bit of any other node corresponds to the i-th set bit
of its parent node
aachen-zyskowski
1111111111111…
corresponds to doc 5
maakeb-zyskowski
1001000111101…
corresponds to doc 5
maakeb-stream
1001110…
corresponds to doc 10
Solution 2: AutoTree

SPIRE’06 / JIR’07
Tricks 2: Push up the words
– For each node, by each set bit, store the leftmost word
of that doc that is not already stored by a parent node
D = 5, 7, 10
W = max*
D = 5, 10 (→ 2, 5)
report: maximum
1 1 1 1 1 1 1 1 1 1 …
1 0 0 0 1 0 0 1 1 1 …
D=5
report: Ø → STOP
1 0 0 1 1 …
Solution 2: AutoTree

SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before
Solution 2: AutoTree

SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before
Solution 2: AutoTree

SPIRE’06 / JIR’07
Tricks 3: divide into blocks
– and build a tree over each block as shown before

Theorem:
– query processing time O(|D| + |output|)
99% correlation with
actual running times
– uses no more space than an inverted index

AutoTree Summary:
+ output-sensitive
– not IO-efficient (heavy use of bit-rank operations)
– compression not optimal
Parenthesis

Despite its quadratic worst-case complexity, the
inverted index is hard to beat in practice
– very simple code
data
– lists are highly compressible
– perfect locality of access

Number of operations is a deceptive measure
– 100 disk seeks take about half a second
– in that time can read 200 MB of contiguous data
(if stored compressed)
– main memory: 100 non-local accesses  10 KB data block
Solution 3: HYB

SIGIR’06 / IR’07
Flat division of word range into blocks
list for
A-D
1 3
D A
3
C
5
A
5
B
6
A
7
C
8 8 9 11 11 11 12 13 15
A D A A B C A C A
list for
E-J
2
E
2
F
3
G
3
J
4
H
4
I
7
I
7
E
list for
K-N
1
L
1 2 3 4
N M N N
5
K
6 6 6 8 9
L M N M K
8
F
8 9
G H
9 11
J I
9 9 10 10
L M K L
Solution 3: HYB

Flat division of word range into blocks
1 3
D A

SIGIR’06 / IR’07
3
C
5
A
5
B
6
A
7
C
8 8 9 11 11 11 12 13 15
A D A A B C A C A
Replace doc ids by gaps and words by frequency ranks:
+1 + 2 +0 +2 +0 +1 +1 +1 + 0 +1 +2 +0 +0 +1 +1 + 2
3rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st

Encode both gaps and ranks such that x  log2 x bits
+0  0
1st (A)  0

+1  10
2nd (C)  10
+2  110
3rd (D)  111
4th (B)  110
An actual block of HYB
10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110
111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0
Solution 3: HYB

Flat division of word range into blocks
1 3
D A

SIGIR’06 / IR’07
3
C
5
A
5
B
6
A
7
C
8 8 9 11 11 11 12 13 15
A D A A B C A C A
Theorem:
– Let n = number of documents, m = number of words
– If blocks are chosen of equal volume ~ n
– Then query time ~ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV

HYB Summary:
+ IO-efficient (mere scans of data)
+ very good compression
– not output-sensitive
experimental results
match perfectly
Conclusion

Context-sensitive prefix search
– core mechanism of the CompleteSearch engine
– simple enough to allow efficient realization
– powerful enough to support many advanced search features

Open problems
– solution which is both output-sensitive and IO-efficient
– implement the whole thing using MapReduce
– support yet more features
–…
Thank you!
Processing the query “beatles musician”
position
Gitanes
John Lennon
… legend says that John
Lennon entity:john_lennon
of the Beatles smoked
Gitanes to deepen his
voice …
beatles entity:*
entity:john_lennon
entity:1964
entity:liverpool
etc.
0
1
2
2
entity:john_lennon
relation:is_a
class:musician
class:singer
…
entity:* . relation:is_a . class:musician
two
prefix
queries
entity:wolfang_amadeus_mozart
entity:johann_sebastian_bach
entity:john_lennon
etc.
one
join
entity:john_lennon
etc.
Processing the query “beatles musician”
position
Gitanes
… legend says that John
Lennon entity:john_lennon
of the Beatles smoked
Gitanes to deepen his
voice …
beatles entity:*

John Lennon
0
1
2
2
entity:john_lennon
relation:is_a
class:musician
class:singer
…
entity:* . relation:is_a . class:musician
Problem: entity:* has a huge number of occurrences
– ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences
– prefix search efficient only for up to ≈ 1% (explanation follows)

Solution: frontier classes
– classes at “appropriate” level in the hierarchy
– e.g.: artist, believer, worker, vegetable, animal, …
Processing the query “beatles musician”
position
Gitanes
John Lennon
… legend says that John
Lennon artist:john_lennon
believer:john_lennon of
the Beatles smoked …
0
0
1
2
artist:john_lennon
believer:john_lennon
relation:is_a
class:musician
…
beatles artist:*
artist:* . relation:is_a . class:musician
two
artist:john_lennon
artist:graham_greene prefix
queries
artist:pete_best
etc.
artist:wolfang_amadeus_mozart
artist:johann_sebastian_bach
artist:john_lennon
etc.
one
join
artist:john_lennon
etc.
first figure out:
musician  artist
(easy)
INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is
Σ ni ∙ (1/ln 2 + log2(n/ni))
Theorem: The empirical entropy of HYB with block size ε∙n is
Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni))
ni = number of documents containing i-th word, n = number of documents
HOMEOPATHY
WIKIPEDIA
TREC .GOV
44,015 docs
263,817 words
with positions
2,866,503 docs
6,700,119 words
with positions
25,204,013 docs
25,263,176 words
no positions
raw size
452 MB
7.4 GB
426 GB
INV
13 MB
0.48 GB
4.6 GB
HYB
14 MB
0.51 GB
4.9 GB
Nice match of theory and practice
INV vs. HYB — Query Time

Experiment: type ordinary queries from left to right
db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ...
HOMEOPATHY
WIKIPEDIA
TREC .GOV
44,015 docs
263,817 words
5,732 real queries
with proximity
2,866,503 docs
6,700,119 words
100 random queries
with proximity
25,204,013 docs
25,263,176 words
50 TREC queries
no proximity
INV
avg : 0.03 secs
max: 0.38 secs
avg : 0.17 secs
max: 2.27 secs
avg : 0.58 secs
max: 16.83 secs
HYB
avg : .003 secs
max: 0.06 secs
avg : 0.05 secs
max: 0.49 secs
avg : 0.11 secs
max: 0.86 secs
HYB beats INV by an order of magnitude
Engineering

With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.

read
decomp.
intersect
rank
history
21%
18%
11%
15%
35%
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers
(on a Linux PC with an approx. 2 GB/sec memory bandwidth)
C++
Java
MySQL
Perl
Engineering

With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.

read
decomp.
intersect
rank
history
21%
18%
11%
15%
35%
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers
(on a Linux PC with an approx. 2 GB/sec memory bandwidth)
C++
1800 MB/sec
Java
MySQL
Perl
Engineering

With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.

read
decomp.
intersect
rank
history
21%
18%
11%
15%
35%
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers
(on a Linux PC with an approx. 2 GB/sec memory bandwidth)
C++
Java
1800 MB/sec
300 MB/sec
MySQL
Perl
Engineering

With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.

read
decomp.
intersect
rank
history
21%
18%
11%
15%
35%
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers
(on a Linux PC with an approx. 2 GB/sec memory bandwidth)
C++
Java
MySQL
1800 MB/sec
300 MB/sec
16 MB/sec
Perl
Engineering

With HYB, every query is essentially one block scan
– perfect locality of access, no sorting or merging, etc.
– balanced ratio of read, decompression, processing, etc.

read
decomp.
intersect
rank
history
21%
18%
11%
15%
35%
Careful implementation in C++
– Experiment: sum over array of 10 million 4-byte integers
(on a Linux PC with an approx. 2 GB/sec memory bandwidth)
C++
Java
MySQL
Perl
1800 MB/sec
300 MB/sec
16 MB/sec
2 MB/sec
System Design — High Level View
Compute Server
C++
Web Server
PHP
User Client
JavaScript
Debugging such an application is hell!