Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Search in Very Large Text Collections, Databases, and Ontologies DFG Priority Programme “Algorithm Engineering” Kickoff Meeting in Karlsruhe, December 2 – 3, 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany General theme of this project Search engines – large variety of challenging algorithmic problems with high practical relevance – algorithm engineering is absolutely essential Focus on scalability – terabytes of data, hundreds of millions of documents – query times in a fraction of a second Focus on advanced queries – beyond Google-style keyword search – but still as efficient in time and space Fancy Searches, yet Fast efficiency is often a secondary issue in DB, AI, CL, or ML research Problems encountered in this project Indexing: fast queries, succinct index, fast construction – Index structures for advanced queries (beyond keyword search) – How to build them fast Learning from text: scalable, yet effective – large-scale spelling correction algorythm algorithm – large-scale synonymy detection web ≈ internet – large-scale entity annotation Einstein the physicist? the physical unit? the musicologist? “Basic Toolbox” (for search) – fast intersection of (sorted) sequences – efficient (de)compression possible synergies with Peter Sanders’ project I will give a few glimpses in the following Prefix Completion Fundamental search problem – definition on next slide – many notoriously difficult search problems can be reduced to it – for example, faceted search: for, say, an article by Peter Sanders that appeared in WEA 2007, add author:Peter Sanders venue:WEA year:2007 Doc. 17 Doc. 17 Doc. 17 Prefix Completion — Problem Definition Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Prefix Completion — Problem Definition Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Answer – all matching word-in-doc pairs D13 E D88 E D88 G … … – with scores 0.5 0.2 0.7 … – and positions 5 7 1 … Prefix Completion — Problem Definition Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Answer – all matching word-in-doc pairs D13 E D88 E D88 G … … – with scores 0.5 0.2 0.7 … – and positions 5 7 1 … Prefix Completion — via the Inverted Index For example, algor* eng* given the documents: D13, D17, D88, … (ids of hits for algor*) and the word range : C D E F G (ids for eng*) Iterate over all words from the given range C (engage) D8, D23, D291, ... D (engel) D24, D36, D165, ... E (engine) D13, D24, D88, ... F (engines) D56, D129, D251, ... G (engineering) D3, D15, D88, ... Intersect each list with the given one and merge the results D13 E D88 E D88 G … … running time |D|∙ |W| + log |W|∙ merge volume Prefix Completion — Status Quo & Problems The inverted index – highly compressible Note: time for 100 disk seeks = time for reading 200 MB of compressed data – perfect locality of access (T operations T / block size IOs) – but quadratic worst-case complexity AutoTree [Bast, Weber, Mortensen, SPIRE’06] 99% correlation with actual running times – output-sensitive (query time linear in size of output) – but poor locality of access (heavy use of bit rank operations) The half-inverted index [Bast, Weber, SIGIR’06] – highly compressible + perfect locality of access perfect prediction of time & space consum. – query time linear in the number of docs, with small constant Major open problem: output-sensitive and IO-efficient Error-Tolerant Search With prefix search available, reduces to the following – Problem: Given a set of distinct words (lexicon), find all clusters of words that are spelling variants of each other algorithm algorytm alogrithm logarithm logaythm machine maschine mahcine Challenges – find appropriate measure of distance between words – algorithm that scales in theory as well as in practice possible synergies with Ernst Mayr’s project Master thesis of Marjan Celikik (talk on Wednesday) Semantic Search — Problems Problem 1: how to index Data Base Management System – previous engines built on top of DBMS (e.g., Oracle) – DBMSs are hard to control (opposite of algorithm engineering) – ongoing work: reduction to prefix search and join Problem 2: integrate an ontology – relate words / phrases in text to entities from ontology – no time for deep parsing, reasoning etc. – learn from neighboring words – numerous algorithmic and engineering problems to make it scale to something like Wikipedia (> 10,000,000,000 words) Semantic Search — Entity Recognition Recognize entities by looking at neighboring words Quantum inequalities Einstein's theory of General Relativity amounts to a description … Albert Einstein, the physicist is a: physicist, mathematician, vegetarian, person, entity, … born in: 1879 Violin Sonata No. 5 …, according to Einstein's Mozart: His Character, His Work. Alfred Einstein, the musicologist is a: musicologist, scholar, intellectual, person, entity, … born in: 1880 Software Enhance our prototype – improve source code, documentation, … – integrate our results into the system Make available to others – public demonstrators – as a platform for experimentation – as a fancy search engine construction toolkit Thank you! General theme of this project Project title Efficient Search in Very Large Text Collections, Databases, and Ontologies In short Fancy searches, yet fast – advanced search, yet highly scalable – quality is an issue – but must not sacrifice performance (as often happens in AI, CL, ML) General “Search engines are a fascinating, multi-faceted field of research giving rise to a multitude of challenging algorithmic problems with a strong algorithm engineering component and of high practical relevance.“ Overview [just for myself not for the talk] An Index for prefix search – inverted index + our + open problem + top-k Building such an index – INV = sorting, HYB = semi-sorting Error-tolerant search – reduce to spelling variants clustering, define problem Semantic Search – point out entity annotation problem Prefix Search Show demo – first explain prefix search – then how to use if for faceted search – use DBLP + show dblp.mpi-inf.mpg.de Explain inverted index – show for example prefix query – point out IO-efficiency – point out compressability – but quadratic worst-case complexity Problems encountered in this project Indexing: fast queries, succinct index, fast construction – Index structures for advanced queries (beyond keyword search) – How to build them fast Learning from text: scalable, yet effective – large-scale spelling correction algorythm algorithm – large-scale synonymy detection web ≈ internet – large-scale entity annotation Einstein the physicist? the physical unit? the musicologist? Fundamental problems – fast intersection of (sorted) sequences – efficient (de)compression I will explain each of these in detail in the following Problems encountered in this project Indexing: fast queries, succinct index, fast construction – Index structures for advanced queries (beyond keyword search) – How to build them fast Learning from text: scalable, yet effective – large-scale spelling correction algorythm algorithm – large-scale synonymy detection web ≈ internet – large-scale entity annotation Einstein the physicist? the physical unit? the musicologist? Fundamental problems – fast intersection of (sorted) sequences – efficient (de)compression just kidding Problems encountered in this project Indexing: fast queries, succinct index, fast construction – Index structures for advanced queries (beyond keyword search) Example: prefix search – How to build them fast Learning from text: scalable, yet effective – large-scale spelling correction algorythm algorithm Demo + problem definition – large-scale synonymy detection – large-scale entity annotation web ≈ internet Einstein the physicist? Demo Fundamental problems the physical unit? the musicologist? – fast intersection of (sorted) sequences – efficient (de)compression I will give you a glimpse of some of these in the following Overview Part 1 – Definition of our prefix search problem – Applications – Demos of our search engine Part 2 – Problem definition again – One way to solve it – Another way to solve it – Your way to solve it Part 1 Definition, Applications, Demos Problem Definition — Formal Context-Sensitive Prefix Search Preprocess – a given collection of text documents such that queries of the following kind can be processed efficiently Given – an arbitrary set of documents D – and a range of words W Compute – all word-in-document pairs (w , d) such that w є W and d є D Problem Definition — Visual Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Problem Definition — Visual Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Answer – all matching word-in-doc pairs D13 E D88 E D88 G … … – with scores 0.5 0.2 0.7 … – and positions 5 7 1 … Problem Definition — Visual Data is given as – documents containing words – documents have ids (D1, D2, …) – words have ids (A, B, C, …) D74 D3 D17 D43 J W QD92 D1 Q D BW DQ AOE A U K AD53 D78P U D D27 WH EM J D E K L S D9 KLD D4 F D32 A D88 D98 D2 E E R K L KD13 B F AA B I L S P A EE B A GQ AOE DH S WH Query – given a sorted list of doc ids D13 D17 D88 … – and a range of word ids CDEFG Answer – all matching word-in-doc pairs D13 E D88 E D88 G … … – with scores 0.5 0.2 0.7 … – and positions 5 7 1 … Application 1: Autocompletion After each keystroke – display completions of the last query word that lead to the best hits, together with the best such hits – e.g., for the query google amp display amphitheatre and the corresponding hits Application 2: Error Correction As before, but also … – … display spelling variants of completions that would lead to a hit – e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm Implementation – if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index aigorithm Doc. 17 also add algorithm::aigorithm Doc. 17 Application 3: Query Expansion As before, but also … – … display words related to completions that would lead to a hit – e.g., for the query russia metal also consider documents containing russia aluminium Implementation – for, say, every occurrence of aluminium in the index aluminium Doc. 17 also add (once for every occurrence) s:67:aluminium Doc. 17 and (one once for the whole collection) s:aluminium:67 Doc. 00 Application 4: Faceted Search As before, but also … – … along with the completions and hits, display a breakdown of the result set by various categories – e.g., for the query algorithm show (prominent) authors of articles containing these words Implementation – for, say, an article by Thomas Hofmann that appeared in NIPS 2004, add author:Thomas_Hofmann venue:NIPS year:2004 Doc. 17 Doc. 17 Doc. 17 – also add thomas:author:Thomas_Hofmann hofmann:author:Thomas_Hofmann etc. Doc. 17 Doc. 17 Application 5: Semantic Search As before, but also … – … display “semantic” completions – e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles Implementation – cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … – tricky combination of completions and joins SIGIR’07 and still more applications … Part 2 Solutions and Open Problem Solution 1: Inverted Index For example, probab* alg* given the documents: D13, D17, D88, … (ids of hits for probab*) and the word range : C D E F G (ids for alg*) Iterate over all words from the given range C (algae) D8, D23, D291, ... D (algarve) D24, D36, D165, ... E (algebra) D13, D24, D88, ... F (algol) D56, D129, D251, ... G (algorithm) D3, D15, D88, ... Intersect each list with the given one and merge the results D13 E D88 E D88 G … … running time |D|∙ |W| + log |W|∙ merge volume A General Idea Precompute inverted lists for ranges of words list for A-D 1 3 D A 3 C 5 A 5 B 6 A 7 C 8 8 9 11 11 11 12 13 15 A D A A B C A C A Note – each prefix corresponds to a word range – ideally precompute list for each possible prefix – too much space – but lots of redundancy Solution 2: AutoTree SPIRE’06 / JIR’07 Trick 1: Relative bit vectors – the i-th bit of the root node corresponds to the i-th doc – the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zyskowski 1111111111111… corresponds to doc 5 maakeb-zyskowski 1001000111101… corresponds to doc 5 maakeb-stream 1001110… corresponds to doc 10 Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 2: Push up the words – For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node D = 5, 7, 10 W = max* D = 5, 10 (→ 2, 5) report: maximum 1 1 1 1 1 1 1 1 1 1 … 1 0 0 0 1 0 0 1 1 1 … D=5 report: Ø → STOP 1 0 0 1 1 … Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks – and build a tree over each block as shown before Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks – and build a tree over each block as shown before Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks – and build a tree over each block as shown before Theorem: – query processing time O(|D| + |output|) 99% correlation with actual running times – uses no more space than an inverted index AutoTree Summary: + output-sensitive – not IO-efficient (heavy use of bit-rank operations) – compression not optimal Parenthesis Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice – very simple code data – lists are highly compressible – perfect locality of access Number of operations is a deceptive measure – 100 disk seeks take about half a second – in that time can read 200 MB of contiguous data (if stored compressed) – main memory: 100 non-local accesses 10 KB data block Solution 3: HYB SIGIR’06 / IR’07 Flat division of word range into blocks list for A-D 1 3 D A 3 C 5 A 5 B 6 A 7 C 8 8 9 11 11 11 12 13 15 A D A A B C A C A list for E-J 2 E 2 F 3 G 3 J 4 H 4 I 7 I 7 E list for K-N 1 L 1 2 3 4 N M N N 5 K 6 6 6 8 9 L M N M K 8 F 8 9 G H 9 11 J I 9 9 10 10 L M K L Solution 3: HYB Flat division of word range into blocks 1 3 D A SIGIR’06 / IR’07 3 C 5 A 5 B 6 A 7 C 8 8 9 11 11 11 12 13 15 A D A A B C A C A Replace doc ids by gaps and words by frequency ranks: +1 + 2 +0 +2 +0 +1 +1 +1 + 0 +1 +2 +0 +0 +1 +1 + 2 3rd 1st 2nd 1st 4th 1st 2nd 1st 3rd 1st 1st 4th 2nd 1st 2nd 1st Encode both gaps and ranks such that x log2 x bits +0 0 1st (A) 0 +1 10 2nd (C) 10 +2 110 3rd (D) 111 4th (B) 110 An actual block of HYB 10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110 111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0 Solution 3: HYB Flat division of word range into blocks 1 3 D A SIGIR’06 / IR’07 3 C 5 A 5 B 6 A 7 C 8 8 9 11 11 11 12 13 15 A D A A B C A C A Theorem: – Let n = number of documents, m = number of words – If blocks are chosen of equal volume ~ n – Then query time ~ n and empiricial entropy HHYB ~ (1+ ε) ∙ HINV HYB Summary: + IO-efficient (mere scans of data) + very good compression – not output-sensitive experimental results match perfectly Conclusion Context-sensitive prefix search – core mechanism of the CompleteSearch engine – simple enough to allow efficient realization – powerful enough to support many advanced search features Open problems – solution which is both output-sensitive and IO-efficient – implement the whole thing using MapReduce – support yet more features –… Thank you! Processing the query “beatles musician” position Gitanes John Lennon … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … beatles entity:* entity:john_lennon entity:1964 entity:liverpool etc. 0 1 2 2 entity:john_lennon relation:is_a class:musician class:singer … entity:* . relation:is_a . class:musician two prefix queries entity:wolfang_amadeus_mozart entity:johann_sebastian_bach entity:john_lennon etc. one join entity:john_lennon etc. Processing the query “beatles musician” position Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … beatles entity:* John Lennon 0 1 2 2 entity:john_lennon relation:is_a class:musician class:singer … entity:* . relation:is_a . class:musician Problem: entity:* has a huge number of occurrences – ≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences – prefix search efficient only for up to ≈ 1% (explanation follows) Solution: frontier classes – classes at “appropriate” level in the hierarchy – e.g.: artist, believer, worker, vegetable, animal, … Processing the query “beatles musician” position Gitanes John Lennon … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … 0 0 1 2 artist:john_lennon believer:john_lennon relation:is_a class:musician … beatles artist:* artist:* . relation:is_a . class:musician two artist:john_lennon artist:graham_greene prefix queries artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. one join artist:john_lennon etc. first figure out: musician artist (easy) INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ ni ∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙n is Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni)) ni = number of documents containing i-th word, n = number of documents HOMEOPATHY WIKIPEDIA TREC .GOV 44,015 docs 263,817 words with positions 2,866,503 docs 6,700,119 words with positions 25,204,013 docs 25,263,176 words no positions raw size 452 MB 7.4 GB 426 GB INV 13 MB 0.48 GB 4.6 GB HYB 14 MB 0.51 GB 4.9 GB Nice match of theory and practice INV vs. HYB — Query Time Experiment: type ordinary queries from left to right db , dbl , dblp , dblp un , dblp uni , dblp univ , dblp unive , ... HOMEOPATHY WIKIPEDIA TREC .GOV 44,015 docs 263,817 words 5,732 real queries with proximity 2,866,503 docs 6,700,119 words 100 random queries with proximity 25,204,013 docs 25,263,176 words 50 TREC queries no proximity INV avg : 0.03 secs max: 0.38 secs avg : 0.17 secs max: 2.27 secs avg : 0.58 secs max: 16.83 secs HYB avg : .003 secs max: 0.06 secs avg : 0.05 secs max: 0.49 secs avg : 0.11 secs max: 0.86 secs HYB beats INV by an order of magnitude Engineering With HYB, every query is essentially one block scan – perfect locality of access, no sorting or merging, etc. – balanced ratio of read, decompression, processing, etc. read decomp. intersect rank history 21% 18% 11% 15% 35% Careful implementation in C++ – Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) C++ Java MySQL Perl Engineering With HYB, every query is essentially one block scan – perfect locality of access, no sorting or merging, etc. – balanced ratio of read, decompression, processing, etc. read decomp. intersect rank history 21% 18% 11% 15% 35% Careful implementation in C++ – Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) C++ 1800 MB/sec Java MySQL Perl Engineering With HYB, every query is essentially one block scan – perfect locality of access, no sorting or merging, etc. – balanced ratio of read, decompression, processing, etc. read decomp. intersect rank history 21% 18% 11% 15% 35% Careful implementation in C++ – Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) C++ Java 1800 MB/sec 300 MB/sec MySQL Perl Engineering With HYB, every query is essentially one block scan – perfect locality of access, no sorting or merging, etc. – balanced ratio of read, decompression, processing, etc. read decomp. intersect rank history 21% 18% 11% 15% 35% Careful implementation in C++ – Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) C++ Java MySQL 1800 MB/sec 300 MB/sec 16 MB/sec Perl Engineering With HYB, every query is essentially one block scan – perfect locality of access, no sorting or merging, etc. – balanced ratio of read, decompression, processing, etc. read decomp. intersect rank history 21% 18% 11% 15% 35% Careful implementation in C++ – Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) C++ Java MySQL Perl 1800 MB/sec 300 MB/sec 16 MB/sec 2 MB/sec System Design — High Level View Compute Server C++ Web Server PHP User Client JavaScript Debugging such an application is hell!