Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Implementation Issues Arjen P. de Vries Centrum voor Wiskunde en Informatica Amsterdam [email protected] Overview • Indexing collections • Some text statistics • File organization – Inverted files – Signature files • Search engines • Indexing XML Supporting the Search Process Source Selection IR System Query Formulation Predict Nominate Choose Query Search Query Reformulation and Relevance Feedback Ranked List Selection Document Examination Source Reselection Document Delivery Supporting the Search Process Source Selection IR System Query Formulation Query Search Ranked List Selection Indexing Document Index Examination Acquisition Document Collection Delivery Some Questions for Today • How long will it take to find a document? – Is there any work we can do in advance? • If so, how long will that take? • How big a computer will I need? – How much disk space? How much RAM? • What if more documents arrive? – How much of the advance work must be repeated? – Will searching become slower? – How much more disk space will be needed? Text Statistics Frequent words on the WWW • • • • • • • • • • • • • • • • • 65002930 62789720 60857930 57248022 54078359 52928506 50686940 49986064 45999001 42205245 41203451 39779377 35439894 35284151 34446866 33528897 31583607 the a to of and in s for on this is by with or at all are • • • • • • • • • • • • • • • • • 30998255 30755410 30080013 29669506 29417504 28542378 28162417 28110383 28076530 27650474 27572533 24548796 24420453 24376758 24171603 23951805 23875218 from e you be that not an as home it i have if new t your page • • • • • • • • • • • • • • • • 22292805 22265579 22107392 21647927 21368265 21367950 21102223 20621335 19898015 19689603 19613061 19394862 19279458 19199145 19075253 18636563 about com information will can more has no other one c d m was copyright us (see http://elib.cs.berkeley.edu/docfreq/docfreq.html) Plotting Word Frequency by Rank • Main idea: – Count how many times tokens occur in the text • Sum over all of the texts in the collection • Now order these tokens according to how often they occur (highest to lowest) • This is called the rank Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant – Rank = order of words’ frequency of occurrence f C 1 / r C N / 10 • Another way to state this is with an approximately correct rule of thumb: – Say the most common term occurs C times – The second most common occurs C/2 times – The third most common occurs C/3 times Zipf Distribution Linear Scale Logarithmic Scale Illustration by Jacob Nielsen What has a Zipf Distribution? • Words in a text collection – Virtually any use of natural language • Library book checkout patterns • Incoming Web Page Requests (Nielsen) • Outgoing Web Page Requests (Cunha & Crovella) • Document Size on Web (Cunha & Crovella) Related Distributions/Laws • For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-ZipfMandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341 • Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on selfsimilarity and market prices and communication errors; etc. Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. – Called “stop words” in IR – Usually correspond to linguistic notion of “closed-class” words • English examples: to, from, on, and, the, ... • Grammatical classes that don’t take on new members. • There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms) • Medium frequency words most descriptive Word Frequency vs. Resolving Power (from Van Rijsbergen 79) The most frequent words are not the most descriptive. File Organization File Structures for IR • Lexicographical indices – indices that are sorted – inverted files – Patricia (PAT) trees (Suffix trees and arrays) • Indices based on hashing – signature files Document Vectors ID A B C D E F G H I nova galaxy heat h'wood film role 10 5 3 5 10 10 8 9 10 diet 7 5 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A 5 (Blank means 7 0 occurrences.) 9 6 10 2 7 8 5 fur 10 9 10 10 1 3 Document Vectors ID A B C D E F G H I nova galaxy heat h'wood film role diet 10 5 3 5 10 “Hollywood” occurs107 times8 in text7I 9 in10text I 5 “Film” occurs 5 times “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I 5 6 7 10 fur 10 9 10 10 1 3 9 2 7 8 5 Document Vectors ID A B C D E F G H I nova galaxy heat h'wood film role 10 5 3 5 10 10 8 9 10 5 6 7 10 diet fur 7 5 10 9 10 10 1 3 9 2 7 8 5 Inverted Index • The primary data structure for text indexes • Main Idea: – Invert documents into a big index • Basic steps: – Make a “dictionary” of all the tokens in the collection – For each token, list all the docs it occurs in. – Do a few things to reduce redundancy in the data structure Inverted Indexes • An Inverted File is a vector file “inverted” so that rows become columns and columns become rows docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 t1 1 1 0 1 1 1 0 0 0 0 t2 0 0 1 0 1 1 1 1 0 1 t3 1 0 1 0 1 0 0 0 1 1 Terms D1 t1 t2 t3 D2 1 0 1 D3 1 0 0 D4 0 1 1 D5 1 0 0 D6 1 1 1 … D7 1 1 0 0 1 0 How Inverted Files Are Created • Documents are parsed to extract tokens • Save token with Document ID Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 What are the tokens? • • • • Stemming? Case folding? Thesauri and/or soundex? Spelling correction? How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created • Multiple term entries for a single document are merged • Within-document term frequency information is compiled Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 How Inverted Files are Created Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Postings Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Searching Three general steps • Vocabulary search – Identify the words and patterns in the query – Search them in the vocabulary • Retrieval of occurrences – Retrieve the lists of occurrences of all the words • Manipulation of occurrences – Solve phrases, proximity, or Boolean operations – Find the exact word positions when block addressing is used How Inverted Files are Used Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Postings Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Boolean Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Query optimization • Consider a query that is an AND of t terms. • The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together. • Process in order of increasing freq: – start with smallest set, then keep cutting further. Keep tf in dictionary! Processing for ranking systems For each query term for each document in inverted list augment similarity coefficient For each document finish calculation of similarity coefficient Perform sort of similarity coefficients Retrieve and present document Processing Vector Space Queries (I) Set A {}. A is the set of accumulators For each query term t Q, Stem t Search the lexicon, record ft and the address of It Set wt 1 + loge (N/ft) Processing VS Queries II Read the inverted file entry, It For each (d,fd,t) pair in It If Ad A then set Ad 0 set A A + {Ad} Set Ad A + loge (1 + fd,t)*wt For each Ad A Set Ad Ad/Wd (where Wd is weight of document D) Ad is now proportional to the value cosine(Q,Dd) #unique terms How Large is the Dictionary? 2.0e6 • Heaps’ law: 1.0e6 #docs – the vocabulary grows as O(n), : 0.4~0.6 Term Storage in Dictionary • Store pointers to every kth on term string. • Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Freq. Postings ptr. Term ptr. 33 29 44 126 7 Save 9 bytes on 3 pointers. Lose 4 bytes on term lengths. Searching the Dictionary • Sorted Arrays – – – – store the list of keywords in a sorted array using a standard binary search advantage: easy to implement disadvantage: updating the index is expensive • Hashing Structures • B-Trees, Tries, Pat(ricia) Trees, … • Combinations of these structures The lexicon as a Hash Table • Store terms in a Hash table: – Map the M terms into a set of m integer values 0 <= h(ti) <= m-1 (m << M). – A simple hash function for strings: t h(t ) ti wi mod m i 1 – Retrieval time: O(|t|) – Cons: • collisions vs. space • Cannot handle wildcards Tries • • • • A structure constructed from strings Name comes from the word retrieval Used for string search in normal text Each edge represents a character of the string including the terminator $ • Leaves contain the keys TRIE - Non-compact Each character stored in one edge of the tree Consider 5 character strings: BIG BIGGER BILL GOOD GONG B G I G G $ O O L L N D G $ $ BIG$ E $ BILL$ GOOD$ R $ BIGGER$ GONG$ TRIE - Properties • An internal node can have from 1 to d children where d is the size of the alphabet. • A path from the root of T to an internal node i corresponds to an i-character prefix of a string S • The height of the tree is the length of the longest string • If there are S unique strings, T has S external nodes • Looking up a string of length M is O(M) Compact Trie Remove chains leading to leaves B G I G O O L N B G $ L D G I BIG$ E $ $ G GONG$ R BIGGER$ O L BILL$ G $ O $ BILL$ GOOD$ G BIGGER$ $ BIG$ GOOD$ N GONG$ PATRICIA (PAT TREE) Practical Algorithm To Retrieve Information Coded In Alphabets - introduced by D.R. Morrison in October 1968 Collapse all unary nodes B I G O BI G O L BILL$ G GOOD$ GO N GONG$ G L $ O N BILL$ BIGGER$ BIG$ G BIGGER$ $ BIG$ GOOD$ GONG$ PATRICIA - Comments • Number of nodes proportional to number of strings • Strings on edges become variable length • Use an auxiliary data structure to actually store the strings • The TRIE only stores triplets of numbers indicating where in the auxiliary data structure to look Partially specified query (*) • Using n-grams – E.g., decompose labor into la-ab-bo-or – Mark begin/end of word with special char ($) – Labor ->$l-la-ab-bo-or-r$ • Answers to lab*r are obtained via $l and la and ab and r$ Use of bigrams Number Term 1 2 3 4 5 6 7 8 Abhor Bear Laaber Labor Laborator Labour Lavacaber Slab Bigram Term id $a $b $l $s aa ab bo la or ou ra ry r$ sl 1 2 3,4,5,6,7 8 3 1,3,4,5,6,7,8 4,5,6 3,4,5,6,7,8 1,4,5 6 5 5 1,2,3,4,5,6,7 8 Problems with bi-grams • Query $l and la and ab and r$ – might be false match: – In example, will match laaber and lavacaber (from TREC collection).. – Why: all bigrams are present, but not consecutive • What about labrador? – Right answer but not what was intended – Wildcards with n-grams are risky! • False matches can be checked with additional exact pattern matcher Can we support wildcards with TRIE • Easy to answer queries with wildcard tails – E.g., lab*, labor*, … • What about lab*r or *bour? • Use a rotated lexicon – Store: $labor, abor$l, bor$la, or$lab, r$labo, $labor • To search for lab*r – rotate until wildcard is at end: r$lab* then – search for strings that have this prefix Rotated Lexicon id Term 1 2 3 4 5 6 7 8 Abhor Bear Laaber Labor Laborator Labour Lavacaber Slab Rotated Add Rotated Add $abhor $bear $laaber $labor $laborator $labour $lavacaber $slab aaber$l abhor$ aber$la abor$l (1,0) (2,0) (3,0) (4,0) (5,0) (6,0) (7,0) (8,0) (3,2) (1,1) (3,3) (4,2) aborator$l abour$l aber$lavac ab$sl r$abho r$bea r$laabe r$labo r$laborato r$labou r$lavacabe slab$ (5,2) (6,2) (7,6) (8,3) (1,5) (2,4) (3,6) (4,5) (5,9) (6,6) (7,9) (8,1) What Goes in a Postings File? • Boolean retrieval – Just the document number • Ranked Retrieval – Document number and term weight (TF*IDF, ...) • Proximity operators – Word offsets for each occurrence of the term • Example: Doc 3 (t17, t36), Doc 13 (t3, t45) How Big Is the Postings File? • Very compact for Boolean retrieval – About 10% of the size of the documents • If an aggressive stopword list is used! • Not much larger for ranked retrieval – Perhaps 20% • Enormous for proximity operators – Sometimes larger than the documents! Full inversion (all words, exact positions, 4-byte pointers) Indexing Small collection Medium collection Large (1 MB) Addressing 45% (200MB) Collection (2GB) 73% 36% 64% 35% 63% 26% 18% 32% 26% 47% 41% 18% 32% 5% 9% 25% 1.7% 2.4% 0.5% 0.7% words Addressing 19% documents Addressing 27% 64k blocks Addressing 18% 256 blocks 2 or 1 byte(s) per pointer independent of the text size document size (10KB), 1, 2, 3 bytes per pointer, depending on text size Stop words are not indexed All words are indexed Inverted File Compression • Key idea, each postings list can be stored as an ascending sequence of integers • Example: if t occurs in 8 documents, list the postings in ascending doc id – <8;3,5,20,21,23,76,77,78> • Storing the gaps would change to <8; [3,2,15,1,2,53,1,1]> Inverted File Compression • More formally, if t appears in nt docs – <nt;d1,d2,…,dft> – where dk < dk+1 • The postings can be equivalently stored as an initial position followed by a list of d-gaps: dk+1-dk • Example – From <8;3,5,20,21,23,76,77,78> – To <8;3,2,15,1,2,53,1,1> Inverted File Compression • No information loss, but … … where does the saving come from? – The largest d-gap in the 2nd rep is potentially the same as the largest document id in the 1st – If there are N documents, and a fixed binary encoding is used, both methods require log N bits per any stored pointer… Intuition • Frequent words have d-gaps significantly smaller than rare words • Variable length representations are considered, in which small values are more likely and coded more economically than large one • Various methods based on probability distribution of index numbers. Bitmaps • For every term in the lexicon, a bitvector is stored, each bit corresponding to a document. Set to 1 if term appears in document, set to 0 otherwise • Thus if term “pot” appears in documents 2 and 5, the corresponding bits are set – pot -> 0100100 • Bitmaps are efficient for Boolean queries – Answers to a query are obtained by simply combining bitvectors with Boolean operators Example of bitmaps term cold days hot bitvector 100100 001001 100100 in it 010010 010010 like 000110 nine old pease porridge pot some the 001001 001001 110000 110000 010010 000110 010010 Query “some” AND “pot” Answer: 010010 AND 000110 Yields: 000010 d5 unique answer Signature Files What Are Signature Files? In your email programme you will find the option “signatures”. For example in outlook express go to “tools”, “options” and you will see “signatures”. Signature files offer you as a person or business to state contact information at the end of an e-mail message. They are also known as "sig. files". Using sig. files is an opportunity to advertise is free. Signature Files • Signature files were popular in the past because they are implicitly compressed (and take less space than uncompressed inverted files) • A signature file is a probabilistic method for indexing text: each document has an associated signature or descriptor, a string of bits that captures the content of the document Not a new idea… Word signatures • Assume that words have 16 bit descriptors constructed by taking 3 hash functions generating values between 1…16, and setting corresponding bits in a 16 bit descriptor • Example – if 3 hashing values for cold are 3,5,16 – Cold -> 1000 0000 0010 0100 • Very unlikely but not impossible that signatures of 2 words are identical Example for given lexicon term cold days hot Hash string 1000 0000 0010 0100 0010 0100 0000 1000 0000 1010 0000 0000 in it 0000 1001 0010 0000 0000 1000 1000 0010 like nine old pease porridge pot some the 0100 0010 1000 0100 0100 0000 0100 1010 0010 1000 1000 0100 0100 0010 0100 1000 0000 0000 0100 0010 0010 0110 0000 0000 0001 0100 0000 0000 0000 0000 0001 0000 Document signatures • Signatures for each word in a given document are superimposed – The words signatures of each word in a document d are ORed, to form the document signature – Example • document 6: nine days old Example • • • • nine days old d6 “nine days old” 0010 0010 1000 1010 1000 0100 1000 1100 0000 0000 0100 0100 0100 1000 0000 1100 • To test whether a word is in a document, calculate its signature. If the corresponding bits are set in the document, there is a high probability that the document contains the word. False hits • Reduce false hits by increasing number of bits set for each term and increase length of signature • Yet need to fetch the document at query time, and scan/parse/stem it to check that the term really occurs: costs time • Effective when queries are long (lower probabilities of false hits) Example of false hits Doc Text 1 signature 1100 1111 0010 0101 2 1110 1111 0110 0001 Pease porridge in the pot. 3 1010 1100 0100 1100 Nine days old. 4 1100 1110 1010 0111 Some like it hot, some like it cold 5 1110 1111 1110 0011 Some like it in the pot 6 1010 1100 0100 1100 Nine days old cold 1000 0000 0010 0100 old 1000 1000 0100 0000 Real matches? Pease porridge hot, please porridge cold. 1 and 4 2, 3, 5 and 6 Merits of Signature Files • Faster than full text scanning – 1 or 2 orders of magnitude faster – But… still linear in collection size • Modest space overhead – 10-15% vs. 50-300% (inversion) • Insertions can be handled easily – append only – no reorganization or rewriting Comparison of indexing methods • Bitmap consumes an order of magnitude more storage than inverted files and signature files • Signature files require un-necessary access to the main text because of false matches • Signature file main advantage: no in-memory lexicon • Compressed inverted files are the state-of-the art index structure used by most search engines Search Engines Searches Per Day (2000) Information from searchenginewatch.com Searches Per Day (2001) Challenges for Web Searching • Distributed data • Volatile data/”Freshness”: 40% of the web changes every month • Exponential growth • Unstructured and redundant data: 30% of web pages are near duplicates • Unedited data • Multiple formats • Commercial biases • Hidden data From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm Standard Web Search Engine Architecture crawl the web Check for duplicates, store the documents DocIds create an inverted index user query Show results To user Search engine servers Inverted index More detailed architecture, from Brin & Page 98. Only covers the preprocessing in detail, not the query serving. Indexes for Web Search Engines • Inverted indexes are still used, even though the web is so huge • Some systems partition the indexes across different machines – Each machine handles different parts of the data • Other systems duplicate the data across many machines – Queries are distributed among the machines • Most do a combination of these In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row. From description of the FAST search engine, by Knut Risvik http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm Querying: Cascading Allocation of CPUs • A variation on this that produces a costsavings: – Put high-quality/common pages on many machines – Put lower quality/less common pages on fewer machines – Query goes to high quality machines first – If no hits found there, go to other machines Google • Google maintains the worlds largest Linux cluster (10,000 servers) • These are partitioned between index servers and page servers – Index servers resolve the queries (massively parallel processing) – Page servers deliver the results of the queries • Over 3 Billion web pages are indexed and served by Google Web Page Ranking • Varies by search engine – Pretty messy in many cases – Details usually proprietary and fluctuating • Combining subsets of: – – – – – – – Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information Ranking: Link Analysis • Assumptions: – If the pages pointing to this page are good, then this is also a good page – The words on the links pointing to this page are useful indicators of what this page is about – References: Page et al. 98, Kleinberg 98 Ranking: Link Analysis • Why does this work? – The official Toyota site will be linked to by lots of other official (or high-quality) sites – The best Toyota fan-club site probably also has many links pointing to it – Less high-quality sites do not have as many highquality sites linking to them Web Crawling • How do the web search engines get all of the items they index? • Main idea: – – – – – Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat Web Crawlers • How do the web search engines get all of the items they index? • More precisely: – Put a set of known sites on a queue – Repeat the following until the queue is empty: • Take the first page off of the queue • If this page has not yet been processed: – Record the information found on this page » Positions of words, links going out, etc – Add each link on the current page to the queue – Record that this page has been processed • Depth-first/breadth-first/… Sites Are Complex Graphs, Not Just Trees Page 1 Page 2 Page 1 Site 1 Site 2 Page 2 Page 3 Page 5 Page 4 Page 3 Page 1 Site 5 Page 6 Page 1 Site 3 Page 2 Page 1 Site 6 Web Crawling Issues • Keep out signs – A file called robots.txt tells the crawler which directories are off limits • Freshness – Figure out which pages change often – Recrawl these often • Duplicates, virtual hosts, etc – Convert page contents with a hash function – Compare new pages to the hash table • Lots of problems – – – – Server unavailable Incorrect html Missing links Infinite loops • Web crawling is difficult to do robustly! XML, IR, … databases? What’s XML? • W3C Standard since 1998 – Subset of SGML – (ISO Standard Generalized Markup Language) • Data-description markup language – HTML text-rendering markup language • De facto format for data exchange on Internet – Electronic commerce – Business-to-business (B2B) communication XML Data Model Highlights • Tagged elements describe semantics of data – Easier to parse for a machine and for a human • Element may have attributes • Element can contain nested sub-elements • Sub-elements may themselves be tagged elements or character data • Usually considered as tree structure … but really a graph! An XML Document <? xml version=" 1.0"?> <! DOCTYPE sigmodRecord SYSTEM “sigmodRecord. dtd"> <sigmodRecord> <issue> <volume> 1</ volume> <number> 1</ number> <articles> <article> <title> XML Research Issues</ title> <initPage> 1</ initPage> <endPage> 5</ endPage> <authors> <author AuthorPosition=" 00"> Tom Hanks</ author> </ authors> </ article> </ articles> </ issue> XPATH Example • List the titles of articles in which the author has “Tom Hanks” – //article[//author=“Tom Hanks”]/title • Find the titles of articles authored by “Tom Hanks” in volume 1. – //issue[/volume=“1”]/articles/article/[//author=“T omHanks”]/title XQuery: Example List the titles of the articles authored by “Tom Hanks” Query Expression for $b IN document(“sigmodRecord.xml")//article where $b//author =“Tom Hanks" return <title>$b/title.text()</title> Query Result <title>XML Research Issues</title> Data- vs. Document-centric • Data-centric – Highly structured – Usage stems from exchange of data from databases • Example: <site> <item ID=“I001”> <name>Chair</name> <description>This chair is in good condition ... </description> </item> <item ID=“I002”> <name>Table</table> <description>...</description> </item> ... </site> • Document-centric – – – – Semi-structured Embedded tags Fulltext search important Usage stems from exchange of formatted text <memo> <author>John Doe</author> <title>...</title> <body> This memo is meant for all persons responsible for <list bullets=1> <item>either <em>customers</em> abroad, <item>or <em>suppliers</em> abroad. </list> ... </memo> Classes of XML Documents • Structured – “Un-normalized” relational data Ex: product catalogs, inventory data, medical records, network messages, logs, stock quotes • Mixed – Structured data embedded in large text fragments Ex: On-line manuals, transcripts, tax forms Queries on Mixed Data • Full-text search operators Ex: Find all <bill>s where "striking" & "amended" are within 6 intervening words • Queries on structure & text Ex: Return <text> element containing both "exemption" & "social security" and preceding & following <text> elements • Queries that span (ignore) structure Ex: Return <bill> that contains “referred to the Committee on Financial Services” XML Indexing A A B C D B E T1 C D E T2 T3 T1 T2 T3 T4 XML Document Schema Node index A B C D E ID S E P A B C 0 1 2 0 1 4 13 3 12 1 1 D … 3 … 5 … 8 … 2 … T1 T2 T3 T4 T1 2 Word index T2 6 T3 7 Text Region Algebras contains(A, nodeindex) A A B C D B E T1 C D E T2 T3 T1 T2 T3 T4 Text Region Algebras contained-by(D, nodeindex) A A B C D B E T1 C D E T2 T3 T1 T2 T3 T4 Using Region Algebras roots := contains and not contained-by A A B C D B E T1 C D E T1 T2 T3 T4 T2 T3 T4 Using Region Algebras leafs := contained-by and not contains A A B C D B E T1 C D E T1 T2 T3 T4 T2 T3 T4 Using Region Algebras inner := diff(diff(nodes, leafs),roots) B C D B E T1 C D E T1 T2 T3 T4 T2 T3 T4 XML Region Operators - 1 • Containment – Contains and contained-by • Length – Length (including markup) and text-length (no markup) • Other possibilities include… – Range or distance - enabling references – Overlaps – (Minimal) bounding region (direct parent) XML Region Operators – 2 • Transitive closure – Containment - select on word index with known lower and upper bound • Regeneration of XML document – Containment on node index – Containment on word index – Union of node and word subsets Inverted Lists for XML Documents 1 <section> 3 4 5 6 7 2 <title> Information Retrieval Using RDBMS </title> <section>98 10 11 12 13 <title> Beyond Simple Translation </title> 14 <section> 15 16 17 18 19 20 <title> Extension of IR Features </title> 21 </section> </section>22 23 </section> Element index Text index <section> (1, 1:23, 0) (1, 8:22, 1) (1, 14:21, 2) … … <title> (1, 2:7, 1) (1, 9:13, 2) (1, 15:20, 3) … … “information” (1, 3, 2) … … “retrieval” (1, 4, 2) … … Containment, direct containment, tight containment, proximity Inverted Lists to Relational Tables ELEMENTS <section> (1, 1:23, 0) Inverted List (1; 8:22, 1) (1; 14:21, 2) term docno begin end level …... section section section 1 1 1 1 8 14 …... 23 22 21 0 1 2 TEXTS “information” Inverted List (1, 3, 2) …… term information docno wordno level …... 1 3 …... 2 Inverted List Operations to SQL Translation -- E // ‘T’ select from where and and and and -- E / ‘T’ select from where and and and and * ELEMENTS e, TEXTS t e.term = ‘E’ t.term = ‘T’ e.docno = t.docno e.begin < t.wordno t.wordno < e.end -- E = ‘T’ select from where and and and -- distance(“T1”, “T2”) <= n * select * ELEMENTS e, TEXTS t from TEXTS t1, TEXTS t2 e.term = ‘E’ and t.term = ‘T’ where t1.term = ‘T1’ and t2.term = ‘T2’ e.docno = t.docno and t1.docno = t2.docno t.wordno = e.begin + 1 and t2.wordno > t1.wordno e.end = t.wordno + 1 and t2.wordno <= t1.wordno + n * ELEMENTS e, TEXTS t e.term = ‘E’ and t.term = ‘T’ e.docno = t.docno e.begin < t.wordno t.wordno < e.end e.level = t.level - 1 Experimental Setup: Workload • Data sets Data Sets Size of documents inverted index size relational table size Number of distinct elements Number of distinct text words Total number of elements Total number of text words Shakespeare DBLP Synthetic 8 MB 53 MB 207 MB 11 MB 78 MB 285 MB 15 MB 121 MB 566 MB 22 598 715 22,825 250,657 100,000 179,726 1,595,010 4,999,500 474,057 3,655,148 19,952,000 • 13 micro-benchmark queries: – Of form: ‘elem’ contains ‘word’ – Vary ‘elem’ and ‘word’ of different frequencies: check sensitivity of performance w.r.t. selectivity Number of Term Occurrences Queries term1 frequency term2 frequency result rows QS1 90 277 2 QS2 107,833 277 36 QS3 107,833 3,231 1,543 QS4 107,833 1 1 QD1 654 55 13 QD2 4,188 712 672 QD3 287,513 6,363 6,315 QD4 287,513 3 3 QG1 50 1,000 809 QG2 134,900 55,142 1,470 QG3 701,000 165,424 21,936 QG4 50 82,712 12 QG5 701,000 17 4 DB2/Inv. List Engine Performance Ratios • Want to know why RDBMS: – sometimes performs much better – usually performs much worse • Two significant factors: Join algorithm, cache utilization Why Does RDBMS Sometimes Perform Better? -- E // ‘T’ select from where and and and and * ELEMENTS e, TEXTS t e.term = ‘E’ t.term = ‘T’ e.docno = t.docno e.begin < t.wordno t.wordno < e.end Inverted List Engine RDBMS Index Nested Loop index ‘E’ ELEMENTS (or TEXTS) More I/O More CPU index ‘T’ TEXTS (or ELEMENTS) ‘E’ list ‘T’ list MPMGJN vs. Standard Merge Join Three merging predicates d1 = d2, b <= w, w <= e One merging predicate: d1 = d2, Additional filtering predicates: b <= w, w <= e Inverted List Engine d 5 5 5 5 5 5 b 7 14 21 22 29 32 e 20 19 28 27 31 40 MPMGJN d w 5 2 5 23 5 24 5 33 5 37 5 42 RDBMS d 5 5 5 5 5 5 b 7 14 21 22 29 32 e 20 19 28 27 31 40 d w 5 2 5 23 5 24 5 33 5 37 5 42 Standard Merge Join Number of Row Pairs Compared Queries d1=d2,b<=w<=e QS1 5 QS2 7,131 QS3 89,716 QS4 2,366 QD1 503 QD2 4,723 QD3 263,458 QD4 1,766 QG1 1,000 QG2 103,994 QG3 610,816 QG4 12 QG5 56,084 d1=d2 1,653 984,948 10,175,904 3,475 555 1,315,662 14,082,080 4,950 1,000 148,773,116 2,319,244,480 82,712 238,340 MPMGJN vs. Index Nested Loop Join Inverted List Engine b 7 14 21 22 29 32 e 20 19 28 27 31 40 MPMGJN d w 5 2 5 23 5 24 5 33 5 37 5 42 d 5 5 5 5 5 5 b 7 14 21 22 29 32 e 20 19 28 27 31 40 d w 5 2 5 23 5 24 5 33 5 37 5 42 Index d 5 5 5 5 5 5 RDBMS Index Nested-loop Join MPMGJN vs. Index nested-loop join, depends on: number of index key comparisons vs. records comparisons cost of index key comparisons vs. records comparisons RDBMS vs. Special-purpose? • Recap: – RDBMS’s have great technology (e.g., B+-tree, optimization, storage) – but not well-tuned for containment queries using inverted lists • Open question: – use RDBMS or special purpose engine to process containment queries using inverted lists? Thanks • • • • • • • • • • Ron Larson Marti Hearst Doug Oard Philip Resnik Sriram Raghavan Johan List Maurice van Keulen David Carmel Aya Soffer …