Download Information Retrieval - University of Pennsylvania

Google, Web Crawling, and Distributed Synchronization Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 1, 2008 Administrivia  Homework due TONIGHT  Please email me a list of project partners by THURSDAY 2 Google Architecture [Brin/Page 98] Focus was on scalability to the size of the Web First to really exploit Link Analysis Started as an academic project @ Stanford; became a startup Est. 450K commodity servers in 2006!!! 3 Google Document Repository “BigFile” distributed filesystem (GFS)  Support for 264 bytes across multiple drives, filesystems  Manages its own file descriptors, resources Repository  Contains every HTML page (this is the cached page entry), compressed in zlib (faster than bzip)  Two indices:  DocID  documents  URL checksum  DocID 4 Lexicon  The list of searchable words  (Presumably, today it’s used to suggest alternative words as well)  The “root” of the inverted index  As of 1998, 14 million “words”  Kept in memory (was 256MB)  Two parts:  Hash table of pointers to words and the “barrels” (partitions) they fall into  List of words (null-separated)  Much larger today – may still fit in memory?? 5 Indices – Inverted and “Forward”  Inverted index divided into “barrels” (partitions by range)  Indexed by the lexicon; for each DocID, consists of a Hit List of entries in the document  Forward index uses the same barrels  Used to find multi-word queries with words in same barrel  Indexed by DocID, then a list of WordIDs in this barrel and this document, then Hit Lists corresponding to the WordIDs  Two barrels: short (anchor and title); full (all text) Lexicon: 293 MB WordID ndocs WordID ndocs WordID ndocs Inverted Barrels: 41 GB DocID: 27 nhits: 8 hit hit hit hit DocID: 27 nhits: 8 hit hit hit DocID: 27 nhits: 8 hit hit hit hit DocID: 27 nhits: 8 hit hit forward barrels: total 43 GB DocID WordID: 24 nhits: 8 hit hit hit WordID: 24 nhits: 8 hit hit NULL DocID WordID: 24 nhits: 8 hit WordID: 24 nhits: 8 hit hit WordID: 24 nhits: 8 hit hit hit NULL  Note: barrels are now called “shards” hit hit hit hit original tables from http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm 6 Hit Lists (Not Mafia-Related)  Used in inverted and forward indices  Goal was to minimize the size – the bulk of data is in hit entries  For 1998 version, made it down to 2 bytes per hit (though that’s likely climbed since then): Plain cap 1 Fancy cap 1 Anchor cap 1 font: 3 position: 12 vs. font: 7 type: 4 position: 8 special-cased to: font: 7 type: 4 hash: 4 pos: 4 7 Google’s Distributed Crawler (in 1998)  Single URL Server – the coordinator  A queue that farms out URLs to crawler nodes  Implemented in Python!  Crawlers have 300 open connections apiece  Each needs own DNS cache – DNS lookup is major bottleneck  Based on asynchronous I/O (which is now supported in Java)  Many caveats in building a “friendly” crawler (we’ll talk about some momentarily) 8 Google’s Search Algorithm 1. 2. 3. 4. 5. 6. 7. 8. Parse the query Convert words into wordIDs Seek to start of doclist in the short barrel for every word Scan through the doclists until there is a document that matches all of the search terms Compute the rank of that document If we’re at the end of the short barrels, start at the doclists of the full barrel, unless we have enough If not at the end of any doclist, goto step 4 Sort the documents by rank; return the top K 9 Ranking in Google Considers many types of information: Position, font size, capitalization Anchor text PageRank Count of occurrences (basically, TF) in a way that tapers off  (Not clear if they do IDF?)     Multi-word queries consider proximity also 10 Google’s Resources In 1998:     24M web pages About 55GB data w/o repository About 110GB with repository Lexicon 293MB Worked quite well with low-end PC By 2006, > 25 billion pages, 400M queries/day:  Don’t attempt to include all barrels on every machine!  e.g., 5+TB repository on special servers separate from index servers  Many special-purpose indexing services (e.g., images)  Much greater distribution of data (2008, ~450K PCs?), huge net BW  Advertising needs to be tied in (100,000 advertisers claimed) 11 Web Crawling  You’ve already built a basic crawler  First, some politeness criteria  Then we’ll look at the Mercator distributed crawler as an example 12 Basic Crawler (Spider/Robot) Process  Start with some initial page P0  Collect all URLs from P0 and add to the crawler queue  Consider <base href> tag, anchor links, optionally image links, CSS, DTDs, scripts  Considerations:      What order to traverse (polite to do BFS – why?) How deep to traverse What to ignore (coverage) How to escape “spider traps” and avoid cycles How often to crawl 13 Essential Crawler Etiquette  Robot exclusion protocols  First, ignore pages with: <META NAME="ROBOTS” CONTENT="NOINDEX">  Second, look for robots.txt at root of web server  See http://www.robotstxt.org/wc/robots.html  To exclude all robots from a server: User-agent: * Disallow: /  To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/ 14 Mercator  A scheme for building a distributed web crawler  Expands a “URL frontier”  Avoids re-crawling same URLs  Also considers whether a document has been seen before  Every document has signature/checksum info computed as it’s crawled 15 Mercator Web Crawler 1. 2. 3. 4. Dequeue frontier URL Fetch document Record into RewindStream (RIS) Check against fingerprints to verify it’s new 5. 6. 7. 8. Extract hyperlinks Filter unwanted links Check if URL repeated (compare its hash) Enqueue URL 16 Mercator’s Polite Frontier Queues  Tries to go beyond breadth-first approach – want to have only one crawler thread per server  Distributed URL frontier queue:  One subqueue per worker thread  The worker thread is determined by hashing the hostname of the URL  Thus, only one outstanding request per web server 17 Mercator’s HTTP Fetcher  First, needs to ensure robots.txt is followed  Caches the contents of robots.txt for various web sites as it crawls them  Designed to be extensible to other protocols  Had to write own HTTP requestor in Java – their Java version didn’t have timeouts  Today, can use setSoTimeout()  Can also use Java non-blocking I/O if you wish:  http://www.owlmountain.com/tutorials/NonBlockingIo.htm  But they use multiple threads and synchronous I/O 18 Other Caveats  Infinitely long URL names (good way to get a buffer overflow!)  Aliased host names  Alternative paths to the same host  Can catch most of these with signatures of document data (e.g., MD5)  Crawler traps (e.g., CGI scripts that link to themselves using a different name)  May need to have a way for human to override certain URL paths – see Section 5 of paper 19 Mercator Document Statistics PAGE TYPE PERCENT text/html 69.2% image/gif 17.9% image/jpeg 8.1% text/plain 1.5% pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4% Histogram of document sizes (60M pages) Further Considerations  May want to prioritize certain pages as being most worth crawling  Focused crawling tries to prioritize based on relevance  May need to refresh certain pages more often 21 Web Search Summarized  Two important factors:  Indexing and ranking scheme that allows most relevant documents to be prioritized highest  Crawler that manages to be (1) well-mannered, (2) avoid traps, (3) scale  We’ll be using Pastry to distribute the work of crawling and to distribute the data (what Google calls “barrels”) 22 Synchronization The issue:  What do we do when decisions need to be made in a distributed system?  e.g., Who decides what action should be taken? Whose conflicting option wins? Who gets killed in a deadlock? Etc. Some options:  Central decision point  Distributed decision points with locking/scheduling (“pessimistic”)  Distributed decision points with arbitration (“optimistic”)  Transactions (to come) 23 Multi-process Decision Arbitration: Mutual Exclusion (i.e., Locking) a) b) c) Process 1 asks the coordinator for permission to enter a critical region. Permission is granted Process 2 then asks permission to enter the same critical region. The coordinator does not reply. When process 1 exits the critical region, it tells the coordinator, when then replies to 2 24 Distributed Locking: Still Based on a Central Arbitrator a) b) c) Two processes want to enter the same critical region at the same moment. Process 0 has the lowest timestamp, so it wins. When process 0 is done, it sends an OK also, so 2 can now enter the critical region. 25 Eliminating the Request/Response: A Token Ring Algorithm a) b) An unordered group of processes on a network. A logical ring constructed in software. 26 Comparison of Costs Algorithm Messages per entry/exit Delay before entry (in message times) Problems Centralized 3 2 Coordinator crash Distributed 2(n–1) 2(n–1) Crash of any process Token ring 1 to  0 to n – 1 Lost token, process crash 27 Electing a Leader: The Bully Algorithm Is Won By Highest ID Suppose the old leader dies…  Process 4 needs a decision; it runs an election  Process 5 and 6 respond, telling 4 to stop  Now 5 and 6 each hold an election 28 Totally-Ordered Multicasting: Arbitrates via Priority  Updating a replicated database and leaving it in an inconsistent state. 29 Another Approach: Time  We can choose the decision that happened first  … But how do we know what happened first? 30

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Information Retrieval - University of Pennsylvania