Download Information Retrieval - University of Pennsylvania

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL shortening wikipedia , lookup

URL redirection wikipedia , lookup

Transcript
Google, Web Crawling,
and Distributed Synchronization
Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 – Internet and Web Systems
April 1, 2008
Administrivia
 Homework due TONIGHT
 Please email me a list of project partners by
THURSDAY
2
Google Architecture [Brin/Page 98]
Focus was on scalability
to the size of the Web
First to really exploit
Link Analysis
Started as an academic
project @ Stanford;
became a startup
Est. 450K commodity
servers in 2006!!!
3
Google Document Repository
“BigFile” distributed filesystem
(GFS)
 Support for 264 bytes across
multiple drives, filesystems
 Manages its own file descriptors,
resources
Repository
 Contains every HTML page (this is the
cached page entry), compressed in zlib
(faster than bzip)
 Two indices:
 DocID  documents
 URL checksum  DocID
4
Lexicon
 The list of searchable words
 (Presumably, today it’s used to
suggest alternative words as well)
 The “root” of the inverted index
 As of 1998, 14 million “words”
 Kept in memory (was 256MB)
 Two parts:
 Hash table of pointers to words and
the “barrels” (partitions) they fall into
 List of words (null-separated)
 Much larger today – may still fit
in memory??
5
Indices – Inverted and “Forward”
 Inverted index divided into
“barrels” (partitions by range)
 Indexed by the lexicon; for each
DocID, consists of a Hit List of
entries in the document
 Forward index uses the same
barrels
 Used to find multi-word queries
with words in same barrel
 Indexed by DocID, then a list of
WordIDs in this barrel and this
document, then Hit Lists
corresponding to the WordIDs
 Two barrels: short (anchor and
title); full (all text)
Lexicon: 293 MB
WordID
ndocs
WordID
ndocs
WordID
ndocs
Inverted Barrels: 41 GB
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit hit
DocID: 27
nhits: 8
hit hit hit hit
DocID: 27
nhits: 8
hit hit
forward barrels: total 43 GB
DocID WordID: 24
nhits: 8
hit hit hit
WordID: 24
nhits: 8
hit hit
NULL
DocID WordID: 24
nhits: 8
hit
WordID: 24
nhits: 8
hit hit
WordID: 24
nhits: 8
hit hit hit
NULL
 Note: barrels are now called
“shards”
hit hit hit
hit
original tables from
http://www.cs.huji.ac.il/~sdbi/2000/google/index.htm
6
Hit Lists (Not Mafia-Related)
 Used in inverted and forward indices
 Goal was to minimize the size – the bulk of data is in
hit entries
 For 1998 version, made it down to 2 bytes per hit (though
that’s likely climbed since then):
Plain
cap 1
Fancy
cap 1
Anchor
cap 1
font: 3
position: 12
vs.
font: 7
type: 4
position: 8
special-cased to:
font: 7
type: 4
hash: 4
pos: 4
7
Google’s Distributed Crawler (in 1998)
 Single URL Server – the coordinator
 A queue that farms out URLs to crawler nodes
 Implemented in Python!
 Crawlers have 300 open connections apiece
 Each needs own DNS cache – DNS lookup is major
bottleneck
 Based on asynchronous I/O (which is now supported in
Java)
 Many caveats in building a “friendly” crawler (we’ll
talk about some momentarily)
8
Google’s Search Algorithm
1.
2.
3.
4.
5.
6.
7.
8.
Parse the query
Convert words into wordIDs
Seek to start of doclist in the short barrel for every word
Scan through the doclists until there is a document that
matches all of the search terms
Compute the rank of that document
If we’re at the end of the short barrels, start at the
doclists of the full barrel, unless we have enough
If not at the end of any doclist, goto step 4
Sort the documents by rank; return the top K
9
Ranking in Google
Considers many types of information:
Position, font size, capitalization
Anchor text
PageRank
Count of occurrences (basically, TF) in a way that tapers
off
 (Not clear if they do IDF?)




Multi-word queries consider proximity also
10
Google’s Resources
In 1998:




24M web pages
About 55GB data w/o repository
About 110GB with repository
Lexicon 293MB
Worked quite well with low-end PC
By 2006, > 25 billion pages, 400M queries/day:
 Don’t attempt to include all barrels on every machine!
 e.g., 5+TB repository on special servers separate from index servers
 Many special-purpose indexing services (e.g., images)
 Much greater distribution of data (2008, ~450K PCs?), huge net BW
 Advertising needs to be tied in (100,000 advertisers claimed)
11
Web Crawling
 You’ve already built a basic crawler
 First, some politeness criteria
 Then we’ll look at the Mercator distributed crawler
as an example
12
Basic Crawler (Spider/Robot) Process
 Start with some initial page P0
 Collect all URLs from P0 and add to the crawler queue
 Consider <base href> tag, anchor links, optionally image links, CSS,
DTDs, scripts
 Considerations:





What order to traverse (polite to do BFS – why?)
How deep to traverse
What to ignore (coverage)
How to escape “spider traps” and avoid cycles
How often to crawl
13
Essential Crawler Etiquette
 Robot exclusion protocols
 First, ignore pages with:
<META NAME="ROBOTS” CONTENT="NOINDEX">
 Second, look for robots.txt at root of web server
 See http://www.robotstxt.org/wc/robots.html
 To exclude all robots from a server:
User-agent: *
Disallow: /
 To exclude one robot from two directories:
User-agent: BobsCrawler
Disallow: /news/
Disallow: /tmp/
14
Mercator
 A scheme for building a distributed web crawler
 Expands a “URL frontier”
 Avoids re-crawling same URLs
 Also considers whether a document has been seen
before
 Every document has signature/checksum info computed as
it’s crawled
15
Mercator Web Crawler
1.
2.
3.
4.
Dequeue frontier URL
Fetch document
Record into RewindStream (RIS)
Check against fingerprints to
verify it’s new
5.
6.
7.
8.
Extract hyperlinks
Filter unwanted links
Check if URL repeated
(compare its hash)
Enqueue URL
16
Mercator’s Polite Frontier Queues
 Tries to go beyond breadth-first approach – want to
have only one crawler thread per server
 Distributed URL frontier queue:
 One subqueue per worker thread
 The worker thread is determined by hashing the
hostname of the URL
 Thus, only one outstanding request per web server
17
Mercator’s HTTP Fetcher
 First, needs to ensure robots.txt is followed
 Caches the contents of robots.txt for various web sites as
it crawls them
 Designed to be extensible to other protocols
 Had to write own HTTP requestor in Java – their
Java version didn’t have timeouts
 Today, can use setSoTimeout()
 Can also use Java non-blocking I/O if you wish:
 http://www.owlmountain.com/tutorials/NonBlockingIo.htm
 But they use multiple threads and synchronous I/O
18
Other Caveats
 Infinitely long URL names (good way to get a buffer
overflow!)
 Aliased host names
 Alternative paths to the same host
 Can catch most of these with signatures of
document data (e.g., MD5)
 Crawler traps (e.g., CGI scripts that link to
themselves using a different name)
 May need to have a way for human to override certain
URL paths – see Section 5 of paper
19
Mercator Document Statistics
PAGE TYPE PERCENT
text/html
69.2%
image/gif
17.9%
image/jpeg
8.1%
text/plain
1.5%
pdf
0.9%
audio
0.4%
zip
0.4%
postscript
0.3%
other
1.4%
Histogram of document sizes
(60M pages)
Further Considerations
 May want to prioritize certain pages as being most
worth crawling
 Focused crawling tries to prioritize based on relevance
 May need to refresh certain pages more often
21
Web Search Summarized
 Two important factors:
 Indexing and ranking scheme that allows most relevant
documents to be prioritized highest
 Crawler that manages to be (1) well-mannered, (2) avoid
traps, (3) scale
 We’ll be using Pastry to distribute the work of
crawling and to distribute the data (what Google
calls “barrels”)
22
Synchronization
The issue:
 What do we do when decisions need to be made in a
distributed system?
 e.g., Who decides what action should be taken? Whose
conflicting option wins? Who gets killed in a deadlock? Etc.
Some options:
 Central decision point
 Distributed decision points with locking/scheduling
(“pessimistic”)
 Distributed decision points with arbitration (“optimistic”)
 Transactions (to come)
23
Multi-process Decision Arbitration:
Mutual Exclusion (i.e., Locking)
a)
b)
c)
Process 1 asks the coordinator for permission to enter a critical region.
Permission is granted
Process 2 then asks permission to enter the same critical region. The
coordinator does not reply.
When process 1 exits the critical region, it tells the coordinator, when then
replies to 2
24
Distributed Locking: Still Based on a
Central Arbitrator
a)
b)
c)
Two processes want to enter the same critical region at the same
moment.
Process 0 has the lowest timestamp, so it wins.
When process 0 is done, it sends an OK also, so 2 can now enter the
critical region.
25
Eliminating the Request/Response:
A Token Ring Algorithm
a)
b)
An unordered group of processes on a network.
A logical ring constructed in software.
26
Comparison of Costs
Algorithm
Messages per
entry/exit
Delay before entry (in
message times)
Problems
Centralized
3
2
Coordinator crash
Distributed
2(n–1)
2(n–1)
Crash of any
process
Token ring
1 to 
0 to n – 1
Lost token, process
crash
27
Electing a Leader: The Bully Algorithm
Is Won By Highest ID
Suppose the old leader dies…

Process 4 needs a decision; it runs an election

Process 5 and 6 respond, telling 4 to stop

Now 5 and 6 each hold an election
28
Totally-Ordered Multicasting:
Arbitrates via Priority
 Updating a replicated database and leaving it in an inconsistent state.
29
Another Approach: Time
 We can choose the decision that happened first
 … But how do we know what happened first?
30