Download The Anatomy of a Large-Scale Hypertextual Web Search Engine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Anatomy of a Large-Scale
Hypertextual Web Search Engine
Sergey Brin, Lawrence Page
Presented By: Paolo Lim
April 10, 2007
CS 331 - Data Mining
1
AKA: The Original Google Paper
Larry Page and Sergey Brin
CS 331 - Data Mining
2
Presentation Outline
Design goals of Google search engine
Link Analysis and other features
System architecture and major structures
Crawling, indexing, and searching the web
Performance and results
Conclusions
Final exam questions
CS 331 - Data Mining
3
Linear Algebra Background
 PageRank involves knowledge of:
Matrix addition/multiplication
Eigenvectors and Eigenvalues
Power iteration
Dot product
 Not discussed in detail in presentation
 For reference:
http://cs.wellesley.edu/~cs249B/math/Linear%20Alg
ebra/CS298LinAlgpart1.pdf
http://www.cse.buffalo.edu/~hungngo/classes/2005/
Expanders/notes/LA-intro.pdf
CS 331 - Data Mining
4
Google Design Goals
 Scaling with the web’s growth
 Improved search quality
Number of documents increasing rapidly, but user’s
ability to look at documents lags
Lots of “junk” results, little relevance
 Academic search engine research
Development and understanding in academic realm
System that reasonable number of people can actually
use
Support novel research activities of large-scale web
data by other researchers and students
CS 331 - Data Mining
5
Link Analysis Basics
PageRank Algorithm
A Top 10 IEEE ICDM data mining algorithm
Large basis for ranking system (discussed later)
Tries to incorporate ideas from academic
community (publishing and citations)
Anchor Text Analysis
<a href=http://www.com> ANCHOR TEXT </a>
CS 331 - Data Mining
6
Intuition: Why Links, Anyway?
Links represent citations
Quantity of links to a website makes the
website more popular
Quality of links to a website also helps in
computing rank
Link structure largely unused before Larry
Page proposed it to thesis advisor
CS 331 - Data Mining
7
Naïve PageRank
Each link’s vote is proportional to the
importance of its’ source page
If page P with important I has N outlinks,
then each link gets I / N votes
Simple recursive formulation:
PR(A) = PR(p1)/C(p1) + … + PR(pn)/C(pn)
PR(X)  PageRank of page X
C(X)  number of links going out of page X
CS 331 - Data Mining
8
Naïve PageRank Model
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
The web in 1839
y
a/2
Yahoo
y = y /2 + a /2
a = y /2 + m
m = a /2
y/2
y/2
m
M’soft
Amazon
a
a/2
m
CS 331 - Data Mining
9
Solving the flow equations
3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
y+a+m = 1
y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for
small examples, but we need a better
method for large graphs
CS 331 - Data Mining
10
Matrix formulation
 Matrix M has one row and one column for each web
page
 Suppose page j has n outlinks
If j ! i, then Mij=1/n
Else Mij=0
 M is a column stochastic matrix
Columns sum to 1
 Suppose r is a vector with one entry per web page
ri is the importance score of page i
Call it the rank vector
CS 331 - Data Mining
11
Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Suppose page j links to 3 pages, including i
j
i
i
=
1/3
M
r
CS 331 - Data Mining
r
12
Eigenvector formulation
The flow equations can be written
r = Mr
So the rank vector is an eigenvector of the
stochastic web matrix
In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
CS 331 - Data Mining
13
Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
m
0
1
0
r = Mr
Amazon
M’soft
y
1/2 1/2 0
a = 1/2 0 1
m
0 1/2 0
y = y /2 + a /2
a = y /2 + m
m = a /2
CS 331 - Data Mining
y
a
m
14
Power Iteration
Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize: r0 = [1,….,1]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < 
|x|1 = 1·i·N|xi| is the L1 norm
Can use any other vector norm e.g., Euclidean
CS 331 - Data Mining
15
Power Iteration Example
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
Yahoo
M’soft
Amazon
y
a =
m
m
0
1
0
1
1
1
1
3/2
1/2
5/4
1
3/4
9/8
22/24 . . .
1/2
CS 331 - Data Mining
6/5
6/5
3/5
16
Random Surfer
 Imagine a random web surfer
At any time t, surfer is on some page P
At time t+1, the surfer follows an outlink from P
uniformly at random
Ends up on some page Q linked from P
Process repeats indefinitely
 Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time t
p(t) is a probability distribution on pages
CS 331 - Data Mining
17
The stationary distribution
 Where is the surfer at time t+1?
Follows a link uniformly at random
p(t+1) = Mp(t)
 Suppose the random walk reaches a state such that
p(t+1) = Mp(t) = p(t)
Then p(t) is called a stationary distribution for the
random walk
 Our rank vector r satisfies r = Mr
So it is a stationary distribution for the random
surfer
CS 331 - Data Mining
18
Spider traps
A group of pages is a spider trap if there
are no links from within the group to
outside the group
Random surfer gets trapped
Spider traps violate the conditions needed
for the random walk theorem
CS 331 - Data Mining
19
Microsoft becomes a spider trap
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
Yahoo
y a
y 1/2 1/2
a 1/2 0
m 0 1/2
m
0
0
1
M’soft
Amazon
y
a =
m
1
1
1
1
1/2
3/2
3/4
1/2
7/4
CS 331 - Data Mining
5/8
3/8
2
...
0
0
3
20
Random teleports
The Google solution for spider traps
At each time step, the random surfer has
two options:
With probability , follow a link at random
With probability 1-, jump to some page
uniformly at random
Common values for  are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within
a few time steps
CS 331 - Data Mining
21
Matrix formulation
Suppose there are N pages
Consider a page j, with set of outlinks O(j)
We have Mij = 1/|O(j)| when j!i and Mij = 0
otherwise
The random teleport is equivalent to
adding a teleport link from j to every other page with
probability (1-)/N
reducing the probability of following each outlink from
1/|O(j)| to /|O(j)|
Equivalent: tax each page a fraction (1-) of its score
and redistribute evenly
CS 331 - Data Mining
22
Page Rank
Construct the NxN matrix A as follows
Aij = Mij + (1-)/N
Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix
satisfying r = Ar
Equivalently, r is the stationary distribution
of the random walk with teleports
CS 331 - Data Mining
23
Previous example with =0.8
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 1
Yahoo
y
a =
m
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 13/15
M’soft
Amazon
1
1
1
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
1.00 0.84
0.60 0.60
1.40 1.56
CS 331 - Data Mining
0.776
0.536 . . .
1.688
7/11
5/11
21/11
24
Dead ends
Pages with no outlinks are “dead ends” for
the random surfer
Nowhere to go on next step
CS 331 - Data Mining
25
Microsoft becomes a dead end
(from http://www.stanford.edu/class/cs345a/lectureslides/PageRank.pdf)
1/2 1/2 0
0.8 1/2 0 0
0 1/2 0
Yahoo
y
a =
m
y 7/15 7/15 1/15
a 7/15 1/15 1/15
m 1/15 7/15 1/15
M’soft
Amazon
1
1
1
1
0.6
0.6
1/3 1/3 1/3
+ 0.2 1/3 1/3 1/3
1/3 1/3 1/3
0.787 0.648
0.547 0.430 . . .
0.387 0.333
CS 331 - Data Mining
0
0
0
Nonstochastic!
26
Dealing with dead-ends
 Teleport
Follow random teleport links with probability 1.0
from dead-ends
Adjust matrix accordingly
 Prune and propagate
Preprocess the graph to eliminate dead-ends
Might require multiple passes
Compute page rank on reduced graph
Approximate values for dead ends by
propagating values from reduced graph
CS 331 - Data Mining
27
Anchor Text
Can be more accurate description of target
site than target site’s text itself
Can point at non-HTTP or non-text
Images
Videos
Databases
Possible for non-crawled pages to be
returned in the process
CS 331 - Data Mining
28
Other Features
List of occurrences of a particular word in
a particular document (Hit List)
Location information and proximity
Keeps track of visual presentation details:
Font size of words
Capitalization
Bold/Italic/Underlined/etc.
Full raw HTML of all pages is available in
repository
CS 331 - Data Mining
29
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Implemented in C and C++ on Solaris and Linux
CS 331 - Data Mining
30
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Multiple crawlers run in parallel.
Each crawler keeps its own DNS
lookup cache and ~300 open
connections open at once.
Keeps track of URLs
that have and need
to be crawled
Compresses and
stores web pages
Stores each link and
text surrounding link.
Converts relative URLs
into absolute URLs.
Uncompresses and parses
documents. Stores
link
CS 331 - Data Mining
information in anchors file.
Contains full html of every web
page. Each document is prefixed
31
by docID, length, and URL.
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
Maps absolute URLs into docIDs stored in Doc
Index. Stores anchor text in “barrels”.
Generates database of links (pairs of docIds).
Parses & distributes hit lists into
“barrels.”
Partially sorted forward
indexes sorted by docID. Each
barrel stores hitlists for a given
range of wordIDs.
In-memory hash table that
maps words to wordIds.
Contains pointer to doclist in
barrel which wordId falls into.
Creates inverted index
whereby document list
containing docID and hitlists
can be retrieved given wordID.
DocID keyed index where each entry includes info such as pointer to doc in
repository, checksum, statistics,
status, etc. Also contains URL info if doc 32
CS 331 - Data Mining
has been crawled. If not just contains URL.
Google Architecture
(from http://www.ics.uci.edu/~scott/google.htm)
2 kinds of barrels. Short
barrell which contain hit
list which include title or
anchor hits. Long barrell
for all hit lists.
New lexicon keyed by
wordID, inverted doc
index keyed by docID,
and PageRanks used to
answer queries
CS 331 - Data Mining
List of wordIds produced
by Sorter and lexicon
created by Indexer used
to create new lexicon
used by searcher. Lexicon
stores ~14 million words.
33
Google Query Evaluation
1.
2.
3.
4.
5.
6.
7.
8.
Parse the query.
Convert words into wordIDs.
Seek to the start of the doclist in the short barrel for every
word.
Scan through the doclists until there is a document that
matches all the search terms.
Compute the rank of that document for the query.
If we are in the short barrels and at the end of any
doclist, seek to the start of the doclist in the full barrel for
every word and go to step 4.
If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and
return the top k.
CS 331 - Data Mining
34
Single Word Query Ranking
 Hitlist is retrieved for single word
 Each hit can be one of several types: title,
anchor, URL, large font, small font, etc.
 Each hit type is assigned its own weight
 Type-weights make up vector of weights
 Number of hits of each type is counted to form
count-weight vector
 Dot product of type-weight and count-weight
vectors is used to compute IR score
 IR score is combined with PageRank to compute
final rank
CS 331 - Data Mining
35
Multi-word Query Ranking
 Similar to single-word ranking except now must
analyze proximity of words in a document
 Hits occurring closer together are weighted higher
than those farther apart
 Each proximity relation is classified into 1 of 10 bins
ranging from a “phrase match” to “not even close”
 Each type and proximity pair has a type-prox weight
 Counts converted into count-weights
 Take dot product of count-weights and type-prox
weights to computer for IR score
CS 331 - Data Mining
36
Scalability
Cluster architecture combined with
Moore’s Law make for high scalability. At
time of writing:
~ 24 million documents indexed in one week
~518 million hyperlinks indexed
Four crawlers collected 100 documents/sec
CS 331 - Data Mining
37
Key Optimization Techniques
 Each crawler maintains its own DNS lookup cache
 Use flex to generate lexical analyzer with own stack for
parsing documents
 Parallelization of indexing phase
 In-memory lexicon
 Compression of repository
 Compact encoding of hit lists for space saving
 Indexer is optimized so it is just faster than the crawler
so that crawling is the bottleneck
 Document index is updated in bulk
 Critical data structures placed on local disk
 Overall architecture designed avoid to disk seeks
wherever possible
CS 331 - Data Mining
38
Storage Requirements
(from http://www.ics.uci.edu/~scott/google.htm)
At the time of publication, Google had the following
statistical breakdown for storage requirements:
CS 331 - Data Mining
39
Conclusions
Search is far from perfect
Topic/Domain-specific PageRank
Machine translation in search
Non-hypertext search
Business potential
Brin and Page worth around $15 billion each…
at 32 years old!
If you have a better idea than how Google does
search, please remember me when you’re
hiring software engineers! 
CS 331 - Data Mining
40
Possible Exam Questions
 Given a web/link graph, formulate a Naïve
PageRank link matrix and do a few steps of
power iteration.
Slides 14 – 16
 What are spider traps and dead ends, and how
does Google deal with these?
Spider Trap: Slides 19 – 21
Dead End: Slides 25 – 27
 Explain difference between single and multiple
word search query evaluation.
Slides 35 – 36
CS 331 - Data Mining
41
References
 Brin, Page. The Anatomy of a Large-Scale
Hypertextual Web Search Engine.
 Brin, Page, Motwani, Winograd. The PageRank
Citation Ranking: Bringing Order to the Web.
 http://www.stanford.edu/class/cs345a/lectureslid
es/PageRank.pdf
 www.cs.duke.edu/~junyang/courses/cps296.12002-spring/lectures/02-web-search.pdf
 http://www.ics.uci.edu/~scott/google.htm
CS 331 - Data Mining
42
Thank you!
CS 331 - Data Mining
43