* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt
Survey
Document related concepts
Algorithm characterizations wikipedia , lookup
Post-quantum cryptography wikipedia , lookup
Selection algorithm wikipedia , lookup
Binary search algorithm wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Computational complexity theory wikipedia , lookup
Sorting algorithm wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Theoretical computer science wikipedia , lookup
Pattern recognition wikipedia , lookup
Fast Fourier transform wikipedia , lookup
Transcript
HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/2005 22 Nov 2004 6th Lecture Christian Schindelhauer [email protected] 1 HEINZ NIXDORF INSTITUTE Chapter III University of Paderborn Algorithms and Complexity Christian Schindelhauer Chapter III Searching the Web 22 Nov 2004 Search Algorithms, WS 2004/05 2 HEINZ NIXDORF INSTITUTE Searching the Web University of Paderborn Algorithms and Complexity Christian Schindelhauer Introduction The Anatomy of a Search Engine Google’s Pagerank algorithm – The Simple Algorithm – Periodicity and convergence Kleinberg’s HITS algorithm – The algorithm – Convergence The Structure of the Web – Pareto distributions – Search in Pareto-distributed graphs Search Algorithms, WS 2004/05 3 Overview Search Engines HEINZ NIXDORF INSTITUTE http://www.searchengineshowdown.com/ University of Paderborn Algorithms and Complexity Christian Schindelhauer (March 2002) Number of documents Search Engine Showdown Claim Estimate (millions) (millions) Google WiseNut AllTheWeb 968 579 580 1,500 1,500 507 Northern Light AltaVista 417 397 358 500 Hotbot 332 500 MSN Search 292 500 Search Algorithms, WS 2004/05 4 Overview Search Engines HEINZ NIXDORF INSTITUTE http://www.searchengineshowdown.com/ University of Paderborn Algorithms and Complexity Christian Schindelhauer (Dez. 2002) Number of documents Search Engine Showdown Claim Estimate (millions) (millions) Google 3,033 3,083 AlltheWeb AltaVista 2,106 1,689 2,116 1,000 WiseNut 1,453 1,500 Hotbot 1,147 3,000 MSN Search 1,018 3,000 Teoma NLResearch Gigablast 1,015 733 275 500 125 150 Search Algorithms, WS 2004/05 5 HEINZ NIXDORF INSTITUTE Problems of Searching the Web University of Paderborn Algorithms and Complexity Christian Schindelhauer Currently (Nov 2004) more than 8 billion = 8.000 millions web-pages – 10.000 words cover more than 95% of each text – much more web-pages than words – Users hardly ever look through more than 40 results The problem is not to find a pattern, but to find the most important pages Problems: – Important pages do not contain the search pattern • www.porsche.com does not contain sports car or even car • www.google.com does not contain web search engine • www.airbus.com does not contain airplane – Certain pages have nearly every word (dictionary) – Names are misleading • http://www.whitehouse.org/ is not the web-site of the white house • www.theonion.com is not about vegetables – Certain pattern can be found everywhere, e.g. page, web, windows, ... Search Algorithms, WS 2004/05 6 HEINZ NIXDORF INSTITUTE How to rank Web-pages University of Paderborn Algorithms and Complexity Christian Schindelhauer The main problem about searching the web is to rank the importance Links are very helpful: – Humans are usually introduced on purpose – The context of the links gives some clues about the meaning of the web-page – Pages where many people point to are of probably very important – Most search rely on links Other approach: Ontology of words – Compare the combination of words with the search word – Good for comparing text – Difficult if single word patterns are given Search Algorithms, WS 2004/05 7 HEINZ NIXDORF INSTITUTE The Anatomy of a Web Search Engine University of Paderborn Algorithms and Complexity Christian Schindelhauer “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page, Computer Networks and ISDN Systems, Vol. 30, 1-6, p. 107-117, 1998 Design of the prototype – Stanford University 1998 Key components: – Web Crawler – Indexer – Pagerank – Searcher Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (Unkomprimiert)“ benötigt. Main difference between Google and other search engines (in 1998) – The Pagerank mechanism Search Algorithms, WS 2004/05 8 HEINZ NIXDORF INSTITUTE Simplified PageRank-Algorithmus University of Paderborn Algorithms and Complexity Christian Schindelhauer Simplified PageRank-Algorithmus – Rank of a wep-page R(u) [0,1] – Important pages hand their rank down to the pages they link to. – c is a normalisation factor such that ||R(u)||1= 1, i.e. • the sum of all page ranks add to 1 – Predecessor nodes Bu – sucessor nodes Fu Search Algorithms, WS 2004/05 9 The Simplifed Pagerank Algorithm and an example HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms, WS 2004/05 10 HEINZ NIXDORF INSTITUTE Matrix representaion University of Paderborn Algorithms and Complexity Christian Schindelhauer R cMR, where R is a vector (R(1),R(2),… R(n)) and M denotes the following n n – Matrix Search Algorithms, WS 2004/05 11 HEINZ NIXDORF INSTITUTE The Simplified Pagerank Algorithm University of Paderborn Algorithms and Complexity Christian Schindelhauer Does it converge? If it converges, does it converge to a single result? Is the result reasonable? Search Algorithms, WS 2004/05 12 HEINZ NIXDORF INSTITUTE The Eigenvector and Eigenvalue of the Matrix University of Paderborn Algorithms and Complexity Christian Schindelhauer For vector x and n n-matrix and a number λ: – If M x = λ x then x is called the eigenvector and λ the eigen-value Every n n-matrix M has at most n eigenvalues Compute the eigenvalues by eigen-decomposition Mx=λx (M - I λ) x = 0, where I is the identity matrix – This equality has only non-trivial solutions if Det(M - I λ) = 0 – This leads to a polynomial equation of degree n, which has always n solutions λ1, λ2, ..., λn • (Fundamental theorem of algebra) – Solving the linear equations (M - I λi) x = 0 lead to the eigenvectors The eigenvektor of the matrix is a fix point of the recursion of the simplified pagerank algorithm Search Algorithms, WS 2004/05 13 HEINZ NIXDORF INSTITUTE Stochastic Matrices University of Paderborn Algorithms and Complexity Christian Schindelhauer Consider n discrete states and a sequence of random variable X1, X2, ... over this set of states The sequence X1, X2, ... is a Markov chain if A stochastic matrix M is the transition matrix for a finite Markov chain, also called a Markov matrix: – Elements of the matrix M must be real numbers of [0, 1]. – The sum of all column in M is 1 Observation for the matrix M of the simpl. pagerank algorithm – M is stochastic if all nodes have at least one outgoing link Search Algorithms, WS 2004/05 14 HEINZ NIXDORF INSTITUTE The Random Surfer University of Paderborn Algorithms and Complexity Christian Schindelhauer Consider the following algorithm – Start in a random web-page according to a probability distribution – Repeat the following for t rounds • If no link is on this page, exit and produce no output • Uniformly and randomly choose a link of the web-page • Follow that link and go to this web-page – Output the web-page Lemma The probability that a web-page i is output by the random surfer after t rounds started with probability distribution x1, .., xn is described by the i-th entry of the output of the simplified Pagerank-algorithm iterated for t rounds without normalization. Proof follows applying the definition of Markov chains Search Algorithms, WS 2004/05 15 HEINZ NIXDORF INSTITUTE Eigenvalues of Stochastic Matrices University of Paderborn Algorithms and Complexity Christian Schindelhauer Notations – Die L1-Norm of a vector x is defined as – x0, if for all i: xi 0 – x0, if for all i: xi 0 Lemma For every stochastic matrix M and every vector x we have • || M x ||1 || x ||1 • || M x ||1 = || x ||1, if x0 or x0 Eigenvalues of M |i| 1 Theorem For every stochastic matrix M there is an eigenvector x with eigenvalue 1 such that x 0 and ||x||1 = 1 Search Algorithms, WS 2004/05 16 HEINZ NIXDORF INSTITUTE The problem of periodicity - Example University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms, WS 2004/05 17 HEINZ NIXDORF INSTITUTE Periodicity - Example 2 University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms, WS 2004/05 18 HEINZ NIXDORF INSTITUTE Periodic Matrices University of Paderborn Algorithms and Complexity Christian Schindelhauer Definition – A square matrix M such that the matrix power Mk=M for k a positive integer is called a periodic matrix. – If k is the least such integer, then the matrix is said to have period k. – If k = 1, then M2 = M and M is called idempotent. Fact – For non-periodic matrices there are vectors x, such that limk Mk x does not converge. Definition – The directed graph G=(V,E) of a n x n-matrix consistis of the node set V={1,..., n} and has edges • E = {(i,j) | Mij 0} – A path is a sequence of edges (u1,u2),(u2,u3),(u3,u4),..,(ut,ut+1) of a graph – A graph cycle is a path where the start node is the end node – A strongly connected subgraph S is a maximum sub-graph such that every graph cycle starting and ending in a node of S is contained in S. Search Algorithms, WS 2004/05 19 Necessary and Sufficient Conditions for Periodicity HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Theorem (necessary condition) – If the stochastic matrix M is periodic with period t2, then for the graph G of M there exists a strongly connected subgraph S of at least two nodes such that every directed graph cycle within S has a length of the form i t for natural number i. Theorem (sufficient condition) – Let the graph consist of one strongly connected subgraph and – let L1,L2, ..., Lm be the lengths all directed graph cycles of maximal length n – Then M is non-periodic if and only if gcd(L1,L2, ..., Lm) = 1 Notation: – gcd(L1,L2, ..., Lm) = greatest common divisor of numbers L1,L2, ..., Lm Corollary – If the graph is strongly connected and there exists a graph cycly of length 1 (i.e. a loop), then M is non-periodic. Search Algorithms, WS 2004/05 20 HEINZ NIXDORF INSTITUTE Disadvantages of the Simplified PagerankAlgorithm University of Paderborn Algorithms and Complexity Christian Schindelhauer The Web-graph has sinks, i.e. pages without links M is not a stochastic matrix The Web-graph is periodic Convergence is uncertain The Web-graph is not strongly connected Several convergence vectors possible Rank-sinks – Strongly connected subgraphs absorb all weight of the predecessors – All predecessors pointing to a web-page loose their weight. Search Algorithms, WS 2004/05 21 HEINZ NIXDORF INSTITUTE The (non-simplified) Pagerank-Algorithm University of Paderborn Algorithms and Complexity Christian Schindelhauer Add to a sink links to all web-pages Uniformly and randomly choose a web-page – With some probability q < 1 perform a step of the simplified Pagerank algorithm – With probability 1-q start with the first step (and choose a random web-page) Note M ist stochastic Search Algorithms, WS 2004/05 22 HEINZ NIXDORF INSTITUTE Properties of the Pagerank-Algorithm University of Paderborn Algorithms and Complexity Christian Schindelhauer Graph der Matrix is strongly connected There are graph cycles of length 1 Theorem In non-periodic matrices of strongly connected graphs the Markov-chain converges to a unique eigenvector with eigenvalue 1. PageRank converges to this unique eigenvector Search Algorithms, WS 2004/05 23 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Thanks for your attention End of 6th lecture Next lecture: Mo 29 Nov 2004, 11.15 am, FU 116 Next exercise class: Mo 22 Nov 2004, 1.15 pm, F0.530 or We 24 Nov 2004, 1.00 pm, E2.316 24