Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Search – Summer Term 2006 VI. Web Search Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University The Evolution of Search Engines 1st generation: Use only "on page", text data - Word frequency, language 1995-1997 (AltaVista, Excite, Lycos, etc.) 2nd gen.: Use off-page, web-specific data - Link (or connectivity) analysis - Click-through data (what results people click on) PageRank [2], introduced by Brin - Anchor-text (how people refer to a page) and Page, used by Google From 1998 (made popular by Google but HITS [3], introduced by Kleinberg everyone now) (used by Teoma?) TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 Link-based ranking: HITS Motivation (compare PageRank): Broad-topic queries: deliver (too) large set of relevant results Therefore: Ranking based on the authority of a web page (cf. PageRank: quality / importance) Link: Interpreted as a conferral of authority Goal: Find pages with high authority (balance between relevance and popularity) Link-based ranking: HITS (cont.) Basic idea: Consider sub-graph of the web graph that contains as much relevant pages as possible Analyze the graph's link structure to find: Authorities = the most authoritative or definitive subset of relevant pages (for ranking) Hubs = Pages pointing to many related authorities (for their identification) Authorities and Hubs - Example Example: Query “search engine” www.google.com HUBS www.teoma.com AUTHORITIES searchenginewatch.com dir.yahoo.com/ Computers_and_Internet/ Internet/World_Wide_Web/ Searching_the_Web/ Search_Engines_and_Directories/ www.altavista.com www.alltheweb.com Authorities and Hubs - Basic idea Approach: - Generate a query-dependent sub-graph - Recursively calculate hubs and authorities Assume S is the set of pages in this subgraph, then S should be - rather small - contain lots of relevant pages - contain the most important authorities Basic idea to generate such a sub-graph: - Get initial root set based on any IR criteria - Include the local neighborhood of this set Authorities and Hubs - Base set GIVEN: - QUERY Q - TEXT-BASED SEARCH ENGINE SE - CONSTANTS T AND D (NAT. NUMBERS) - SET R(Q) OF THE FIRST T RESULTS OF SE GIVEN Q ALGORITHM TO CALCULATE SUBGRAPH S(Q) S(Q) := R(Q) FOR EACH PAGE P IN R(Q) T+(P) := SET OF PAGES LINKED BY P T-(P) := SET OF PAGES LINKING TO P ADD ALL PAGES FROM T+(P) TO S(Q) IF |T-(P)| < D THEN ADD ALL PAGES FROM T-(P) TO S(Q) ELSE ADD RANDOM SUBSET OF T-(P) TO S(Q) Query-dependent base set - Comments Why only use a sub-graph? - Advantage of query dependence - Reduces processing time (online calculation!) Why not just take the root set? - Appearance of query terms does not necessarily represent relevance (or authority) - Larger network is needed for link analysis In original work: Heuristics for special cases - Remove intrinsic links, i.e. links from the same domain (navigational links, etc.) - Consider only a certain number of links from one domain to a page p (to avoid spamming) Calculating Hubs and Authorities Obviously, there exists a mutual reinforcing relationship between Hubs and Authorities: - A good Hub links to many good Authorities - A good Authority is linked by many Hubs Hence, use an iterative algorithm to estimate a Hub and Authority value, respectively Authorities: I-Operation Hubs: O-Operation Calculating Hubs and Authorities Authorities: I-Operation Hubs: O-Operation q1 q2 q1 PAGE p q3 PAGE p q2 q3 Calculating Hubs and Authorities GIVEN: - SUB-GRAPH G WITH N PAGES (FROM BASE SET S(Q)) - CONSTANT NUMBER K ALGORITHM TO CALCULATE HUBS AND AUTHOR. X0 := (1, 1, ..., 1) Y0 := (1, 1, ..., 1) FOR i = 1, ..., K CALCULATE NEW WEIGHTS Xi BY APPLYING THE I-OPERATION TO Xi-1, Yi-1 CALCULATE NEW WEIGHTS Yi BY APPLYING THE O-OPERATION TO Xi, Yi-1 NORMALIZE Xi AND Yi Calculating Hubs and Authorities Convergence: see lit. Basic idea: PageRank vs. HITS PageRank HITS + - Hard to spam - Computes quality signal for all pages - Easy to compute, realtime execution is hard - Query specific - Works on small graphs + - - Local graph structure - Non-trivial to compute can be manufactured - Not query specific - Provides a signal only - Does not work on when there is direct small graphs connectivity (e.g. home pages) Proven to be effective for Well suited for supervised general purpose ranking directory construction TUTORIAL ON SEARCH FROM THE WEB TO THE ENTERPRISE, SIGIR 2002 Commercial search engines using HITS (Maybe?) Teoma, now search.ask.com "Teomas underlying technology is an extension of the HITS algorithm …", C. Sherman, April 2002, http://dc.internet.com/news/article.php/1002061 (Not online anymore) References - HITS [1] S. BRIN, L. PAGE: "THE ANATOMY OF A LARGE-SCALE HYPERTEXTUAL WEB SEARCH ENGINE", WWW 1998 [2] JON KLEINBERG: "AUTHORITATIVE SOURCES IN A HYPERLINKED ENVIRONMENT", JOURNAL OF THE ACM, VOL. 46, NO. 5, SEPTEMBER 1999 General Web Search Engine Architecture CLIENT QUERIES QUERY ENGINE WWW RESULTS PAGE REPOSITORY RANKING CRAWLER(S) COLLECTION ANALYSIS MOD. INDEXER MODULE CRAWL CONTROL INDEXES UTILITY STRUCTURE TEXT USAGE FEEDBACK (CF. [1] FIG. 1)