Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Search Engines and Web Data Mining basic techniques, architectures, tools, and current trends CATT Short Course - July 12, 2002 Instructor: Prof. Torsten Suel Department of Computer and Information Science Polytechnic University [email protected] http://cis.poly.edu/suel/ This material was developed for a one-day short course on search engines and search technology taught at Polytechnic University in Brooklyn on June 11, 2004. The course was organized by the Center for Advanced Technology in Telecommunication (CATT) at Polytechnic University; see http://catt.poly.edu for more information. An electronic version of these slides (except for handwritten additions) is available at http://cis.poly.edu/suel/webshort/ Which also contains a detailed list of additional pointers and bibliographic references. © 2004 Torsten Suel ([email protected]) Goal of this course: • learn how search engines work • learn about other types of web search tools and applications • learn how to build and deploy such tools - basic information retrieval techniques - what software tools to use - system architectures and performance Target Audience: • technically oriented people interested in how it works • developers who may need to build or deploy tools Overview: • Part I - Introduction: - Motivation (why web search tools?) - Examples of search tools - How the web works - Basic search engine structure - Introduction to Information Retrieval (coffee break: 11:00-11:15) • Part II – Search Engine Basics - Search engine architecture - Web crawling basics: following links, robot exclusion, .. - Storage - Text indexing - Querying and term-based ranking - Basic link-based ranking (lunch: 12:30-1:30) Overview: (cont.) • Part III – Applications, Systems, and Tools - Types of search tools - Available software tools - Search and DBMSs - Application scenarios: * Major search engine * Focused Data Collection and Analysis * Browsing/Search Assistants * Site and Enterprise Search * Geographic Web Search - Example: citeseer system - Example: Internet Archive - Using search engines - Search engine optimization and manipulation (break 2:45 -3:00) Overview: (cont.) • Part IV - Advanced Techniques - High-performance crawling - Recrawling and focused crawling - Link-based ranking (Pagerank, HITS) - Structural analysis of the web graph - Optimizing query execution - Parallel search engines and scaling - Meta search engines - Document clustering and duplicate detection Not Covered: • Semi-structured data and XML • Web accessible databases - crawling the hidden web - efficient query processing on remote data sources - wrapper construction • Extracting relational data from the web • Shopping bots • Image and multimedia search • Peer-to-peer search technologies • advanced IR: categorization, clustering, ... • natural language processing (NLP) 1 – Introduction and Motivation: What is the Web? What is the web? (another view) • pages containing (fairly unstructured) text • images, audio, etc. embedded in pages • structure defined using HTML (Hypertext Markup Language) • hyperlinks between pages! • over 10 billion pages • over 150 billion hyperlinks a giant graph! How is the web organized? Web Server (Host) Web Server (Host) www.cnn.com www.poly.edu • pages reside in servers • related pages in sites • local versus global links • logical vs. physical structure Web Server (Host) www.irs.gov How do we find pages on the web? • more than 10 billion pages • more than 150 billion hyperlinks • plus images, movies, .. , database content we need specialized tools for finding pages and information 2 - Overview of web search tools • Major search engines (google, fast, altavista, teoma, wisenut) • Web directories (yahoo, open directory project) • Specialized search engines (citeseer, achoo, findlaw) • Local search engines (for one site or domain) • Meta search engines (dogpile, mamma, search.com, vivisimo) • Personal search assistants (alexa, google toolbar) • Comparison shopping (mysimon, pricewatch, dealtime) • Image search (ditto, visoo (gone), major engines) • Natural language questions (askjeeves?) • Deep Web/Database search (completeplanet/brightplanet) Major search engines: Basic structure of a search engine: indexing Crawler Index disks Query: “computer” Search.com look up Ranking: • return best pages first • term- vs. link-based approaches Challenges for search engines: • coverage (need to cover large part of the web) need to crawl and store massive data sets • good ranking (in the case of broad queries) smart information retrieval techniques • freshness (need to update content) frequent recrawling of content • user load (up to 10000 queries/sec - Google) many queries on massive data • manipulation (sites want to be listed first) naïve techniques will be exploited quickly Web directories: (Yahoo, Open Directory Project) Topic hierarchy: sports baseball everything politics foreign hockey domestic soccer business health .... .... .... .... …. Challenges: • designing topic hierarchy • automatic classification: “what is this page about?” • Yahoo and Open Directory mostly human-based .... Specialized search engines: (achoo, findlaw) • be the best on one particular topic • use domain-specific knowledge • limited resources do not crawl the entire web! • focused crawling techniques or meta search Meta search engines: (dogpile, vivisimo, search.com, mamma) • uses other search engines to answer questions • ask the right specialized search engine, or • combine results from several large engines • may need to be “familiar” with thousands of engines Personal Search Assistants: (Alexa, Google Toolbar) • embedded into browser • can suggest “related pages” • search by “highlighting text” can use context • may exploit individual browsing behavior • may collect and aggregate browsing information privacy issues • architectures: - on top of crawler-based search engine (alexa, google), or - based on meta search (MIT Powerscout) - based on limited crawls by client or proxy (MIT Letizia, Stanford PowerBrowser) Perspective: systems machine learning information retrieval AI natural language processing library & information science algorithms databases Example #1: Link-based ranking techniques • Ragerank (Brin&Page/Google) “significance of a page depends on significance of those referencing it” • HITS (Kleinberg/IBM) “Hubs and Authorities” Example #2: Crawling 100 million pages • crawler architecture • networking requirements • data structures: size and robustness • crawling etiquette • concerns for webmasters Example #3: Analysis of the web graph • What does the web look like? (diameter, connectivity, in-degree) • Why are there so many bipartite cliques? (IBM) (and why do we care?) (2,3-clique) • How do you compute with a 500 million node graph? Example #4: Finding near duplicates on the web • given 1 billion pages, find duplicates (15 TB) • more difficult: find very similar pages (clustering) • find mirror sites and replicated collections • process collection to account for duplicates 3 - How the web works (more details) Fetching “www.cnn.com/world/index.html” give me the file “/world/index.html” Web Server Desktop (with browser) www.cnn.com here is the file: “...” Three Main Ingredients: • Naming: URL (uniform resource locators) (used to identify and locate objects) • Communication: HTTP (hypertext transfer protocol) (used to request and transfer objects) • Rendering: HTML (hypertext markup language) (used to defined how object should be presented to user) Client Server Paradigm: • Client (browser) used HTTP to ask server (web server) for object identified by URI, and renders this object according to rules defined by HTML Domain Name Service: root DNS server where is www.poly.edu located? answer: 123.238.24.10 where is www.cnn.com located? 2. 3. desktop (or crawler) 1. 4. 6. local DNS server 5. DNS server for cnn.com Names, addresses, hosts, and sites • one machine can have several host names and IP addresses • one host name may correspond to several machines • one host can have several “sites” (what is a site?) • one “site” on several hosts • issues: detecting duplicates, crawling, local vs. global links HTTP: GET /world/index.html HTTP/1.0 User-Agent: Mozilla/3.0 (Windows 95/NT) Host: www.cnn.com From: … Referer: … If-Modified-Since: ... desktop or crawler web server HTTP/1.0 200 OK Server: Netscape-Communications/1.1 Date: Tuesday, 8-Feb-99 01:22:04 GMT Last-modified: Thursday, 3-Feb-99 10:44:11 GMT Content-length: 5462 Content-type: text/html <the html file> HTML: HTTP & HTML issues: • “dynamic” URLs: http://www.google.com/search?q=brooklyn http://www.amazon.com/exec/obidos/ASIN/1558605703/qid%3D9… http:/cis.poly.edu/search/search.cgi • result file can be computed by server in arbitrary manner! • persistent connections in HTTP/1.1 • mime types and extensions • frames • redirects • javascript/java/JEB/flash/activeX ???????? 4 – Basic Search Engine Structure: indexing Crawler Index disks Query: “computer” Search.com look up Crawling Crawler disks • fetches pages from the web • starts at set of “seed pages” • parses fetched pages for hyperlinks • then follows those links (e.g., BFS) • variations: - recrawling - focused crawling - random walks Indexing indexing disks aardvark . . . . . arm armada armadillo armani . . . . . zebra 3452, 11437, ….. 4, 19, 29, 98, 143, ... 145, 457, 789, ... 678, 2134, 3970, ... 90, 256, 372, 511, ... 602, 1189, 3209, ... “inverted index” • parse & build lexicon & build index • index very large I/O-efficient techniques needed Querying Boolean queries: (zebra AND armadillo) OR armani compute unions/intersections of lists Ranked queries: zebra, armadillo give scores to all docs in union aardvark . . . . . arm armada armadillo armani . . . . . zebra 3452, 11437, ….. 4, 19, 29, 98, 143, ... 145, 457, 789, ... 678, 2134, 3970, ... 90, 256, 372, 511, ... 602, 1189, 3209, ... look up 5 – Introduction to Information Retrieval (Information Retrieval from Memex to Google) • (a) Introduction: What is IR? • (b) Historical development of IR (plus Information Science and Library Science) • (c) Fundamentals: Indexing, search, data acquisition • (d) Mathematical Techniques: vector space model, ranking and classification (a) Introduction: Information Retrieval (IR): “IR is concerned with the representation, storage, organization of, and access to information items” • focus on automatic processing (indexing, clustering, search) of unstructured data (text, images, audio, ...) • subfield of Computer Science, but with roots in Library Science, Information Science, and Linguistics • In this course, we focus on text data • Applications: - searching in a library catalog - categorizing a collection of medical articles by area - web search engines (a) Introduction: Historical Perspective: • WW2 era computers built for number crunching: ballistics computations, code breaking • since earliest days, also used to “organize information” - Memex (Vannevar Bush, 1945) • today, this is the main application! - store and organize data and documents - model organizations and work processes • Computer Organizer • … however, no clean separation (a) Introduction: Structured vs. Unstructured Data: • IR: lesser known cousin of field of Databases • Databases: focus on structured data • IR: unstructured data: “documents” - scientific articles, novels, poems, jokes, web pages • Information retrieval vs. data retrieval • IR focused on human user (?) • Challenges: semistructured data, closing the gap (a) Introduction: Recall: the fraction of the relevant documents (R) that is successfully retrieved: |Ra| Recall = |R| Precision: the fraction of the retrieved documents (A) that are relevant: |Ra| Precision = |A| Collection Relevant Documents in Answer Set |Ra| • fundamental trade-off • policy ramifications Relevant Docs |R| Answer Set |A| (a) Introduction: Text IR Operations: • Indexing: create a full-text or keyword index • Querying/ranking: find documents (most) relevant to query • Clustering: group documents in sets of similar ones • Categorization: assign documents to given set of categories • Citation Analysis: find frequently cited or influential papers • (Extraction/tagging: identify names of people, organizations) • (Summarization: automatically create a summary) • (Translation: translate text between languages) Note: not the same as NLP, string processing (a) Introduction: Text IR Applications: • Digital Libraries and Library Catalog Systems - full-text indexing and search of electronic documents - keyword search for other documents - categorization of medical or news articles - browsing by categories and relationships • Web Search Engines - very large amounts of data, data acquisition problem - many short documents, very informal, manipulation • National Security and Competitive Intelligence: - web surveillance (extremist sites, competitors) - filtering and querying intercepted communications - searching internal documents (files from branch offices) - analyzing phone call graphs, server and search engine logs (a) Introduction: Online Analytical Processing: (OLAP) • scenario: analyst posing questions (queries) to a system • preferably based on powerful query language (e.g., SQL) • goal: to discover properties of the data • for text/IR: need query language (VQL?) Data Mining: • how to automatically find interesting patterns and rules in data • or how to find exceptions to rules (outliers) • no clear boundary between data mining and OLAP • relevant to structured and unstructured data • example: association rules - products, words, links, books (a) Introduction: IR versus NLP: (Natural Language Processing or Computational Linguistics) • IR does not analyze grammar, local structure (document as a set or bag or words) • NLP analyzes sentence structure, grammar (shallow/deep parsing) • IR: simple statistical methods, good for search & categorization • NLP: good for automatic translation, summarization, extraction • IR is largely language independent • NLP uses knowledge about the language (WordNet, thesaurus) • NLP: rooted in linguistics, grammar vs. statistical NLP • web search: so far, NLP has not proven that critical yet (a) Introduction: IR versus databases: • IR based on statistical techniques • IR “fuzzy”, many different techniques • databases: very precisely defined semantics • databases: “pounding reality into table form” Machine Learning: (sometimes part of AI) • symbolic/logic vs. statistical approaches • set of general statistical techniques for learning from past data (how to profile correctly) • widely used in IR and increasingly in NLP (b) Historical Development Before 1945: • Babylonians, Greeks, Romans, etc • Indexing and creation of concordances - “algorithms for full-text indexing” ! - e.g., Dobson 1940-65 Byron concordance: “last handmade one” • Library of Congress and Dewey Classification • Documentalism • Bibliometric and Informetric distributions: - Bradford, Lotka, Zipf, Pareto, Yule (1920s-40s) • Citation Analysis and Social Network Analysis • Microfilm rapid selectors: - E. Goldberg 1931 - V. Bush 1939 (b) Historical Development Zipf and Bibliometric/Informetric Distributions • distributions observed in many fields of science • some things occur much more frequently than others “some words are used much more often than others “some authors are much more often cited than others” “some people have much more money than others” “some animal species are more common than others” • follows a particular class of distributions: f(i) ~ i -z • large z means large skew • heavy-tailed: “some have a lot more, but most wealth is held by the many” (b) Historical Development Citation and Social Network Analysis • “who has written the most influential papers?” • “who is the most influential person in a social network?” • maybe the person who knows the largest number of people? • or someone who knows a few influential people very well? graph-theoretic approaches to analyzing social networks and citation graphs • national security applications: - funding, communication, coordination - telephone call graphs (b) Historical Development Memex: Vannevar Bush (1890-1974) • “As We May Think”, Atlantic Monthly, 1945 (mostly written 1939) (b) Historical Development Memex: • device for storing and retrieving information • storage based on microfilm • users can store all their experiences and knowledge and retrieve information later • trails (hyperlinks): users can connect different pieces of information • surprising earlier history … (b) Historical Development Rapid Microfilm Selectors: • microfilm for storage of large amounts of data • storage density in MB per square inch (1925) • but how can we search such massive data? • idea: - add index terms to microfilm boundaries - build rapid selection machines that can scan index • Rapid Microfilm Selectors: - use light and photo cells to find matches in index terms - hundreds of index cards per second Source: M. Buckland, UC Berkeley See http://www.sims.berkeley.edu/~buckland/goldbush.html (b) Historical Development Rapid Microfilm Selectors: (b) Historical Development (b) Historical Development Emanuel Goldberg: • * Moscow 1881 – Tel Aviv 1970 • director at Zeiss Ikon (Dresden) • multiple invention in photo technologies • inventor of rapid microfilm selector Documentalism: • predecessor of library and information sciences • goal: “organize and share knowledge to make it widely available” • The Bridge: organization to design a “world brain” • Otlet, Ostfeld, Schuermeyer, …. • Also related: H.G. Wells Source: M. Buckland, UC Berkeley See http://www.sims.berkeley.edu/~buckland/goldbush.html and links in page (b) Historical Development "We should have a complex of associated machines which would achieve the following operations simultaneously or sequentially: 1. Conversion of sound into text; 2. Copying that text as many times as is useful; 3. Setting up documents in such a way that each datum has its own identity and its relationships with all the others in the group and to which it can be re-united as needed; 4. Assignment of a classification code to each datum; [division of the document into parts, one for each datum, and] rearrangement of the parts of the document to correspond with the classification codes; 5. Automatic classification and storage of these documents; 6. Automatic retrieval of these documents for consultation and for delivery either for inspection or to a machine for making additional notes; 7. Mechanized manipulation at will of all the recorded data in order to derive new combinations of facts, new relationships between ideas, new operations using symbols.” (Otlet, 1934) "But what a revolution for information retrieval and especially for libraries television can bring! Perhaps one day we shall see our reading rooms deserted and in their place a room without people in which books requested by telephone are displayed, which the users read in their homes using television equipment." (Schuermeyer 1936) (b) Historical Development (after Bush) • early work by H.-P. Luhn (KWIC index, SDI, abstraction) • hypertext (Nelson, Engelbart, 1960s) - links between different documents and sections - Xanadu - hypertext community (HT conferences) • vector space model and ranking methods - Salton et al (Cornell), 1960s - cosine measure, SMART system • automatic text classification - manually generated classifiers (since 1960s) - machine learning approach (1980s- now) • WWW - “the Web”, Berners-Lee, ~1992 - earlier: gopher, archie, WAIS - 1994: Mosaic browser, breakthrough in size - 1994/1995: first crawler-based search engines (c) Fundamentals of IR Systems Text Index: • a data structure that for supporting IR queries • most popular form: inverted index structure • like index in a book indexing disks with documents aalborg . . . . . arm armada armadillo armani . . . . . zz 3452, 11437, ….. 4, 19, 29, 98, 143, ... 145, 457, 789, ... 678, 2134, 3970, ... 90, 256, 372, 511, ... 602, 1189, 3209, ... inverted index (c) Fundamentals of IR Systems Boolean querying: (see Managing Gigabytes book) Boolean queries: (zebra AND armadillo) OR armani unions/intersections of lists aalborg . . . . . arm armada armadillo armani . . . . . Zz 3452, 11437, ….. 4, 19, 29, 98, 143, ... 145, 457, 789, ... 678, 2134, 3970, ... 90, 256, 372, 511, ... 602, 1189, 3209, ... look up (c) Fundamentals of IR Systems Ranking: “return best documents first” (c) Fundamentals of IR Systems Text classification: sports baseball everything politics foreign hockey domestic soccer business health .... .... .... .... …. Challenges: • designing topic hierarchy • automatic classification: “what is this page about?” • Yahoo and Open Directory mostly human-based .... (c) Fundamentals of IR Systems Basic structure of a very simple IR system: store indexing Index document collection Query: “computer” disks user interface look up (c) Fundamentals of IR Systems Simple IR system: • constructs and maintain inverted index on documents • supports Boolean and ranked queries • may use automatic or manual classification • may support other clustering techniques • may support advanced browsing operations • “searching vs. browsing” • often well-structured collections (news, medical articles) • queries with multiple keywords (up to hundreds) (c) Fundamentals of IR Systems Basic structure of a search engine: indexing Crawler Index web pages disks Query: “brooklyn” look up (c) Fundamentals of IR Systems Differences to traditional IR systems: • data acquisition important (crawling the web) • collections are much larger (3 billion pages = 50 TB) • documents are of very mixed quality and types • queries are VERY short (less than 3 words on average) • traditional stat. techniques do not work as well • but additional sources of information: - hyperlink structure - usage data / logs • search engine manipulation!!! (d) Mathematical Techniques Vector-Space Model • each document D represented as set of words • a query Q is also just a set of words • let L be the set of all words in the collection, |L| = m • D and Q correspond to m-dimensional vectors - if word does not occur in D resp. Q, the corresponding element is set to 0 - otherwise, element is positive • score of D with respect to query Q is D * Q • return documents with highest k scores (d) Mathematical Techniques Vector-Space Model • Example: put a 1 into vector for each word L = {a, alice, bob, book, like, reads}, m = 6 doc1: “Bob reads a book” doc2: “Alice likes Bob” D1 = ( 1, 0, 1, 1, 0, 1 ) D2 = ( 0, 1, 1, 0, 1, 0 ) doc3: “book” D3 = ( 0, 0, 0, 1, 0, 0 ) query: “bob, book” Q = ( 0, 0, 1, 1, 0, 0 ) D1*Q = 2, D2 * Q = 1, D3 * Q = 1 • very primitive ranking function: “how many words in common?” • smarter functions: assign appropriate weights to doc vectors • vector-matrix multiplication to score are documents (d) Mathematical Techniques Vector-Space Model for Ranking • higher score for more occurrences of a word • higher score for rare words • lower score for long documents • example: “cosine measure” (and many others) • f_d,t = number of occurrences of term t in document d • f_t = total number of occurrences of t in the collection (d) Mathematical Techniques Vector-Space Model • vast amount of vector space work in IR (see Witten/Moffat/Bell and Baeza-Yates/Ribeiro-Neto for intro & pointers) • many different ranking functions • additional factors in ranking (mainly for web): - higher weight if word in title, in large font, in bold face - search engines: higher score if word in URL, in anchortext - distance between terms in text (near, or far away?) - user feedback or browsing behavior? - hyperlink structure • execution: “compute score of all documents containing at least one query term, by scanning the inverted lists” (d) Mathematical Techniques Text classification: sports baseball everything politics foreign hockey domestic soccer business health .... .... .... .... .... …. • given a topic hierarchy, assign pages automatically to topics • learning-based approaches • document generation models • naïve Bayes, Support Vector Machines (SVM), many more … (d) Mathematical Techniques Naïve Bayes: • document generation model: e.g., baseball vs. football - a person decides to creates a web page - with probability 0.2 creates a page about football, with probability 0.8 creates a page about baseball - now fills the page with words according to some probability distribution that depends on the chosen topic - say, if topic is baseball, word “pitcher” more probably likely to occur - now under this model, given a page, what is the likelihood that this page was intended to be about baseball? • compare to testing for rare disease: - only one out of a million people has the disease - but one out of 10000 tests results in a false positive - given a person that tests positive, what is the likelihood of him being sick? - in this example, a person that tests positive is most likely healthy (d) Mathematical Techniques Naïve Bayes: (cont.) • Pr[a & b] = Pr[a | b] * Pr[b] = Pr[b | a] * Pr[a] Pr[a | b] = Pr[b | a] * Pr[a] / Pr[b] • let b be the event that we get a particular page p • let a be the event that the page is about baseball Pr[ • so given page p, we would like to know Pr[a | b] • to do this, we need to know Pr[b | a], Pr[a], and Pr[b] • suppose we are given 1000 labeled pages, that is, presentative pages that have been marked as football or baseball (by a human expert) • estimate Pr[a] by counting number of football and baseball pages • estimate Pr[b | a] and Pr[b] by estimating topic-dependent word distributions: if 1.2% of all words in baseball pages are “pitcher” then assume word pitcher is generated with probability 0.012 in each step