Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Searching the Web The Web Why is it important: – “Free” ubiquitous information resource – Broad coverage of topics and perspectives – Becoming dominant information collection – Growth and jobs Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other … Web Characteristics Distributed data – 80 million web sites (hostnames responding) in April 2006 – 40 million active web sites (don’t redirect, …) High volatility – Servers come and go … Large volume – One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages) – “Dark Web” – content not indexed, not crawlable estimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages) Web Characteristics Unstructured data – Lots of duplicated content (30% estimate) – Semantic duplication much higher Quality of data – No required editorial process – Many typos and misspellings (impacts IR) Heterogeneous data – Different media – Different languages These characteristics are not going to change Web Content Types Source: How much information 2003 Search Engine Architecture Lots and lots of computers Users Interface Query Engine Index Crawler Web Indexer Search Engine Architecture Evaluation Users Chapter 3 Interface Query Engine Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapters 6 & 7 Web Chapter 8 Search Engine Architecture Users Interface Query Engine Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapters 6 & 7 Web Chapter 8 Hubs and Authorities Hubs – Have lots of links to other pages Authorities – Have lots of links that point to them Can use feedback to rank hubs and authorities – Better hubs have links to good authorities – Better authorities have links from good hubs Crawling the Web Creating a Web Crawler (Web Spider) Simplest Technique – Start with a set of URLs – Extract URLs pointed to in original set – Continue using either breadth-first or depth-first Works for one crawler but hard to coordinate for many crawlers – Partition Web by domain name, ip address, or other technique – Each crawler has its own set but shares a to-do list Crawling the Web Need to recrawl – Indexed content is always out of date – Sites come and go and sometimes disappear for periods of time to reappear Order of URLs traversed makes a difference – Breadth first matches hierarchic organization of content – Depth first gets to deeper content faster – Proceeding to “better” pages first can also help (e.g. good hubs and good authorities) Server and Author Control of Crawling Avoid crawling sites that do not want to be crawled – Legal issue – Robot exclusion protocol (Server level control) • file that indicates which portion of web site should not be visited by crawler • http://.../robots.txt – Robot META tag (Author level control) • used to indicate if a file (page) should be indexed or analyzed for links • few crawlers implement this • <meta name=“robots” content=“noindex, nofollow”) • http://www.robotstxt.org/wc/exclusion.html Example robots.txt Files Google User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues … CSDL User-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/ TAMU Library User-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ … New York Times User-agent: * Disallow: /pages/college/ … Allow: /pages/ Allow: /2003/ … User-agent: Mediapartners-Google* Disallow: Crawling Goals Crawling technique may depend on goal Types of crawling goals: – Create large broad index – Creating a focused topic or domain-specific index • Target topic-relevant sites • Index preset terms – Creating a subset of content to model characteristics of (part of) the Web • Need to survey appropriately • Cannot use simple depth-first or breadth-first – Create up-to-date index • Use estimated change frequencies Crawling Challenges Identifying and keeping track of links – Which to visit – Which have been visited Issues – Relative vs. absolute link descriptions – Alternate server names – Dynamically generated pages – Server-side scripting – Links buried in scripts Crawling Architecture Crawler components Worker threads – attempt to retrieve data for URL DNS resolver – resolves domain names into IP addresses Protocol modules – downloads content in appropriate protocol Link extractor – finds and normalizes URLs URL filter – determines which URLs to add to to-do list URL to-do agent – keeps list of URLs to visit Crawling Issues Avoid overloading servers – Brute force approach can become a denial of service attack – Weak politeness guarantee: only one thread allowed to contact a server – Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors Broken links, time outs – How many times to try? – How long to wait? – How to recognize crawler traps? (server-side programs that generate “infinite” links) Web Tasks Precision is the key – Goal: first 10-100 results should satisfy user – Requires ranking that matches user’s need – Recall is not important • Completeness of index is not important • Comprehensive crawling is not important Browsing Web directories – Human-organized taxonomies of Web sites – Small portion (< than 1%) of Web pages • Remember that recall (completeness) is not important • Directories point to logical web sites rather than pages – Directory search returns both categories and sites – People generally browse rather than search once they identify categories of interest Metasearch Search a number of search engines Advantages – Do not build their own crawler and index – Cover more of the Web than any of their component search engines Difficulties – Need to translate query to each engine query language – Need to merge results into a meaningful ranking Metasearch II Merging Results – Voting scheme based on component search engines • No model of component ranking schemes needed – Model-based merging • Need understanding of relative ranking, potentially by query type Why they are not used for the Web – Bias towards coverage (e.g. recall), which is not important for most Web queries – Merging results is largely ad-hoc, so search engines tend to do better Big application: the Dark Web Using Structure in Search Languages to search content and structure – Query languages over labeled graphs • PHIQL: Used in Microplis and PHIDIAS hypertext systems • Web-oriented: W3QL, WebSQL, WebLog, WQL Using Structure in Search Other use of structure in search – Relevant pages have neighbors that also tend to be relevant – Search approaches that collect (and filter) neighbors to returned pages Web Query Characteristics Few terms and operators – Average 2.35 terms per query • 25% of queries have a single term – Average 0.41 operators per query Queries get repeated – Average 3.97 instances of each query – This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”) Query sessions are short – Average 2.02 queries per session – Average of 1.39 pages of results examined Data from 1998 study – How different today?