Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
WIRED Week 6 •Syllabus Review •Readings Overview •Search Engine Optimization •Assignment Overview & Scheduling •Projects and/or Papers Discussion Web Search Engines • Independent of IR model • Distributed index and servers - Crawler - Query server - Indexer • Crawlers and Spiders - Centralized control, Coordinated, Refresh, Filtering - Not the main problem • Queries - Interface, processing, results • Indexing - Data normalization, load balancing, data sharing Harvesting • Not just Web data - Caching, Duplication, Normalization • Armies of crawlers • Filtering collected data • Gatherers - Collects and extracts on various schedules - Works with several brokers • Brokers - Indexes and interfaces to queries - Works with other Brokers and Gatherers • Topical Agents? Web Crawling Issues • • • • • • • • • • • • • Follow chains of URLs to gather more URLs Extract index (content) from each page Lather-Rinse-Repeat Update crawler to-do list Associate frequency of crawls Breadth or Depth first? Endless looping Duplicate pages/sites Changed page (or not really?) Dynamically generated pages Intranet pages Markup language getting in the way NOROBOTS • What should a crawler get? Indexing the Web • Inverted File Index - Sorted words with pointers to location(s) & page(s) - Pointers are the focus (inversion) • What about pages and sites? - Massive redundancy on well-organized sites • Navigation • Topics • Content • “State of the art indexing techniques” = 30% of text (not page) size. p 383 • How can you tune an index for massively changing documents? Ranking • Boolean and Vector models mostly used - Why? - Works from the index, not the text • Which ranking methods are best? - Datasets - Syntaxes - Users & Testing Ranking Methods • TF-IDF - Simple, smaller data sets • Boolean Spread - Degrees of match Within a document Set of documents Links between documents (meta docs?) • Vector Spread - Standard cosine between query and index (to document) - Links with answer or pointing to answer • Most Cited Is Web ranking different? • Links are the difference that makes the difference - Internal links on a page Internal links on a site Relationships between sites Link freshness • Kleinberg’s HITS method (1998) - Hypertext Induced Topic Search Number of pages that point to (processed) query Authorities (relevant content by links) Hubs (links to varied authorities) Problems with Hubs & Authorities • Is more links always better? • What about pages without many outgoing links? • How do you count multiple links from within one page to another? • Do automatically generated sites/pages have an advantage? - CMS systems may have linking “fingerprints” - Metadata • How varied are the link weights? - Simple counts - Modified by other IR measures Anatomy of a LS Web Search Engine • Initial Google Design • PageRank - PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) - “A model of user behavior” • probability of a random surfer visiting a page is its PageRank + • a damping factor (boredom) - Pages point to a page - Highly ranked pages point to a page - Anchor text is mined (the label for the link) - Proximity included Anatomy 2 • Repository of page content • Document index - Forward (sorted) - Inverted (sorter) • • • • • Lexicon of words & pointers Hit Lists of word occurrence(s) Crawlers Ranking Feedback of selection (~) Popularity? • Do you always want the most popular information source? - Talk Radio New York Times Bestseller List “Lincoln’s Doctors Dog” “The C.S.I. Diet and Cookbook” • Trend or Fad? • Blogs, Editorials and Propaganda vs. “Facts”? • Result Diversity • Death of the Mid-List Metasearch Issues • • • • One place for everything? First or Last place to look? Better or different interface? Combined, sorted results would be best - How to sort? - Sorting for different types of queries • • • • Syntax Errors State Information (monitoring) Copyright issues (robots) User, content and interface mismatches/challenges Web Searching Metaphors • How do people visualize the Web? • Is Browsing better? • Do we need new metaphors for using the Web? - Searching - Browsing - What else? Search Engine Optimization • Found by spiders and submissions - More links to and from site - Registration on major directories - Links to and from major directories • Real Contact information Helps prove validity - META tag Header and footer of home page About Us or Contact Us pages Location/Map page Good Design is SEO • Basic interface • Well-structured links - Comprehensive Site Navigation - Updated and accurate links • Easy to find (via the Web or on the site itself) • Clear labels - TITLEs Headings Term consistency Link consistency • Small sizes to download quickly Web Search Tests • Perform searches with targeted keywords • Compare and contrast top results with your potential site - Similar terms - Links (external and internal) - Popularity (sites that link to the site) • Use Data to - Build a keyword list - Build an introductory text • Blurbs • Description (2 sentences max) • Any page found via a Web search engine should have search for the site itself • Regularly monitor Search with your terms Internal Search • Robots.txt • Log and analyze search results Measure success and failure Tune for click-through productivity Keep list of terms Match terms to pages • Add terms • Script terms to certain pages - Provide list (links) of most recent search terms - Provide list (links) of most popular search terms - Page Design • Use CSS - <style type=“text/css”> - Keep content in pages, not CSS templates • Put JavaScript, etc. in external files - <script language=“JavaScript” src=“scripts/myscript.js” type=“text/javascript”> </script> - <noscript> tag too for alternate content • • • • • • Continually verify external links ALT tags & Accessibility Compliance Index link on Splash page (if needed) Exact consistency on internal links (ending “/”s) <noframes> Redirects <META HTTP-EQUIV=“refresh” content=“0”; URL=http://www.newsite.com/index.html> Search and MIME types • Flash now supports internal text • PDF files - Add comments and authorship info - Modify existing PDFs • Check Document PropertiesFonts with fonts shows that PDF can be indexed (not a group a graphics files) - Provide text abstract or summary of PDF • PPT, use text if possible • Java interfaces prove difficult • Dynamic pages should have key(word) static elements • FORMs not always completely indexed Track your Tracking • Keep list of sites submitted to - When, Who, Email address, exact URL submitted - Suggested categories, Current site description - Terms and Conditions • Keep list of “goal” keywords • Keep list of sites you check keywords - Keywords - Dates - Successes/Failures Assignment Overview & Scheduling • Leading WIRED Topic Discussions - # in class = # of weeks left? • Web Information Retrieval System Evaluation & Presentation - 5 page written evaluation of a Web IR System - technology overview (how it works) - a brief history of the development of this type of system (why it works better) - intended uses for the system (who, when, why) - (your) examples or case studies of the system in use and its overall effectiveness Projects and/or Papers Overview • How can (Web) IR be better? - Better IR models - Better User Interfaces • More to find vs. easier to find • Scriptable applications • New interfaces for applications • New datasets for applications