Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009 How the Web Came About? • First, we had the Internet with text-based files and indexes to find information in these files – Static, no graphics or multimedia – No point and click using a mouse – No GUI (Graphical User Interface) – Menu-driven and subject categories for topics were hierarchical in nature How the Web Came About? • Tim Berners-Lee – Late 1980s created the HTTP protocol – Hypertext Transfer Protocol – Links various files and documents (text, sound, images, videos, etc.) available on various Internet host servers in a seamless way • Beginning of the World Wide Web (WWW) • WWW is part of the Internet How the Web Came About? • Graphical Web browsers were developed for navigating through Web content • Mosaic – First Web browser – Appeared in 1993 – Revolutionized access to information – Made use of the Web much easier to use • Other browsers appeared Searching the Web • Search engines (general and subjectdriven) • Directories • Meta-search engines • Meta-directories Search Engines • Engines are computer programs designed for searching the Web • Components – Crawlers or spiders – Database – Search engine software – Search algorithms Crawlers or Spiders • Traverse the Web, visits web pages that are not blocked • Read the pages visited • Follows links form pages to additional pages • Return frequently to the pages for updates Database Component • Stores copies of the web pages the crawlers or spiders visited • Database is organized based on a preset scheme • Fields in each document or webpage are identified (e.g., URL, page title, header or section title, metadata described by author of a page)----> pages are indexed Search Engine Software • Program that sorts through the pages stored in the database • Takes a user query entered in a search engine • Matches the words in the query to the web pages stored in the database alongside the search criteria in the query – Matches each word and accounts for the operators appearing in the query (+; -; “ “) • The + sign is assumed when no operators are used Search Engine Software • Matching is performed by algorithms (computational rules) • Relevance of what was matched is calculated using sophisticated algorithms • Relevance ranking of pages returned to a user are based on rules used by the engine company Search Engine Relevance Ranking • Some criteria – Word frequency – Location of a word in the web page or document • page title, page URL, page first heading, 2nd heading, first sentence in a heading, etc.) – Number of links to a page by other pages – No. of clicks on a page when it appears in the result of a search – Meta-tags (metadata) Basic Search Strategy • Identify the information need • Extract basic concepts from the information need (broad ideas) • Choose possible keywords or terms related to the concepts – Think of broader, narrower, or related terms • Determine the search logic and techniques most suitable for formulating a search using the keywords or terms – Boolean? Proximity? Combination of both? Nesting? • Select an appropriate engine, directory, metaengine, or meta-directory based on the topic Basic Search Strategy • Explore the features of the engine or directory if you’re unfamiliar with them – Visit the Advanced Search options, Help file, Search Tips, as applicable • • Conduct the search Examine the first page of returned results and visit the top five or more – Search engine ranks results not based on the context of the topic search; rather, based on the matching and ranking criteria • System relevance • Identify the pages or documents that are the most relevant to your topic – User relevance judgment (also called pertinence) • Use the most relevant document or page and explore the keywords, headings, phrases, etc. that you can use to find additional relevant pages or documents. – “Seed” document or “Pearl growing” – Follow the Cited by, as applicable to find additional documents relevant to the topic. • • Revise your search if needed. Try your search in another engine, specialized engine, meta-engine, directory, etc. The Question of Quality • Criteria for evaluating information quality – Source domain (.com, .edu, .gov, etc.) – Authority – Purpose or motivation – Quality of writing – Balanced views – Currency of information – Sources cited The Question of Quality • Accuracy • Factual information (check against two or more authoritative sources) • Use additional sources for evaluating the quality of information on the Internet. http://www.virtualchase.com/quality http://www.lib.berkeley.edu/TeachingLib/Guides/I nternet/Evaluate.html The Invisible Web • Search engines don’t index all web pages • Reasons: – Information stored in databases that require subscription – Pages or websites that are passwordprotected – Pages that are not linked to other pages – Pages that are blocked to spiders or crawlers Search Logic: Boolean Operators Source: Google Images Boolean and Search Engines • AND + • OR • NOT - Phrase Searching • • • • Proximity searching “ “ are used in search engines Provides more precise results Limits the results to the words that are close to each other. Demos • Google Features – Basic – Advanced – I’m feeling lucky – Google Directory – About Google – More (from the menu option) – Show options/Hide options (from the results page) Google Advanced Searching • Video on YouTube http://www.youtube.com/watch?v=tk6vZiGi aiQ Yahoo Demo • • • • • • Basic Advanced Directory Yahoo Answers Ask Earl Other features