Download Web search engine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Web Searching
Basics
Dr. Dania Bilal
IS 530
Fall 2009
How the Web Came About?
• First, we had the Internet with text-based
files and indexes to find information in
these files
– Static, no graphics or multimedia
– No point and click using a mouse
– No GUI (Graphical User Interface)
– Menu-driven and subject categories for topics
were hierarchical in nature
How the Web Came About?
• Tim Berners-Lee
– Late 1980s created the HTTP protocol
– Hypertext Transfer Protocol
– Links various files and documents (text,
sound, images, videos, etc.) available on
various Internet host servers in a seamless
way
• Beginning of the World Wide Web (WWW)
• WWW is part of the Internet
How the Web Came About?
• Graphical Web browsers were developed
for navigating through Web content
• Mosaic
– First Web browser
– Appeared in 1993
– Revolutionized access to information
– Made use of the Web much easier to use
• Other browsers appeared
Searching the Web
• Search engines (general and subjectdriven)
• Directories
• Meta-search engines
• Meta-directories
Search Engines
• Engines are computer programs designed
for searching the Web
• Components
– Crawlers or spiders
– Database
– Search engine software
– Search algorithms
Crawlers or Spiders
• Traverse the Web, visits web pages that
are not blocked
• Read the pages visited
• Follows links form pages to additional
pages
• Return frequently to the pages for updates
Database Component
• Stores copies of the web pages the
crawlers or spiders visited
• Database is organized based on a preset
scheme
• Fields in each document or webpage are
identified (e.g., URL, page title, header or
section title, metadata described by author
of a page)----> pages are indexed
Search Engine Software
• Program that sorts through the pages stored in
the database
• Takes a user query entered in a search engine
• Matches the words in the query to the web
pages stored in the database alongside the
search criteria in the query
– Matches each word and accounts for the operators
appearing in the query (+; -; “ “)
• The + sign is assumed when no operators are used
Search Engine Software
• Matching is performed by algorithms
(computational rules)
• Relevance of what was matched is
calculated using sophisticated algorithms
• Relevance ranking of pages returned to a
user are based on rules used by the
engine company
Search Engine Relevance Ranking
• Some criteria
– Word frequency
– Location of a word in the web page or
document
• page title, page URL, page first heading, 2nd
heading, first sentence in a heading, etc.)
– Number of links to a page by other pages
– No. of clicks on a page when it appears in the
result of a search
– Meta-tags (metadata)
Basic Search Strategy
• Identify the information need
• Extract basic concepts from the information need (broad
ideas)
• Choose possible keywords or terms related to the
concepts
– Think of broader, narrower, or related terms
• Determine the search logic and techniques most suitable
for formulating a search using the keywords or terms
– Boolean? Proximity? Combination of both? Nesting?
• Select an appropriate engine, directory, metaengine, or meta-directory based on the topic
Basic Search Strategy
•
Explore the features of the engine or directory if you’re unfamiliar with them
– Visit the Advanced Search options, Help file, Search Tips, as applicable
•
•
Conduct the search
Examine the first page of returned results and visit the top five or more
– Search engine ranks results not based on the context of the topic search; rather,
based on the matching and ranking criteria
• System relevance
•
Identify the pages or documents that are the most relevant to your topic
– User relevance judgment (also called pertinence)
•
Use the most relevant document or page and explore the keywords,
headings, phrases, etc. that you can use to find additional relevant pages or
documents.
– “Seed” document or “Pearl growing”
– Follow the Cited by, as applicable to find additional documents relevant to the
topic.
•
•
Revise your search if needed.
Try your search in another engine, specialized engine, meta-engine,
directory, etc.
The Question of Quality
• Criteria for evaluating information quality
– Source domain (.com, .edu, .gov, etc.)
– Authority
– Purpose or motivation
– Quality of writing
– Balanced views
– Currency of information
– Sources cited
The Question of Quality
• Accuracy
• Factual information (check against two or
more authoritative sources)
• Use additional sources for evaluating the
quality of information on the Internet.
 http://www.virtualchase.com/quality
 http://www.lib.berkeley.edu/TeachingLib/Guides/I
nternet/Evaluate.html
The Invisible Web
• Search engines don’t index all web pages
• Reasons:
– Information stored in databases that require
subscription
– Pages or websites that are passwordprotected
– Pages that are not linked to other pages
– Pages that are blocked to spiders or crawlers
Search Logic: Boolean Operators
Source: Google Images
Boolean and Search Engines
• AND
+
• OR
• NOT
-
Phrase Searching
•
•
•
•
Proximity searching
“ “ are used in search engines
Provides more precise results
Limits the results to the words that are
close to each other.
Demos
• Google Features
– Basic
– Advanced
– I’m feeling lucky
– Google Directory
– About Google
– More (from the menu option)
– Show options/Hide options (from the results
page)
Google Advanced Searching
• Video on YouTube
http://www.youtube.com/watch?v=tk6vZiGi
aiQ
Yahoo Demo
•
•
•
•
•
•
Basic
Advanced
Directory
Yahoo Answers
Ask Earl
Other features