Download What is a Search Engine? - UMass Lowell Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

URL redirection wikipedia , lookup

Transcript
What is a Search Engine?
• Two functions:
– Query processing
user interface, what yyou see on the web
– Data collection
“robots” or “spiders” that walk the web
¾Large data centers
Powerful computers, massive storage
¾Proprietary technology
Expression Parsing
• parse – to break a sentence down
down, giving
the form and function of each part
(Webster’s
(Webster
s Dictionary)
• Search engines parse your search
expression:
– spaces are separators
– quotes
t force
f
words
d to
t be
b joined
j i d tto make
k
phrases
Search Expressions
• Single word: cat
Returns every page that contains that word
• Phrase: “calico
calico cat”
cat
Returns every page that contains that phrase
• Add a word: “calico
calico cat”
cat +geriatric
Returns pages with phrase and added term
• Subtract a word: “calico
calico cat”
cat –kitten
Return pages with phrase but not added term
Search Phrase Construction
Run on Google:
• Single word: dishwasher
• Multi
M lti word:
d di
dishwasher
h
h +history
hi t
• Better multi word: dishwasher +history
+appliance
• Exclude: add the term: -whirlpool
p
• Phrase: “dishwasher history”
Global Search Engine Usage
1.
2.
3
3.
4.
5
5.
6.
7
7.
Google (46.5%)
Yahoo (20.6%)
MSN Search (7
(7.8%)
8%)
Altavista (6.4%)
Terra Lycos (4
(4.6%)
6%)
Ixquick (2.4%)
AOL Search (1.6%)
(1 6%)
Meta Search Engines
• Submit your query to multiple search
engines
– dogpile.com
dogpile com
– ixquick.com
• Downside:
D
id
The 7 most used word phrases in search engines
on the web are:
1.
2 word phrases
32.58%
2.
3 word phrase
25.61%
3.
1 word phrases
19.02%
4
4.
4 word
d phrases
h
12 83%
12.83%
5.
5 word phrases
5.64%
6.
6 word phrases
2.32%
7.
7 word phrases
0.98%
Directory Oriented
Search Engines
• These engines are similar to editorial
services
– they determine the best sites on the Web and
include them in categorized listings
– unwieldy results
– some have both a directory of categorized
sites and a search engine for searching both
th lilisting
the
ti and
d th
the IInternet
t
t
Directory Services
• Yahoo!
– Commercial sites pay for inclusion
– Search results favor paying sites
• ODP – open directory project
– Listing is free, subject to editorial approval
– odp.org
– dmoz.org
g
o Google, Lycos and others also have directories.
Basic Principle #2:
Use phrase searching whenever possible. Almost all the
portals and search engines can do phrase searching -searching for the words entered adjacent to each other and
exactly in the order submitted. Most use double quotes to
identify a phrase:
"this is a phrase"
Examples:
To Narrow results for a search on apples
add more words: apples strawberries
Or use a phrase search: “apple pie recipe”
Use the – to exclude : apples strawberries -kiwi
kiwi
Combining Words when Searching
To get more precise results, add more words to the search.
With billions of Web pages indexed, adding more words
(intelligently) helps to narrow the search results to a better match.
match
•Use the most unique words first.
• Use a minus - to exclude terms.
• Whenever possible, try a phrase search first.
1. Search bozeman, then john bozeman, and
then "john bozeman" on Google Which finds
the least? Does john bozeman find both
words or either term?
2 Search the phrase productivity quality and
2.
outcomes on AltaVista
Portals, Directories, & Search Engines
Portals: Offer search, directory, and many other general services such as
email, free home page building, news, popular topics, etc.
Well-known p
portals include Yahoo!,, AOL,, MSN,, and Lycos.
y
Directory Function: A subject directory includes selected Web sites
(more often than pages) and classifies them into hierarchical subject
categories. Most portals have one and some specialized directories are
available by themselves. They do not index every word on every page
i l d d
included.
Search Engines:
g
Indexingg the words on every
y ppage
g in their database,, a
search engine covers Web pages and can include more than 2 billion.
Primary Search Engines
•While popular sites are well covered by most all of the portals,
g
, individual ppages
g and lesser
directories,, and search engines,
known sites are not.
•None are comprehensive,
comprehensive and there is not always overlap
between the search engines.
•Sometimes,
S
i
it
i is
i more effective
ff i to switch
i h to another
h searchh
engine rather than to stick just to one.
•The search engines have many aspects in common, but they
also each differ in important ways.
Use the portals
portals, directories,
directories and search engines for finding the Web site
of organizations when you can't guess the URL
Some reasons for using specific search engines
Google: One of the largest and includes PDF and other file types.
Also includes cached copies of the page as it appeared when
indexed. Also has News, Usenet ((Groups),
p ) and Image
g databases.
AltaVista: One with the most powerful search capabilities. Has
relatively
y unique
q features such as NEAR,, truncation,, full Boolean
searching, and extensive language limits. Includes PDFs. Also has
limited machine translation capabilities.
MSN Search: Powerful search features, a very fresh database, and
the default search engine in Internet Explorer.
Teoma: A newer and smaller search engine that specializes in
identifying communities on the Web and metasites.
-- merged with Ask ( formerly known as Ask Jeeves )
AND
OR
Finds documents containing all of the specified words or
phrases Peanut AND butter finds documents with both the
phrases.
word peanut and the word butter.
Finds documents containing
g at least one of the specified
p
words
or phrases. Peanut OR butter finds documents containing either
peanut or butter. The found documents could contain both items,
but not necessarily.
AND
NOT
Excludes documents containing the specified word or phrase.
Peanut AND NOT butter finds documents with peanut but not
containing butter. NOT must be used with another operator, like
AND. AltaVista does not accept 'peanut NOT butter'; instead,
specify peanut AND NOT butter.
NEAR
Finds documents containing both specified words or phrases
within 10 words of each other. Peanut NEAR butter would find
documents with peanut butter
butter, but probably not any other kind of
butter
Wild Cards
The asterisk is a wildcard; any letters can take the place of
the asterisk. Bass* would find documents with bass,,
basset and bassinet.
You must type at least three letters before the *.
*
You can also place the * in the middle of a word
word. This is
useful when you're unsure about spelling.
Colo*r would find documents that contain color and
colour.
l
Use parentheses to group complex Boolean phrases. For
example, (peanut AND butter) AND (jelly OR jam) finds
()
documents with the words 'peanut butter and jelly' or
'peanut butter and jam' or both
Advanced Searching Tips
Finds pages with a link to a page with the specified
link:URLtext URL text. Use link:www.myway.com to find all
pages linking to myway
myway.com.
com
text:text
Finds pages that contain the specified text in any
h page other
h than
h an image
i
tag, lilink,
k or
part off the
URL. The search text:graduation would find all
pages
p
g with the term g
graduation in them.
title:text
Finds pages that contain the specified word or
phrase in the page title (which appears in the title
bar of most browsers). The search title:sunset
would find p
pages
g with sunset in the title.
url:text
Finds pages with a specific word or phrase in the
URL U
URL.
Use url:garden
l
d to
t find
fi d allll pages on allll
servers that have the word garden anywhere in the
host name, path, or filename.
Some of the advanced search features available, include the following.
• Use "phrase searching" whenever possible
Search
Engine
Boolean
Proximit
y
Truncation
Limits
Google
-, OR
"Phrase"
No
Title, domain, filetype,
PDF
AlltheWeb /
Lycos
+, - [and, or,
"Phrase"
andnot, ( ) in adv]
No
Title, PDF, domain,
content type
• Use link search to find who links to a specific page
Alt Vi t
AltaVista
AND OR
AND,
OR, AND
NOT, ( ), +, -
"Phrase",
"Ph
" Yes
Y *
Near
Titl d
Title,
date,
t domain
d
i
• Use the advanced search forms, or at least look to
see what it offers. They often include ability to limit
search to title, a specific domain, or do a link search.
Teoma
-, OR
"Phrase"
No
Title, domain, site
MSN
Search
AND, OR, NOT
"Phrase"
No
Title, domain, contain
type
• Add more words to focus results
• Try a title search for subject focus
Search engines won’t
won t find…
• New sites
Search engines won’t
won t find…
• Dynamic
y
web pages
p g
• Sites that aren’t seen byy crawler p
programs
g
– Crawlers follow links.
– If no sites link to your site, a crawler won’t find it.
– Example:
E
l Th
The web
b site
it ffor thi
this course!!
• Sites explicitly excluded by page author
– meta tags embedded in page, visible to crawler, but
not visible to user viewing the page with a browser.
The Deep Web
• Information on the web
web, but not visible to
spider or crawler programs
• Databases or non-text files
• invisible-web.net
– Example: when you search for books at Amazon, it
reads a database then builds a page for you.
• Data that is in a database.
– Statistics sites.
sites
• Some binary file formats
formats. Not all search engines
can read .pdf files.
– Also multi-media files, like sound clips, images.
Beyond Search Engines
What's Not Included
• Search Engines attempt to find and index as
many sites as possible, but note what is
not included.
• Search engines include billions of pages in their
databases, but none of the Search
Engines come close to indexing the entire
Web, much less the entire Internet.
Beyond Search Engines
H
Here
iis a lilistt off some off what
h t is
i missing
i i / nott available:
il bl
Selected
Se
ected Spec
Specialized
a ed Search
Sea c Engines
g es
– The content in sites requiring a log in
• i.e. username & PW
–
–
–
–
CGI output such as data requested by a form
Intranets – pages / sites not linked from anywhere else
Commercial resources with domain limitations
C t t off Adobe
Content
Ad b PDF and
d fformatted
tt d files
fil
• News: Daypop, AlltheWeb News, Google News
• Opinions: Google Groups, Epinions
• (some are now indexed)
– Non-Web
Non Web resources: Email lists, chat, IM, books, etc.
– Very current information: News, press releases
– Multimedia file content: Words in pictures, sound files,
video files
• http://www.searchenginecolossus.com/
http://www searchenginecolossus com/
– Search Engine Colossus offers you links to search engines
and directories from 195 countries and 46 territories around
the world! Conduct extensive web searches!
– Make your own website submissions!
– Locate your new favorite search engines!
– Search the web using your choice of language!
• http://www.search.com/
– subject collection of search engines
From : http://www.searchengineshowdown.com/phone/
On-Site
On
Site Search Engines
• Local search engine to find data on a site
• Becoming
B
i a necessity
it ffor a llarge site
it
– Microsoft has one
– UMass Lowell has one
• Virus Bulletin has a local installation of
Google
News Search
• Search engines usually don’t include news
in web results.
• Most news on the web is not permanent
permanent,
at least in free form
• News changes frequently, spiders need to
follow different rules
• Google
Google, AltaVista and Lycos
Lycos, etc
etc. have
separate news search services
Specialty Search Engines
• Topic-specific
Topic specific
Search Engines
• What is the best search engine?
• Medical,
Medical Legal
Legal, etc
etc.
– The answer changes often!
– It depends on what you want to find
find.
• More
M
relevant
l
t results,
lt lless noise
i
¾Example:
www.eliyon.com - collects information on
b i
business
people
l (15 million)
illi )
AltaVista
• Overture to acquire AltaVista from CMGI
– DEC filed $50M IPO in 1996
– Compaq sold it to CMGI for $2
$2.2B
2B in 1999
– Overture buys AltaVista for $140M in 2003
• AltaVista
Alt Vi t h
has never b
been profitable!
fit bl !
Inktomi
• Inktomi provides search technology to
Microsoft, AOL, etal.
• Yahoo! to acquire Inktomi
– Inktomi stock price in June 1998 - $9.00
– … in
i M
March
h 2000 - $250.00
$250 00
– … in February 2003 - $1.62
• Inktomi was profitable briefly in 2001.
Yahoo!
• Yahoo! to buy Overture for $1.63B, pending
shareholder vote on October 7.
• Overture has an ongoing agreement with MSN
to supply
pp y context-specific
p
ads for search result
pages.
• Yahoo! used Google for search ; recently moved
to their own proprietary search engine.
The Business of Search
• Google
g recently
y became a p
public company.
p y
– Are acquiring companies and many very talented
leaders
• Yahoo! has recently acquired several search
companies.
companies
• Microsoft has made the decision to deploy their
search technology
¾ Leadership has been shifting and the industry
has been consolidating.
Search Engine Revenue?
• Paid inclusion in directories
directories.
• Sponsored links.
• Paid
P id iinclusion
l i and/or
d/ priority
i it placement
l
t iin
search results.
• License search technology to other sites.
• Advertising.
g
• Add-on services.