Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
What is a Search Engine? • Two functions: – Query processing user interface, what yyou see on the web – Data collection “robots” or “spiders” that walk the web ¾Large data centers Powerful computers, massive storage ¾Proprietary technology Expression Parsing • parse – to break a sentence down down, giving the form and function of each part (Webster’s (Webster s Dictionary) • Search engines parse your search expression: – spaces are separators – quotes t force f words d to t be b joined j i d tto make k phrases Search Expressions • Single word: cat Returns every page that contains that word • Phrase: “calico calico cat” cat Returns every page that contains that phrase • Add a word: “calico calico cat” cat +geriatric Returns pages with phrase and added term • Subtract a word: “calico calico cat” cat –kitten Return pages with phrase but not added term Search Phrase Construction Run on Google: • Single word: dishwasher • Multi M lti word: d di dishwasher h h +history hi t • Better multi word: dishwasher +history +appliance • Exclude: add the term: -whirlpool p • Phrase: “dishwasher history” Global Search Engine Usage 1. 2. 3 3. 4. 5 5. 6. 7 7. Google (46.5%) Yahoo (20.6%) MSN Search (7 (7.8%) 8%) Altavista (6.4%) Terra Lycos (4 (4.6%) 6%) Ixquick (2.4%) AOL Search (1.6%) (1 6%) Meta Search Engines • Submit your query to multiple search engines – dogpile.com dogpile com – ixquick.com • Downside: D id The 7 most used word phrases in search engines on the web are: 1. 2 word phrases 32.58% 2. 3 word phrase 25.61% 3. 1 word phrases 19.02% 4 4. 4 word d phrases h 12 83% 12.83% 5. 5 word phrases 5.64% 6. 6 word phrases 2.32% 7. 7 word phrases 0.98% Directory Oriented Search Engines • These engines are similar to editorial services – they determine the best sites on the Web and include them in categorized listings – unwieldy results – some have both a directory of categorized sites and a search engine for searching both th lilisting the ti and d th the IInternet t t Directory Services • Yahoo! – Commercial sites pay for inclusion – Search results favor paying sites • ODP – open directory project – Listing is free, subject to editorial approval – odp.org – dmoz.org g o Google, Lycos and others also have directories. Basic Principle #2: Use phrase searching whenever possible. Almost all the portals and search engines can do phrase searching -searching for the words entered adjacent to each other and exactly in the order submitted. Most use double quotes to identify a phrase: "this is a phrase" Examples: To Narrow results for a search on apples add more words: apples strawberries Or use a phrase search: “apple pie recipe” Use the – to exclude : apples strawberries -kiwi kiwi Combining Words when Searching To get more precise results, add more words to the search. With billions of Web pages indexed, adding more words (intelligently) helps to narrow the search results to a better match. match •Use the most unique words first. • Use a minus - to exclude terms. • Whenever possible, try a phrase search first. 1. Search bozeman, then john bozeman, and then "john bozeman" on Google Which finds the least? Does john bozeman find both words or either term? 2 Search the phrase productivity quality and 2. outcomes on AltaVista Portals, Directories, & Search Engines Portals: Offer search, directory, and many other general services such as email, free home page building, news, popular topics, etc. Well-known p portals include Yahoo!,, AOL,, MSN,, and Lycos. y Directory Function: A subject directory includes selected Web sites (more often than pages) and classifies them into hierarchical subject categories. Most portals have one and some specialized directories are available by themselves. They do not index every word on every page i l d d included. Search Engines: g Indexingg the words on every y ppage g in their database,, a search engine covers Web pages and can include more than 2 billion. Primary Search Engines •While popular sites are well covered by most all of the portals, g , individual ppages g and lesser directories,, and search engines, known sites are not. •None are comprehensive, comprehensive and there is not always overlap between the search engines. •Sometimes, S i it i is i more effective ff i to switch i h to another h searchh engine rather than to stick just to one. •The search engines have many aspects in common, but they also each differ in important ways. Use the portals portals, directories, directories and search engines for finding the Web site of organizations when you can't guess the URL Some reasons for using specific search engines Google: One of the largest and includes PDF and other file types. Also includes cached copies of the page as it appeared when indexed. Also has News, Usenet ((Groups), p ) and Image g databases. AltaVista: One with the most powerful search capabilities. Has relatively y unique q features such as NEAR,, truncation,, full Boolean searching, and extensive language limits. Includes PDFs. Also has limited machine translation capabilities. MSN Search: Powerful search features, a very fresh database, and the default search engine in Internet Explorer. Teoma: A newer and smaller search engine that specializes in identifying communities on the Web and metasites. -- merged with Ask ( formerly known as Ask Jeeves ) AND OR Finds documents containing all of the specified words or phrases Peanut AND butter finds documents with both the phrases. word peanut and the word butter. Finds documents containing g at least one of the specified p words or phrases. Peanut OR butter finds documents containing either peanut or butter. The found documents could contain both items, but not necessarily. AND NOT Excludes documents containing the specified word or phrase. Peanut AND NOT butter finds documents with peanut but not containing butter. NOT must be used with another operator, like AND. AltaVista does not accept 'peanut NOT butter'; instead, specify peanut AND NOT butter. NEAR Finds documents containing both specified words or phrases within 10 words of each other. Peanut NEAR butter would find documents with peanut butter butter, but probably not any other kind of butter Wild Cards The asterisk is a wildcard; any letters can take the place of the asterisk. Bass* would find documents with bass,, basset and bassinet. You must type at least three letters before the *. * You can also place the * in the middle of a word word. This is useful when you're unsure about spelling. Colo*r would find documents that contain color and colour. l Use parentheses to group complex Boolean phrases. For example, (peanut AND butter) AND (jelly OR jam) finds () documents with the words 'peanut butter and jelly' or 'peanut butter and jam' or both Advanced Searching Tips Finds pages with a link to a page with the specified link:URLtext URL text. Use link:www.myway.com to find all pages linking to myway myway.com. com text:text Finds pages that contain the specified text in any h page other h than h an image i tag, lilink, k or part off the URL. The search text:graduation would find all pages p g with the term g graduation in them. title:text Finds pages that contain the specified word or phrase in the page title (which appears in the title bar of most browsers). The search title:sunset would find p pages g with sunset in the title. url:text Finds pages with a specific word or phrase in the URL U URL. Use url:garden l d to t find fi d allll pages on allll servers that have the word garden anywhere in the host name, path, or filename. Some of the advanced search features available, include the following. • Use "phrase searching" whenever possible Search Engine Boolean Proximit y Truncation Limits Google -, OR "Phrase" No Title, domain, filetype, PDF AlltheWeb / Lycos +, - [and, or, "Phrase" andnot, ( ) in adv] No Title, PDF, domain, content type • Use link search to find who links to a specific page Alt Vi t AltaVista AND OR AND, OR, AND NOT, ( ), +, - "Phrase", "Ph " Yes Y * Near Titl d Title, date, t domain d i • Use the advanced search forms, or at least look to see what it offers. They often include ability to limit search to title, a specific domain, or do a link search. Teoma -, OR "Phrase" No Title, domain, site MSN Search AND, OR, NOT "Phrase" No Title, domain, contain type • Add more words to focus results • Try a title search for subject focus Search engines won’t won t find… • New sites Search engines won’t won t find… • Dynamic y web pages p g • Sites that aren’t seen byy crawler p programs g – Crawlers follow links. – If no sites link to your site, a crawler won’t find it. – Example: E l Th The web b site it ffor thi this course!! • Sites explicitly excluded by page author – meta tags embedded in page, visible to crawler, but not visible to user viewing the page with a browser. The Deep Web • Information on the web web, but not visible to spider or crawler programs • Databases or non-text files • invisible-web.net – Example: when you search for books at Amazon, it reads a database then builds a page for you. • Data that is in a database. – Statistics sites. sites • Some binary file formats formats. Not all search engines can read .pdf files. – Also multi-media files, like sound clips, images. Beyond Search Engines What's Not Included • Search Engines attempt to find and index as many sites as possible, but note what is not included. • Search engines include billions of pages in their databases, but none of the Search Engines come close to indexing the entire Web, much less the entire Internet. Beyond Search Engines H Here iis a lilistt off some off what h t is i missing i i / nott available: il bl Selected Se ected Spec Specialized a ed Search Sea c Engines g es – The content in sites requiring a log in • i.e. username & PW – – – – CGI output such as data requested by a form Intranets – pages / sites not linked from anywhere else Commercial resources with domain limitations C t t off Adobe Content Ad b PDF and d fformatted tt d files fil • News: Daypop, AlltheWeb News, Google News • Opinions: Google Groups, Epinions • (some are now indexed) – Non-Web Non Web resources: Email lists, chat, IM, books, etc. – Very current information: News, press releases – Multimedia file content: Words in pictures, sound files, video files • http://www.searchenginecolossus.com/ http://www searchenginecolossus com/ – Search Engine Colossus offers you links to search engines and directories from 195 countries and 46 territories around the world! Conduct extensive web searches! – Make your own website submissions! – Locate your new favorite search engines! – Search the web using your choice of language! • http://www.search.com/ – subject collection of search engines From : http://www.searchengineshowdown.com/phone/ On-Site On Site Search Engines • Local search engine to find data on a site • Becoming B i a necessity it ffor a llarge site it – Microsoft has one – UMass Lowell has one • Virus Bulletin has a local installation of Google News Search • Search engines usually don’t include news in web results. • Most news on the web is not permanent permanent, at least in free form • News changes frequently, spiders need to follow different rules • Google Google, AltaVista and Lycos Lycos, etc etc. have separate news search services Specialty Search Engines • Topic-specific Topic specific Search Engines • What is the best search engine? • Medical, Medical Legal Legal, etc etc. – The answer changes often! – It depends on what you want to find find. • More M relevant l t results, lt lless noise i ¾Example: www.eliyon.com - collects information on b i business people l (15 million) illi ) AltaVista • Overture to acquire AltaVista from CMGI – DEC filed $50M IPO in 1996 – Compaq sold it to CMGI for $2 $2.2B 2B in 1999 – Overture buys AltaVista for $140M in 2003 • AltaVista Alt Vi t h has never b been profitable! fit bl ! Inktomi • Inktomi provides search technology to Microsoft, AOL, etal. • Yahoo! to acquire Inktomi – Inktomi stock price in June 1998 - $9.00 – … in i M March h 2000 - $250.00 $250 00 – … in February 2003 - $1.62 • Inktomi was profitable briefly in 2001. Yahoo! • Yahoo! to buy Overture for $1.63B, pending shareholder vote on October 7. • Overture has an ongoing agreement with MSN to supply pp y context-specific p ads for search result pages. • Yahoo! used Google for search ; recently moved to their own proprietary search engine. The Business of Search • Google g recently y became a p public company. p y – Are acquiring companies and many very talented leaders • Yahoo! has recently acquired several search companies. companies • Microsoft has made the decision to deploy their search technology ¾ Leadership has been shifting and the industry has been consolidating. Search Engine Revenue? • Paid inclusion in directories directories. • Sponsored links. • Paid P id iinclusion l i and/or d/ priority i it placement l t iin search results. • License search technology to other sites. • Advertising. g • Add-on services.