Download Search Engines

Search Engines What Are They?  Four Components A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables users to submit queries  Displays results    Information retrieval system Each is unique, but are mostly the same 2 Database  Where user's query is matched  Contains only essential parts of pages Only includes pages that were indexed Search engines are always out of date   3 Web Crawler   A robot that follows links Records data it finds Words in the webpage  Metadata    ALT attributes in IMG tags Robot Exclusion Protocol 4 Search Engine Interfaces   Gathers input from users Presents results from the IR system  Often in ranked order 5 Search Engine Interfaces  Input  User requirements   Search expression, search limits Presentation style  Presentation format , search type 6 Search Engine Interfaces  Output Results  Descriptions  Clusters  7 Search Term Matching   Trying to find a match in the database Two main methods  Keyword searching   Matching single terms, computing cosine Concept-based searching Examining clusters of words  Attempt to determine meaning of query and find records related to that meaning  8 Basic IR Features  Boolean operators   Extended operators     AND, OR, NOT, grouping NEAR, ADJACENT, (") Stop word deletion Stemming Searching in fields (e.g. host) 9 Ranked Output  Most SEs produce ranked lists by applying simple rules:       Early words are more important Title is very important Frequency of occurrence matters for some Infrequent words matter more Modification date Google is different:   PageRankTM method based on popularity Links as money 10 Googlebombing  Google spoofed from the lecture list first hit from 1992  Official GoogleBlog explanation  11 What about the Invisible Web?   Also known as the Deep Web Documents that are on the WWW but not indexed by Search Engines Some are available only by submitting forms  Some are not generally accessible (in subnets)  Some are not in (X)HTML format  12 The Invisible Web Isn't So Invisible Anymore…   More search engines parse non(X)HTML now than before Because of awareness of the problem companies are making more content available using Stable URLs  Robot-friendly sitemaps   But much content is still not indexed 13 But, there's still plenty of important yet invisible docs  How to find them?    Use database tools from the U.'s library   Many of them are in databases No one search engine covers everything Especially for research articles Use multiple search engines or a metacrawler  dogpile is the most famous 14 Search Engines A Summary of Practical Advice How To Succeed With SEs  As a surfer:  If you don't know what you are looking for Use multiple SEs, or a meta-crawler  Search within results   If you don't know what you are looking for Use multiple SEs, or a meta-crawler  Use Boolean expressions or search within results  Consider specialized engines  16 How To Succeed With SEs  As a creator:  HTML level    Always use ALT attributes with <IMG>, etc. Avoid frames Make it easier to index    Don't expect SEs to find your pages Make links between your pages Use metadata    Informal: <meta name="description" …> Formal: Dublin core and others Increase your pages popularity   Don’t use systematic reciprocal linking: rings, exchanges, lists Page Rank™ is inversely proportional to outdegree 17 How To Succeed With SEs   As a creator (cont.) For surfers: Use <meta name="description" …>  Don't expect surfers to start at top of your hierarchy  Don't rely on a hierarchy  Include a context map near the top of each page  Don't use frames  Think through dynamic content implications  Stickiness… is for another day  18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Search Engines