Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
search engines fdm 20c introduction to digital media lecture 06.06.2007 warren sack / film & digital media department / university of california, santa cruz outline • what’s the difference between a web directory and a search engine? – ontologies – clusters • what does a search engine do? – crawling – indexing – retrieving • searching for difference rather than popularity – an explanation of the search engine experiment what do search engines do? • crawl the web • index the pages found • retrieve pages from the index in response to user queries crawling/spidering the web depth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html breadth-first crawl source: http://www.rci.rutgers.edu/~cfs/472_html/AI_SEARCH/SearchAnimations.html indexing web pages found by the crawler 1. tokenization: break the document into words, punctuation, tags, and uris 2. stop words: remove any that are too frequently used to be informational (these are called stop words) 3. stemming: find the stem of each word 4. inverted index: create an index linking each word found to one or more documents tokenization • T’was was a dark and stormy night in the country manor. ==> • T single-quote was a dark and stormy night in the country manor period stop words • T single-quote was a dark and stormy night in the country manor period ==> • T single-quote was a period dark and stormy night in the country manor stemming • dark stormy night country manor ==> • dark storm night country manor y document vectors • documents (i.e., web pages) are stored as vectors or lists; e.g., country dark manor night storm • note that syntactic structure is lost inverted index • dictionary: make a file listing all of the words found in all of the web pages crawled • postings: make a file listing all of the each web page in which each of the words was found • encoding: remove any redundancies found in the dictionary and postings files so that they are as small as possible inverted index / dictionary • • • • create a document vector for each web page combine all of the vectors together remove any duplicate words alphabetize the resulting list inverted index / dictionary / example • webpage 1: T’was a dark and stormy night in the country manor. ==> country dark manor night storm • webpage 2: Now is the time for all good men to come to the aid of their country. ==> aid all come country good men time • dictionary: aid all come country dark good manor men night storm time inverted index / postings / example • postings: aid (2) all (2) come (2) country (1&2) dark (1) good (2) manor (1) men (2) night (1) storm (1) time (2) • the postings file just annotates the dictionary file with the documents in which the words appear retrieval • the postings file can support boolean queries • for example, searching for webpages that contain the word “country” returns 1&2. • for example, searching for the webpages that contain the words “country” and “manor” returns webpage 1 better retrieval • how should the webpages retrieved be ordered? • e.g., if webpage 1 and webpage 2 both contain the word “country” should 1 belisted before or after 2? social networks and centrality Diane is central; Jane is not. See www.orgnet.com/sna.html google’s page rank algorithm • basic idea: if a webpage has a lot of other webpages that link to it, then it is central and probably an important page. • If page A has pages T1...Tn which link to it (i.e., are citations) and C(A) is defined as the number of links going out of page A, then PageRank (PR) of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) – where the parameter d is a damping factor which can be set between 0 and 1 and is usually set to 0.85. the difference is in the details • The Boolean approach to information retrieval is the most widely used. Links to documents are returned only if they contain exactly the same words as your query. Even if a site is related in some way, it would never be retrieved unless it contains the user's search string. In order to get better results from this, most search engines finesse things a little. The results from the inverted file are modified by factors such as: where the search term appears in the item - whether the string of characters appears in specially tagged areas such as titles or head, or whether it appears early in the body text of the html document; how close the search words are together, or whether they form an 'exact phrase'; the frequency of occurrence of search terms in the inverted file; whether the search terms appear as keywords in the metatags in the html file's head; how frequently the terms appear in relation to the document's length (greater frequency indicating a probable greater relevance). These techniques gradually move over into semantic analysis. For instance selection by the frequency of occurrence of words in general in which very uncommon words get heavier weight. Matthew Fuller