Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Structures Vamshi Ambati [email protected] Overview Java you need for the Project Search Engine and Data Structures THIS Code Structure On the Data Structure front Dictionaries (Dictionary Structures) Java Collections Linked List Queue [c] Vamshi Ambati 2 Java you will need for the Project Core Programming + I/O and Files OOPS Inheritance Packages Encapsulation Java API Collections [c] Vamshi Ambati 3 What is a Search Engine? A sophisticated tool for finding information on the web An Index for the World Wide Web Analogous to the Index on a textbook Just Imagine a world without Search Engine! [c] Vamshi Ambati 4 Why Index in the first place? Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak A Sorted list always helps Permits binary search. About log2n probes into list log2(1 billion) ~ 3 [c] Vamshi Ambati 5 How search engines work The search engines maintain data of web sites in its database. Use programs (often referred to as "spiders" or "robots") to collect information. The information is then indexed by the search engine. It allows users to look for the words or combination of words found in the index Inverted Files FILE POS 1 10 20 30 36 A file is a list of words and this file contains words at various positions. Each entry of the word is associated with a position. a (1, 4, 24…) entry (17…) file (2, 10) contains(11,….) position (25…) positions (15…) word (20….) words (6,12..) . . INVERTED FILE [c] Vamshi Ambati 8 Inverted Files for Multiple Documents LEXICON WORD jezebel OCCUR POS 1 POS 2 ... NDOCS PTR 20 jezer 3 jezerit 1 jeziah 1 jeziel 1 jezliah 1 jezoar 1 jezrahliah 1 jezreel jezoar DOCID 39 34 44 56 6 3 4 1 215 5 118 2291 22 2087 3010 134 566 3 203 245 287 67 1 132 4 6 1 3 322 15 481 42 [c] Vamshi Ambati 3922 3981 5002 1951 2192 992 WORD INDEX ... 107 232 677 713 “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . . 354 195 381 248 312 802 405 1897 9 A comprehensive form of Inverted Index [c] Vamshi Ambati SOURCE: http://www.searchtools.com/slides/bestsearch/bls-24.html 10 THIS Search engine for the website http://www.hinduonnet.com/ Website for the news paper The Hindu Not for the entire web Results are confined to only one web site [c] Vamshi Ambati 11 Index Structure for our Project (THIS) http://www.hindu.com/2004/10/09/stories/2004100 904051900.htm :: 23 http://www.hindu.com/2004/10/09/stories/2004100 910970300.htm :: 3 .. http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2002102700140200.htm&date=2002/10 /27/&prd=mag :: 7 …. India .. ManMoh an … Cricket … Bollywo … Sharukh … Sachin … …. http://www.hinduonnet.com/thehindu/thscrip/prin t.pl?file=2004091500081100.htm&date=2004/09 /15/&prd=bl :: 4 … http://www.hinduonnet.com/thehindu/ gallery/0166/016606.htm :: 2 http://www.hinduonnet.com/thehindu/ gallery/0048/004807.htm :: 1 .. … [c] Vamshi Ambati 12 Search Engines Search Engine Differences Coverage (What part of the web do they really cover?) Crawling algorithms Frequency of crawl depth of visits http://www.msitprogram.net/ Depth -0 http://www.msitprogram.net/admissions.html/ Depth -1 Indexing policies Data Structures Representation Search interfaces Ranking [c] Vamshi Ambati 14 Search Engine [c] Vamshi Ambati 15 Crawl Index Search [c] Vamshi Ambati 16 TheWeb crawl Parser parse Spider addUrls URLList getNextUrl addPage store Indexer Index retrieve FinalResult retrieve makePage Query Sort by Rank ResultSet ResultPage [c] Vamshi Ambati 17 Where are our data structures and algorithms lying? Queue Priority Queue TheWeb crawl Parser parse Spider addUrls URLList getNextUrl addPage Hashtable BinaryTree store Indexer Index retrieve LinkedList FinalResult retrieve makePage Query Sort by Rank ResultSet MergeSort& InsertionSort [c] Vamshi Ambati ResultPage 18 Code Structure (THIS) Inheritance Uses Spider SearchDriver Calls CrawlerDriver Crawl WebSpider Query Index addPage Restore Parse Queue Save PageLexer Indexer HttpTokenizer DictionaryDriver URLTextReader Index PageElement DictionaryInterface ListDictionary TreeDictionary HashDictionary [c] Vamshi Ambati PageImg PageHref PageWord 19 Dictionary Structures (Lexicon) A Dictionary is an unordered container that contains keyelement pairs Ordered Dictionary has the elements in sorted order Keys are unique, but the values could be any [c] Vamshi Ambati 20 Dictionary ADT size(): returns the number of items in D isEmpty(): Test whether D is empty. Output: Iterator of elements with key k insertItem(k,e): Insert an Item with element e and key k into D. removeElement(k): Remove an item with key == k and return it. If no such element, return NO_SUCH_KEY Output: Object findAllElements(k): Output: iterator of keys (objects) findElement(k): if D contains an item with key == k, then return the element of that item, else return NO_SUCH_KEY. Output: iterator of elements (objects) keys(): Return the keys stored in D. Output: Boolean elements(): Return the elements stored in D. Output: Integer Output: Object (element) removeAllElements(k): Remove from D the items with key == k. Output: iterator of elements Also see the Java Standard API for Dictionary http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html [c] Vamshi Ambati 21 Dictionary ADT in THIS Project size(): returns the number of items in D isEmpty(): Test whether D is empty. Output: String array (Ideally it should be Vector!!) getValue(k): if D contains an item with key == k, then return the element of that item, else return NULL. Output: Boolean getKeys(): Return all the keys of the elements stored in D. Output: Integer Output: Object insertItem(k,e): Insert an Item with element e and key k into D. remove(k): Remove an Item with key k from D. We have customized the Dictionary a bit as we would be inserting only elements of the type <String,Object> !! [c] Vamshi Ambati 22 Java Collections java.util.* (A quite helpful library) Has implementations for most of the Data Structures They make life really easy You can not use the data structures inbuilt unless specified (Eg:Task1 Tasklet-A) Use them for non-data structural purposes - Collections Eg: Arrays,Vectors, Iterators,Lists, Sets etc You would definitely be using “Iterator” atleast as you would be dealing with many Objects at a time! http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterat or.html. See: http://java.sun.com/docs/books/tutorial/collections/ [c] Vamshi Ambati 23 Other Data structures Queue LinkedList Beware! there are no Pointers in Java However there are “references” Learn more about References in Java Do not use the java.util package for DataStructures or Sorting Algorithms! You are expected to code them [c] Vamshi Ambati 24 Summary Learn data structures by implementing THIS Mini version of a real search engine Frame work is provided More details in the next video [c] Vamshi Ambati 25 THANK YOU [c] Vamshi Ambati 26