Download CS:1761: An Overview and A Brief Tour OOP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Introduction to Data Structures
Vamshi Ambati
[email protected]
Overview
Java you need for the Project
 Search Engine and Data Structures
 THIS Code Structure
 On the Data Structure front





Dictionaries (Dictionary Structures)
Java Collections
Linked List
Queue
[c] Vamshi Ambati
2
Java you will need for the Project
Core Programming + I/O and Files
 OOPS





Inheritance
Packages
Encapsulation
Java API

Collections
[c] Vamshi Ambati
3
What is a Search Engine?
A sophisticated tool for finding information
on the web
 An Index for the World Wide Web


Analogous to the Index on a textbook

Just Imagine a world without Search Engine!
[c] Vamshi Ambati
4
Why Index in the first place?
Which list is easier to search?
 sow fox pig eel yak hen ant cat dog hog
 ant cat dog eel fox hen hog pig sow yak
 A Sorted list always helps


Permits binary search. About log2n probes
into list

log2(1 billion) ~ 3
[c] Vamshi Ambati
5
How search engines work
The search engines maintain data of web
sites in its database.
 Use programs (often referred to as
"spiders" or "robots") to collect
information.
 The information is then indexed by the
search engine.
 It allows users to look for the words or
combination of words found in the index

Inverted Files
FILE
POS
1
10
20
30
36
A file is a list of words and this file
contains words at various positions. Each
entry of the word is associated with a
position.
a (1, 4, 24…)
entry (17…)
file (2, 10)
contains(11,….)
position (25…)
positions (15…)
word (20….)
words (6,12..)
.
.
INVERTED FILE
[c] Vamshi Ambati
8
Inverted Files for Multiple Documents
LEXICON
WORD
jezebel
OCCUR
POS 1
POS 2
...
NDOCS PTR
20
jezer
3
jezerit
1
jeziah
1
jeziel
1
jezliah
1
jezoar
1
jezrahliah
1
jezreel
jezoar
DOCID
39
34
44
56
6
3
4
1
215
5
118
2291
22
2087
3010
134
566
3
203
245
287
67
1
132
4
6
1
3
322
15
481
42
[c] Vamshi Ambati
3922
3981
5002
1951
2192
992
WORD
INDEX
...
107
232
677
713
“jezebel” occurs
6 times in document 34,
3 times in document 44,
4 times in document 56 . . .
354
195
381
248
312
802
405
1897
9
A comprehensive form of Inverted Index
[c] Vamshi Ambati
SOURCE: http://www.searchtools.com/slides/bestsearch/bls-24.html
10
THIS
Search engine for the website
http://www.hinduonnet.com/

Website for the news paper The Hindu
 Not for the entire web
 Results are confined to only one web site

[c] Vamshi Ambati
11
Index Structure for our Project (THIS)
http://www.hindu.com/2004/10/09/stories/2004100
904051900.htm :: 23
http://www.hindu.com/2004/10/09/stories/2004100
910970300.htm :: 3
..
http://www.hinduonnet.com/thehindu/thscrip/prin
t.pl?file=2002102700140200.htm&date=2002/10
/27/&prd=mag :: 7
….
India
..
ManMoh
an
…
Cricket
…
Bollywo
…
Sharukh
…
Sachin
…
….
http://www.hinduonnet.com/thehindu/thscrip/prin
t.pl?file=2004091500081100.htm&date=2004/09
/15/&prd=bl :: 4
…
http://www.hinduonnet.com/thehindu/
gallery/0166/016606.htm :: 2
http://www.hinduonnet.com/thehindu/
gallery/0048/004807.htm :: 1
..
…
[c] Vamshi Ambati
12
Search Engines
Search Engine Differences
Coverage (What part of the web do they
really cover?)
 Crawling algorithms



Frequency of crawl
depth of visits



http://www.msitprogram.net/ Depth -0
http://www.msitprogram.net/admissions.html/
 Depth -1
Indexing policies


Data Structures
Representation
Search interfaces
 Ranking

[c] Vamshi Ambati
14
Search Engine
[c] Vamshi Ambati
15
Crawl
Index
Search
[c] Vamshi Ambati
16
TheWeb
crawl
Parser
parse
Spider
addUrls
URLList
getNextUrl
addPage
store
Indexer
Index
retrieve
FinalResult
retrieve
makePage
Query
Sort by Rank
ResultSet
ResultPage
[c] Vamshi Ambati
17
Where are our data structures and algorithms lying?
Queue
Priority
Queue
TheWeb
crawl
Parser
parse
Spider
addUrls
URLList
getNextUrl
addPage
Hashtable
BinaryTree
store
Indexer
Index
retrieve
LinkedList
FinalResult
retrieve
makePage
Query
Sort by Rank
ResultSet
MergeSort&
InsertionSort
[c] Vamshi Ambati
ResultPage
18
Code Structure (THIS)
Inheritance
Uses
Spider
SearchDriver
Calls
CrawlerDriver
Crawl
WebSpider
Query
Index
addPage
Restore
Parse
Queue
Save
PageLexer
Indexer
HttpTokenizer
DictionaryDriver
URLTextReader
Index
PageElement
DictionaryInterface
ListDictionary
TreeDictionary
HashDictionary
[c] Vamshi Ambati
PageImg
PageHref
PageWord
19
Dictionary Structures (Lexicon)

A Dictionary is an unordered container that contains keyelement pairs


Ordered Dictionary has the elements in sorted order
Keys are unique, but the values could be any
[c] Vamshi Ambati
20
Dictionary ADT

size(): returns the number of items in D


isEmpty(): Test whether D is empty.



Output: Iterator of elements with key k
insertItem(k,e): Insert an Item with element e and key k into D.
removeElement(k): Remove an item with key == k and return it. If no such
element, return NO_SUCH_KEY


Output: Object
findAllElements(k):


Output: iterator of keys (objects)
findElement(k): if D contains an item with key == k, then return the element of
that item, else return NO_SUCH_KEY.


Output: iterator of elements (objects)
keys(): Return the keys stored in D.


Output: Boolean
elements(): Return the elements stored in D.


Output: Integer
Output: Object (element)
removeAllElements(k): Remove from D the items with key == k.

Output: iterator of elements
Also see the Java Standard API for Dictionary
http://java.sun.com/j2se/1.4.2/docs/api/java/util/Dictionary.html
[c] Vamshi Ambati
21
Dictionary ADT in THIS Project

size(): returns the number of items in D


isEmpty(): Test whether D is empty.




Output: String array (Ideally it should be Vector!!)
getValue(k): if D contains an item with key == k, then return the
element of that item, else return NULL.


Output: Boolean
getKeys(): Return all the keys of the elements stored in D.


Output: Integer
Output: Object
insertItem(k,e): Insert an Item with element e and key k into D.
remove(k): Remove an Item with key k from D.
We have customized the Dictionary a bit as we would be inserting
only elements of the type <String,Object> !!
[c] Vamshi Ambati
22
Java Collections


java.util.* (A quite helpful library)
 Has implementations for most of the Data Structures
 They make life really easy
 You can not use the data structures inbuilt unless
specified (Eg:Task1 Tasklet-A)
Use them for non-data structural purposes - Collections
 Eg: Arrays,Vectors, Iterators,Lists, Sets etc
 You would definitely be using “Iterator” atleast as you
would be dealing with many Objects at a time!

http://java.sun.com/j2se/1.4.2/docs/api/java/util/Iterat
or.html.
See: http://java.sun.com/docs/books/tutorial/collections/
[c] Vamshi Ambati
23
Other Data structures


Queue
LinkedList
 Beware! there are no Pointers in Java
 However there are “references”

Learn more about References in Java

Do not use the java.util package for DataStructures or
Sorting Algorithms! You are expected to code them
[c] Vamshi Ambati
24
Summary

Learn data structures by implementing
THIS

Mini version of a real search engine

Frame work is provided

More details in the next video
[c] Vamshi Ambati
25
THANK YOU
[c] Vamshi Ambati
26