Download CS352

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS352
Project 2: Creating Word Frequencies from HTML pages using BST1
Fall 2009
First Draft
Due date: 10th November 11:59pm
TestScanner file and
Write-up of your understanding due: 18th October, 11:59pm
CONTENTS





Project Description
Problem Description
Getting Started
C++ Classes
Deliverables
COURSE PROJECT DESCRIPTION
This assignment is the first part in a series of related assignments about the World Wide
Web. Our ultimate goal is to build a Web browser with a search engine for a limited
portion of the Web. Your search engine will have some of the features of common search
engines such as Yahoo, Lycos, or Google. By the end of the semester you will have
implemented a Web browser with a fairly sophisticated search engine. Your Web
browser will have the following capabilities:





It will display a Web page given a URL
It will display connectivity information of local Web pages
It will answer questions about the connectivity of local Web pages
Its search engine will search for good matching Web pages given a query string
and display the resulting URLs in order of best match to worst (we will re-define
a "good match" as we add more features to the search engine).
The Web browser will automatically display the best matching URL result of a
search.
PROBLEM DESCRIPTION
The first step to building a search engine is to be able to analyze web pages based on their
content. This will eventually allow us to rate how relevant a particular page is for a given
user request. The content of a page is obviously correlated with the words it contains. For
1
Based on an assignment from Swarthmore College
this assignment you will use a binary search tree to help you calculate and store the
frequencies of words found in a given web page. Then you will print out all words (in
alphabetical order) that occur at least the minimum frequency number of times and print
each of these words’s frequency count.
WORD
---artifical
cat
dog
... \
FREQUENCY
--------5
3
7
GETTING STARTED
Begin by writing a test class for the Scanner class and experimenting with the
Scanner class. The Scanner class is written to scan a file. For this assignment you will use
the Scanner class for scanning a file. The constructor for the Scanner class requires a
filename. It then opens the given file and will return the next token that it finds when the
method getNextWord() is called. For our purposes a word/token will be a string of
contiguous alphabetic and numeric characters up until white space (all special characters
included) is encountered or the end of file.
Your program will take two or three command line arguments:
Command:> WordFrequencyTree input_file_name min_freq_num [ignore_file_name]
Note: Do not create a menu option. You must take inputs from the command line.
The first is an input file that your program will process using the Scanner class.
You should use .html input files. You can test your WordFrequencyTree by parsing these
files. The second argument, min_freq_num, is used to determine which elements in your
WordFrequency tree your program prints out (only words with a frequency count >=
min_freq_num). The optional third argument is an ignore list file containing words that
should be ignored in the input file. If there is an ignore list command line argument, then
your program should process the ignore list first and create a data structure for storing the
list of tokens to ignore. Next, your program should process the input file. For all tokens in
the input file that are not on the ignore list, you will add it to a binary search tree sorted
alphabetically. Only unique words should be added to the tree. When a duplicate word is
encountered, just update the frequency count indicating it’s already in the tree.
After the file has been read, record the final height of the tree. Based on this height, give
the worst case search time for a node at this height. Then traverse the tree and print out
all the words in the WordFrequencyTree in alphabetical order that occur at least
min_freq_num times. With each word you should list its frequency count.
The following are examples of what your command line should look like, and what your
program should do for different calls
# program will add Nodes to the BST for each token
# in the index.html input file that is not on the html_ignore_list
# and will print out in alphabetical order all words that occur
# at least 3 times in the tree
#
% WebSearch.exe index.html 3 html_ignore_list
#
# program will add Nodes to the BST for each token
# in the index.html input file and will print out in alphabetical
# order all words that occur at least 4 times in the tree
#
% WebSearch.exe index.html 4
#
# program will print a usage message to stdout and exit
# (ex) usage: inputfile min_freq_num
#
or: inputfile min_freq_num ignorelist
#
% WebSearch.exe index.html
CLASSES TO IMPLEMENT

Complete the implementation of LinkedBinarySearchTree.

Create a WordFrequencyTree class that implements the BinarySearchTree
interface. This class will have a LinkedBinarySearchTree data member (it may
need additional data members too), and its implementation of the
BinarySearchTree interface methods just invoke the corresponding method on its
LinkedBinaryTree data member. In addition, it should have at least three
constructors: a default constructor, a constructor that takes an input file argument
and a constructor that takes an input file and an ignore file argument. The second
two constructors will process the file(s) and create the initial word frequency tree
as specified above.

Part of this assignment includes developing a good ignorelist file for html files.
As you test your program for different .html input files, develop a list of words
that should be ignored from html files, and use this list as the optional third
command line argument to your program. You will submit your html_ignore_list
file as part of your homework solution.
C++ CLASSES
Following classes are available to you






Scanner class (.h & .cpp)
Node interface (pure abstract)(.h)
BinarySearchTree interface (aure abstract)(.h)
LinkedBTNode (.h & .cpp)
LinkedBinarySearchTree (.h & .cpp)
TraversalIterator(.h & .cpp)
DELIVERABLES on 18th October.
What do you understand about the project? Answer the following questions
Write in short what is expected of you from the project?
What class/functions need to be implemented?
What will be the data member of WordFrequencyTree.cpp class?
What is your understanding about the ignore list? How will you create, store,
use it as part of the project?
Write a ScannerTest.cpp to test the working of Scanner class (input a .html file and see
what getnextWord() returns).
Submit Word file and ScannerTest.cpp
DELIVERABLES on 10th November.



Modified Code mainly LinkedBinarySearchTree.cpp, WorkFrequency.cpp, Main
file, output files(copy the output to the file).
Your html input file and your ignore list file.
Your project report.
Reference:
Click on the icon to learn how to Add your exe file to the path
Microsoft Word
Document