Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS352 Project 2: Creating Word Frequencies from HTML pages using BST1 Fall 2009 First Draft Due date: 10th November 11:59pm TestScanner file and Write-up of your understanding due: 18th October, 11:59pm CONTENTS Project Description Problem Description Getting Started C++ Classes Deliverables COURSE PROJECT DESCRIPTION This assignment is the first part in a series of related assignments about the World Wide Web. Our ultimate goal is to build a Web browser with a search engine for a limited portion of the Web. Your search engine will have some of the features of common search engines such as Yahoo, Lycos, or Google. By the end of the semester you will have implemented a Web browser with a fairly sophisticated search engine. Your Web browser will have the following capabilities: It will display a Web page given a URL It will display connectivity information of local Web pages It will answer questions about the connectivity of local Web pages Its search engine will search for good matching Web pages given a query string and display the resulting URLs in order of best match to worst (we will re-define a "good match" as we add more features to the search engine). The Web browser will automatically display the best matching URL result of a search. PROBLEM DESCRIPTION The first step to building a search engine is to be able to analyze web pages based on their content. This will eventually allow us to rate how relevant a particular page is for a given user request. The content of a page is obviously correlated with the words it contains. For 1 Based on an assignment from Swarthmore College this assignment you will use a binary search tree to help you calculate and store the frequencies of words found in a given web page. Then you will print out all words (in alphabetical order) that occur at least the minimum frequency number of times and print each of these words’s frequency count. WORD ---artifical cat dog ... \ FREQUENCY --------5 3 7 GETTING STARTED Begin by writing a test class for the Scanner class and experimenting with the Scanner class. The Scanner class is written to scan a file. For this assignment you will use the Scanner class for scanning a file. The constructor for the Scanner class requires a filename. It then opens the given file and will return the next token that it finds when the method getNextWord() is called. For our purposes a word/token will be a string of contiguous alphabetic and numeric characters up until white space (all special characters included) is encountered or the end of file. Your program will take two or three command line arguments: Command:> WordFrequencyTree input_file_name min_freq_num [ignore_file_name] Note: Do not create a menu option. You must take inputs from the command line. The first is an input file that your program will process using the Scanner class. You should use .html input files. You can test your WordFrequencyTree by parsing these files. The second argument, min_freq_num, is used to determine which elements in your WordFrequency tree your program prints out (only words with a frequency count >= min_freq_num). The optional third argument is an ignore list file containing words that should be ignored in the input file. If there is an ignore list command line argument, then your program should process the ignore list first and create a data structure for storing the list of tokens to ignore. Next, your program should process the input file. For all tokens in the input file that are not on the ignore list, you will add it to a binary search tree sorted alphabetically. Only unique words should be added to the tree. When a duplicate word is encountered, just update the frequency count indicating it’s already in the tree. After the file has been read, record the final height of the tree. Based on this height, give the worst case search time for a node at this height. Then traverse the tree and print out all the words in the WordFrequencyTree in alphabetical order that occur at least min_freq_num times. With each word you should list its frequency count. The following are examples of what your command line should look like, and what your program should do for different calls # program will add Nodes to the BST for each token # in the index.html input file that is not on the html_ignore_list # and will print out in alphabetical order all words that occur # at least 3 times in the tree # % WebSearch.exe index.html 3 html_ignore_list # # program will add Nodes to the BST for each token # in the index.html input file and will print out in alphabetical # order all words that occur at least 4 times in the tree # % WebSearch.exe index.html 4 # # program will print a usage message to stdout and exit # (ex) usage: inputfile min_freq_num # or: inputfile min_freq_num ignorelist # % WebSearch.exe index.html CLASSES TO IMPLEMENT Complete the implementation of LinkedBinarySearchTree. Create a WordFrequencyTree class that implements the BinarySearchTree interface. This class will have a LinkedBinarySearchTree data member (it may need additional data members too), and its implementation of the BinarySearchTree interface methods just invoke the corresponding method on its LinkedBinaryTree data member. In addition, it should have at least three constructors: a default constructor, a constructor that takes an input file argument and a constructor that takes an input file and an ignore file argument. The second two constructors will process the file(s) and create the initial word frequency tree as specified above. Part of this assignment includes developing a good ignorelist file for html files. As you test your program for different .html input files, develop a list of words that should be ignored from html files, and use this list as the optional third command line argument to your program. You will submit your html_ignore_list file as part of your homework solution. C++ CLASSES Following classes are available to you Scanner class (.h & .cpp) Node interface (pure abstract)(.h) BinarySearchTree interface (aure abstract)(.h) LinkedBTNode (.h & .cpp) LinkedBinarySearchTree (.h & .cpp) TraversalIterator(.h & .cpp) DELIVERABLES on 18th October. What do you understand about the project? Answer the following questions Write in short what is expected of you from the project? What class/functions need to be implemented? What will be the data member of WordFrequencyTree.cpp class? What is your understanding about the ignore list? How will you create, store, use it as part of the project? Write a ScannerTest.cpp to test the working of Scanner class (input a .html file and see what getnextWord() returns). Submit Word file and ScannerTest.cpp DELIVERABLES on 10th November. Modified Code mainly LinkedBinarySearchTree.cpp, WorkFrequency.cpp, Main file, output files(copy the output to the file). Your html input file and your ignore list file. Your project report. Reference: Click on the icon to learn how to Add your exe file to the path Microsoft Word Document