Download Proposal - salsahpc - Indiana University Bloomington

Frequent Word Combinations Mining and Indexing Hemanth Gokavarapu Santhosh Kumar Saminathan School of Informatics and Computing Indiana University Bloomington {hemagoka, sasamina}@indiana.edu Abstract Google AutoComplete algorithm offers searches similar to the words that you might be typing. This works on the frequent words’ combinations. This inspired us to learn and implement similar technique involved in this process. This involves the frequent word combinations mining and indexing. This document will first describe the concept behind the project and then provide the methods and implementation details of the project. We chose HBase as the distributed, open-source database, which gives similar BigTable functionality for Hadoop Key Words: alphabetically, sorted, excluding, words, Reduce. In addition to the survey about the mentioned topics, we are also going to do a detailed analysis in the trade off of design in choosing the HBase, Hadoop or Twister for this project. 3. Approach The concept of data mining is used in this project. Data mining is the process of analyzing the data from different perspective and summarizing data into useful information. It also analyzes the relationships and patterns in stored transaction data based on open-ended user queries. 4. Architecture Design Apriori, Mining. 1. Project Goal The problem of finding the frequency of the word combinations is considered as one of the major problems faced in the field of cloud and distributed field, as there are solutions to just find a frequency of a single word. We cannot find the individual frequency of words in a word combination and arrive at the result. For example, for the combination of words ‘cloud computing’, we cannot find the frequency of cloud and computing separately and compute the result because there are chances for the word combinations ‘distribute cloud’, ‘computing field’, ‘method of computing’, etc., Our project focuses on finding the combination of words without the stated error. We are going to use the concept of data mining and Apriori algorithm for implementing this project. 2. Survey In this project we are going to survey various topics before going to the implementation of the project. The survey topics include Apriori algorithm, HBase and Map The Apriori algorithm is used in the project for finding the frequency of a combination of words. It is an influential algorithm for mining frequent item sets for Boolean association rules. These association rules form a very applied data mining approach. They are derived from frequent itemsets. They use level-wise search using frequent item property. The Apriori algorithm calculates candidate itemsets. These candidate itemsets are refined at each and every iteration. When the candidate set becomes null the loop ends. This algorithm uses larger itemset property and it is easily parallelized. HBase is a non-relational, distributed, Hadoop database built after Google Bigtable. HBase internal architecture is shown below. HBase basically handles two kinds of file types. One is used for the write – ahead log and other for the actual data storage. The HRegionServer’s primarily handles the files. But in certain Scenarios even the HMaster will have to perform low-level file operations. You may also notice in the diagram that the actual files are in fact divided up into smaller blocks when stored with in the Hadoop Distributed File System (HDFS). 5. Timeline We are going to follow the timeline mentioned below. 1 week – Talking to the experts at Future grid. 1 week – Survey of HBase, Apriori algorithm and other design problems. 3 weeks – Implementation of the algorithm 2 weeks – Testing the code, evaluation and getting the results. 6. Validation Methods There are many validation methods that can be given for this project. We are going to follow the basic approach of giving the huge data input and checking the results with the expected output. 7. References [1] http://en.wikipedia.org/wiki/Text_mining. [2] http://en.wikipedia.org/wiki/Apriori_algorithm. [3] http://hbase.apache.org/book/book.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Proposal - salsahpc - Indiana University Bloomington