Download Proposal - salsahpc - Indiana University Bloomington

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Frequent Word Combinations Mining and Indexing
Hemanth Gokavarapu
Santhosh Kumar Saminathan
School of Informatics and Computing
Indiana University Bloomington
{hemagoka, sasamina}@indiana.edu
Abstract
Google AutoComplete algorithm offers searches
similar to the words that you might be typing. This works
on the frequent words’ combinations.
This inspired us to learn and implement similar
technique involved in this process. This involves the
frequent word combinations mining and indexing. This
document will first describe the concept behind the
project and then provide the methods and implementation
details of the project. We chose HBase as the distributed,
open-source database, which gives similar BigTable
functionality for Hadoop
Key Words: alphabetically, sorted, excluding, words,
Reduce. In addition to the survey about the mentioned
topics, we are also going to do a detailed analysis in the
trade off of design in choosing the HBase, Hadoop or
Twister for this project.
3. Approach
The concept of data mining is used in this project. Data
mining is the process of analyzing the data from different
perspective and summarizing data into useful information.
It also analyzes the relationships and patterns in stored
transaction data based on open-ended user queries.
4. Architecture Design
Apriori, Mining.
1. Project Goal
The problem of finding the frequency of the word
combinations is considered as one of the major problems
faced in the field of cloud and distributed field, as there
are solutions to just find a frequency of a single word. We
cannot find the individual frequency of words in a word
combination and arrive at the result. For example, for the
combination of words ‘cloud computing’, we cannot find
the frequency of cloud and computing separately and
compute the result because there are chances for the word
combinations ‘distribute cloud’, ‘computing field’,
‘method of computing’, etc.,
Our project focuses on finding the combination of
words without the stated error. We are going to use the
concept of data mining and Apriori algorithm for
implementing this project.
2. Survey
In this project we are going to survey various topics
before going to the implementation of the project. The
survey topics include Apriori algorithm, HBase and Map
The Apriori algorithm is used in the project for finding
the frequency of a combination of words. It is an
influential algorithm for mining frequent item sets for
Boolean association rules. These association rules form a
very applied data mining approach. They are derived from
frequent itemsets. They use level-wise search using
frequent item property.
The Apriori algorithm calculates candidate
itemsets. These candidate itemsets are refined at
each and every iteration. When the candidate set
becomes null the loop ends. This algorithm uses
larger itemset property and it is easily
parallelized.
HBase is a non-relational, distributed, Hadoop
database built after Google Bigtable. HBase
internal architecture is shown below.
HBase basically handles two kinds of file types.
One is used for the write – ahead log and other
for
the
actual
data
storage.
The
HRegionServer’s primarily handles the files.
But in certain Scenarios even the HMaster will
have to perform low-level file operations. You
may also notice in the diagram that the actual
files are in fact divided up into smaller blocks
when stored with in the Hadoop Distributed File
System (HDFS).
5. Timeline
We are going to follow the timeline mentioned below.
1 week – Talking to the experts at Future grid.
1 week – Survey of HBase, Apriori algorithm and
other design problems.
3 weeks – Implementation of the algorithm
2 weeks – Testing the code, evaluation and getting the
results.
6. Validation Methods
There are many validation methods that can be given
for this project. We are going to follow the basic approach
of giving the huge data input and checking the results
with the expected output.
7. References
[1] http://en.wikipedia.org/wiki/Text_mining.
[2] http://en.wikipedia.org/wiki/Apriori_algorithm.
[3] http://hbase.apache.org/book/book.html