Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequent Word
Combinations Mining and
Indexing on HBase
Hemanth Gokavarapu
Santhosh Kumar Saminathan
S
Introduction
S Many projects use Hbase to store large amount of data for
distributed computation
S The Processing of these data becomes a challenge for the
programmers
S The use of frequent terms help us in many ways in the field of
machine learning
S Eg: Frequently purchased items, Frequently Asked Questions, etc.
Problem
S These projects on Hbase create indexes on multiple data
S We are able to find the frequency of a single word easily
using these indexes
S It is hard to find the frequency of a combination of words
S For example: “cloud computing”
S Searching these words separately may lead to results like
“scientific computing”, “cloud platform”
Objective
S This project focuses on finding the frequency of a
combination of words
S We use the concept of Data mining and Apriori algorithm
for this project
S We will be using Map-Reduce and HBase for this project.
Survey Topics
S Apriori Algorithm
S HBase
S Map – Reduce
Data Mining
What is Data Mining?
S Process of analyzing data from different perspective
S Summarizing data into useful information.
Data Mining
How Data Mining works?
S Data Mining analyzes relationships and patterns in stored
transaction data based on open – ended user queries
What technology of infrastructure is needed?
Two critical technological drivers answers this question.
S Size of the database
S Query complexity
Apriori Algorithm
S Apriori Algorithm – Its an influential algorithm for mining
frequent item sets for Boolean association rules.
S Association rules form an very applied data mining
approach.
S Association rules are derived from frequent itemsets.
S It uses level-wise search using frequent item property.
Algorithm Flow
Apriori Algorithm & Problem
Description
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that
satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are:
Shoes Jacket Support=50%, Confidence=66%
Jacket Shoes Support=50%, Confidence=100%
10
Apriori Algorithm Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Apriori Advantages &
Disadvantages
S ADVANTAGES:
Uses larger itemset property
Easily Parallelized
Easy to Implement
S DISADVANTAGES:
Assumes transaction database is memory resident
Requires many database scans
HBase
What is HBase?
S A Hadoop Database
S Non - Relational
S Open-source, Distributed, versioned, column-oriented store model
S Designed after Google Bigtable
S Runs on top of HDFS ( Hadoop Distributed File System )
Map Reduce
S Framework for processing highly distributable problems
across huge datasets using large number of nodes. / cluster.
S Processing occur on data stored either in filesystem (
unstructured ) or in Database ( structured )
Map Reduce
Mapper and Reducer
S Mappers
FreqentItemsMap
-Finds the combination and assigns the key value for each combination
CandidateGenMap
AssociationRuleMap
S Reducer
FrequentItemsReduce
CandidateGenReduce
AssociationRuleReduce
Flow Chart
No
Yes
Schedule
S
1 week – Talking to the Experts at Futuregrid
S 1 Week – survey of HBase, Apriori Algorithm
S 4 Weeks -- Kick start on implementing Apriori Algorithm
S 2 Weeks – Testing the code and get the results.
Results
Conclusion
S The execution takes more time for the single node
S As the number of mappers getting increased, we come up
with better performance
S When the data is very large, single node execution takes
more time and behaves weirdly
Screenshot
Known Issues
S When the frequency is very low for large data set the
reducer takes more time
S Eg: A text paragraph in which the words are not repeated
often.
Future Work
S The analysis can be done with Twister and other platform
S The algorithm can be extended for other applications that
use machine learning techniques
References
S http://en.wikipedia.org/wiki/Text_mining
S http://en.wikipedia.org/wiki/Apriori_algorithm
S http://hbase.apache.org/book/book.html
S http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_
apriori.html
S http://www.codeproject.com/KB/recipes/AprioriAlgorithm.asp
x
S http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
Questions?