Download Frequent Word Combinations Mining and Indexing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Frequent Word
Combinations Mining and
Indexing on HBase
Hemanth Gokavarapu
Santhosh Kumar Saminathan
S
Introduction
S Many projects use Hbase to store large amount of data for
distributed computation
S The Processing of these data becomes a challenge for the
programmers
S The use of frequent terms help us in many ways in the field of
machine learning
S Eg: Frequently purchased items, Frequently Asked Questions, etc.
Problem
S These projects on Hbase create indexes on multiple data
S We are able to find the frequency of a single word easily
using these indexes
S It is hard to find the frequency of a combination of words
S For example: “cloud computing”
S Searching these words separately may lead to results like
“scientific computing”, “cloud platform”
Objective
S This project focuses on finding the frequency of a
combination of words
S We use the concept of Data mining and Apriori algorithm
for this project
S We will be using Map-Reduce and HBase for this project.
Survey Topics
S Apriori Algorithm
S HBase
S Map – Reduce
Data Mining
What is Data Mining?
S Process of analyzing data from different perspective
S Summarizing data into useful information.
Data Mining
How Data Mining works?
S Data Mining analyzes relationships and patterns in stored
transaction data based on open – ended user queries
What technology of infrastructure is needed?
Two critical technological drivers answers this question.
S Size of the database
S Query complexity
Apriori Algorithm
S Apriori Algorithm – Its an influential algorithm for mining
frequent item sets for Boolean association rules.
S Association rules form an very applied data mining
approach.
S Association rules are derived from frequent itemsets.
S It uses level-wise search using frequent item property.
Algorithm Flow
Apriori Algorithm & Problem
Description
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that
satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are:
Shoes  Jacket Support=50%, Confidence=66%
Jacket  Shoes Support=50%, Confidence=100%
10
Apriori Algorithm Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Apriori Advantages &
Disadvantages
S ADVANTAGES:
Uses larger itemset property
Easily Parallelized
Easy to Implement
S DISADVANTAGES:
Assumes transaction database is memory resident
Requires many database scans
HBase
What is HBase?
S A Hadoop Database
S Non - Relational
S Open-source, Distributed, versioned, column-oriented store model
S Designed after Google Bigtable
S Runs on top of HDFS ( Hadoop Distributed File System )
Map Reduce
S Framework for processing highly distributable problems
across huge datasets using large number of nodes. / cluster.
S Processing occur on data stored either in filesystem (
unstructured ) or in Database ( structured )
Map Reduce
Mapper and Reducer
S Mappers




FreqentItemsMap
-Finds the combination and assigns the key value for each combination
CandidateGenMap
AssociationRuleMap
S Reducer
 FrequentItemsReduce
 CandidateGenReduce
 AssociationRuleReduce
Flow Chart
No
Yes
Schedule
S
1 week – Talking to the Experts at Futuregrid
S 1 Week – survey of HBase, Apriori Algorithm
S 4 Weeks -- Kick start on implementing Apriori Algorithm
S 2 Weeks – Testing the code and get the results.
Results
Conclusion
S The execution takes more time for the single node
S As the number of mappers getting increased, we come up
with better performance
S When the data is very large, single node execution takes
more time and behaves weirdly
Screenshot
Known Issues
S When the frequency is very low for large data set the
reducer takes more time
S Eg: A text paragraph in which the words are not repeated
often.
Future Work
S The analysis can be done with Twister and other platform
S The algorithm can be extended for other applications that
use machine learning techniques
References
S http://en.wikipedia.org/wiki/Text_mining
S http://en.wikipedia.org/wiki/Apriori_algorithm
S http://hbase.apache.org/book/book.html
S http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_
apriori.html
S http://www.codeproject.com/KB/recipes/AprioriAlgorithm.asp
x
S http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf
Questions?