* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Download proposal_Hemanth_Gokavarapu
Survey
Document related concepts
Transcript
Frequent Word Combinations
Mining and Indexing on HBase
HEMANTH GOKAVARAPU
SANTHOSH KUMAR SAMINATHAN
Introduction
Many projects on HBase create indexes on multiple
data
We are able to find the frequency of a single word
easily
It is hard to find the frequency of a combination of
words
For example: cloud computing
Objective
This project focuses on finding the frequency of a
combination of words
We use the concept of Data mining and Apriori
algorithm for this project
We will be using Map-Reduce and HBase for this
project.
Survey Topics
Apriori Algorithm
HBase
Map – Reduce
Data Mining
What is Data Mining?
Process of analyzing data from different perspective
Summarizing data into useful information.
Data Mining
How Data Mining works?
Data Mining analyzes relationships and patterns in
stored transaction data based on open – ended user
queries
What technology of infrastructure is needed?
Two critical technological drivers answers this
question.
Size of the database
Query complexity
Apriori Algorithm
Apriori Algorithm – Its an influential algorithm for
mining frequent item sets for Boolean association
rules.
Association rules form an very applied data mining
approach.
Association rules are derived from frequent itemsets.
It uses level-wise search using frequent item
property.
Algorithm Flow
Apriori Algorithm & Problem Description
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2itemset that satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are:
Shoes Jacket Support=50%, Confidence=66%
Jacket Shoes Support=50%, Confidence=100%
9
Apriori Algorithm Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Apriori Advantages & Disadvantages
ADVANTAGES:
Uses larger itemset property
Easily Parallelized
Easy to Implement
DISADVANTAGES:
Assumes transaction database is memory resident
Requires many database scans
HBase
What is HBase?
A Hadoop Database
Non - Relational
Open-source, Distributed, versioned, columnoriented store model
Designed after Google Bigtable
Runs on top of HDFS ( Hadoop Distributed File
System )
HBase Architecture
Map Reduce
Framework for processing highly distributable
problems across huge datasets using large number of
nodes. / cluster.
Processing occur on data stored either in filesystem (
unstructured ) or in Database ( structured )
Map Reduce
How Combination works cont.
The approach is similar to the frequent item sets
mining problem
But only the adjacent words are to be mined
The idea is if a phrase (combination of words) is
frequent then its subset are also frequent.
Schedule
1 week – Talking to the Experts at Futuregrid
1 Week – survey of HBase, Apriori Algorithm
4 Weeks -- Kick start on implementing Apriori
Algorithm
2 Weeks – Testing the code and get the results.
References
http://en.wikipedia.org/wiki/Text_mining.
http://en.wikipedia.org/wiki/Apriori_algorithm
http://hbase.apache.org/book/book.html
Questions?