Download proposal_Hemanth_Gokavarapu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Concurrency control wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Frequent Word Combinations
Mining and Indexing on HBase
HEMANTH GOKAVARAPU
SANTHOSH KUMAR SAMINATHAN
Introduction
 Many projects on HBase create indexes on multiple
data
 We are able to find the frequency of a single word
easily
 It is hard to find the frequency of a combination of
words
 For example: cloud computing
Objective
 This project focuses on finding the frequency of a
combination of words
 We use the concept of Data mining and Apriori
algorithm for this project
 We will be using Map-Reduce and HBase for this
project.
Survey Topics
 Apriori Algorithm
 HBase
 Map – Reduce
Data Mining
What is Data Mining?
 Process of analyzing data from different perspective
 Summarizing data into useful information.
Data Mining
How Data Mining works?
 Data Mining analyzes relationships and patterns in
stored transaction data based on open – ended user
queries
What technology of infrastructure is needed?
Two critical technological drivers answers this
question.
 Size of the database
 Query complexity
Apriori Algorithm
 Apriori Algorithm – Its an influential algorithm for
mining frequent item sets for Boolean association
rules.
 Association rules form an very applied data mining
approach.
 Association rules are derived from frequent itemsets.
 It uses level-wise search using frequent item
property.
Algorithm Flow
Apriori Algorithm & Problem Description
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
If the minimum support is 50%, then {Shoes, Jacket} is the only 2itemset that satisfies the minimum support.
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are:
Shoes  Jacket Support=50%, Confidence=66%
Jacket  Shoes Support=50%, Confidence=100%
9
Apriori Algorithm Example
Min support =50%
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Apriori Advantages & Disadvantages
 ADVANTAGES:
Uses larger itemset property
Easily Parallelized
Easy to Implement
 DISADVANTAGES:
Assumes transaction database is memory resident
Requires many database scans
HBase
What is HBase?
 A Hadoop Database
 Non - Relational
 Open-source, Distributed, versioned, columnoriented store model
 Designed after Google Bigtable
 Runs on top of HDFS ( Hadoop Distributed File
System )
HBase Architecture
Map Reduce
 Framework for processing highly distributable
problems across huge datasets using large number of
nodes. / cluster.
 Processing occur on data stored either in filesystem (
unstructured ) or in Database ( structured )
Map Reduce
How Combination works cont.
 The approach is similar to the frequent item sets
mining problem
 But only the adjacent words are to be mined
 The idea is if a phrase (combination of words) is
frequent then its subset are also frequent.
Schedule
 1 week – Talking to the Experts at Futuregrid
 1 Week – survey of HBase, Apriori Algorithm
 4 Weeks -- Kick start on implementing Apriori
Algorithm
 2 Weeks – Testing the code and get the results.
References
 http://en.wikipedia.org/wiki/Text_mining.
 http://en.wikipedia.org/wiki/Apriori_algorithm
 http://hbase.apache.org/book/book.html
Questions?