Download Frequent Word Combinations Mining and Indexing

Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan S Introduction S Many projects use Hbase to store large amount of data for distributed computation S The Processing of these data becomes a challenge for the programmers S The use of frequent terms help us in many ways in the field of machine learning S Eg: Frequently purchased items, Frequently Asked Questions, etc. Problem S These projects on Hbase create indexes on multiple data S We are able to find the frequency of a single word easily using these indexes S It is hard to find the frequency of a combination of words S For example: “cloud computing” S Searching these words separately may lead to results like “scientific computing”, “cloud platform” Objective S This project focuses on finding the frequency of a combination of words S We use the concept of Data mining and Apriori algorithm for this project S We will be using Map-Reduce and HBase for this project. Survey Topics S Apriori Algorithm S HBase S Map – Reduce Data Mining What is Data Mining? S Process of analyzing data from different perspective S Summarizing data into useful information. Data Mining How Data Mining works? S Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question. S Size of the database S Query complexity Apriori Algorithm S Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules. S Association rules form an very applied data mining approach. S Association rules are derived from frequent itemsets. S It uses level-wise search using frequent item property. Algorithm Flow Apriori Algorithm & Problem Description Transaction ID Items Bought 1 Shoes, Shirt, Jacket 2 Shoes,Jacket 3 Shoes, Jeans 4 Shirt, Sweatshirt If the minimum support is 50%, then {Shoes, Jacket} is the only 2- itemset that satisfies the minimum support. Frequent Itemset {Shoes} {Shirt} {Jacket} {Shoes, Jacket} Support 75% 50% 50% 50% If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are: Shoes  Jacket Support=50%, Confidence=66% Jacket  Shoes Support=50%, Confidence=100% 10 Apriori Algorithm Example Min support =50% Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Apriori Advantages & Disadvantages S ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement S DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans HBase What is HBase? S A Hadoop Database S Non - Relational S Open-source, Distributed, versioned, column-oriented store model S Designed after Google Bigtable S Runs on top of HDFS ( Hadoop Distributed File System ) Map Reduce S Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster. S Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured ) Map Reduce Mapper and Reducer S Mappers     FreqentItemsMap -Finds the combination and assigns the key value for each combination CandidateGenMap AssociationRuleMap S Reducer  FrequentItemsReduce  CandidateGenReduce  AssociationRuleReduce Flow Chart No Yes Schedule S 1 week – Talking to the Experts at Futuregrid S 1 Week – survey of HBase, Apriori Algorithm S 4 Weeks -- Kick start on implementing Apriori Algorithm S 2 Weeks – Testing the code and get the results. Results Conclusion S The execution takes more time for the single node S As the number of mappers getting increased, we come up with better performance S When the data is very large, single node execution takes more time and behaves weirdly Screenshot Known Issues S When the frequency is very low for large data set the reducer takes more time S Eg: A text paragraph in which the words are not repeated often. Future Work S The analysis can be done with Twister and other platform S The algorithm can be extended for other applications that use machine learning techniques References S http://en.wikipedia.org/wiki/Text_mining S http://en.wikipedia.org/wiki/Apriori_algorithm S http://hbase.apache.org/book/book.html S http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/itemset_ apriori.html S http://www.codeproject.com/KB/recipes/AprioriAlgorithm.asp x S http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Frequent Word Combinations Mining and Indexing