Download proposal_Hemanth_Gokavarapu

Frequent Word Combinations Mining and Indexing on HBase HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Introduction  Many projects on HBase create indexes on multiple data  We are able to find the frequency of a single word easily  It is hard to find the frequency of a combination of words  For example: cloud computing Objective  This project focuses on finding the frequency of a combination of words  We use the concept of Data mining and Apriori algorithm for this project  We will be using Map-Reduce and HBase for this project. Survey Topics  Apriori Algorithm  HBase  Map – Reduce Data Mining What is Data Mining?  Process of analyzing data from different perspective  Summarizing data into useful information. Data Mining How Data Mining works?  Data Mining analyzes relationships and patterns in stored transaction data based on open – ended user queries What technology of infrastructure is needed? Two critical technological drivers answers this question.  Size of the database  Query complexity Apriori Algorithm  Apriori Algorithm – Its an influential algorithm for mining frequent item sets for Boolean association rules.  Association rules form an very applied data mining approach.  Association rules are derived from frequent itemsets.  It uses level-wise search using frequent item property. Algorithm Flow Apriori Algorithm & Problem Description Transaction ID Items Bought 1 Shoes, Shirt, Jacket 2 Shoes,Jacket 3 Shoes, Jeans 4 Shirt, Sweatshirt If the minimum support is 50%, then {Shoes, Jacket} is the only 2itemset that satisfies the minimum support. Frequent Itemset {Shoes} {Shirt} {Jacket} {Shoes, Jacket} Support 75% 50% 50% 50% If the minimum confidence is 50%, then the only two rules generated from this 2itemset, that have confidence greater than 50%, are: Shoes  Jacket Support=50%, Confidence=66% Jacket  Shoes Support=50%, Confidence=100% 9 Apriori Algorithm Example Min support =50% Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Apriori Advantages & Disadvantages  ADVANTAGES: Uses larger itemset property Easily Parallelized Easy to Implement  DISADVANTAGES: Assumes transaction database is memory resident Requires many database scans HBase What is HBase?  A Hadoop Database  Non - Relational  Open-source, Distributed, versioned, columnoriented store model  Designed after Google Bigtable  Runs on top of HDFS ( Hadoop Distributed File System ) HBase Architecture Map Reduce  Framework for processing highly distributable problems across huge datasets using large number of nodes. / cluster.  Processing occur on data stored either in filesystem ( unstructured ) or in Database ( structured ) Map Reduce How Combination works cont.  The approach is similar to the frequent item sets mining problem  But only the adjacent words are to be mined  The idea is if a phrase (combination of words) is frequent then its subset are also frequent. Schedule  1 week – Talking to the Experts at Futuregrid  1 Week – survey of HBase, Apriori Algorithm  4 Weeks -- Kick start on implementing Apriori Algorithm  2 Weeks – Testing the code and get the results. References  http://en.wikipedia.org/wiki/Text_mining.  http://en.wikipedia.org/wiki/Apriori_algorithm  http://hbase.apache.org/book/book.html Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download proposal_Hemanth_Gokavarapu