Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1 Analytics ? • Machine learning, data mining & statistics tools • Analyze/mine/summarize large datasets • Extract knowledge from past or streaming data • Predict trends in future data 2 ML Today • Internet search clustering • Social network analysis • Taxonomy transformations • Market analytics • Recommendation systems • Log analysis & event filtering • SPAM filtering • Fraud detection Tools & Algorithms • Collaborative Filtering • Clustering Techniques • Classification Algorithms • Association Rules • Frequent Pattern Mining • Statistical libraries (Regression, SVM, …) • Others… 4 Common Use Cases 5 Make It Industry Strength: Big Data --Efficient in managing big data --Does not analyze or mine data --Efficient in analyzing/mining data --Do not scale 6 Some Projects • Apache Mahout • Open-source package on Hadoop for data mining and machine learning • Revolution R (R-Hadoop or Radoop ) • Extensions to R package to run on Hadoop 8 Apache Mahout 9 Apache Mahout • Apache Software Foundation project • Create scalable machine learning libraries • Why ? • Many Open Source ML libraries either: • • • • Lack Community Lack Documentation Lack Scalability Or are research-oriented only 10 Support Machine Learning Applica ons Examples Gene c Freq. Pa ern Mining U li es Lucene/Vectorizer Classifica on Clustering Math Vectors/Matrices/ SVD Recommenders Collec ons (primi ves) Apache Hadoop But Must Scale & Perform • Be as fast as possible • Scale to as much data as possible 12 But Must Scale & Perform • Be as fast as possible given intrinsic algorithm ! • What is expressible as map-reduce jobs ? • Work in progress . . . 13 C1: Collaborative Filtering 14 C2: Clustering • Group similar objects together • K-Means, Fuzzy K-Means, Density-Based,… • Different distance measures • Manhattan, Euclidean, … 15 C3: Classification 16 FPM: Frequent Pattern Mining • Find the frequent itemsets • <milk, bread, cheese> are sold frequently together • Very common in market analysis, access pattern analysis, etc… 17 Matrices and Statistics • Math libraries • Vectors, matrices, etc. • Noise reduction • Similarity Functions 18 Apache Mahout • http://mahout.apache.org/ 19