Download Mahout

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
CS525: Big Data Analytics
Machine Learning on Hadoop
Fall 2013
Elke A. Rundensteiner
1
Analytics ?
• Machine learning, data mining & statistics tools
• Analyze/mine/summarize large datasets
• Extract knowledge from past or streaming data
• Predict trends in future data
2
ML Today
• Internet search clustering
• Social network analysis
• Taxonomy transformations
• Market analytics
• Recommendation systems
• Log analysis & event filtering
• SPAM filtering
• Fraud detection
Tools & Algorithms
• Collaborative Filtering
• Clustering Techniques
• Classification Algorithms
• Association Rules
• Frequent Pattern Mining
• Statistical libraries (Regression, SVM, …)
• Others…
4
Common Use Cases
5
Make It Industry Strength: Big Data
--Efficient in managing big data
--Does not analyze or mine data
--Efficient in analyzing/mining data
--Do not scale
6
Some Projects
• Apache Mahout
• Open-source package on Hadoop for data
mining and machine learning
• Revolution R (R-Hadoop or Radoop )
• Extensions to R package to run on Hadoop
8
Apache Mahout
9
Apache Mahout
• Apache Software Foundation project
• Create scalable machine learning libraries
• Why ?
• Many Open Source ML libraries either:
•
•
•
•
Lack Community
Lack Documentation
Lack Scalability
Or are research-oriented only
10
Support Machine Learning
Applica ons
Examples
Gene c
Freq.
Pa ern
Mining
U li es
Lucene/Vectorizer
Classifica on
Clustering
Math
Vectors/Matrices/
SVD
Recommenders
Collec ons
(primi ves)
Apache
Hadoop
But Must Scale & Perform
• Be as fast as possible
• Scale to as much data as possible
12
But Must Scale & Perform
• Be as fast as possible given intrinsic algorithm !
• What is expressible as map-reduce jobs ?
• Work in progress . . .
13
C1: Collaborative Filtering
14
C2: Clustering
• Group similar objects together
• K-Means, Fuzzy K-Means,
Density-Based,…
• Different distance measures
• Manhattan, Euclidean, …
15
C3: Classification
16
FPM: Frequent Pattern
Mining
• Find the frequent itemsets
• <milk, bread, cheese> are sold
frequently together
• Very common in market analysis,
access pattern analysis, etc…
17
Matrices and Statistics
• Math libraries
• Vectors, matrices, etc.
• Noise reduction
• Similarity Functions
18
Apache Mahout
• http://mahout.apache.org/
19