Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introducing Apache Mahout
Scalable Machine Learning for All!
Grant Ingersoll
Lucid Imagination
Overview
• What is Machine Learning?
• Mahout
Definition
• “Machine Learning is programming
computers to optimize a performance
criterion using example data or past
experience”
– Intro. To Machine Learning by E.
Alpaydin
• Subset of Artificial Intelligence
– Many other fields: comp sci., biology,
math, psychology, etc.
Types
• Supervised
– Using labeled training data, create
function that predicts output of unseen
inputs
• Unsupervised
– Using unlabeled data, create function
that predicts output
• Semi-Supervised
– Uses labeled and unlabeled data
Characterizations
• Lots of Data
• Identifiable Features in that Data
• Too big/costly for people to handle
– People still can help
Clustering
• Unsupervised
• Find Natural Groupings
– Documents
– Search Results
– People
– Genetic traits in groups
– Many, many more uses
Example: Clustering
Google News
Collaborative Filtering
• Unsupervised
• Recommend people and products
– User-User
• User likes X, you might too
– Item-Item
• People who bought X also bought Y
Example: Collab Filtering
Amazon.com
Classification/Categorization
•
•
•
•
•
•
Many, many types
Spam Filtering
Named Entity Recognition
Phrase Identification
Sentiment Analysis
Classification into a Taxonomy
Example: NER
NER?
Excerpt from Yahoo News
Example: Categorization
Info. Retrieval
• Learning Ranking Functions
• Learning Spelling Corrections
• User Click Analysis and Tracking
Other
•
•
•
•
Image Analysis
Robotics
Games
Higher level natural language
processing
• Many, many others
What is Apache Mahout?
• A Mahout is an elephant
trainer/driver/keeper, hence…
+ (and other distributed techniques)
Machine Learning
=
What?
• Hadoop brings:
– Map/Reduce API
– HDFS
– In other words, scalability and faulttolerance
• Mahout brings:
– Library of machine learning algorithms
– Examples
Why Mahout?
• Many Open Source ML libraries either:
– Lack Community
– Lack Documentation and Examples
– Lack Scalability
– Lack the Apache License ;-)
– Or are research-oriented
Why Mahout?
• Intelligent Apps are the Present and
Future
• Thus, Mahout’s Goal is:
– Scalable Machine Learning with Apache
License
Current Status
• What’s in it:
– Simple Matrix/Vector library
– Taste Collaborative Filtering
– Clustering
• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet
– Classifiers
• Naïve Bayes
• Complementary NB
– Evolutionary
• Integration with Watchmaker for fitness function
How?
• Examples
– Taste
– Clustering
– Classification
– Evolutionary
Taste: Movie
Recommendations
• Given ratings by users of movies,
recommend other movies
• http://lucene.apache.org/mahout/taste
.html#demo
Taste Demo
• http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI
D=12&debug=true
• http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI
D=43&debug=true
Clustering: Synthetic Control
Data
• http://archive.ics.uci.edu/ml/datasets/Synth
etic+Control+Chart+Time+Series
• Each clustering impl. has an example
Job for running in
<MAHOUT_HOME>/examples
– o.a.mahout.clustering.syntheticcontrol.*
• Outputs clusters…
Classification: NB and CNB
Examples
• 20 Newsgroups
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/TwentyNewsgroups
• Wikipedia
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/WikipediaBayesExample
Evolutionary
• Traveling Salesman
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/Traveling+Salesman
• Class Discovery
– http://cwiki.apache.org/confluence/displa
y/MAHOUT/Class+Discovery
What’s Next?
•
•
•
•
•
•
•
More Examples
Winnow/Perceptron (MAHOUT-85)
Text Clustering
Association Rules (MAHOUT-108)
Logistic Regression
Solr Integration (SOLR-769)
GSOC
When, Who
• When? Now!
– Mahout is growing
• Who? You!
– We want programmers who:
• Are comfortable with math
• Like to work on hard problems
– We want others to:
• Kick the tires
Where?
• http://lucene.apache.org/mahout
– Hadoop - http://hadoop.apache.org
• http://cwiki.apache.org/MAHOUT
• mahout-{user|dev}@lucene.apache.org
– http://www.lucidimagination.com/search/p:mahout
Resources
• “Programming Collective Intelligence”
by Segaran
• “Data Mining - Practical Machine
Learning Tools and Techniques” by
Witten and Frank
• “Taming Text” by Ingersoll and
Morton