Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination Overview • What is Machine Learning? • Mahout Definition • “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” – Intro. To Machine Learning by E. Alpaydin • Subset of Artificial Intelligence – Many other fields: comp sci., biology, math, psychology, etc. Types • Supervised – Using labeled training data, create function that predicts output of unseen inputs • Unsupervised – Using unlabeled data, create function that predicts output • Semi-Supervised – Uses labeled and unlabeled data Characterizations • Lots of Data • Identifiable Features in that Data • Too big/costly for people to handle – People still can help Clustering • Unsupervised • Find Natural Groupings – Documents – Search Results – People – Genetic traits in groups – Many, many more uses Example: Clustering Google News Collaborative Filtering • Unsupervised • Recommend people and products – User-User • User likes X, you might too – Item-Item • People who bought X also bought Y Example: Collab Filtering Amazon.com Classification/Categorization • • • • • • Many, many types Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy Example: NER NER? Excerpt from Yahoo News Example: Categorization Info. Retrieval • Learning Ranking Functions • Learning Spelling Corrections • User Click Analysis and Tracking Other • • • • Image Analysis Robotics Games Higher level natural language processing • Many, many others What is Apache Mahout? • A Mahout is an elephant trainer/driver/keeper, hence… + (and other distributed techniques) Machine Learning = What? • Hadoop brings: – Map/Reduce API – HDFS – In other words, scalability and faulttolerance • Mahout brings: – Library of machine learning algorithms – Examples Why Mahout? • Many Open Source ML libraries either: – Lack Community – Lack Documentation and Examples – Lack Scalability – Lack the Apache License ;-) – Or are research-oriented Why Mahout? • Intelligent Apps are the Present and Future • Thus, Mahout’s Goal is: – Scalable Machine Learning with Apache License Current Status • What’s in it: – Simple Matrix/Vector library – Taste Collaborative Filtering – Clustering • Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet – Classifiers • Naïve Bayes • Complementary NB – Evolutionary • Integration with Watchmaker for fitness function How? • Examples – Taste – Clustering – Classification – Evolutionary Taste: Movie Recommendations • Given ratings by users of movies, recommend other movies • http://lucene.apache.org/mahout/taste .html#demo Taste Demo • http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI D=12&debug=true • http://localhost:8080/mahout-tastewebapp/RecommenderServlet?userI D=43&debug=true Clustering: Synthetic Control Data • http://archive.ics.uci.edu/ml/datasets/Synth etic+Control+Chart+Time+Series • Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples – o.a.mahout.clustering.syntheticcontrol.* • Outputs clusters… Classification: NB and CNB Examples • 20 Newsgroups – http://cwiki.apache.org/confluence/displa y/MAHOUT/TwentyNewsgroups • Wikipedia – http://cwiki.apache.org/confluence/displa y/MAHOUT/WikipediaBayesExample Evolutionary • Traveling Salesman – http://cwiki.apache.org/confluence/displa y/MAHOUT/Traveling+Salesman • Class Discovery – http://cwiki.apache.org/confluence/displa y/MAHOUT/Class+Discovery What’s Next? • • • • • • • More Examples Winnow/Perceptron (MAHOUT-85) Text Clustering Association Rules (MAHOUT-108) Logistic Regression Solr Integration (SOLR-769) GSOC When, Who • When? Now! – Mahout is growing • Who? You! – We want programmers who: • Are comfortable with math • Like to work on hard problems – We want others to: • Kick the tires Where? • http://lucene.apache.org/mahout – Hadoop - http://hadoop.apache.org • http://cwiki.apache.org/MAHOUT • mahout-{user|dev}@lucene.apache.org – http://www.lucidimagination.com/search/p:mahout Resources • “Programming Collective Intelligence” by Segaran • “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank • “Taming Text” by Ingersoll and Morton