Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Büyük Veri Madenciliği veYapay Öğrenme A. Taylan Cemgil 24.12.2012, ITO Istanbul http://www.cmpe.boun.edu.tr/pilab Machine Learning Use Cases Supervised Learning Classification Unsupervised Learning Clustering Dimensionality Reduction Probabilistic Approach to Machine Learning Probability Theory Graphical Models, Probabilistic Expert Systems Time Series Matrix and Tensor Factorization Sensor Fusion Scaling up Machine Learning Architectures References ML for Big Data, Cemgil, 24.12.2012 2 Collection of computational methods to … Detect hidden patterns in data Create useful predictions about unseen data Decision making under uncertainty Transform raw data into useful knowledge ML for Big Data, Cemgil, 24.12.2012 3 Mathematics and Statistics • Optimization • Numerical Linear Algebra • Probability Theory Computer Science • Databases • Parallel Processing • Artificial Intelligence • Information Retrieval • Graphics/Visualization Electrical Engineering • Pattern Recognition • Signal processing • Detection/Estimation • Information Theory • Data Compression ML for Big Data, Cemgil, 24.12.2012 4 Facets of the same problem Differences in emphasis/terminology Historical Evolution of the fields Data Mining: Database systems, Data Structures Statistics: Probability Theory, Mathematics Machine Learning: Artificial Intelligence, Pattern Recognition ML for Big Data, Cemgil, 24.12.2012 5 Thinking about old methods with a new mind set … and invent new ones Curse/Blessing of Dimensionality Infrastructure is cheaper Cloud Computing Sensor Networks (“new kind of data”) Speed (“real time”) ML for Big Data, Cemgil, 24.12.2012 6 Emphasis on System Integration Reached Critical Mass/Mature technology ML for Big Data, Cemgil, 24.12.2012 7 “data explosion is bigger than Moore's law” Computers get faster and cheaper every year but the amount of data that needs to be processed grows even faster. DATA CPU ML for Big Data, Cemgil, 24.12.2012 8 AMERICAN/TURKISH (SHORT) EUROPEAN (LONG) 103 Thousand (106 ) Million (109 ) Billion (1012 ) Trillion (1015 ) Quadrillion (1018 ) Quintillion … 1000 × 1000𝑛 103 Thousand (106 ) Million (109 ) Milliard (1012 ) Billion (1015 ) Billiard (1018 ) Trillion … 1000000𝑛 ML for Big Data, Cemgil, 24.12.2012 9 103 210 megabyte (MB) 106 220 gigabyte (GB) 109 230 terabyte (TB) 1012 240 petabyte (PB) 1015 250 exabyte (EB) 1018 260 zettabyte (ZB) 1021 270 yottabyte (YB) 1024 280 kilobyte (kB) ML for Big Data, Cemgil, 24.12.2012 10 = 1TB = 1 000 000 000 000 Bytes =1 Trillion Bytes = 1PB = 1 000 000 000 000 000B =1 Quadrillion Bytes ML for Big Data, Cemgil, 24.12.2012 11 CERN: Large Hadron Collider produces about 15 petabytes of data per year × 15 000 Google processes about 24 petabytes of data per day. × 24 000 ML for Big Data, Cemgil, 24.12.2012 12 Facebook’s Hadoop Distributed File System (HDFS) is reported to be about 100 PB × 100 000 Global Internet Traffic per month in 2011 is estimated to be about 27500 PB (Source:Cisco) × 27 500 000 ML for Big Data, Cemgil, 24.12.2012 13 We are drowning in data and starving for knowledge – J. Naisbitt (from Machine Learning, a probabilistic perspective, KP Murphy) ML for Big Data, Cemgil, 24.12.2012 14 Product Recommendation Market Basket Analysis Event/Activity/Behavior Analysis Campaign management and optimization Supply-chain management and analytics Market and consumer segmentations ML for Big Data, Cemgil, 24.12.2012 15 Netflix: 18K movies × 500K users %99 sparse ML for Big Data, Cemgil, 24.12.2012 16 Network Monitoring and Performance Optimization Pricing Optimization Customer Churn Management Call Detail Record (CDR) Analysis (Mobile) User Behavior Analysis Cybersecurity, Detection and Prevention of DDOS Attacks Infrastructure Planning ML for Big Data, Cemgil, 24.12.2012 17 ML for Big Data, Cemgil, 24.12.2012 18 Fraud Detection/Risk Estimation High Speed Trading Anomality/Changepoint Detection ML for Big Data, Cemgil, 24.12.2012 19 Clickstream Segmentation and Analysis Ad Targeting/Selection, Forecasting and Optimization Click Fraud Detection/Prevention Social Graph Analysis Customer Segmentation Newsgroup/Blog/Social Media opinion tracking ML for Big Data, Cemgil, 24.12.2012 20 Community Detection (source: matlab exchange) ML for Big Data, Cemgil, 24.12.2012 21 Ad Personalization: Match ads with users Key income generator for Google, Yahoo ML for Big Data, Cemgil, 24.12.2012 22 Urban Traffic Management Energy Grid Management/Optimization, Power Generation Management Environment Monitoring ML for Big Data, Cemgil, 24.12.2012 23 Diagnosis and Medical Expert systems Health Insurance fraud detection Patient care quality and program analysis Drug discovery Remote Monitoring ML for Big Data, Cemgil, 24.12.2012 24 𝑋(𝑔𝑒𝑛𝑒, 𝑠𝑎𝑚𝑝𝑙𝑒, 𝑡𝑖𝑚𝑒) ML for Big Data, Cemgil, 24.12.2012 25 Pragmatic view Small Data: Naïve algorithms are feasible Medium Data: Feasibly processed on one machine Big Data: Does not fit on one machine Complex relational data Analysis of pairwise/higher order interactions between entities ML for Big Data, Cemgil, 24.12.2012 26 Classification ML for Big Data, Cemgil, 24.12.2012 27 Feature 1 Feature 2 Feature 3 Feature 4 Class 5.1 4.3 2.1 0.3 0 5.7 3.5 3.2 0.8 0 3.4 5.2 0.4 0.6 1 X1 X2 X3 X4 c 𝑐 ≈ 𝑓(𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + 𝑤𝑁 𝑥𝑁 ) ML for Big Data, Cemgil, 24.12.2012 28 Ad Prediction on a Cluster of 1000 Machines what is the probability that a given ad will be clicked given some context? A Reliable Effective Terascale Linear Learning System, Agarwal et.al. 2012 Features = 16 M Number of Examples 17 Billion 3TB Entries 1000 Machines ML for Big Data, Cemgil, 24.12.2012 29 1. 2. 3. 4. 5. On each node use online learning independently to find a parameter vector. Use AllReduce to average the weights. On each node, compute the sum of the gradient for each example. AllReduce to add the gradients at each node. Use L-BFGS to update the weight vector, goto 3 ML for Big Data, Cemgil, 24.12.2012 30 Clustering Dimensionality Reduction Visualization ML for Big Data, Cemgil, 24.12.2012 31 ML for Big Data, Cemgil, 24.12.2012 32 Terms-Documents ML for Big Data, Cemgil, 24.12.2012 33 ML for Big Data, Cemgil, 24.12.2012 34 ML for Big Data, Cemgil, 24.12.2012 35 Probability Theory Probability theory is nothing but common sense reduced to calculation – P. Laplace Graphical Models, Probabilistic Expert Systems Time Series Example: Network flow classification ML for Big Data, Cemgil, 24.12.2012 36 ML for Big Data, Cemgil, 24.12.2012 37 ML for Big Data, Cemgil, 24.12.2012 38 ML for Big Data, Cemgil, 24.12.2012 39 ML for Big Data, Cemgil, 24.12.2012 40 ML for Big Data, Cemgil, 24.12.2012 41 ML for Big Data, Cemgil, 24.12.2012 42 ML for Big Data, Cemgil, 24.12.2012 43 ML for Big Data, Cemgil, 24.12.2012 44 ML for Big Data, Cemgil, 24.12.2012 45 ML for Big Data, Cemgil, 24.12.2012 46 ML for Big Data, Cemgil, 24.12.2012 47 ML for Big Data, Cemgil, 24.12.2012 48 ML for Big Data, Cemgil, 24.12.2012 49 Graphical Model Through Time ML for Big Data, Cemgil, 24.12.2012 50 Mobile 3G Usage patterns, Monitor Applications without Deep Packet Inspection (DPI) 8 Hrs Capture, Anonymised, without Payload 1TB Joint work Kurt, Mungan, Saygun with Ericsson/Avae FP7 Mevico ML for Big Data, Cemgil, 24.12.2012 51 VIDEO VIDEO2 ML for Big Data, Cemgil, 24.12.2012 52 ML for Big Data, Cemgil, 24.12.2012 53 Tracking ML for Big Data, Cemgil, 24.12.2012 54 ML for Big Data, Cemgil, 24.12.2012 55 1 2 1.5 ? 4 3 3 6 ? ML for Big Data, Cemgil, 24.12.2012 4 8 6.1 56 1 2 1.5 1 1 2 1.5 2 ? 4 3 3 3 6 ? ML for Big Data, Cemgil, 24.12.2012 4 4 8 6.1 57 1 2 1.5 1 1 2 1.5 2 2 4 3 3 3 6 4.5 ML for Big Data, Cemgil, 24.12.2012 4 4 8 6.1 58 ML for Big Data, Cemgil, 24.12.2012 59 ML for Big Data, Cemgil, 24.12.2012 60 ML for Big Data, Cemgil, 24.12.2012 61 ML for Big Data, Cemgil, 24.12.2012 62 Slide from ICML 2011 tutorial Langford et. al. ML for Big Data, Cemgil, 24.12.2012 63 A. Gray, Analyzing Massive Datasets, Skytree, ML Company Data Scientist: The Sexiest Job of the 21st Century (HBR) Agarwal et. al. A Reliable Effective Terascale Linear Learning System ML for Big Data, Cemgil, 24.12.2012 64 ML for Big Data, Cemgil, 24.12.2012 65 ML for Big Data, Cemgil, 24.12.2012 66 ML for Big Data, Cemgil, 24.12.2012 67 Data is not Knowledge More Data is not more Knowledge ML for Big Data Requires a new mindset for algorithm design Big Data is not only about entities but also about their relations and interactions Many applications, ML provides viable solutions New CS Education, need more Maths, Physics and Social Science Majors Big Data = Big Potential ML for Big Data, Cemgil, 24.12.2012 68 ML for Big Data, Cemgil, 24.12.2012 69 Ground Truth Labelling Difficult but a must Cheaters abound Validation of labellers + qualification test Amazon Mechanical Turk ML for Big Data, Cemgil, 24.12.2012 70