Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Intel HPC Developer Convention Salt Lake City 2016 Machine Learning Track Data Analytics, Machine Learning and HPC in today’s changing application environment Franz J. Király (practical) An overview of data analytics DATA Scientific Questions R Methods Statistical Questions Exploration Quantitative Modelling Descriptive/Explanatory Predictive/Inferential Scientific and Statistical Validation Knowledge The Scientific Method Statistical Programming python Data analytics and data science in a broader context Lot of problems and subtleties at these stages already Raw data Clean data often, most of manpower in „data“ project needs to go here first before one can attempt reliable Data analytics Statistics, Modelling, Data mining, Machine learning Knowledge Relevant findings and underlying arguments need to be explained well and properly Big Data? What „Big Data“ may mean in practice Strategies that stop working in reasonable time 1.000 10.000 1.000 100 Manual exploratory data analysis Kernel methods, OLS Random forests L1, LASSO (around the same order) Number of features Feature extraction Feature selection Large-scale strategies for super-linear algorithms Super-linear algorithms On-line models 10.000.000 Linear algorithms, including Reading in all the data 10.000.000.000 Number of data samples Distributed computing Sub-sampling Solution strategies Large-scale motifs in data science = where high-performance computing is helpful/impactful „Big models“ = the „classic“, beloved by everyone Not necessarily a lot of data, but computationally intensive models Classical example: finite elements and other numerical models New fancy example: large neural networks aka „deep learning“ Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes „Big data“ = what it says, a lot of data (ca 1 million samples or more) Computational challenge arises from processing all of the data Example: histogram or linear regression with huge amounts of data Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting Model validation and model selection = this talk‘s focus Answers the question: which model is best for your data? Demanding even for simple models and small amounts of data! Example: is deep learning better than logistic regression, or guessing? Meta-modelling: stylized case studies Customer: Hospital specializing in treatment of patients with a certain disease. Patients with this disease are at-risk to experience an adverse event (e.g. death) Scientific question: depending on patient characteristics, predict the event risk. Data set: complete clinical records of 1.000 patients, including event if occurred Customer: Retailer who wants to accurately model behaviour of customers. Customers can buy (or not buy) any of a number of products, or churn. Scientific question: predict future customer behaviour given past behaviour Data set: complete customer and purchase records of 100.000 customers Customer: Manufacturer wishes to find best parameter setting for machines. Parameters influence amount/quality of product (or whether machine breaks) Scientific question: find parameter settings which optimizes the above Data set: outcomes for 10.000 parameter settings on those machines Of interest: model interpretability; how accurate the predictions are expected to be whether the algorithm/model is (easily) deployable in the „real world“ Not of interest: which algorithm/strategy, out of many, exactly solves the task Model validation and model selection = data-centric and data-dependent modelling a scientific necessity implied by the scientific method and the following: 1. There is no model that is good for all data. (otherwise the concept of a model would be unnecessary) 2. For given data, there is no a-priori reason to believe that a certain type of model will be the best one. (any such belief is not empirically justified hence pseudoscientific) 3. No model can be trusted unless its validity has been verified by a model-independent argument. (otherwise the justification of validity is circular hence faulty) Machine learning provides algorithms & theory for meta-modelling and powerful algorithms motivated by meta-modelling optimality. Machine Learning and Meta-Modelling in a Nutshell Leitmotifs of Machine Learning from the intersection of engineering, statistics and computer science Engineering & statistics idea: Statistical models are objects in their own right „learning machines“ modelling strategy Engineering & computer science idea: Any abstract algorithm can be a modelling strategy/learning machine „computational learning“ modelling strategy Computer science & statistics idea: Possibly non-explicit Future performance of algorithm/learning machine can be estimated „model validation“ „model selection“ (and should) learning machine ? Problem types in Machine Learning Supervised Learning: some data is labelled by expert/oracle Task: predict label from covariates statistical models are usually discriminative Examples: regression, classification ? ? ? Problem types in Machine Learning Unsupervised Learning: the training data is not pre-labelled ? ! Task: find „structure“ or „pattern“ in data statistical models are usually generative Examples: clustering, dimension reduction ? Advanced learning tasks Complications in the labelling Semi-supervised learning some training data are labelled, some are not Reinforcement learning data are not directly labelled, only indirect gain/loss Anomaly detection all or most data are „positive examples“, the task is to flag „test negatives“ Complications through correlated data and/or time On-line learning the data is revealed with time, models need to update Forecasting each data point has a time stamp, predict the temporal future Transfer learning the data comes in dissimilar batches, train and test may be distinct What is a Learning Machine? … an algorithm that solves, e.g., the previous tasks: Illustration: supervised learning machine new data observations „training data“ prediction model fitting “learning” fitted model ?? predictions model tuning parameters e.g., to base decisions on Examples: generalized linear model, linear regression, support vector machine, neural networks (= „deep learning“), random forests, gradient boosting, … Example: Linear Regression new data ? observations „training data“ prediction model fitting “learning” fitted model predictions Fit intercept or not? Model validation: does the model make sense? ? „test labels“ „the truth“ „test data“ compare & quantify „hold-out “ „in-sample“ „training data“ prediction strategy learning machine „out-of-sample“ Model learning Prediction ?? e.g. regression, GLM, advanced methods e.g. evaluating the regression model predictions learnt model Predictive models need to be validated on unseen data! The only (general) way to test goodness of prediction is actually observing prediction! Which means the part of data for testing has not been seen by the algorithm before! (note: this includes the case where machine = linear regression, deep learning, etc) „Re-sampling“: Predictor 1 training data 1 Predictor 2 test data 2 training Predictor 2 all data test data training data 3 Predictor Predictor 1 3 Predictor 2 test data 3 errors 1,2,3 Predictor 3 Predictor 1 Predictor 3 errors 1,2,3 errors 1,2,3 aggregate errors 1,2,3 comparison Multiple algorithms are compared on multiple data splits/sub-datasets State-of-art principle in model validation, model comparison and meta-modelling type of re-sampling k-fold cross-validation often: k=5 how to obtain training/test splits 1. divide data in k (almost) equal parts 2. obtain k train/tests splits via: each part is test data exactly once the rest of data is the training set pros/cons good compromise between runtime and accuracy when k is small compared to data size leave-one-out = [number of data points]-fold c.v. very accurate, high run-time repeated sub-sampling 1. obtain a random sub-sample of training/test data of specified sizes can be arbitrarily quick can be arbitrarily inaccurate (train/test need not cover all data) parameters: training/test size # of repetitions 2. repeat 1. desired number of times (depending on parameter choice) can be combined with k-fold Quantitative model comparison a „benchmarking experiment“ results in a table like this model ? RMSE MAE 15.3 ± 1.4 12.3 ± 1.7 9.5 ± 0.7 7.3 ± 0.9 13.6 ± 0.9 11.4 ± 0.8 20.1 ± 1.2 18.1 ± 1.1 Confidence regions (or paired tests) to compare models to each other: A is better than B / B is better than A / A and B are equally good Uninformed model (stupid model/random guess) needs to be included otherwise a statement „is better than an uninformed guess“ cannot be made. „useful model“ = (significantly) better than uninformed baseline Meta-model: automated parameter tuning Re-sampling is used to determine [best parameter setting] For validation, new unseen data needs to be used: model tuning train training data all data goodness ± 1 5 . 3 1 . 4 9 . 5 1 3 . 6 ? 2 0 . 1 ± 0 . 7 ± 0 . 9 ± 1 . 2 fit to all tuning test test data predict & „real“ test Model w. Best Parameter quantify training data Multi-fold-schemes are nested: „splits within splits“ whole training data training mo del Parameters 1 data Parameters 2 test data Parameters 3 Re-sampled training data ? goodn ess 1 5 . 39 . 5 1 3 . 2 6 0 . 1 ± 1 .± 4 0 .± 7 0 .± 9 1 . 2 Best parameters Important caveat: the „inner“ training/test splits need to be part of any „outer“ training set otherwise validation is not out-of-sample! Which measure of predictive goodness Which inner re-sampling scheme Methods are usually less sensitive to these „new“ tuning parameters Meta-Strategies in ML „Model tuning“ Model with tuning parameters Best tuning parameters are determined using data-driven tuning algorithm „Ensemble learning“ C A B D A D a number of (possibly „weak“) models B „strong“ ensemble model Object dependencies in the ML workflow One interesting dataset all data („small data“) is re-sampled into multiple train/test splits training data test data training data test data training data test data on each of which the strategies are compared most of which are parametertuned by the same principle „Typical number of“ N = 100-100.000 data points 1 2 M 5-10 outer splits M = 5-20 3-5 nested splits 10-10.000 parameter combinations 10-1.000 base learners Ensembles: further nesting Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples (usually O(N²) or O(N³) ) Machine Learning Toolboxes An incomplete list of influential toolboxes scikit-learn is perhaps the most widely used ML toolbox Language Modular API (e.g., methods) GUI Common models Model tuning, meta-methods mostly kernels some Few, mostly classifiers few R python caret R python multiinterface Not entirely Java 3rd party wrappers python Model validation and comparison The object-oriented ML Toolbox API as found in the R/mlr or scikit-learn packages Leading principles: encapsulation, modularization „learning machine“ object Linear regression modular structure object orientation fit(traindata) predict(testdata) plus metadata & model info Abstraction models objects with unified API: Concept abstracted Learning Machines Re-sampling schemes Evaluation metrics Meta-modelling Tuning Ensembling Pipelining Learning task Public interface fitting, predicting, set parameters sample, apply & get results compute from results, tabulate wrapping machines by strategy benchmark, list strategies/measures in R/mlr in sklearn Learner estimator ResampleDesc splitter classes in Measure metrics classes in metrics various wrappers fused classes Task model_selection various wrappers Pipeline Implicit, not encapsulated HPC for benchmarking/validation today Scikit-learn: joblib mlr: parallelMap all data „Typical number of“ N = 100-100.000 data points („small data“) 1 At the selected level: training data test data training data test data training data test data (one of 1-4) 5-10 outer splits 2 Distribute to clusters/cores 1 2 M 3 3-5 nested splits 4 10-10.000 parameter combinations M = 5-20 10-1.000 base learners Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive) HPC support tomorrow? DATA (e.g. Hadoop) Layer 1: 1 2 M re-samples algorithms parameters … Layer 2: Scheduler for algorithms and meta-algorithms Layer 3: Optimized Primitives Layer 4: Hardware API (image source: continuum analytics) Linear systems convex optimization stoch. gradient descent (image source: Intel math kernel library) Combining (?) MapReduce, DAAL, dask, joblib -> TBB? Data/task pipeline full graph of dependencies: e.g. MKL, CUDA, BLAS e.g. distributed, multi-core, multi-type/heterogeneous Challenges in ML APIs and HPC Surprisingly few resources have been invested in ML toolboxes Most advanced toolboxes are currently open-source & academic Features that would be desirable to the practitioner but not available without mid-scale software development: Integration of (a) data management, (b) exploration and (c) modelling especially challenging: integration in large scale scenarios e.g. MapReduce for divide/conquer over data, model parts, and models Full HPC integration on granular level for distributed ML benchmarking making full use parallelism for nesting and computational redundancies complete HPC architecture for whole model benchmarking workflow Non-standard modelling tasks, structured data (incl time series) data heterogeneity, multiple datasets, time series, spatial features, images etc forecasting, on-line learning, anomaly detection, change point detection meta-modelling and re-sampling for these is an order of magnitude more costly