Download Data analytics, Machine Learning and HPC in Today`s Changing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Intel HPC Developer Convention Salt Lake City 2016
Machine Learning Track
Data Analytics, Machine Learning
and HPC in today’s changing
application environment
Franz J. Király
(practical)
An overview of data analytics
DATA
Scientific
Questions
R
Methods
Statistical
Questions
Exploration
Quantitative Modelling
Descriptive/Explanatory
Predictive/Inferential
Scientific and Statistical Validation
Knowledge
The Scientific Method
Statistical Programming
python
Data analytics and data science
in a broader context
Lot of problems and subtleties
at these stages already
Raw
data
Clean
data
often, most of manpower
in „data“ project needs
to go here first before
one can attempt reliable
Data analytics
Statistics, Modelling,
Data mining,
Machine learning
Knowledge
Relevant findings and
underlying arguments
need to be explained
well and properly
Big Data?
What „Big Data“ may mean in practice
Strategies that
stop working
in reasonable time
1.000
10.000
1.000
100
Manual exploratory
data analysis
Kernel methods, OLS
Random forests
L1, LASSO
(around the same order)
Number of features
Feature extraction
Feature selection
Large-scale strategies
for super-linear algorithms
Super-linear algorithms
On-line models
10.000.000
Linear algorithms, including
Reading in all the data
10.000.000.000
Number of data samples
Distributed computing
Sub-sampling
Solution strategies
Large-scale motifs in data science
= where high-performance computing is helpful/impactful
„Big models“
= the „classic“, beloved by everyone
Not necessarily a lot of data, but computationally intensive models
Classical example: finite elements and other numerical models
New fancy example: large neural networks aka „deep learning“
Common HPC motif: divide/conquer in parts-of-model, e.g. neurons/nodes
„Big data“
= what it says, a lot of data (ca 1 million samples or more)
Computational challenge arises from processing all of the data
Example: histogram or linear regression with huge amounts of data
Common HPC motif: divide/conquer training/fitting of model, e.g. batchwise/epoch fitting
Model validation and model selection
= this talk‘s focus
Answers the question: which model is best for your data?
Demanding even for simple models and small amounts of data!
Example: is deep learning better than logistic regression, or guessing?
Meta-modelling: stylized case studies
Customer: Hospital specializing in treatment of patients with a certain disease.
Patients with this disease are at-risk to experience an adverse event (e.g. death)
Scientific question: depending on patient characteristics, predict the event risk.
Data set: complete clinical records of 1.000 patients, including event if occurred
Customer: Retailer who wants to accurately model behaviour of customers.
Customers can buy (or not buy) any of a number of products, or churn.
Scientific question: predict future customer behaviour given past behaviour
Data set: complete customer and purchase records of 100.000 customers
Customer: Manufacturer wishes to find best parameter setting for machines.
Parameters influence amount/quality of product (or whether machine breaks)
Scientific question: find parameter settings which optimizes the above
Data set: outcomes for 10.000 parameter settings on those machines
Of interest: model interpretability; how accurate the predictions are expected to be
whether the algorithm/model is (easily) deployable in the „real world“
Not of interest: which algorithm/strategy, out of many, exactly solves the task
Model validation and model selection
= data-centric and data-dependent modelling
a scientific necessity implied by the scientific method and the following:
1. There is no model that is good for all data.
(otherwise the concept of a model would be unnecessary)
2. For given data, there is no a-priori reason to believe
that a certain type of model will be the best one.
(any such belief is not empirically justified hence pseudoscientific)
3. No model can be trusted unless its validity has
been verified by a model-independent argument.
(otherwise the justification of validity is circular hence faulty)
Machine learning provides algorithms & theory for meta-modelling
and powerful algorithms motivated by meta-modelling optimality.
Machine Learning
and Meta-Modelling
in a Nutshell
Leitmotifs of Machine Learning
from the intersection of engineering, statistics and computer science
Engineering & statistics idea:
Statistical models are objects in their own right
„learning
machines“
modelling
strategy
Engineering & computer science idea:
Any abstract algorithm can be a modelling strategy/learning machine
„computational
learning“
modelling
strategy
Computer science & statistics idea: Possibly non-explicit
Future performance of algorithm/learning machine can be estimated
„model validation“
„model selection“
(and should)
learning
machine
?
Problem types in Machine Learning
Supervised Learning:
some data is labelled by expert/oracle
Task: predict label from covariates
statistical models are usually discriminative
Examples: regression, classification
?
?
?
Problem types in Machine Learning
Unsupervised Learning:
the training data is not pre-labelled
?
!
Task: find „structure“ or „pattern“ in data
statistical models are usually generative
Examples: clustering, dimension reduction
?
Advanced learning tasks
Complications in the labelling
Semi-supervised learning
some training data are labelled, some are not
Reinforcement learning
data are not directly labelled, only indirect gain/loss
Anomaly detection
all or most data are „positive examples“, the task is to flag „test negatives“
Complications through correlated data and/or time
On-line learning
the data is revealed with time, models need to update
Forecasting
each data point has a time stamp, predict the temporal future
Transfer learning
the data comes in dissimilar batches, train and test may be distinct
What is a Learning Machine?
… an algorithm that solves,
e.g., the previous tasks:
Illustration: supervised learning machine
new data
observations
„training data“
prediction
model fitting
“learning”
fitted model
??
predictions
model tuning parameters
e.g., to base
decisions on
Examples: generalized linear model, linear regression, support vector machine,
neural networks (= „deep learning“), random forests, gradient boosting, …
Example: Linear Regression
new data
?
observations
„training data“
prediction
model fitting
“learning”
fitted model
predictions
Fit intercept or not?
Model validation: does the model make sense?
?
„test labels“
„the truth“
„test data“
compare
&
quantify
„hold-out “
„in-sample“
„training data“
prediction strategy
learning machine
„out-of-sample“
Model
learning
Prediction
??
e.g. regression, GLM,
advanced methods
e.g. evaluating the
regression model
predictions
learnt model
Predictive models need to be validated on unseen data!
The only (general) way to test goodness of prediction is actually observing prediction!
Which means the part of data for testing has not been seen by the algorithm before!
(note: this includes the case where machine = linear regression, deep learning, etc)
„Re-sampling“:
Predictor 1
training data 1
Predictor 2
test data 2
training
Predictor 2
all data
test data
training
data 3
Predictor
Predictor
1 3
Predictor 2
test data 3
errors 1,2,3
Predictor
3
Predictor
1
Predictor 3
errors 1,2,3
errors 1,2,3
aggregate
errors 1,2,3
comparison
Multiple algorithms are compared on multiple data splits/sub-datasets
State-of-art principle in model validation, model comparison and meta-modelling
type of re-sampling
k-fold
cross-validation
often: k=5
how to obtain training/test splits
1. divide data in k (almost) equal parts
2. obtain k train/tests splits via:
each part is test data exactly once
the rest of data is the training set
pros/cons
good compromise between
runtime and accuracy
when k is small compared to data size
leave-one-out
= [number of data points]-fold c.v.
very accurate, high run-time
repeated
sub-sampling
1. obtain a random sub-sample of
training/test data of specified sizes
can be arbitrarily quick
can be arbitrarily inaccurate
(train/test need not cover all data)
parameters:
training/test size
# of repetitions
2. repeat 1. desired number of times
(depending on parameter choice)
can be combined with k-fold
Quantitative model comparison
a „benchmarking experiment“ results in a table like this
model
?
RMSE
MAE
15.3
± 1.4
12.3
± 1.7
9.5
± 0.7
7.3
± 0.9
13.6
± 0.9
11.4
± 0.8
20.1
± 1.2
18.1
± 1.1
Confidence regions (or paired tests) to compare models to each other:
A is better than B / B is better than A / A and B are equally good
Uninformed model (stupid model/random guess) needs to be included
otherwise a statement „is better than an uninformed guess“ cannot be made.
„useful model“ = (significantly) better than uninformed baseline
Meta-model: automated parameter tuning
Re-sampling is used to determine [best parameter setting]
For validation, new unseen data needs to be used:
model
tuning train
training
data
all data
goodness
±
1
5
.
3
1
.
4
9
.
5
1
3
.
6
?
2
0
.
1
±
0
.
7
±
0
.
9
±
1
.
2
fit to all
tuning test
test data
predict &
„real“ test
Model w. Best
Parameter
quantify
training
data
Multi-fold-schemes are nested:
„splits within splits“
whole training data
training
mo
del
Parameters 1
data
Parameters 2
test data
Parameters 3
Re-sampled training data
?
goodn
ess
1
5
.
39
.
5
1
3
.
2
6
0
.
1
±
1
.±
4
0
.±
7
0
.±
9
1
.
2
Best parameters
Important caveat: the „inner“ training/test splits
need to be part of any „outer“ training set
otherwise validation is not out-of-sample!
Which measure
of predictive goodness
Which inner re-sampling scheme
Methods are usually less sensitive
to these „new“ tuning parameters
Meta-Strategies in ML
„Model
tuning“
Model with tuning parameters
Best tuning parameters are determined
using data-driven tuning algorithm
„Ensemble
learning“
C
A
B
D
A
D
a number of (possibly „weak“) models
B
„strong“ ensemble model
Object dependencies in the ML workflow
One interesting dataset
all data
(„small data“)
is re-sampled
into multiple
train/test splits
training
data test data
training
data test data
training
data test data
on each
of which
the strategies
are compared
most of which
are parametertuned by the
same principle
„Typical
number of“
N = 100-100.000
data points
1
2
M
5-10
outer
splits
M = 5-20
3-5 nested splits
10-10.000
parameter
combinations
10-1.000
base learners
Ensembles: further nesting
Runtime = 10 x 10 x 5 x 1.000 (x 100) x one run on N samples
(usually O(N²) or O(N³) )
Machine Learning
Toolboxes
An incomplete list of influential toolboxes
scikit-learn is perhaps the most widely used ML toolbox
Language
Modular API
(e.g., methods)
GUI
Common
models
Model tuning,
meta-methods
mostly
kernels
some
Few, mostly
classifiers
few
R
python
caret
R
python
multiinterface
Not
entirely
Java
3rd party
wrappers
python
Model validation
and comparison
The object-oriented ML Toolbox API
as found in the R/mlr or scikit-learn packages
Leading principles: encapsulation, modularization
„learning machine“ object
Linear regression
modular structure
object orientation
fit(traindata)
predict(testdata)
plus metadata & model info
Abstraction models objects with unified API:
Concept abstracted
Learning Machines
Re-sampling schemes
Evaluation metrics
Meta-modelling
Tuning
Ensembling
Pipelining
Learning task
Public interface
fitting, predicting, set parameters
sample, apply & get results
compute from results, tabulate
wrapping machines by strategy
benchmark, list strategies/measures
in R/mlr
in sklearn
Learner
estimator
ResampleDesc
splitter classes in
Measure
metrics classes in
metrics
various wrappers
fused classes
Task
model_selection
various wrappers
Pipeline
Implicit, not
encapsulated
HPC for benchmarking/validation today
Scikit-learn: joblib
mlr: parallelMap
all data
„Typical
number of“
N = 100-100.000
data points
(„small data“)
1
At the
selected
level:
training
data test data
training
data test data
training
data test data
(one of 1-4)
5-10
outer
splits
2
Distribute to
clusters/cores
1
2
M
3
3-5 nested splits
4
10-10.000
parameter
combinations
M = 5-20
10-1.000
base learners
Plus algorithm-specific HPC interfaces, e.g. deep learning (mutually exclusive)
HPC support tomorrow?
DATA
(e.g. Hadoop)
Layer 1:
1
2
M
re-samples
algorithms
parameters
…
Layer 2:
Scheduler for
algorithms and
meta-algorithms
Layer 3:
Optimized
Primitives
Layer 4:
Hardware API
(image source: continuum analytics)
Linear systems
convex optimization
stoch. gradient descent
(image source: Intel math kernel library)
Combining (?)
MapReduce,
DAAL, dask,
joblib -> TBB?
Data/task pipeline
full graph of
dependencies:
e.g. MKL,
CUDA,
BLAS
e.g. distributed, multi-core,
multi-type/heterogeneous
Challenges in ML APIs and HPC
Surprisingly few resources have been invested in ML toolboxes
Most advanced toolboxes are currently open-source & academic
Features that would be desirable to the practitioner
but not available without mid-scale software development:
Integration of (a) data management, (b) exploration and (c) modelling
especially challenging: integration in large scale scenarios
e.g. MapReduce for divide/conquer over data, model parts, and models
Full HPC integration on granular level for distributed ML benchmarking
making full use parallelism for nesting and computational redundancies
complete HPC architecture for whole model benchmarking workflow
Non-standard modelling tasks, structured data (incl time series)
data heterogeneity, multiple datasets, time series, spatial features, images etc
forecasting, on-line learning, anomaly detection, change point detection
meta-modelling and re-sampling for these is an order of magnitude more costly