Download Part 1 - Electrical and Computer Engineering

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Advanced Methodologies for
Predictive Data-Analytic Modeling
Vladimir Cherkassky
Electrical & Computer Engineering
University of Minnesota – Twin Cities
[email protected]
Presented at Chicago Chapter ASA, May 6, 2016
Electrical and Computer Engineering
1
Part 1: Motivation & Background
•
•
•
•
Background
- Big Data and Scientific Discovery
- Philosophical Connections
- Modeling Complex Systems
Two Data-Analytic Methodologies
Basics of VC-theory
Summary
2
Growth of (biological) Data
from http://www.dna.affrc.go.jp/growth/D-daily.html
3
Practical and Societal Implications
•
Personalized medicine
•
Genetic Testing: already available at $300-1K
4
Typical Applications
• Genomics
• Medical imaging (i.e., sMRI, fMRI)
• Financial
• Process Control
• Marketing ……
•
Sparse High-Dimensional Data
number of samples ~ n << d number of features
• Complex systems
underlying first-principle mechanism is unknown
• Ill-posed nature of such problems
 only approximate non-deterministic models
5
What is Big Data?
•
Traditional IT infrastructure
Data storage, access, connectivity etc.
•
Making sense / acting on this data
Data  Knowledge  Decision making
always predictive by nature
•
Focus of my presentation
- Methodological aspects of data-analytic
knowledge discovery
6
Scientific Discovery
•
Combines ideas/models and facts/data
• First-principle knowledge:
hypothesis  experiment  theory
~ deterministic, causal, intelligible models
• Modern data-driven discovery:
s/w program + DATA  knowledge
~ statistical, complex systems
• Two different philosophies
7
History of Scientific Knowledge
•
Ancient Greece:
Logic+deductive_reasoning
• Middle Ages: Deductive (scholasticism)
• Renaissance, Enlightment:
(1) First-Principles (Laws of Nature)
(2) Experimental science (empirical data)
Combining (1) + (2)  problem of induction
• Digital Age: the problem of induction attains
practical importance in many fields
8
Induction and Predictive Learning
Induction:
aka inductive step,
standard inductive
inference
Deduction:
aka Prediction
9
Problem of Induction in Philosophy
•
•
•
•
Francis Bacon: advocated empirical
knowledge (inductive) vs scholastic
David Hume: What right do we have to
assume that the future will be like the past?
Philosophy of Science tries to resolve this
dilemma/contradiction between deterministic
logic and uncertain nature of empirical data.
Digital Age: growth of empirical data, and
this dilemma becomes important in practice.
10
Cultural and Psychological Aspects
•
•
•
All men by nature desire knowledge
Man has an intense desire for assured
knowledge
Assured Knowledge ~ belief in
- religion (much of human history)
- reason (causal determinism)
- science / pseudoscience
- data-analytic models (~ Big Data)
- genetic risk factors …
11
Knowledge Discovery in Digital Age
•
•
•
Most information in the form of digital data
Can we get assured knowledge from data?
Big Data ~ technological nirvana
data + connectivity  more knowledge
Wired Magazine, 16/07: We can stop looking for (scientific)
models. We can analyze the data without hypotheses
about what it might show. We can throw the numbers into
the biggest computing clusters the world has ever seen and
let statistical algorithms find patterns where science cannot.
12
•
More examples …
Duke biologists discovered an unusual link btwn
the popular singer and a new species of fern, i.e.
- bisexual reproductive stage of the ferns;
- the team found the sequence GAGA when analyzing the
fern’s DNA base pairs
13
Real Data Mining: Kepler’s Laws
• How planets move among the stars?
- Ptolemaic system (geocentric)
- Copernican system (heliocentric)
• Tycho Brahe (16 century)
- measured positions of the planets in the sky
- use experimental data to support one’s
view (hypothesis)
• Johannes Kepler
- used volumes of Tycho’s data to discover
three remarkably simple laws
14
Kepler’s Laws vs. ‘Lady Gaga’ knowledge
• Both search for assured knowledge
• Kepler’s Laws
- well-defined hypothesis stated a priori
- prediction capability
- human intelligence
•
Lady Gaga knowledge
- no hypothesis stated a priori
- no prediction capability
- computer intelligence (software program)
- popular appeal (to widest audience)
15
Lessons from Natural Sciences
• Prediction capability
Prediction is hard. Especially about the future.
• Empirical validation/repeatable events
• Limitations (of scientific knowledge)
• Important to ask the right question
-Science starts from problems, and not from
observations (K. Popper)
-What we observe is not nature itself, but nature
exposed to our method of questioning
(W.Heisenberg)
16
Limitations of Scientific Method
When the number of factors coming
into play in a phenomenological
complex is too large, scientific
method in most cases fails us.
We are going to be shifting the mix
of our tools as we try to land the
ship in a smooth way onto the
aircraft carrier.
Recall: the Ancient Greeks scorned ‘predictability’
17
Important Differences
Albert Einstein:
•
It might appear that there are no methodological
differences between astronomy and economics: scientists
in both fields attempt to discover general laws for a group
of phenomena. But in reality such differences do exist.
•
The discovery of general laws in economics is difficult
because observed economic phenomena are often affected
by many factors that are very hard to evaluate separately.
•
The experience which has accumulated during the civilized
period of human history has been largely influenced by
causes which are not economic in nature.
18
Flexible Data Modeling Approaches
•
Late 1980’s
Artificial Neural Networks
•
Mid 1990’s
Data Mining
•
Late 1990’s Support Vector Machines
•
Mid 2000’s
•
Early 2010’s Big Data
Deep Learning (reincarnated NNs)
NOTE 1: no clear boundary btwn science vs marketing
NOTE 2: fragmentation and ‘soft’ plagiarism
19
Methodologies for Data Modeling
•
•
•
•
•
The field of Pattern Recognition is concerned with the
automatic discovery of regularities in data.
Data Mining is the process of automatically discovering
useful information in large data repositories.
This book (on Statistical Learning) is about learning
from data.
The field of Machine Learning is concerned with the
question of how to construct computer programs that
automatically improve with experience.
Artificial Neural Networks perform useful computations
through the process of learning.

(1) focus on algorithms/ computational procedures
(2) all fields estimate useful models from data, i.e. extract
knowledge from data (the same as in classical statistics)
Real Issues: what is ‘useful’? What is ‘knowledge’?
20
What is ‘a good model’?
•
•
•
•
All models are mental constructs that
(hopefully) relate to real world
Two goals of data-analytic modeling:
- explanation (of past/ available data)
- prediction (of future data)
All good models make non-trivial predictions
Good data-driven models can predict well,
so the goal is to estimate predictive models
aka generalization, inductive inference
 Importance of methodology/assumptions
21
The Role of Statistics
•
•
•
•
Dilemma: Mathematical or natural science?
Traditionally, heavy emphasis on
parametric modeling and math proofs
Conservative attitude: slow acceptance of
modern computational approaches
Under-appreciation of predictive modeling
William Edwards Deming: The only useful function of
a statistician is to make predictions, and thus to
provide a basis for action.
22
BROADER QUESTIONS
•
•
•
•
•
Can we trust models derived from data?
What is scientific knowledge?
First-principle knowledge vs empirical
knowledge vs beliefs
Understanding uncertainty and risk
Historical view: how explosive growth of
data-driven knowledge changes human
perception of uncertainty
23
Scientific Understanding of Uncertainty
•
Very recent: most probability theory and
statistics developed in the past 100 years.
Most apps in the last 50-60 years
•
Dominant approach in classical science is
causal determinism, i.e. the goal is to
estimate the true model (or cause)
~ system identification
Classical statistics: the goal is to estimate
probabilistic model underlying the data, i.e.
system identification
•
24
Scientific Understanding (cont’d)
•
Albert Einstein:
The scientist is possessed by the sense of
universal causality. The future, to him, is every
whit as necessary and determined as the past.
•
Albert Einstein:
God does not play dice
•
Stephen Hawking:
God not only plays dice. He sometimes throws
the dice where they cannot be seen
25
Modeling Complex Systems
•
•
•
First-principle scientific knowledge:
- deterministic
- simple models (~ few main concepts)
This knowledge has been used to design
complex systems: computers, airplanes etc.
It has not been successful for modeling and
understanding complex systems:
- weather prediction/ climate modeling
- human brain
- stock market etc.
26
Modeling Complex Systems
•
A. Einstein:
When the number of factors coming into play in a
phenomenological complex is too large, scientific
method in most cases fails us. One need only think
of the weather, in which case prediction even for a
few days is impossible… Occurences in this domain
are beyond the reach of exact prediction because
of the variety of factors in operation, not because of
any lack of order in nature.
27
How to Model Complex Systems ?
•
Conjecture 1
first-principle /system identification
approach cannot be used
•
Conjecture 2
system imitation approach, i.e. modeling
certain aspects of a system, may be used
 statistical models
Examples: stock trading, medical diagnosis
28
Three Types of Knowledge
•
•
•
Growing role of empirical knowledge
Classical philosophy of science
differentiates only between (first-principle)
science and beliefs (demarcation problem)
Importance of demarcation btwn empirical
knowledge and beliefs in modern apps
29
Beliefs vs Scientific Theories
Men have lower life expectancy than women
• Because they choose to do so
• Because they make more money (on
average) and experience higher stress
managing it
• Because they engage in risky activities
• Because …..
Demarcation problem in philosophy
30
Popper’s Demarcation Principle
• First-principle scientific theories vs. beliefs
or metaphysical theories
• Risky prediction, testability, falsifiability
Karl Popper: Every true
(inductive) theory prohibits
certain events or occurences,
i.e. it should be falsifiable
31
Popper’s conditions
for scientific hypothesis
- Should be testable
- Should be falsifiable
Example 1: Efficient Market Hypothesis(EMH)
The prices of securities reflect all known
information that impacts their value
Example 2: We do not see our noses,
because they all live on the Moon
32
Observations, Reality and Mind
Philosophy is concerned with the relationship btwn
- Reality (Nature)
- Sensory Perceptions
- Mental Constructs (interpretations of reality)
Three Philosophical Schools
•
REALISM:
- objective physical reality perceived via senses
- mental constructs reflect objective reality
•
IDEALISM:
- primary role belongs to ideas (mental constructs)
- physical reality is a by-product of Mind
•
INSTRUMENTALISM:
- the goal of science is to produce useful theories
Which one should be adopted (by scientists+ engineers)??
33
Three Philosophical Schools
•
Realism
(materialism)
•
Idealism
•
Instrumentalism
34
Application Example:
predicting gender of face images
•
Training data: labeled face images
Male
etc.
Female
etc.
35
Predicting Gender of Face Images
•
Input ~ 16x16 pixel image
•
Model ~ indicator function f(x) separating
256-dimensional pixel space in two halves
Model should predict well new images
Difficult machine learning problem, but easy
for human recognition
•
•
36
Two Philosophical Views (Vapnik, 2006)
•
System Identification (~ Realism)
- estimate probabilistic model (of true class
densities) from available data
- this view is adopted in classical statistics
•
System Imitation (~ Instrumentalism)
- need only to predict well i.e. imitate
specific aspect of unknown system;
- multiplicity of good models;
- can they be interpreted and/or trusted?
37
OUTLINE
•
•
•
•
Background
Two Data-Analytic Methodologies
- inductive inference step
- two approaches to statistical inference
- advantages of predictive approach
- Example: market timing of mutual funds
Basics of VC-theory
Summary
38
Statistical vs Predictive Modeling
EMPIRICAL
DATA
KNOWLEDGE,
ASSUMPTIONS
STATISTICAL INFERENCE
PROBABILISTIC
MODELING
PREDICTIVE
APPROACH
39
Inductive Inference Step
•
Inductive inference step:
Data  model ~ ‘uncertain inference’
•
Is it possible to make uncertain inferences
mathematically rigorous? (Fisher 1935)
•
Many types of ‘uncertain inferences’
- hypothesis testing
- maximum likelihood
- risk minimization ….
 each comes with its own methodology/assumptions
40
Two Data-Analytic Methodologies
•
Many existing data-analytic methods but
lack of methodological assumptions
•
Two theoretical developments
- classical statistics ~ mid 20-th century
- Vapnik-Chervonenkis (VC) theory ~ 1970’s
•
Two related technological advances
- applied statistics (R. Fisher)
- machine learning, neural nets, data mining etc.
41
Statistical vs Predictive Approach
• Binary Classification problem
estimate decision boundary from training data x i , y i 
Assuming class distributions P(x,y) were known:
10
8
(x1,x2) space
6
x2
4
2
0
-2
-4
-6
-2
0
2
4
x1
6
8
10
42
Classical Statistical Approach: Realism
(1) parametric form of unknown distribution P(x,y) is known
(2) estimate parameters of P(x,y) from the training data
(3) Construct decision boundary using estimated distribution
and given misclassification costs
10
Estimated boundary
8
6
4
Unknown P(x,y) can be
accurately estimated from
available data
x2
Modeling assumption:
2
0
-2
-4
-6
-2
0
2
4
x1
6
8
10
43
Critique of Statistical Approach (Leo Breiman)
•
•
•
•
The Belief that a statistician can invent a
reasonably good parametric class of models
for a complex mechanism devised by nature
Then parameters are estimated and
conclusions are drawn
But conclusions are about
- the model’s mechanism
- not about nature’s mechanism
Many modern data-analytic sciences
(economics, life sciences) have similar flaws
44
Predictive Approach: Instrumentalism
(1) parametric form of decision boundary f(x,w) is given
(2) Explain available data via fitting f(x,w), or minimization of
some loss function (i.e., squared error)
(3) A function f(x,w*) providing smallest fitting error is then
used for predictiion
10
8
Estimated boundary
6
4
x2
Modeling assumptions
2
- Need to specify f(x,w) and 0
-2
loss function a priori.
-4
- No need to estimate P(x,y)
-6
-2
0
2
4
x1
6
8
10
45
Classification with High-Dimensional Data
• Digit recognition 5 vs 8:
each example ~ 16 x 16 pixel image
 256-dimensional vector x
Medical Interpretation
- Each pixel ~ genetic marker
- Each patient (sample) described by 256 genetic markers
- Two classes ~ presence/ absence of a disease
• Estimation of P(x,y) with finite data is not possible
• Accurate estimation of decision boundary in 256-dim.
space is possible, using just a few hundred samples
46
Common Modeling Assumptions
•
Future is similar to Past
- training and test data from the same distribution
- i.i.d. training data
- large test set
•
Prediction accuracy ~ given loss function
- misclassification costs (classification problems)
- squared loss (regression problems)
- etc.
•
Proper formalization~type of learning problem
e.g., classification is used in many applications
47
Importance of Complexity Control
Regression estimation for known parameterization
• Ten training samples
y  x 2  N (0,  2 ), where 2  0.25
•
Fitting linear and 2-nd order polynomial:
48
Statistical vs Predictive: issues
Predictive approach
- estimates certain properties of unknown P(x,y)
that are useful for predicting y
- has solid theoretical foundations (VC-theory)
- successfully used in many apps
BUT its methodology + concepts are different from
classical statistics:
- formalization of the learning problem (~ requires
understanding of application domain)
- a priori specification of a loss function
- interpretation of predictive models may be hard
- multiplicity of models estimated from the same data
49
Predictive Methodology (VC-theory)
•
Method of questioning is
- the learning problem setting(inductive step)
- driven by application requirements
•
Standard inductive learning commonly used
(may not be the best choice)
•
Good generalization depends on two factors
- (small) training error
- small VC-dimension ~ large ‘falsifiability’
50
Timing of International Funds
•
International mutual funds
- priced at 4 pm EST (New York time)
- reflect price of foreign securities traded at
European/ Asian markets
- Foreign markets close earlier than US market
 Possibility of inefficient pricing.
Market timing exploits this inefficiency.
• Scandals in the mutual fund industry ~2002
• Solution adopted: restrictions on trading
51
Binary Classification Setting
•
•
•
•
TWIEX ~ American Century Int’l Growth
Input indicators (for trading) ~ today
- SP 500 index (daily % change) ~ x1
- Euro-to-dollar exchange rate (% change) ~ x2
Output : TWIEX NAV (% change)~y next day
Trading rule: D(x) = 0~Sell, D(x)=1 ~ Buy
Model parameterization (fixed):
- linear g (x, w)  w1 x1  w2 x2  w0
- quadratic g (x, w)  w x  w x  w x  w x  w x x  w
Decision rule (estimated from training data):
1 1
•
2 2
2
3 1
2
4 2
5 1 2
0
D(x)  Ind ( g (x, w  ))  Buy /Sell decision (+1 / 0)
52
Methodological Assumptions
•
When a trained model can predict well?
(1) Future/test data is similar to training data
i.e., use 2004 period for training, and 2005 for testing
(2) Estimated model is ‘simple’ and provides good
performance during training period
i.e., the trading strategy is consistently better than buyand-hold during training period
•
Loss function (to measure performance):
L(x, y )  D(x) y where D(x)  Ind ( g (x, w  ))
53
Empirical Results: 2004 -2005 data
Linear model
Training data 2004
Training period 2004
30
1.5
25
1
20
Cumulative Gain /Loss (%)
EURUSD ( %)
2
0.5
0
-0.5
-1
15
10
5
0
-5
-1.5
-2
-2
Trading
Buy and Hold
-1.5
-1
-0.5
0
0.5
SP500 ( %)
1
1.5
2
-10
0
50
100
150
200
250
Days
 can expect good performance with test data
54
Empirical Results: 2004 -2005 data
Linear model
Test data 2005
Test period 2005
2
25
Cumulative Gain /Loss (%)
1.5
1
EURUSD( %)
0.5
0
-0.5
-1
-1.5
-2
-2
-1.5
-1
-0.5
0
SP500( %)
0.5
1
1.5
2
20
Trading
Buy and Hold
15
10
5
0
-5
0
50
100
150
200
250
Days
confirmed good prediction performance
55
Empirical Results: 2004 -2005 data
Quadratic model
Training data 2004
Training period 2004
2
35
1.5
30
25
Cumulative Gain /Loss (%)
EURUSD( %)
1
0.5
0
-0.5
20
15
10
5
-1
0
-1.5
-5
-2
-2
Trading
Buy and Hold
-10
0
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
50
100
150
200
250
Days
 can expect good performance with test data
56
Empirical Results: 2004 -2005 data
Quadratic model
Test data 2005
Test period 2005
30
2
1.5
25
Cumulative Gain/Loss (%)
EURUSD( %)
1
0.5
0
-0.5
Trading
Buy and Hold
20
15
10
5
-1
0
-1.5
-2
-2
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
-5
0
50
100
150
200
250
Days
confirmed good test performance
57
Interpretation vs Prediction
•
Two good trading strategies estimated from
2004 training data
2
2
1.5
1.5
1
0.5
0
0.5
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
•
•
EURUSD( %)
EURUSD ( %)
1
-1.5
-1
-0.5
0
0.5
SP500 ( %)
1
1.5
2
-2
-2
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
Both models predict well for test period 2005
Which one is ‘true’?
58
DISCUSSION
•
Can this trading strategy be used now ?
- NO, this market timing strategy becomes
ineffective since ~ year 2008. The reason is
changing statistical characteristics of the market
- YES, it can be used occasionally.
•
Hypocrisy of the mutual fund industry
Story 1: markets are very efficient, so individual
investors cannot trade successfully and outperform
the market indices (such as SP500)
Story 2: market timing is harmful for mutual funds,
so such abusive trading activity should be banned
Story 3: restrictions also apply to domestic funds
59
Interpretation of Predictive Models
•
Humans cannot provide interpretation
even if they can make good prediction
Each input ~ 28 x 28 pixel image  784-dimensional input x
•
Interpretation of black-box models
Not unique/ subjective
Depends on chosen parameterization (method)
60
Interpretation of SVM models
How to interpret high-dimensional models?
(say, SVM model)
Strategy 1: dimensionality reduction/feature selection
 prediction accuracy usually suffers
Strategy 2: approximate SVM model via a set of
rules (using rule induction, decision tree etc.)
 does not scale well for high-dim. models
61
OUTLINE
•
•
•
•
Background
Two Data-Analytic Methodologies:
- Classical statistics vs predictive learning
Basics of VC-theory
- History and Overview
- Inductive problem setting
- Conditions for consistency of ERM
- Generalization Bounds and SRM
Summary
62
History and Overview
• SLT aka VC-theory (Vapnik-Chervonenkis)
• Theory for estimating dependencies from finite
samples (predictive learning setting)
• Based on the risk minimization approach
• All main results originally developed in 1970’s
for classification (pattern recognition) – why?
but remained largely unknown
• Recent renewed interest due to practical
success of Support Vector Machines (SVM)
63
History and Overview(cont’d)
MAIN CONCEPTUAL CONTRIBUTIONS
• Distinction between the problem setting,
inductive principle and learning algorithms
• Direct approach to estimation with finite data
(KID principle)
• Math analysis of Empirical Risk Minimization
• Two factors responsible for generalization:
- empirical risk (fitting error)
- complexity(capacity) of approximating functions
64
Inductive Learning: problem setting
• The learning machine observes samples (x ,y), and
returns an estimated response yˆ  f (x, w)
• Two modes of inference: identification vs imitation
• Risk  Loss(y, f(x,w)) dP(x,y) min
65
The Problem of Inductive Learning
• Given: finite training samples Z={(xi, yi),i=1,2,…n}
choose from a given set of functions f(x, w) the
one that approximates best the true output. (in the
sense of risk minimization)
Concepts and Terminology
• approximating functions f(x, w)
• (non-negative) loss function L(f(x, w),y)
• expected risk functional R(Z,w)
Goal: find the function f(x, wo) minimizing R(Z,w)
when the joint distribution P(x,y) is unknown.
66
Empirical Risk Minimization
• ERM principle in model-based learning
– Model parameterization: f(x, w)
– Loss function: L(f(x, w),y)
1 n
– Estimate risk from data:Remp (w )  n  L( f (x i , w ), yi )
i 1
– Choose w* that minimizes Remp
• Statistical Learning Theory developed
from the theoretical analysis of ERM
principle under finite sample settings
67
Consistency/Convergence of ERM
• Empirical Risk known but Expected Risk unknown
• Asymptotic consistency requirement:
under what (general) conditions models providing
min Empirical Risk will also provide min Prediction
Risk, when the number of samples grows large?
68
Consistency of ERM
• Necessary & sufficient condition: The set of
possible models f(x, w) has limited ability to fit
(explain) finite number of samples
~ VC-dimension of this set of functions is finite
• Generalization (prediction) is possible only if a
set of models has limited complexity (VC-dim)
• VC-dimension
- measures the ability (of a set of functions) to fit
or ‘explain’ available finite data.
- similar to DoF for linear parameterization, but
different for nonlinear
69
• VC-dimension of a set of indicator functions:
- Shattering: if n samples can be separated by a set of
indicator functions in all 2^^n possible ways, then these
samples can be shattered by this set of functions.
- A set of functions has VC-dimension h if there exist h
samples that can be shattered by this set of functions, but
there does not exist h+1 samples that can be shattered.
• Example: VC-dimension of linear indicator functions (d=2)
Z2
Z2
*
*
*
*
*
*
*
Z1
Z1
70
• VC-dimension of a set of linear hyperplanes
is h=d+1
• VC-dimension of linear slab or delta-margin
hyperplanes is controlled by the width (delta)
71
• Example: VC-dimension of a linear combination of fixed
basis functions (i.e. polynomials, Fourier expansion etc.)
Assuming that basis functions are linearly independent,
the VC-dim equals the number of basis functions.
• Counter- Example: single parameter but infinite VCdimension.
72
Generalization Bounds
• Bounds for learning machines (implementing
ERM) evaluate the difference btwn (unknown) risk and
known empirical risk, as a function of sample size n and
the properties of the loss functions (approximating fcts).
• Classification: the following bound holds with
probability of 1   for all approximating functions
R( )  Remp ( )   Remp ( ), n / h, ln  / n 
where
 is called the confidence interval
• Regression: the following bound holds with probability
of 1   for all approximating functions

R( )  Remp ( ) / 1  c 


73
Structural Risk Minimization
• Analysis of generalization bounds
R( )  Remp ( )   Remp ( ), n / h, ln  / n 
suggests that when n/h is large, the term  is small
R( ) ~ Remp ( )

This leads to parametric modeling approach (ERM)
• When n/h is not large (say, less than 20), both terms in the
right-hand side of VC- bound need to be minimized
 make the VC-dimension a controlling variable
• SRM = formal mechanism for controlling model complexity
Set of loss functions has nested structure Sk  L(z,  ),   k 
S1  S2 ... Sk ... such that h1  h2... hk ...
 structure ~ complexity ordering
74
Model Complexity Control via SRM
• An upper bound on the true risk and the empirical risk, as
a function of VC-dimension h (for fixed sample size n)
75
VC Approach to Predictive Learning
Goals of Predictive Learning
- explain (or fit) available training data
- predict well future (yet unobserved) data
(similar to biological learning)
Main Practical Result of VC-theory:
If a model explains well past data AND
is simple, then it can generalize (predict)
76
VC Approach to High-Dimensional Data
Strategy for modeling high-dimensional data:
Find a model f(x) that explains past data AND
has low VC-dimension, even when dim. is large
SVM approach
Large margin =
Low VC-dimension
~ easy to falsify
77
OUTLINE
•
•
•
•
Background
Two Data-Analytic Methodologies:
- Classical statistics vs predictive learning
Basics of VC-theory
Summary
78
SUMMARY (A)
•
Predictive Data-Analytic Modeling:
usually on the boundary btwn trivial and impossible
•
Asking the right question ~ problem setting
- depends on modeler’s creativity/ intelligence
- requires application domain knowledge
- cannot be formalized
•
•
Modeling Assumptions (not just algorithm)
Interpretation of black-box models
- very difficult (requires domain knowledge)
- multiplicity of ‘good’ models
79
SUMMARY (B)
•
•
•
•
Common misconception:
data-driven models are intrinsically objective
Explanation bias (favors simplicity+causality)
- for psychological + cultural reasons
Confirmation bias (only positive findings) –
encouraged by funding agencies
When all these human biases are
incorporated into data-analytic modeling:
- many ‘interesting’ discoveries
- little objective value
- no real predictive value
80
References
• V. Vapnik, Estimation of Dependencies Based on Empirical
Data. Empirical Inference Science: Afterword of 2006 Springer
• L. Breiman, Statistical Modeling: the Two Cultures, Statistical
Science, vol. 16(3), pp. 199-231, 2001
• A. Einstein, Ideas and Opinions, Bonanza Books, NY 1954
• V. Cherkassky and F. Mulier, Learning from Data, second
edition, Wiley, 2007
• V. Cherkassky and S. Dhar, Market timing of international
mutual funds: a decade after the scandal, Proc. CIFEr 2012
81