Download From Big Data to Little Knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
From
Big Data
to Little Knowledge
Vladimir Cherkassky
University of Minnesota
[email protected]
Presented at CodeFreeze, Jan 16, 2014
Electrical and Computer Engineering
1
Motivation: What is Big Data?
•
Traditional IT infrastructure
Data storage, access, connectivity etc.
•
Making sense / acting on this data
Data  Knowledge  Decision making
always predictive by nature
•
Objectives of my talk
- Hype vs. Reality
- Methodological aspects of data-analytic
knowledge discovery
2
Scientific Discovery
•
Combines ideas/models and facts/data
• First-principle knowledge:
hypothesis  experiment  theory
~ deterministic, causal, intelligible models
• Modern data-driven discovery:
s/w program + DATA  knowledge
~ statistical, complex systems
• Two different philosophies
3
History of Scientific Knowledge
•
Ancient Greece:
Logic+deductive_reasoning
• Middle Ages: Deductive (scholacticism)
• Renaissance, Enlightment:
(1) First-Principles (Laws of Nature)
(2) Experimental science (empirical data)
Combining (1) + (2)  problem of induction
• Digital Age: the problem of induction attains
practical importance in many fields
4
Induction and Predictive Learning
Induction:
aka inductive step,
generalization etc.
Deduction:
aka Prediction
5
Problem of Induction in Philosophy
•
•
•
•
Francis Bacon: advocated empirical
knowledge (inductive) vs scholastic
David Hume: What right do we have to
assume that the future will be like the past?
Philosophy of Science tries to resolve this
dilemma/contradiction between deterministic
logic and uncertain nature of empirical data.
Digital Age: growth of empirical data, and
this dilemma becomes important in practice.
6
Cultural and Psychological Aspects
•
•
•
All men by nature desire knowledge
Man has an intense desire for assured
knowledge
Assured Knowledge ~ belief in
- religion (much of human history)
- reason (causal determinism)
- science / pseudoscience
- data-analytic models (~ Big Data)
- genetic risk factors …
7
Gods, Prophets and Shamans
8
Uncertainty and Risk in Science
•
Math, Logic and Science are about
certainty ~ deterministic rules
• Probability and empirical data: involves
uncertainty ~ inferior knowledge
Causal Determinism dominates modern science
• True Scientific knowledge consists of
deterministic Laws of Nature
• There is a single (true, causal) model that
explains natural phenomenon
9
Knowledge Discovery in Digital Age
•
•
•
Most information in the form of digital data
Can we get assured knowledge from data?
Big Data ~ technological nirvana
data + connectivity  more knowledge
Wired Magazine, 16/07: We can stop looking for (scientific)
models. We can analyze the data without hypotheses
about what it might show. We can throw the numbers into
the biggest computing clusters the world has ever seen and
let statistical algorithms find patterns where science cannot.
10
REALITY
• Many studies have questionable value
- statistical correlation vs causation
• Some border nonsense
- US scientists at SUNY discovered
Adultery Gene !!!
(based on a sample of 181 volunteers
interviewed about their sex life)
• Economic forecasting, i.e. ‘predicting’
-unemployment rate, monthly job gain/loss...
11
•
More examples …
Duke biologists discovered an unusual link btwn
the popular singer and a new species of fern, i.e.
- bisexual reproductive stage of the ferns;
- the team found the sequence GAGA when analyzing the
fern’s DNA base pairs
12
Real Data Mining: Kepler’s Laws
• How planets move among the stars?
- Ptolemaic system (geocentric)
- Copernican system (heliocentric)
• Tycho Brahe (16 century)
- measured positions of the planets in the sky
- use experimental data to support one’s
view (hypothesis)
• Johannes Kepler
- used volumes of Tycho’s data to discover
three remarkably simple laws
13
Kepler’s Laws
(1) The orbit is an ellipse with sun at its focus
(2) The line joining a planet to the sun sweeps equal
areas during the same time
(3) The ratio P2/D3 is constant, where P is the orbit
period and D is the orbit size.
NO computers, statistics, machine learning or Big Data
14
Kepler’s Laws vs. ‘Lady Gaga’ knowledge
• Both search for assured knowledge
• Kepler’s Laws
- well-defined hypothesis stated a priori
- prediction capability
- human intelligence
•
Lady Gaga knowledge
- no hypothesis stated a priori
- no prediction capability
- computer intelligence (software program)
- popular appeal (to widest audience)
15
Lessons from Natural Sciences
• Prediction capability
Prediction is hard. Especially about the future.
• Empirical validation/repeatable events
• Limitations (of scientific knowledge)
• Important to ask the right question
-Science starts from problems, and not from
observations (K. Popper)
-What we observe is not nature itself, but nature
exposed to our method of questioning
(W.Heisenberg)
16
Limitations of Scientific Method
When the number of factors coming
into play in a phenomenological
complex is too large, scientific
method in most cases fails us.
We are going to be shifting the mix
of our tools as we try to land the
ship in a smooth way onto the
aircraft carrier.
Recall: the Ancient Greeks scorned ‘predictability’
17
Important Differences
Albert Einstein:
•
It might appear that there are no methodological
differences between astronomy and economics: scientists
in both fields attempt to discover general laws for a group
of phenomena. But in reality such differences do exist.
•
The discovery of general laws in economics is difficult
because observed economic phenomena are often affected
by many factors that are very hard to evaluate separately.
•
The experience which has accumulated during the civilized
period of human history has been largely influenced by
causes which are not economic in nature.
18
Prediction in Social Systems
The Bitcoin saga
Illusion of predictability:
19
Methodological Aspects of
Data-Driven Knowledge Discovery
•
Empirical Knowledge vs. First-Principles
•
Method of Questioning:
- Two Data-Analytic Methodologies
- Statistical Modeling Assumptions
•
Example: Market Timing of Mutual Funds
•
Interpretation of Predictive Models
20
Three Types of Knowledge
•
•
•
Growing importance of empirical knowledge
Demarcation problems:
- first-principles vs empirical vs beliefs
Assured knowledge ~ interpretable
- first-principle ~ small number of concepts
- empirical knowledge ???
21
Empirical Knowledge
•
•
•
•
Can it be obtained from data alone?
How is it different from ‘beliefs’ ?
Role of a priori knowledge vs. data ?
What is ‘the method of questioning’ ?
These methodological/philosophical
issues need to be properly addressed
22
Induction and Predictive Learning
Induction:
aka inductive step,
generalization etc.
Deduction:
aka Prediction
23
Inductive Inference Step
•
Inductive inference step:
Data  model ~ ‘uncertain inference’
•
Is it possible to make uncertain inferences
mathematically rigorous? (Fisher 1935)
•
Many types of ‘uncertain inferences’
- hypothesis testing
- maximum likelihood
- risk minimization ….
 each comes with its own methodology/assumptions
24
Two Data-Analytic Methodologies
•
Many existing data-analytic methods but
lack of methodological assumptions
•
Two theoretical developments
- classical statistics ~ mid 20-th century
- Vapnik-Chervonenkis (VC) theory ~ 1970’s
•
Two related technological advances
- applied statistics (R. Fisher)
- machine learning, neural nets, data mining etc.
25
Binary Classification Problem
Given: training data (x,y) ~ i.i.d. samples from
unknown distribution P(x,y)
Estimate: a model or function f(x) that:
- explains this data
- can predict future data
Classification problem:
 Learning ~ function estimation
26
Classical Statistics
Goal of data modeling /Asking the right question
- estimate unknown distribution P(x,y)
• Classical statistics approach (R. Fisher)
•
- specify a parametric model for P(x,y)
- estimate its parameters from training data
Observed_Data ~ Model + Noise
more data  better (more accurate) model
Assumed parametric form of P(x,y) is based
on first-principle knowledge, so it is true.
27
Critique of Statistical Approach (Leo Breiman)
•
•
•
•
The Belief that a statistician can invent a
reasonably good parametric class of models
for a complex mechanism devised by nature
Then parameters are estimated and
conclusions are drawn
But conclusions are about
- the model’s mechanism
- not about nature’s mechanism
Many modern data-analytic sciences
(economics, life sciences) have similar flaws
28
Risk Minimization Approach
Goal of data modeling /Asking the right question
~estimate a model that will predict well
• Predictive Approach:
estimate only properties of P(x, y)
that are useful for predicting y
Note: no need to estimate P(x, y)
•
Requires specification of:
- a set of possible models f(x,w)
- loss function to measure prediction performance
- proper formalization of the learning problem
29
Standard Modeling Assumptions
•
Future is similar to Past
- training and test data from the same distribution
- i.i.d. training data
- large test set
•
Prediction accuracy ~ given loss function
- misclassification costs (classification problems)
- squared loss (regression problems)
- etc.
•
Proper formalization (for an application)
e.g., classification is used in many applications
30
Predictive Methodology (VC-theory)
•
Method of questioning is
- the learning problem setting(inductive step)
- driven by application requirements
•
Standard inductive learning commonly used
(may not be the best choice)
•
Good generalization depends on two factors
- (small) training error
- small VC-dimension ~ large ‘falsifiability’
31
Timing of International Funds
•
International mutual funds
- priced at 4 pm EST (New York time)
- reflect price of foreign securities traded at
European/ Asian markets
- Foreign markets close earlier than US market
 Possibility of inefficient pricing.
Market timing exploits this inefficiency.
• Scandals in the mutual fund industry ~2002
• Solution adopted: restrictions on trading
32
Binary Classification Setting
•
•
•
•
TWIEX ~ American Century Int’l Growth
Input indicators (for trading) ~ today
- SP 500 index (daily % change) ~ x1
- Euro-to-dollar exchange rate (% change) ~ x2
Output : TWIEX NAV (% change)~y next day
Trading rule: D(x) = 0~Sell, D(x)=1 ~ Buy
Model parameterization (fixed):
- linear g (x, w)  w1 x1  w2 x2  w0
- quadratic g (x, w)  w x  w x  w x  w x  w x x  w
Decision rule (estimated from training data):
1 1
•
2 2
2
3 1
2
4 2
5 1 2
0
D(x)  Ind ( g (x, w  ))  Buy /Sell decision (+1 / 0)
33
Methodological Assumptions
•
When a trained model can predict well?
(1) Future/test data is similar to training data
i.e., use 2004 period for training, and 2005 for testing
(2) Estimated model is ‘simple’ and provides good
performance during training period
i.e., the trading strategy is consistently better than buyand-hold during training period
•
Loss function (to measure performance):
L(x, y )  D(x) y where D(x)  Ind ( g (x, w  ))
34
Empirical Results: 2004 -2005 data
Linear model
Training data 2004
Training period 2004
30
1.5
25
1
20
Cumulative Gain /Loss (%)
EURUSD ( %)
2
0.5
0
-0.5
-1
15
10
5
0
-5
-1.5
-2
-2
Trading
Buy and Hold
-1.5
-1
-0.5
0
0.5
SP500 ( %)
1
1.5
2
-10
0
50
100
150
200
250
Days
 can expect good performance with test data
35
Empirical Results: 2004 -2005 data
Linear model
Test data 2005
Test period 2005
2
25
Cumulative Gain /Loss (%)
1.5
1
EURUSD( %)
0.5
0
-0.5
-1
-1.5
-2
-2
-1.5
-1
-0.5
0
SP500( %)
0.5
1
1.5
2
20
Trading
Buy and Hold
15
10
5
0
-5
0
50
100
150
200
250
Days
confirmed good prediction performance
36
Empirical Results: 2004 -2005 data
Quadratic model
Training data 2004
Training period 2004
2
35
1.5
30
25
Cumulative Gain /Loss (%)
EURUSD( %)
1
0.5
0
-0.5
20
15
10
5
-1
0
-1.5
-5
-2
-2
Trading
Buy and Hold
-10
0
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
50
100
150
200
250
Days
 can expect good performance with test data
37
Empirical Results: 2004 -2005 data
Quadratic model
Test data 2005
Test period 2005
30
2
1.5
25
Cumulative Gain/Loss (%)
EURUSD( %)
1
0.5
0
-0.5
Trading
Buy and Hold
20
15
10
5
-1
0
-1.5
-2
-2
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
-5
0
50
100
150
200
250
Days
confirmed good test performance
38
Interpretation vs Prediction
•
Two good trading strategies estimated from
2004 training data
2
2
1.5
1.5
1
0.5
0
0.5
0
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-2
•
•
EURUSD( %)
EURUSD ( %)
1
-1.5
-1
-0.5
0
0.5
SP500 ( %)
1
1.5
2
-2
-2
-1.5
-1
-0.5
0
0.5
SP500( %)
1
1.5
2
Both models predict well for test period 2005
Which one is ‘true’?
39
DISCUSSION
•
Can this trading strategy be used now ?
- NO, this market timing strategy becomes
ineffective since ~ year 2008. The reason is
changing statistical characteristics of the market
- YES, it can be used occasionally.
•
Hypocrisy of the mutual fund industry
Story 1: markets are very efficient, so individual
investors cannot trade successfully and outperform
the market indices (such as SP500)
Story 2: market timing is harmful for mutual funds,
so such abusive trading activity should be banned
Story 3: restrictions also apply to domestic funds
40
Interpretation of Predictive Models
•
Humans cannot provide interpretation
even if they can make good prediction
Each input ~ 28 x 28 pixel image  784-dimensional input x
•
Interpretation of black-box models
Not unique/ subjective
Depends on chosen parameterization (method)
41
Classification with High-Dimensional Data
• Digit recognition 5 vs 8:
each example ~ 28 x 28 pixel image
 784-dimensional vector x
Medical Interpretation
- Each pixel ~ genetic marker
- Each patient (sample) described by 784 genetic markers
- Two classes ~ presence/ absence of a disease
• Estimation of P(x,y) with finite data is not possible
• Accurate estimation of decision boundary in 784-dim.
space is possible from just a few hundred samples, i.e.
using Support Vector Machine (SVM) classifiers
42
Interpretation of SVM models
How to interpret high-dimensional models?
(say, SVM model)
Strategy 1: dimensionality reduction/feature selection
 prediction accuracy usually suffers
Strategy 2: approximate SVM model via a set of
rules (using rule induction, decision tree etc.)
 does not scale well for high-dim. models
43
Dimensionality Reduction
(1) Reduce dimensionality (small # features)
(a) 10 top ranked pixels using Fisher’s criterion
(b) extract 3 principal components (via PCA)
(2) Estimate RBF SVM model
 Generalization performance degrades:
Method
Test Error (%)
Training Error (%)
SVM
1.08
0
FISHER+SVM
7.28
4.93
PCA+SVM
6.22
6.18
44
Rule Induction (via ALBA method)
(1) Estimate SVM model using all 784 pixels
(2) Interpret this SVM model via ALBA method
(Active Learning Based Algorithm: Martens et al 2009)
 Generalization performance degrades:
RBF
Polynomial (d=3)
METHODS
SVM
ALBA
SVM
ALBA
Training Error (%)
0
0
0
0
Test Error (%)
1.23
6.48
1.98
8.47
45
SUMMARY (A)
•
Predictive Data-Analytic Modeling:
usually on the boundary btwn trivial and impossible
•
Asking the right question ~ problem setting
- depends on modeler’s creativity/ intelligence
- requires application domain knowledge
- cannot be formalized
•
•
Modeling Assumptions (not just algorithm)
Interpretation of black-box models
- very difficult (requires domain knowledge)
- multiplicity of ‘good’ models
46
SUMMARY (B)
•
•
•
•
Common misconception:
data-driven models are intrinsically objective
Explanation bias (favors simplicity+causality)
psychological + cultural reasons
Cognitive bias (favor only positive findings)
When all these human biases are
incorporated into data-analytic modeling:
- many ‘interesting’ discoveries
- little objective value
- no real predictive value
47
SUMMARY (C)
•
•
Predictive learning methodology is useful for
safeguarding against these biases
It clearly differentiates between
(1) The learning problem setting
(~ creation of human mind, intelligent speculation cannot be logically justified or derived from data)
(2) The learning algorithm /software
~ particular implementation of (1)
(3) Predictive data-analytic model
~ estimated from data; provides the only objective
evaluation of the original intelligent speculation (1).
Model (3) makes sense only in the context of (1).
48
V. Cherkassky
Predictive Learning, 2013
www.VCtextbook.com
Available on Amazon.com
This book presents a very good introduction to machine learning for
undergraduate students and practitioners. It differs from other textbooks in
its original coverage of the philosophical aspects of inference and their
relationship to machine learning theory. This will allow readers to develop a
better understanding of generalization problems and learning algorithms.
- V. Vapnik, Columbia University
49
References
• V. Vapnik, Estimation of Dependencies Based on Empirical
Data. Empirical Inference Science: Afterword of 2006 Springer
• L. Breiman, Statistical Modeling: the Two Cultures, Statistical
Science, vol. 16(3), pp. 199-231, 2001
• A. Einstein, Ideas and Opinions, Bonanza Books, NY 1954
• V. Cherkassky and F. Mulier, Learning from Data, second
edition, Wiley, 2007
• V. Cherkassky and S. Dhar, Market timing of international
mutual funds: a decade after the scandal, Proc. CIFEr 2012
50