Download Artificial Intelligence (AI) Machine Learning and AI Pattern Recognition

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantum machine learning wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Neural modeling fields wikipedia , lookup

Person of Interest (TV series) wikipedia , lookup

Barbaric Machine Clan Gaiark wikipedia , lookup

Concept learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Machine learning wikipedia , lookup

Transcript
Artificial Intelligence (AI)
•
What is Artificial Intelligence?
by John McCarthy.
http://www-formal.stanford.edu/jmc/whatisai/
“After WWII, a number of people independently started to work on intelligent machines. The English
mathematician Alan Turing may have been the first. He gave a lecture on it in 1947. He also may have
been the first to decide that AI was best researched by programming computers rather than by building
machines. By the late 1950s, there were many researchers on AI, and most of them were basing their
work on programming computers.”
•
• •Towards complexity of real-world structures
A survey of Machine Learning techniques
Ant-colony example
“The complex behavior of the ant colony is due to the complexity of its environment and not to the
complexity of the ants themselves. This suggests the adaptive behavior of learning and representation
and the path the science of the artificial should take.”
(H.A. Simons, The Science of the Artificial, MIT Press, 1969)
Arshia Cont
Musical Representations Team, IRCAM.
[email protected]
ATIAM 2012-13
http://repmus.ircam.fr/atiam_ml_2012
’11
Arshia Cont: Survey of Machine Learning
Machine Learning and AI
•
•
’11
•
The methods presented today are not domain-specific but for every problem, you
start with a design, collect related data and then define the learning problem. We
will not get into design today... .
Keep in mind that,
• AI is an empirical science!
• See “Science of the Artificial” by H.A. Simons, MIT Press, 1969
apply algorithms blindly to your data/problem set!
• •DOTheNOT
MATLAB Toolbox syndrome: Examine the hypothesis and limitation of each approach before hitting enter!
Do
not
forget
your own intelligence!
•
Arshia Cont: Survey of Machine Learning
Pattern recognition in action:
•
Applied
A set of well-defined approaches each within its limits that can be applied to a
problem set
Classification / Pattern Recognition / Sequential Reasoning / Induction / Parameter
Estimation etc.
Our goal today is to introduce some well-known and wellestablished approaches in AI and Machine Learning
•
•
Pattern Recognition
Machine Learning deals with sub-problems in engineering and
sciences rather than the global “intelligence” issue!
•
•
•
2
3
’11
Examples:
•
•
Instrument Classification
•
•
•
•
•
•
Music genre classification
Audio to Score Alignment
(score following)
Automatic Improvisation
Gesture Recognition
Music Structure Discovery
Concatenative Synthesis (unit selection)
Artist Recovery
Arshia Cont: Survey of Machine Learning
4
Pattern Recognition
•
Machine Learning
Pattern recognition design cycle:
•
Examples:
•
•
•
•
•
•
•
•
•
•
Instrument Classification
Audio to Score Alignment (score following)
Music genre classification
Automatic Improvisation
Gesture Recognition
Music Structure Discovery
Concatenative Synthesis (unit selection)
Room Acoustics (parameter adaptation)
•
•
•
•
•
Is an empirical science
Is extremely diverse
Should keep you honest (and not the contrary!)
Course objective:
To get familiar with Machine Learning tools and reasoning and prepare you for
attacking real-world problems in Music Processing
etc!
5
Sample Example (I)
’11
Arshia Cont: Survey of Machine Learning
Example
• This is a typical Classification problem
BayesSolution:
decision rule is usually highly intuitive
•theIntuitive
Question: What should an optimal decoder do to recover Y from X ?
the Bayes decision rule is usually highly intuitive
is usually referred to as observation and is a random variable.
• Xhave
we
used an example from communications
we Threshold
have used
an example from communications
on 0.5
•
• a bit is transmitted by a source, corrupted by noise, and received
Butalet’s
make life more difficult!
decoder
• by
In most problems, the real state of the world (y) is not observable to us! So we try
atobit
is transmitted by a source, corrupted by noise, and received
infer this from the observation.
by a decoder
Y
Y
channel
6
Sample Example (I)
Example
• Communication theory:
channel
X
X
• Q: what should the optimal decoder do to recover Y?
• Q: what should the optimal decoder do to recover Y?
’11
Has a profound theoretical background
•
Arshia Cont: Survey of Machine Learning
••
Provide tools and reasoning for the design process of a given
problem
Physical Modeling (model learning/adaptation)
’11
•
•
Arshia Cont: Survey of Machine Learning
7
5
’11
Arshia Cont: Survey of Machine Learning
5
8
Sample Example (I)
Sample Example (I)
Example
•
•
Simple Solution 1:
in• summary
Define a decision function g(x) that predicts the state of the world (y).
•
•
Simple Solution 2:
•
•
" G ( x,0, ! )
P (0) " PY (1) " 1
2 X if I
I am P
thus
assuming that the family ofY g(x) that
generate
X |Y ( x | 1) " G ( x,1, ! )
have Y (the inverse problem).
andP
learn(it!
x | 0)
X |Y
•
Try to find an optimal boundary (defined as g(x)) that can best separate the two.
Define the decision function as + or - distance from this boundry.
I am thus assuming that the family of g(x) that discriminate X
classes.
• or, graphically,
g(x)
0
1
Example
’11
Arshia Cont: Survey of Machine Learning
15
’11
9
Arshia Cont: Survey of Machine Learning
10
in summary
PX |Y ( x | 0) " G ( x,0, ! )
SampleP Example
( x | 1) " G ( x,1, ! )(I)
PY (0) " PY (1) " 1
X |Y
•
Sample Example (I)
2
•
We just saw two •different
philosophies to solve our simple
or, graphically,
problem:
•
Generative Design:
•
Discriminative Design:
0
In the real world things are not as simple
•
•
Consider the following 2-dimensional problem
Not hard to see the problem!
1
g(x)
15
’11
Arshia Cont: Survey of Machine Learning
11
’11
Arshia Cont: Survey of Machine Learning
12
Sample Example (I)
•
Sample Example (I)
•
In the real world things are not as simple
•
•
Consider the following 2-dimensional problem
1. To what extend does our solution
generalize to new data?
•
In the real world things are not as simple
Consider the following 2-dimensional problem
1. To what extend does our solution
generalize to new data?
•
The central aim of designing a classifier
is to correctly classify novel input!
The central aim of designing a classifier
is to correctly classify novel input!
2. How do we know when we have
collected adequately large and
representative set of examples
for training?
’11
Arshia Cont: Survey of Machine Learning
13
’11
Sample Example (I)
•
14
Sample Example (II)
•
•
In the real world things are not as simple
•
Arshia Cont: Survey of Machine Learning
Consider the following 2-dimensional problem
This is a typical Regression problem
Polynomial Curve Fitting
1. To what extend does our solution
generalize to new data?
•
The central aim of designing a classifier
is to correctly classify novel input!
2. How do we know when we have
collected adequately large and
representative set of examples
for training?
3. How can we decide model
complexity versus performance?
’11
Arshia Cont: Survey of Machine Learning
15
’11
Arshia Cont: Survey of Machine Learning
16
Sample Example (II)
•
Sample Example (II)
•
Polynomial Curve Fitting
•
•
Sum-of-squares Error Function
’11
Arshia Cont: Survey of Machine Learning
Polynomial Curve Fitting
17
’11
Arshia Cont: Survey of Machine Learning
Sample Example (II)
•
’11
•
Polynomial Curve Fitting
•
1st order polynomial
Arshia Cont: Survey of Machine Learning
18
Sample Example (II)
Polynomial Curve Fitting
•
0th order polynomial
19
’11
3rd order polynomial
Arshia Cont: Survey of Machine Learning
20
Sample Example (II)
•
Sample Example (II)
•
Polynomial Curve Fitting
•
Polynomial Curve Fitting
•
9th order polynomial
Over-fitting
Root$Mean$Square-(RMS)-Error:
’11
Arshia Cont: Survey of Machine Learning
21
’11
Arshia Cont: Survey of Machine Learning
Sample Example (II)
•
’11
Sample Example (II)
•
Polynomial Curve Fitting
•
•
Over-fitting and regularization
Effect of data set size (9th order polynomial)
Arshia Cont: Survey of Machine Learning
22
23
’11
Polynomial Curve Fitting
•
•
Regularization
•
9th order polynomial with
Penalize large coefficient values
Arshia Cont: Survey of Machine Learning
24
Sample Example (III)
•
Important Questions
•
I want my computer to learn the “style” of Bach and to
generate new Bach’s music that has not happened before.
•
•
Harder to imagine..
But we’ll soon get there!
Given that we have learned what we want...
•
If my g(x) can predict well on the data I have, will it also predict well on other
sources of X I have not seen before?
•
OR To what extent the knowledge that has been learned applies to the whole
world outside? OR how does my learning generalize itself? (Generalization)
•
Does having more data necessarily mean I learn better?
•
•
Does having more complex models necessarily improve learning?
•
’11
Arshia Cont: Survey of Machine Learning
25
’11
Machine
Learning
Families
Three
Types of Learning
No! Overfitting... .
No! Regularization... .
Arshia Cont: Survey of Machine Learning
26
Machine Learning Families
•
Imagine an organism or machine which experiences a series of sensory inputs:
Goals of Supervised Learning
Supervised Learning Families:
Classification: The desired outputs yi are discrete class labels.
The goal is to classify new inputs correctly (i.e. to generalize).
x1, x2, x3, x4, . . .
Supervised learning: The machine is also given desired outputs y1, y2, . . ., and its goal is
to learn to produce the correct output given a new input.
Regression: The desired outputs yi are continuous valued.
The goal is to predict the output accurately for new inputs.
Unsupervised learning: The goal of the machine is to build a model of x that can be
used for reasoning, decision making, predicting things, communicating etc.
Reinforcement learning: The machine can also produce actions a1, a2, . . . which affect
the state of the world, and receives rewards (or punishments) r1, r2, . . .. Its goal is to learn
to act in a way that maximises rewards in the long term.
’11
Arshia Cont: Survey of Machine Learning
27
’11
Arshia Cont: Survey of Machine Learning
28
Machine Learning Models
Machine Learning Models
1. Generative Learning
When we start with the hypothesis that a family of parametric models can
generate X given Y
•
When we do not assume a model over data, but assume a form on how they are
separated from each other and fit it to discriminate classes... .
•
•
The notion of prior model!
•
•
Neural Networks, Kernel methods, Support Vector Machines etc.
•
Pros:
•
’11
1. Discriminative Learning
•
At the core of Bayesian learning.. Subject of ongoing and historical philosophical
debates.
•
We can incorporate our own belief and knowledge into the model and
eventually test and refine it.
•
•
In most cases simplifies the mathematical structure of the problem.
•
•
Tautology?!
•
Guaranteed solutions exist in many situations!
Cons:
Pros:
•
•
No curse of dimensionality (in most cases)
•
•
Prior knowledge for discriminant factors are hard to imagine/justify.. .
•
Less appealing for applications where generation is also important... .
Good when you can not formally describe the hidden generative process.
Cons:
For complicated problems, they “seem” less intuitive than Generative
methods... .
Curse of Dimensionality
Arshia Cont: Survey of Machine Learning
29
’11
Arshia Cont: Survey of Machine Learning
30
Probability Theory
•
Probability Theory
•
•
’11
’11
A probabilistic model of the data can be used to
•
•
•
•
Make inference about missing inputs
Generate predictions/fantasies!
Make decisions with minimized expected loss
Communicate the data in an efficient way
Statistical modeling is equivalent to other views of learning
•
•
Information theoretic: Finding compact representations of the data
Physics: Minimizing free energy of a corresponding mechanical system
If not, what else?
•
•
•
knowledge engineering approach vs. Empirical induction approach
Domain of Probabilities vs. Domain of Possibilities (fuzzy logic)
Logic AI ...
Arshia Cont: Survey of Machine Learning
32
Probability Theory
Probability Theory
Marginal-Probability
Joint-Probability
Sum-Rule
Condi:onal-Probability
’11
Arshia Cont: Survey of Machine Learning
Product-Rule
33
’11
Probability Theory
•
Arshia Cont: Survey of Machine Learning
34
Probability Theory
•
Rules of Probability:
Bayes’ Theorem:
Sum-Rule
Product-Rule
•
posterior-∝ likelihood-×-prior
Independence:
•
Random variables X and Y are independent if
p( X |Y ) = p( X )
’11
Arshia Cont: Survey of Machine Learning
35
’11
Arshia Cont: Survey of Machine Learning
36
Probability Theory
•
Probability Theory
•
Probability Densities
Expectations
Condi:onal-Expecta:on
(discrete)
Approximate-Expecta:on
(discrete-and-con:nuous)
’11
Arshia Cont: Survey of Machine Learning
37
’11
Probability Theory
•
’11
38
Probability Theory
Variances and Covariances
Arshia Cont: Survey of Machine Learning
Arshia Cont: Survey of Machine Learning
39
•
The Gaussian Distribution
•
Gaussian Mean and Variance:
’11
Arshia Cont: Survey of Machine Learning
40
Generative vs.Example
Discriminative
Basic Rules of Probability
Basic Rules of Probability
Probabilities are non-negative P (x)
Probabilities normalise:
probability densities.
x P (x)
0 ⇤x.
= 1 for discrete distributions and
•
⇥
in summary
Generative approach:
•
•
p(x)dx = 1 for
PX |Y ( x | 0) " G ( x,0, ! )
PY (0) " PY (1) " 1
PX |Y ( x | 1) " G ( x,1, ! )
Model
Use Bayes’ theorem
2
• or, graphically,
The joint probability of x and y is: P (x, y).
The marginal probability of x is: P (x) =
y
P (x, y).
The conditional probability of x given y is: P (x|y) = P (x, y)/P (y)
Bayes Rule:
P (x, y) = P (x)P (y|x) = P (y)P (x|y)
⇥
P (y|x) =
0
•
P (x|y)P (y)
P (x)
1
g(x)
Discriminative approach:
15
•
Model
directly
Warning: I will not be obsessively careful in my use of p and P for probability density and probability
distribution. Should be obvious from context.
’11
Arshia Cont: Survey of Machine Learning
41
’11
Arshia Cont: Survey of Machine Learning
42
Bayesian Decision Theory
•
Framework for computing optimal decisions on problems
involving uncertainty (probabilities)
•
Basic concepts:
•
Bayesian Decision
Theory
World:
•
•
•
•
’11
Instrument classification,
Audio to Score Alignment,
Observer:
•
•
’11
has states or classes, drawn from a random variable Y
{violin, piano, trumpet, drums, ...}
Y
Y
{note1, chord2, note3, trill4, ...}
Measures observations (features), drawn from a random process X
Instrument classification,
X = M F CCf eatures
Arshia Cont: Survey of Machine Learning
Rn
44
Bayesian Decision Theory
Basics of Bayesian Decision Theory
•
Bayesian
(MAP)
classifier
Question: How
to choose
the best
class given the data?
•
•
•
•
•
Minimize the probability of error by
Choose the•Maximum
A Posteriori (MAP) class:
choosing maximum a posteriori (MAP) class:
ˆ = argmax Pr ( ! i x )
!
Bayesian (MAP)
! i classifier
Intuitively: Choose
the most
probable of
class
given
• Minimize
the probability
error
by the observation.
•
Intuitively
right - choose
most(MAP)
probable
choosing maximum
a posteriori
class:class in
light of the
observation
But we don’t know
Pr ( ! i x )
P!ˆ r(= argmax
i |x)
!
• Given models fori each distribution p ( x ! i ) ,
But we know
P r(x| i )
•
Intuitively right - choose most probable class in
ˆ = argmax Pr ( ! i x )
the search for !
light of the observation !
Apply Bayes rule:
•
i
Given models for each p
distribution
p ((x!!)i ) ,
( x ! ) " Pr
i
i
-------------------------------------------------becomes
ˆ = argmax
the search argmax
for !
Pr ( ! x )
! i #! ip ( x ! j )i " Pr ( ! j )
j
p ( x ! i ) " Pr ( ! i )
the same- over all !i
but
denominator
= p(x) is
argmax -------------------------------------------------becomes
! i # p ( x ! j ) " Pr ( ! j )
ˆ = argmaxj p ( x ! i ) " Pr ( ! i )
hence !
but denominator!
= p(x) is the same over all !i
i
ˆ = argmax p ( x ! i ) " Pr ( ! i )
hence !
! i Recognition
Musical Content - Dan Ellis
1: Pattern
2003-03-18 - 23
’11
Arshia Cont: Survey of Machine Learning
45
Musical Content - Dan Ellis
’11
1: Pattern Recognition
2003-03-18 - 23
Arshia Cont: Survey of Machine Learning
46
Example
the Bayes decision rule is usually highly intuitive
Basics of Bayesian Decision Theory
we have used an example from communications
•
Basics of Bayesian Decision Theory
• a bit is transmitted by a source, corrupted by noise, and received
Let’s
go back to our sample example (I):
by anow
decoder
Y
channel
•
We need:
•
X
Class probabilities:
•
•
• Q: what should the optimal decoder do to recover Y?
•
’11
Intuitively, the decision rule can be:
Y =
Arshia Cont: Survey of Machine Learning
0, if x > T
1, if x > T
PY (0) = PY (1) = 1/2
Class-conditional densities:
•
•
•
in the absence of any other information let’s say
Noise results from thermal processes, a lot of independent events that add up
By the central limit theorem, it is reasonable to assume that noise is Gaussian
Denote a Gaussian random variable of mean µ and variance
by
X
N (µ, ⇥)
5
47
’11
Arshia Cont: Survey of Machine Learning
48
Basics of Bayesian Decision Theory
Basics of Bayesian Decision Theory
Example
Example
•
the Gaussian probability density function is
PX ( x) % G ( x, " , ! ) %
1
2#! 2
e
$
( x$" )2
2! 2
PX |Y ( x | 0) " G ( x,0, ! )
channel
X
X % Y '&,
PY (0) " PY (1) " 1
PX |Y ( x | 1) " G ( x,1, ! )
since noise is Gaussian, and assuming it is just added to
the signal we have
Y
In summary:
in summary
2
• or, graphically,
& ~ N (0, ! 2 )
• in both cases, X corresponds to a constant (Y) plus zero-mean
Gaussian noise
• this simply adds Y to the mean of the Gaussian
0
1
14
’11
Arshia Cont: Survey of Machine Learning
49
’11
Basics of Bayesian Decision Theory
•
•
⇥
i (x) = arg max log PX|Y (x|i) + log PY (i)
BDR
Graphically this MAP solution is equal to:
this is intuitive
• we pick the class that “best explains” (gives higher probability)
the observation
i
• in this case, we can solve visually
and note that
•
•
50
Basics of Bayesian Decision Theory
To compute the Bayesian Decision Rule (or MAP) we use log
probabilities here:
•
15
Arshia Cont: Survey of Machine Learning
terms which are constant can be dropped
Hence, if priors are equal, then we have:
i (x) = arg max log PX|Y (x|i)
0
i
1
pick 0
pick 1
• but the mathematical solution is equally simple
17
’11
Arshia Cont: Survey of Machine Learning
51
’11
Arshia Cont: Survey of Machine Learning
52
Basics of Bayesian Decision Theory
•
Basics of Bayesian Decision Theory
BDR
BDR
Now let’s consider the general case:
• or
let’s consider the more general case
PX |Y ( x | 0) # G ( x, " 0 , ! )
PX |Y ( x | 1) # G ( x, "1 , ! )
( x $ !i ) 2
2" 2
i
2
2
% arg min ( x $ 2 x!i # !i )
i* % arg min
i
• for which
% arg min ($2 x!i # !i )
2
i
i ( x) # arg max log PX |Y ( x | i )
*
i
( x % "i )
+, 1
%
2
# arg max log *
e 2!
2
i
,) 2$!
2
• the optimal decision is, therefore
(,
'
,&
• pick 0 if
$ 2 x! 0 # ! 0 & $2 x!1 # !1
2
2 x( !1 $ ! 0 ) & !1 $ ! 0
2
+ 1
( x % "i ) 2 (
# arg max *% log(2$! 2 ) %
'
2! 2 &
i
) 2
( x % "i ) 2
# arg min
2! 2
i
2
2
• or, pick 0 if
x&
!1 # !0
2
18
’11
19
Arshia Cont: Survey of Machine Learning
53
’11
Arshia Cont: Survey of Machine Learning
54
BDR
for a problem with Gaussian classes, equal variances
andBasics
equal class
probabilitiesDecision Theory
of Bayesian
Basics of Bayesian Decision Theory
• optimal decision boundary is the threshold
“Yeah! So what?”
•BDR
Bayesian Decision Theory keeps you honest!
• what
is the point of going through all the math?
the mid-point between the two means
• •Oratgraphically,
••
•
!0
pick 0
•
In practice we never have one variable but a vector of observations >>
now we know
that the intuitive threshold is actually optimal, and
Multivariate
Gaussians
in which sense it is optimal (minimum probability or error)
Priors are not uniform.
• the Bayesian solution keeps us honest.
It• also
forces us to make our assumptions explicit!
it forces us to make all our assumptions explicit
• assumptions we have made
!1
PY (0) ! PY (1) ! 1
• uniform class probabilities
pick 1
•
PX |Y ( x | i ) ! G ( x, #i , " i )
• Gaussianity
• the variance is the same under the two states
20
2
" i ! " , $i
X !Y &%
• noise is additive
• even for a trivial problem, we have made lots of assumptions
22
’11
Arshia Cont: Survey of Machine Learning
55
’11
Arshia Cont: Survey of Machine Learning
56
Gaussian Classifiers
•
Gaussian Classifiers
The Gaussian classifier
Let’sThe
imagine
the Bayesian
Decision Rule (BDR) for the case of
Gaussian
classifier
two multivariate
in this case Gaussian case:
di i i
t
discriminant:
PY|X(1"x ) % 0'5
this can be written as
i * (x ) & arg min!d i (x , $i ) % # i "
$ 1
"
PX |Y ( x | i ) )
exp#& ( x & *i )T % i&1 ( x & *i )!
d
2
(
'
(2+ ) | % i |
1
i
with
• the BDR
d i (x , y ) & (x ' y )T (i'1 (x ' y )
i * (x ) ) arg max,log PX |Y (x | i ) . log PY (i )i
# i & log(2) )d (i ' 2 log PY (i )
• becomes
0 1
3 2
1
2
& log(
l (2+ )d %i . log
l PY (i )1
2
4
the optimal rule is to assign x to the closest class
i * (x ) ) arg max /& (x & *i )T %i&1 (x & *i )
i
closest is measured with the Mahalanobis distance di(x,y)
to which the # constant is added to account for the class
prior
11
12
’11
Arshia Cont: Survey of Machine Learning
’11
57
Arshia Cont: Survey of Machine Learning
Gaussian Classifiers
Gaussian Classifiers
•
i⇥ (x)
In detail:
=
arg min (x
=
arg min xT
1
x
xT
=
arg min xT
i
⇤
1
x
2µTi
=
i
µi )T
i
⌥ T
arg max ⌥
µi
i ⇧↵ ⌦
wiT
’11
Arshia Cont: Survey of Machine Learning
58
59
’11
1
x
1
(x
1 T
µi
↵2
µi )
1
1
x + µTi
1
1
⇥
2 log PY (i)
⇥
2 log PY (i)
⌅
x + µTi
µTi
µi
1
⇥
2 log PY (i)
µi
1
µi
µi + 2 log PY (i)⌃
⌦
wi0
Arshia Cont: Survey of Machine Learning
60
Gaussian Classifiers
Gaussian Classifiers
The Gaussian classifier
Group Homework 0
discriminant:
PY|X(1"x ) % 0'5
in summary,
Find the Geometric equation for the
hyperplane separating the two classes
for the linear discriminant of the Gaussian
classifier.
i * ( x) ! arg max g i ( x)
i
Hint: This is the set such that
• with
g i ( x) ! wiT x " wi 0
wi ! # $1%i
1 T
wi 0 ! $ %i # $1%i " log PY (i )
2
• the BDR is a linear function or a linear discriminant
15
’11
Arshia Cont: Survey of Machine Learning
61
The role of prior
’12
•
prior can offset the “threshold” value (in our simple
Geometric
• Theinterpretation
example):
boundary
hyper-plane
yp p
in 1, 2,
and 3D
Arshia Cont: Survey of Machine Learning
62
The role of covariance
So far, our covariance matrices were simple! If they are
different, the the nice hyper-plane for a 2-class problem
becomes
hyper-quadratic:
Geometric
interpretation
in 2 and 3D:
ffor various
i
prior
configurations
36
’11
Arshia Cont: Survey of Machine Learning
63
29
’11
Arshia Cont: Survey of Machine Learning
64
Bayesian Decision Theory
•
•
Advantages
•
•
•
•
BDR is optimat and can not be beaten!
•
•
No need for heuristics to combine these two sources of information
Bayes keeps you honest
Maximum Likelihood
Estimation
Models reflect causal interpretation of the problem, or how we think!
Natural decomposition into “what we knew already” (prior) and “what data tells
us” (obs)
BDR is intuitive
Problems
•
BDR is optimal ONLY if the models are correct!
’11
Arshia Cont: Survey of Machine Learning
’11
65
Bayesian Decision Theory
Bayesian Decision Theory
Advantages
WHAT???
BDR is optimal and can not be beaten!
We do have an optimal (and geometric) solution:
Bayes keeps you honest
i⇤ (x) = arg max[µTi ⌃ 1 x
i
| {z }
Models reflect causal interpretation of the problem, or how we think!
wiT
Natural decomposition into “what we knew already” (prior) and “what data
tells us” (obs)
1 T
µi ⌃
|2
1
µi + 2 log PY (i)]
{z
}
wi0
but we do not know the values of the parameters µ, ⌃, PY (i)
We have to estimate these values!
No need for heuristics to combine these two sources of information
We can estimate from a training set
BDR is intuitive
example: use the average value as an estimate for the mean!
Problems
BDR is optimal ONLY if the models are correct!
’12
Arshia Cont: Survey of Machine Learning
67
’12
Arshia Cont: Survey of Machine Learning
68
Maximum Likelihood
Maximum likelihood
Maximum
Likelihood
• this seems pretty serious
We rely on the maximum likelihood (ML) principle.
– how should I get these probabilities then?
ML has three main steps:
• we rely on the maximum likelihood (ML) principle
• this has three steps:
1. Choose a parametric model for all probabilities (as a function of unknown
parameters)
– 1) we choose a parametric model for all probabilities
– to make this clear we denote the vector of parameters by ! and
the class-conditional distributions by
2. Assemble a training data-set
3. Solve for parameters that maximize probabilities on the data-set!
PX |Y ( x | i; !)
– note that this means that ! is NOT a random variable (otherwise
it would have to show up as subscript)
– it is simply a parameter, and the probabilities are a function of this
parameter
10
’12
Arshia Cont: Survey of Machine Learning
69
Maximum Likelihood
’12
Maximum
likelihood
Maximum
Likelihood
• three steps:
• since
– 2) we assemble a collection of datasets
!(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from
class i
– each sample !(i) is considered independently
– parameter
t !i estimated
ti t d only
l ffrom sample
l !(i)
• we simply have to repeat the procedure for all classes
• so,
so from now on we omit the class variable
– 3) we select the parameters of class i to be the ones that
maximize the probability of the data from that class
!
# i $ arg max PX |Y D (i ) | i; #
!* $ arg max PX "D; ! #
"
!
$ arg
g max log
g PX |Y|Y D ( i ) | i; #
#
!
$ arg max log PX "D; ! #
"
!
• the function PX(!;!) is called the likelihood of the
parameter ! with respect to the data
• or simply
i l the
h likelihood
lik lih d ffunction
i
– like before,, it does not really
y make any
y difference to maximize
probabilities or their logs
11
’12
70
Maximum Likelihood
Maximum likelihood
#
Arshia Cont: Survey of Machine Learning
Arshia Cont: Survey of Machine Learning
12
71
’12
Arshia Cont: Survey of Machine Learning
72
Maximum
Maximum Likelihood
likelihood
Maximum Likelihood
givenGiven
a sample,
to obtain ML
weforneed to solve
In• short:
some data-points,
weestimate
are solving
The gradient:
#* $ arg max PX !D; # "
in higher dimensions, the generalization of the derivative is the gradient. The
gradient of a function f(x) at z is:
#
rf (z) =
If• ⇥
is scalar,
is high-school
calculus!
when
# isthis
a scalar
this is high-school
high
school calculus
We have maximum when first derivative is zero + second derivative is
negative!
✓
◆T
@f
@f
(z), . . . ,
(z)
@x0
@xn 1
It has a nice geometric interpretation:
It points in the direction of
maximum growth of the function
We review higher dimensional tools for this aim very quick... .
Perpendicular to the contour where
the function is constant
• we have a maximum when
– first derivative is zero
– second derivative is negative
’12
Arshia Cont: Survey of Machine Learning
14
73
Maximum Likelihood
’12
Arshia Cont: Survey of Machine Learning
74
Maximum Likelihood
The gradient
The Hessian:
extension of the 2nd-order derivative is the Hessian Matrix:
2
6
6
6
6
6
6
r2 f (x) = 6
6
6
6
6
4
@2f
@x20
@2f
@x0 @x1
@2f
@x1 @x0
@2f
@x21
···
..
.
..
.
..
@2f
@xn 1 @x0
@2f
@xn 1 @x1
···
.
···
@2f
@x0 @xn
@2f
@x1 @xn
..
.
@2f
@x2n 1
1
3
7
7
7
7
17
7
7
7
7
7
7
5
In an ML setup we have a maximum when Hessian is negative definite or
xT r2 f (x)x  0
’12
Arshia Cont: Survey of Machine Learning
75
’12
Arshia Cont: Survey of Machine Learning
76
Maximum Likelihood
Maximum Likelihood
The Hessian:
In summary:
1. Choose a parametric model for probabilities
PX (x; ⇥)
2. Assemble D = {X1 , . . . , Xn } of independently drawn examples
3. Select parameters that maximize the probability of the data
or Given a data-set we need to solve
⇥⇤
= arg max PX (D; ⇥)
⇥
= arg max log PX (D; ⇥)
⇥
The solutions are the parameters such that
r⇥ PX (D; ⇥) = 0
✓ r2⇥ PX (D; ✓)✓
t
’12
Arshia Cont: Survey of Machine Learning
77
Maximum Likelihood
’12
Arshia Cont: Survey of Machine Learning
78
ML + BDR
Sample Example II:
Polynomial Regression
Going back to our simple classification problem... .
We can combine ML and Bayesian Decision Rule to make things safer and
pick the desired class i if:
Two random variables X and Y
Example
A dataset of examples
in⇤summary
i (x) = arg max PX|Y (x|i; ✓i⇤ )PY (i)
D = {(X1 , Y1 ), . . . , (Xn , Yn )}
A parametric model of the form
y = f (x; ⇥) + ✏ where ✏ ⇠ N (0,
Concretely, f (x; ⇥) =
 0, 8✓ 2 Rn
K
X
PX |Y ( x | 0) " G ( xi,0, ! )
Y
2
1
P (0) " P (1) "
⇤
2 ✓)
where
P ( x | 1✓
) i" G=
( x,1arg
, ! ) max PX|Y (D|i,
X |Y
)
Y
✓
• or, graphically,
✓i xi
i=0
where the data is distributed as PZ|X (D|x; ⇥) = G(z, f (x; ⇥),
Show that ⇥ = [
⇤
T
]
1 T
y where
GROUP I (for next class)
’12
2
1
6
6
6 ..
= 6.
6
4
1
Arshia Cont: Survey of Machine Learning
···
..
.
···
2
)
3
xK
1
7
7
.. 7
. 7
7
5
xK
n
0
1
15
79
’12
Arshia Cont: Survey of Machine Learning
80
Estimators
Estimators
We now know how to produce estimators using MaximumLikelihood... .
Example
ML estimator for the mean of a Gaussian
How do we evaluate an estimator? Using bias and variance
Bias(µ̂) = EX1 ,...,Xn [µ̂ µ]
= EX1 ,...,Xn [µ̂] µ
1X
=
EX1 ,...,Xn [Xi ]
n i
1X
=
EXi [Xi ] µ
n i
Bias
A measure how the expected value is equal to the true value
ˆ = EX ,...,X [f (X1 , . . . , Xn )
If ✓ˆ = f (X1 , . . . , Xn ) then Bias(✓)
1
n
✓]
An estimator that has bias will usually not converge to the perfect estimate!
No matter how large the data-set is!
Variance
= µ
Given a good bias, how many sample points do we need?
ˆ = EX ,...,X f (X1 , . . . , Xn )
V ar(✓)
1
n
N (µ,
2
)
µ
µ=0
The estimator is thus unbiased
EX1 ,...,Xn [f (X1 , . . . , Xn )]2
Variance usually decreases with more training examples... .
’12
Arshia Cont: Survey of Machine Learning
81
’12
Arshia Cont: Survey of Machine Learning
82
Bayesian Parameter Estimation
•
Bayesian parameter estimation is an alternative framework for
parameter estimation
•
There is fundamental difference between Bayesian and ML
methods!
Bayesian Parameter
Estimation
•
•
To understand this, we need to distinguish between two components:
•
•
•
’11
The long debate between frequentists vs Bayesians
’11
The definition of probability (intact)
The assessment of probability (differs)
We need to review these fundamentals!
Arshia Cont: Survey of Machine Learning
84
Probability Measures
•
This does not change between frequentist and Bayesian
philosophies
•
Probability measure satisfies three axioms:
P (A)
Frequentist vs Bayesian
•
•
Difference is in interpretation!
Frequentist view:
•
•
•
0 8 events A
P (universal event) = 1
\
if A B = ; then P (A + B) = P (A) + P (B)
Problems:
85
In most cases we do not have large number of observations!
In most cases probabilities are not objective!
This is not usually how people behave.
Bayesian view:
•
•
Arshia Cont: Survey of Machine Learning
Make sense when we have a lot of observations (no bias)
•
•
•
•
’11
Probabilities are relative frequencies
Probabilities are subjective (not equal to relative count)
Probabilities are degrees of belief on the outcome of experiment
’11
Arshia Cont: Survey of Machine Learning
Bayesian Parameter Estimation
•
•
Bayes vs. ML
•
Difference with ML: ⇥ is a random variable.
Training set D = {X1 , . . . , Xn } of examples drawn independently
Probability density for observations given parameter
•
PX|⇥ (x|⇥)
•
Prior distribution for parameter configurations
P⇥ (✓)
encodes prior belief on
•
Optimal estimate
•
•
•
Basic concept:
•
•
86
⇥
Goal: Compute the posterior distribution
under ML there is one “best” estimate
under Bayes there is no “best” estimate
It makes no sense under Bayes to talk about “best” estimate
Predictions
•
We do not really care about the parameters themselves! Only in the fact that they
build models... .
•
•
Models can be used to make predictions
Unlike ML, Bayes uses ALL information in the training set to make predictions
P⇥|X (⇥|D)
’11
Arshia Cont: Survey of Machine Learning
87
’11
Arshia Cont: Survey of Machine Learning
88
Bayes vs ML
Bayes vs ML
• let’s consider the BDR under the “0-1” loss and an
independent sample D = {x1 , ..., xn}
• note that we know that information is lost
– e.g. we can’t even know how good of an estimate !* is
– unless we run multiple experiments and measure bias/variance
• ML-BDR:
– pick i if
!
• Bayesian BDR
"
– under the Bayesian framework, everything is conditioned on the
training data
– denote T = {X1 , ..., Xn} the set of random variables
from which the training
g sample
p D = {{x1 , ...,, xn} is drawn
i ( x) $ arg max PX |Y x | i;# PY (i )
*
i
*
i
where # i* $ arg max PX |Y !D | i, # "
#
• B-BDR:
• two steps:
– pick i if
– i) find #*
– ii) plug into the BDR
"
#
i * ( x) $ arg max PX |Y ,T x | i, Di PY (i )
• all information not captured by #* is lost, not used at
d i i titime
decision
i
• the
th decision
d i i iis conditioned
diti
d on the
th entire
ti training
t i i sett
20
’11
Arshia Cont: Survey of Machine Learning
21
89
’11
Arshia Cont: Survey of Machine Learning
Bayesian BDR
Bayesian BDR
• to compute the conditional probabilities, we use the
marginalization equation
• in summary
!
"
!
"
!
– pick i if
"
i
• note 1: when the parameter value is known, x no longer
depends on T, e.g. X|$ ~ N(&,'2)
where PX |Y ,T
i
X |Y ,&
!
"
i
– as before the bottom equation is repeated for each class
– hence, we can drop the dependence on the class
– and consider the more g
general problem
p
of estimating
g
PX |Y ,T x | i, Di % # PX |$,Y ! x | & , i "P$|Y ,T & | i, Di d&
• note 2: once again can be done in two steps (per class)
PX |T ! x | D " % $ PX |& ! x | # "P&|T !# | D "d#
– i) find P$|T(&|Di)
– ii) compute PX|Y,T(x|i, Di) and plug into the BDR
• no training information is lost
23
22
’11
&|Y ,T
• note:
– we can, simplify equation above into
"
!
"
!x | i, D " % $ P !x | i,# "P !# | i, D "d#
i * ( x) % arg max PX |Y ,T x | i, Di PY (i )
PX |Y ,T x | i, Di % # PX |$,Y ,T x | & , i, Di P$|Y ,T & | i, Di d&
!
90
Arshia Cont: Survey of Machine Learning
91
’11
Arshia Cont: Survey of Machine Learning
92
Predictive Distribution
•
The distribution
PX|T (x|D) =
Z
MAP approximation
• this sounds good, why use ML at all?
• the main problem with Bayes is that the integral
PX|⇥ (x|✓)P⇥|T (✓|D)d✓
is known as the predictive distribution. It allows us
•
•
PX |T ! x | D " & % PX |$ ! x | # "P$|T !# | D "d#
to predict the value of x given ALL the information in the training set
can be quite nasty
• in practice one is frequently forced to use approximations
• one possibility is to do something similar to ML, i.e. pick
only one model
• this can be made to account for the prior by
Bayes vs. ML:
•
•
•
ML picks one model, Bayes averages all models
ML is a special case of Bayes when we are very confident about the model
In otherwords ML~Bayes when
•
•
•
•
– picking the model that has the largest posterior probability given
the training data
prior is narrow
if the sample space is quite large
# MAP & argg max P$|T !# | D "
intuition: Given a lot of training data, there is little uncertainty
#
28
Bayes regularizes the ML estimate!
’11
Arshia Cont: Survey of Machine Learning
93
’11
Arshia Cont: Survey of Machine Learning
MAP approximation
MAP approximation
• this can usually be computed since
• in this case
# MAP % arg max P$|T !# | D "
PX |T ! x | D " & ' PX |( ! x | # "$ !# % # MAP "d#
#
& PX |( ! x | # MAP "
% arg max PT |$ !D | # "P$ !# "
#
and corresponds to approximating the prior by a delta
function centered at its maximum
P$|T !# | D "
94
• the BDR becomes
– pick i if
!
"
i * ( x) & arg max PX |Y x | i;# iMAP PY (i )
P$|T !# | D "
i
where # iMAP & arg max PT |Y ,( !D | i, # "P(|Y !# | i "
#
#MAP
’11
#
Arshia Cont: Survey of Machine Learning
#MAP
#
– when compared to the ML this has the advantage of still
accounting for the prior (although only approximately)
30
29
95
’11
Arshia Cont: Survey of Machine Learning
96
Bayesian Learning
MAP vs ML
Bayesian Learning
• ML-BDR
– pick i if
!
Summary
"
i * ( x) $ arg
g max PX |Y x | i;# i* PY (i )
Apply the basic rules of probability to learning from data.
Data set: D = {x1, . . . , xn}
Models: m, m etc.
where # i* $ arg max PX |Y !D | i, # "
Prior probabilities on models: P (m), P (m ) etc.
i
#
Prior probabilities on model parameters: e.g. P ( |m)
Model of data given parameters: P (x| , m)
• Bayes MAP-BDR
– pick i if
!
"
If the data are independently and identically distributed then:
i * ( x) $ arg max PX |Y x | i;# iMAP PY (i )
i
n
P (D| , m) =
where # iMAP $ arg max PT |Y ,% !D | i, # "P%|Y !# | i "
i=1
#
– the difference is non-negligible only when the dataset is small
P ( |D, m) =
P (D| , m)P ( |m)
P (D|m)
Posterior probability of models:
31
Arshia Cont: Survey of Machine Learning
P (xi| , m)
Posterior probability of model parameters:
• there are better alternative approximations
pp
’11
Model parameters:
P (m|D) =
97
Example
’11
P (m)P (D|m)
P (D)
Arshia Cont: Survey of Machine Learning
98
Example
communications problem
the BDR is
• pick “0” if
Y
X
atmosphere
x'
receiver
!0 & "% !0 #
2
$0
this is optimal and everything works wonderfully, but
• one day we get a phone call: the receiver is generating a lot of
errors!
two states:
• Y=0 transmit signal s = -!0
• there is a calibration mode:
• Y=1 transmit signal s = !0
• rover can send a test sequence
noise model
• but it is expensive, can only send a few bits
X % Y $#,
• if everything is normal, received means should be !0 and –!0
# ~ N (0, " )
2
7
’12
Arshia Cont: Survey of Machine Learning
99
8
’12
Arshia Cont: Survey of Machine Learning
100
Example
Bayesian solution
action:
•
Gaussian likelihood (observations)
ask the system to transmit a few 1s and measure X
• compute the ML estimate of the mean of X
!#
PT |! ( D | ! ) # G ( D, ! , " 2 )
1
" Xi
n i
Gaussian prior (what we know)
result: the estimate is different than !0
P! ( ! ) # G ( ! , ! 0 , " 02 )
we need to combine two forms of information
• our prior is that
" 2is known
! !0,"02 are known hyper-parameters
! ~ N ( !0 , $ 2 )
we need to compute
• our “data driven” estimate is that
• posterior distribution for !
X ~ N ( !ˆ , $ )
P! |T ( ! | D) #
2
PT |! ( D | ! ) P! ( ! )
PT ( D)
9
’12
Arshia Cont: Survey of Machine Learning
101
Bayesian solution
the posterior distribution is
+
P- |T ( - | D) ) G - , - n , * n2
-n )
* 02 . xi ( - 0* 2
i
* 2 ( n* 02
!
10
’12
Arshia Cont: Survey of Machine Learning
102
Bayesian solution
for free, Bayes also gives us
,
-n )
• the weighting constants
"n $
n* 02
*2
(
-0
ML
* 2 ( n* 02
* 2 ( n* 02
$!#!
"
$!#!
"
/n
n! 02
! 2 # n! 02
• a measure of the uncertainty of our estimate
1
10/ n
!
1
1
n
' * 2* 2 $
* n2 ) %% 2 0 2 "" ! 2 ) 2 ( 2
*n *0 *
& * ( n* 0 #
2
n
$
1
!
2
0
#
n
!2
• note that 1/!2 is a measure of precision
• this should be read as
this is intuitive
PBayes = PML + Pprior
• Bayesian precision is greater than both that of ML and prior
11
’12
Arshia Cont: Survey of Machine Learning
103
12
’12
Arshia Cont: Survey of Machine Learning
104
Observations
Observations
• 1) note that precision increases with n, variance goes to zero
1
!
2
n
#
1
!
2
0
"
• 3) for a given n
n
n! 02
+n * 2
! ) n! 02
!2
we are guaranteed that in the limit of infinite data we have
convergence to a single estimate
+ n # * n +ˆ " $1 ) * n %+ 0
* n ( [0,1], * n & 1, * n & 0
•
n &0
the solution is equivalent to that of ML
•
for small n, the prior dominates
•
this always happens for Bayesian solutions
n %&
n %0
if !02>>!2, i.e. we really don’t know what " is a priori
then "n = "ML
• 2) for large n the likelihood term dominates the prior term
n &'
" n * + n "ˆ ) #1 ( + n $" 0
+ n ' [0,1], + n % 1, + n % 0
on the other hand, if !02<<!2, i.e. we are very certain a priori,
then "n = "0
in summary,
• Bayesian estimate combines the prior beliefs with the evidence
provided by the data
P+ |T ( + | D) - , PX |+ ( xi | + )P+ ( + )
• in a very intuitive manner
i
13
’12
Arshia Cont: Survey of Machine Learning
105
14
’12
Arshia Cont: Survey of Machine Learning
106
Group Homework 2
Conjugate priors
Histogram Problem
note that
• the prior P$ ( $ ) % G !$ , $0 , # 02 " is Gaussian
Imagine a random variable X such that, PX (k) = ⇡k , k 2 1, . . . , N
• the posterior P$ |T ( $ | D ) % G !x , $n , # n " is Gaussian
Suppose we draw n independent observations from X and form a random
vector C = (C1 , · · · , CN )T where Ck is the number of times where the
observed value is k
2
whenever this is the case (posterior in the same family as
prior) we say that
C is then a histogram and has a multinomial distribution:
• P$ ( $ ) is a conjugate prior for the likelihood PX |$ (x | $ )
n!
PC1 ,...,CN (c1 , . . . , cN ) = QN
• posterior P$ |T ( $ | D ) is the reproducing density
k=1 ck ! j=1
HW: a number of likelihoods have conjugate priors
Likelihood
Bernoulli
Poisson
Exponential
Normal (known #2)
N
Y
c
⇡j j
Note that ⇡ = (⇡1 , . . . , ⇡w ) are probabilities and thus: ⇡i
1. Derive the ML estimate for parameters ⇡k , k 2 {1, ..., N }
Conjugate prior
Beta
Gamma
Gamma
Gamma
0 ,
X
⇡i = 1
hint: If you know about lagrange multipliers, use them! Otherwise, keep in
mind that minimizing for a function f(a,b) constraint to a+b=1 is
equivalent to minimizing for f(a,1-a).
16
’12
Arshia Cont: Survey of Machine Learning
107
’12
Arshia Cont: Survey of Machine Learning
108
Group Homework 2
Histogram Problem
2. Derive the MAP solution using Dirichlet priors:
One possible prior model over ⇡k is the Dirichlet Distribution:
PN
N
( j=1 )uj ) Y
u 1
⇡j j
P⇧1 ,...,⇧N (⇡1 , . . . , ⇡N ) = QN
j=1 (uj ) j=1
where u is the set of hyper-parameters (prior parameters to solve) and
Z 1
(x) =
e t tx 1 dt
is the Gamma function.
O
You should show that the posterior is equal to:
PW
W
( j=1 cj + uj ) Y
c +u
⇡j j j
P⇧|C (⇡|c) = QW
k=1 (cj + uj ) j=1
1
3. Compare the MAP estimator with that of ML in part (1). What is the role of
this prior compared to ML?
’12
Arshia Cont: Survey of Machine Learning
109