Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning
Srihari
Discrete Probability
Distributions
Sargur N. Srihari
1
Machine Learning
Srihari
Binary Variables
Bernoulli, Binomial and Beta
2
Machine Learning
•
•
•
•
•
•
•
•
Srihari
Bernoulli Distribution
Expresses distribution of Single binary-valued random variable x e {0,1}
Probability of x=1 is denoted by parameter µ, i.e.,
p(x=1|µ)=µ
Therefore
p(x=0|µ)=1-µ
Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x
Mean is shown to be E[x]=µ
Variance is Var[x]=µ (1-µ)
Likelihood of n observations independently drawn from p(x|µ) is
N
N
n =1
n =1
Jacob Bernoulli
1654-1705
p( D | µ ) = ∏ p( xn | µ ) =∏ µ xn (1 − µ )1− xn
Log-likelihood is
N
N
n =1
n =1
ln p( D | µ ) = ∑ ln p( xn | µ ) =∑{xn lnµ + (1 − xn ) ln(1 − µ )}
•
•
Maximum likelihood estimator
– obtained by setting derivative of ln p(D|µ) wrt m equal to zero is
µ ML
1
=
N
N
∑x
n =1
n
If no of observations of x=1 is m then µML=m/N
3
Machine Learning
Srihari
Binomial Distribution
• Related to Bernoulli distribution
• Expresses Distribution of m
– No of observations for which x=1
• It is proportional to Bern(x|µ)
• Add up all ways of obtaining heads
⎛N⎞
Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m
⎝m⎠
• Mean and Variance are
N
E[m] = ∑ mBin(m | N , µ ) = Nµ
Histogram of
Binomial for
N=10 and
m=0.25
Binomial Coefficients:
⎛ N ⎞⎟
N!
⎜⎜
⎟
⎜⎜⎝ m ⎟⎟⎠ = m ! N −m !
(
)
m =0
Var[m] = Nµ (1 − µ )
4
Machine Learning
Srihari
Beta Distribution
• Beta distribution
Beta ( µ | a, b) =
a=0.1, b=0.1
a=1, b=1
Γ(a + b) a −1
µ (1 − µ )b−1
Γ(a)Γ(b)
• Where the Gamma
function ∞is defined as
Γ( x) = ∫ u x −1e −u du
a=2, b=3
0
• a and b are
hyperparameters that
control distribution of
parameter µ
• Mean and Variance
E[ µ ] =
a
a+b
var[µ ] =
ab
(a + b) 2 (a + b + 1)
a=8, b=4
Beta distribution as function of µ
For values of hyperparameters a and b
5
Machine Learning
Srihari
Bayesian Inference with Beta
• MLE of µ in Bernoulli is fraction of observations with x=1
– Severely over-fitted for small data sets
• Likelihood function takes products of factors of the form
µx(1-µ)(1-x)
• If prior distribution of µ is chosen to be proportional to
powers of µ and 1-µ, posterior will have same functional
form as the prior
– Called conjugacy
• Beta has form suitable for a prior distribution of p(µ)
6
Machine Learning
Srihari
Bayesian Inference with Beta
• Posterior obtained by multiplying beta
prior with binomial likelihood yields
p(µ | m, l , a, b) α µ
m + a −1
(1 − µ )
Illustration of
one step in process
a=2, b=2
p(µ)
l +b −1
– where l=N-m, which is no of tails
– m is no of heads
• It is another beta distribution
Γ(m + a + l + b) m+ a −1
p( µ | m, l , a, b) =
µ
(1 − µ )l +b−1
Γ(m + a)Γ(l + b)
– Effectively increase value of a by m and b by l
– As number of observations increases
distribution becomes more peaked
N=m=1, with x=1
p(x=1/µ)=
µ1(1-µ)0
a=3, b=2
p(µ/x=1)
7
Machine Learning
Srihari
Predicting next trial outcome
• Need predictive distribution of x given observed D
– From sum and products rule
∫ p(x = 1,µ | D)dµ = ∫ p(x = 1 | µ) p(µ | D)dµ =
= ∫ µp(µ | D)dµ = E[µ | D]
p(x = 1 | D) =
1
1
0
0
1
0
• Expected value of the posterior distribution can be
shown to be
m+a
p ( x = 1 | D) =
m+a+l +b
– Which is fraction of observations (both fictitious and
real) that correspond to x=1
• Maximum likelihood and Bayesian results agree in
the limit of infinite observations
– On average uncertainty (variance) decreases with
observed data
8
Machine Learning
Srihari
Summary of Binary Distributions
• Single Binary variable distribution is
represented by Bernoulli
• Binomial is related to Bernoulli
– Expresses distribution of number of
occurrences of either 1 or 0 in N trials
• Beta distribution is a conjugate prior for
Bernoulli
– Both have the same functional form
9
Machine Learning
Srihari
Sample Matlab Code
Probability Distributions
• Binomial Distribution:
– Probability Density Function : Y = binopdf (X,N,P)
returns the binomial probability density function with parameters N and P
at the values in X.
– Random Number Generator: R = binornd (N,P,MM,NN)
returns n MM-by-NN matrix of random numbers chosen from a binomial
distribution with parameters N and P
• Beta Distribution
– Probability Density Function : Y = betapdf (X,A,B)
returns the beta probability density function with parameters A and B at the
values in X.
– Random Number Generator: R = betarnd (A,B)
returns a matrix of random numbers chosen from the beta distribution with
parameters A and B.
Machine Learning
Srihari
Multinomial Variables
Generalized Bernoulli and
Dirichlet
11
Machine Learning
Srihari
Generalization of Binomial
• Binomial
Histogram of
Binomial for
N=10 and
µ=0.25
– Tossing a coin
– Expresses probability of no of successes in N trials
• Probability of 3 rainy days in 10 days
• Multinomial
– Throwing a Die
– Probability of a given frequency for each value
• Probability of 3 specific letters in a string of N
• Probability Calculator
– http://stattrek.com/Tables/Multinomial.aspx
12
Machine Learning
Srihari
Generalization of Bernoulli
• Bernoulli distribution x is 0 or 1
Bern(x|µ)=µ x (1-µ) 1-x
• Discrete variable that takes one of K
values (instead of 2)
• Represent as 1 of K scheme
– Represent x as a K-dimensional vector
– If x=3 then we represent it as x=(0,0,1,0,0,0)T
– Such vectors satisfy ∑ x = 1
K
k =1
k
• If probability of xk=1 is denoted µk then
distribution
of x is given by
K
p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T
k =1
Generalized Bernoulli
13
Machine Learning
Srihari
MLE of Generalized Bernoulli Parameters
• Data set D of N ind. observations x1,..xN
– where the nth observation is written as [xn1,.., xnK]
• Likelihood function has the form
K
∑x )
(
p(D | µ) = ∏ ∏ µ = ∏ µk
= ∏ µkm
N
K
K
x nk
k
n=1 k=1
n
k=1
nk
k
k=1
– where mk=Σn xnk is the no. of observations of xk=1
• Maximum likelihood solution (obtained by
setting derivative wrt µ of log-likelihood to zero)
mk
ML
is
µk =
N
which is fraction of N observations for which xk=1
14
Machine Learning
Srihari
Generalized Binomial Distribution
• Multinomial distribution (with K-state variable)
N
⎛
⎞ K mk
⎟⎟∏ µ k
Mult (m1m2 ..mK | µ , N ) = ⎜⎜
⎝ m1m2 ..mk ⎠ k =1
∑
k
µk = 1
– Where the normalization coefficient is the no of
ways of partitioning N objects into K groups of size
m1 , m2 ..mk
• Given by
N
⎛
⎞
N!
⎜⎜
⎟⎟ =
⎝ m1m2 ..mk ⎠ m1!m2!..mk !
15
Machine Learning
Srihari
Dirichlet Distribution
• Family of prior distributions for
parameters µk of multinomial
distribution
• By inspection of multinomial, form of
conjugate prior is
Lejeune Dirichlet
1805-1859
K
p( µ | α ) α ∏ µ kα k −1 where 0 ≤ µ k ≤ 1 and ∑k µ k = 1
k =1
• Normalized form of Dirichlet distribution
K
K
Γ(α 0 )
α k −1
Dir( µ | α ) =
µ k where α 0 = ∑α k
∏
Γ(α1 )...Γ(α k ) k =1
k =1
16
Machine Learning
Srihari
Dirichlet over 3 variables
αk=0.1
• Due to summation
constraint ∑k µk = 1
– Distribution over
Plots of Dirichlet
distribution over the
space of {µk} is
simplex for various
confined to the
settings of parameters
αk
simplex of
dimensionality K-1
Γ(α )
Dir(µ | α) =
∏ µ where α = ∑ α
– For K=3
Γ(α )...Γ(α )
3
0
1
3
k=1
3
αk −1
k
0
k=1
k
αk=1
αk=10
17
Machine Learning
Srihari
Dirichlet Posterior Distribution
• Multiplying prior by likelihood
K
p( µ | D, α ) α p( D | µ ) p( µ | α ) α ∏ µ kα k + mk −1
k =1
• Which has the form of the Dirichlet distribution
p( µ | D, α ) = Dir ( µ | α + m)
K
Γ(α 0 + N )
α k + mk −1
=
µ
∏ k
Γ(α1 + m1 )..Γ(α K + mK ) k =1
18
Machine Learning
Srihari
Summary of Discrete Distributions
• Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x
– Binomial:
⎛N⎞
Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m
⎝m⎠
• Generalized Bernoulli (K states):
⎛ N ⎞⎟
N!
⎜⎜
⎟
⎜⎜⎝ m ⎟⎟⎠ = m ! N −m !
(
)
K
p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T
k =1
– Multinomial
N
⎛
⎞ K mk
⎟⎟∏ µ k
Mult (m1m2 ..mK | µ , N ) = ⎜⎜
⎝ m1m2 ..mk ⎠ k =1
• Conjugate priors:
– Binomial is Beta
Beta ( µ | a, b) =
– Multinomial is Dirichlet
Γ(a + b) a −1
µ (1 − µ )b−1
Γ(a)Γ(b)
K
K
Γ(α 0 )
α k −1
Dir( µ | α ) =
αk
∏ µ where α 0 = ∑
Γ(α1 )...Γ(α k ) k =1 k
k =1
19
Machine Learning
Srihari
Distributions: Landscape
DiscreteBinary
Bernoulli
DiscreteMultivalued
Binomial
Beta
Multinomial
Dirichlet
Continuous
Gaussian
Angular
Wishart
Student’s-t
Gamma
Exponential
Von Mises
Uniform
20
Machine Learning
Srihari
Distributions: Relationships
DiscreteBinary
N=1
Binomial
Conjugate
Prior
Bernoulli
N samples of Bernoulli
Beta
Continuous variable
between {0,1]
Single binary variable
K=2
DiscreteMulti-valued
Continuous
Gaussian
Angular
Multinomial
Large
N
One of K values =
K-dimensional
binary vector
Conjugate Prior
Dirichlet
K random variables
between [0.1]
Student’s-t Gamma
Wishart
Exponential
Generalization of
Gaussian robust to
Outliers
Infinite mixture of
Gaussians
Conjugate Prior of multivariate
Gaussian precision matrix
Special case of Gamma
ConjugatePrior of univariate
Gaussian precision
Gaussian-Gamma
Gaussian-Wishart
Conjugate prior of univariate Gaussian
Unknown mean and precision
Conjugate prior of multi-variate Gaussian
Unknown mean and precision matrix
Von Mises
Uniform
21