Download Discrete Probability Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
Machine Learning
Srihari
Discrete Probability
Distributions
Sargur N. Srihari
1
Machine Learning
Srihari
Binary Variables
Bernoulli, Binomial and Beta
2
Machine Learning
• 
• 
• 
• 
• 
• 
• 
• 
Srihari
Bernoulli Distribution
Expresses distribution of Single binary-valued random variable x e {0,1}
Probability of x=1 is denoted by parameter µ, i.e.,
p(x=1|µ)=µ
Therefore
p(x=0|µ)=1-µ
Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x
Mean is shown to be E[x]=µ
Variance is Var[x]=µ (1-µ)
Likelihood of n observations independently drawn from p(x|µ) is
N
N
n =1
n =1
Jacob Bernoulli
1654-1705
p( D | µ ) = ∏ p( xn | µ ) =∏ µ xn (1 − µ )1− xn
Log-likelihood is
N
N
n =1
n =1
ln p( D | µ ) = ∑ ln p( xn | µ ) =∑{xn lnµ + (1 − xn ) ln(1 − µ )}
• 
• 
Maximum likelihood estimator
–  obtained by setting derivative of ln p(D|µ) wrt m equal to zero is
µ ML
1
=
N
N
∑x
n =1
n
If no of observations of x=1 is m then µML=m/N
3
Machine Learning
Srihari
Binomial Distribution
•  Related to Bernoulli distribution
•  Expresses Distribution of m
–  No of observations for which x=1
•  It is proportional to Bern(x|µ)
•  Add up all ways of obtaining heads
⎛N⎞
Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m
⎝m⎠
•  Mean and Variance are
N
E[m] = ∑ mBin(m | N , µ ) = Nµ
Histogram of
Binomial for
N=10 and
m=0.25
Binomial Coefficients:
⎛ N ⎞⎟
N!
⎜⎜
⎟
⎜⎜⎝ m ⎟⎟⎠ = m ! N −m !
(
)
m =0
Var[m] = Nµ (1 − µ )
4
Machine Learning
Srihari
Beta Distribution
•  Beta distribution
Beta ( µ | a, b) =
a=0.1, b=0.1
a=1, b=1
Γ(a + b) a −1
µ (1 − µ )b−1
Γ(a)Γ(b)
•  Where the Gamma
function ∞is defined as
Γ( x) = ∫ u x −1e −u du
a=2, b=3
0
•  a and b are
hyperparameters that
control distribution of
parameter µ
•  Mean and Variance
E[ µ ] =
a
a+b
var[µ ] =
ab
(a + b) 2 (a + b + 1)
a=8, b=4
Beta distribution as function of µ
For values of hyperparameters a and b
5
Machine Learning
Srihari
Bayesian Inference with Beta
•  MLE of µ in Bernoulli is fraction of observations with x=1
–  Severely over-fitted for small data sets
•  Likelihood function takes products of factors of the form
µx(1-µ)(1-x)
•  If prior distribution of µ is chosen to be proportional to
powers of µ and 1-µ, posterior will have same functional
form as the prior
–  Called conjugacy
•  Beta has form suitable for a prior distribution of p(µ)
6
Machine Learning
Srihari
Bayesian Inference with Beta
•  Posterior obtained by multiplying beta
prior with binomial likelihood yields
p(µ | m, l , a, b) α µ
m + a −1
(1 − µ )
Illustration of
one step in process
a=2, b=2
p(µ)
l +b −1
–  where l=N-m, which is no of tails
–  m is no of heads
•  It is another beta distribution
Γ(m + a + l + b) m+ a −1
p( µ | m, l , a, b) =
µ
(1 − µ )l +b−1
Γ(m + a)Γ(l + b)
–  Effectively increase value of a by m and b by l
–  As number of observations increases
distribution becomes more peaked
N=m=1, with x=1
p(x=1/µ)=
µ1(1-µ)0
a=3, b=2
p(µ/x=1)
7
Machine Learning
Srihari
Predicting next trial outcome
•  Need predictive distribution of x given observed D
–  From sum and products rule
∫ p(x = 1,µ | D)dµ = ∫ p(x = 1 | µ) p(µ | D)dµ =
= ∫ µp(µ | D)dµ = E[µ | D]
p(x = 1 | D) =
1
1
0
0
1
0
•  Expected value of the posterior distribution can be
shown to be
m+a
p ( x = 1 | D) =
m+a+l +b
–  Which is fraction of observations (both fictitious and
real) that correspond to x=1
•  Maximum likelihood and Bayesian results agree in
the limit of infinite observations
–  On average uncertainty (variance) decreases with
observed data
8
Machine Learning
Srihari
Summary of Binary Distributions
•  Single Binary variable distribution is
represented by Bernoulli
•  Binomial is related to Bernoulli
–  Expresses distribution of number of
occurrences of either 1 or 0 in N trials
•  Beta distribution is a conjugate prior for
Bernoulli
–  Both have the same functional form
9
Machine Learning
Srihari
Sample Matlab Code
Probability Distributions
•  Binomial Distribution:
–  Probability Density Function : Y = binopdf (X,N,P)
returns the binomial probability density function with parameters N and P
at the values in X.
–  Random Number Generator: R = binornd (N,P,MM,NN)
returns n MM-by-NN matrix of random numbers chosen from a binomial
distribution with parameters N and P
•  Beta Distribution
–  Probability Density Function : Y = betapdf (X,A,B)
returns the beta probability density function with parameters A and B at the
values in X.
–  Random Number Generator: R = betarnd (A,B)
returns a matrix of random numbers chosen from the beta distribution with
parameters A and B.
Machine Learning
Srihari
Multinomial Variables
Generalized Bernoulli and
Dirichlet
11
Machine Learning
Srihari
Generalization of Binomial
•  Binomial
Histogram of
Binomial for
N=10 and
µ=0.25
–  Tossing a coin
–  Expresses probability of no of successes in N trials
•  Probability of 3 rainy days in 10 days
•  Multinomial
–  Throwing a Die
–  Probability of a given frequency for each value
•  Probability of 3 specific letters in a string of N
•  Probability Calculator
–  http://stattrek.com/Tables/Multinomial.aspx
12
Machine Learning
Srihari
Generalization of Bernoulli
•  Bernoulli distribution x is 0 or 1
Bern(x|µ)=µ x (1-µ) 1-x
•  Discrete variable that takes one of K
values (instead of 2)
•  Represent as 1 of K scheme
–  Represent x as a K-dimensional vector
–  If x=3 then we represent it as x=(0,0,1,0,0,0)T
–  Such vectors satisfy ∑ x = 1
K
k =1
k
•  If probability of xk=1 is denoted µk then
distribution
of x is given by
K
p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T
k =1
Generalized Bernoulli
13
Machine Learning
Srihari
MLE of Generalized Bernoulli Parameters
•  Data set D of N ind. observations x1,..xN
–  where the nth observation is written as [xn1,.., xnK]
•  Likelihood function has the form
K
∑x )
(
p(D | µ) = ∏ ∏ µ = ∏ µk
= ∏ µkm
N
K
K
x nk
k
n=1 k=1
n
k=1
nk
k
k=1
–  where mk=Σn xnk is the no. of observations of xk=1
•  Maximum likelihood solution (obtained by
setting derivative wrt µ of log-likelihood to zero)
mk
ML
is
µk =
N
which is fraction of N observations for which xk=1
14
Machine Learning
Srihari
Generalized Binomial Distribution
•  Multinomial distribution (with K-state variable)
N
⎛
⎞ K mk
⎟⎟∏ µ k
Mult (m1m2 ..mK | µ , N ) = ⎜⎜
⎝ m1m2 ..mk ⎠ k =1
∑
k
µk = 1
–  Where the normalization coefficient is the no of
ways of partitioning N objects into K groups of size
m1 , m2 ..mk
•  Given by
N
⎛
⎞
N!
⎜⎜
⎟⎟ =
⎝ m1m2 ..mk ⎠ m1!m2!..mk !
15
Machine Learning
Srihari
Dirichlet Distribution
•  Family of prior distributions for
parameters µk of multinomial
distribution
•  By inspection of multinomial, form of
conjugate prior is
Lejeune Dirichlet
1805-1859
K
p( µ | α ) α ∏ µ kα k −1 where 0 ≤ µ k ≤ 1 and ∑k µ k = 1
k =1
•  Normalized form of Dirichlet distribution
K
K
Γ(α 0 )
α k −1
Dir( µ | α ) =
µ k where α 0 = ∑α k
∏
Γ(α1 )...Γ(α k ) k =1
k =1
16
Machine Learning
Srihari
Dirichlet over 3 variables
αk=0.1
•  Due to summation
constraint ∑k µk = 1
–  Distribution over
Plots of Dirichlet
distribution over the
space of {µk} is
simplex for various
confined to the
settings of parameters
αk
simplex of
dimensionality K-1
Γ(α )
Dir(µ | α) =
∏ µ where α = ∑ α
–  For K=3
Γ(α )...Γ(α )
3
0
1
3
k=1
3
αk −1
k
0
k=1
k
αk=1
αk=10
17
Machine Learning
Srihari
Dirichlet Posterior Distribution
•  Multiplying prior by likelihood
K
p( µ | D, α ) α p( D | µ ) p( µ | α ) α ∏ µ kα k + mk −1
k =1
•  Which has the form of the Dirichlet distribution
p( µ | D, α ) = Dir ( µ | α + m)
K
Γ(α 0 + N )
α k + mk −1
=
µ
∏ k
Γ(α1 + m1 )..Γ(α K + mK ) k =1
18
Machine Learning
Srihari
Summary of Discrete Distributions
•  Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x
–  Binomial:
⎛N⎞
Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m
⎝m⎠
•  Generalized Bernoulli (K states):
⎛ N ⎞⎟
N!
⎜⎜
⎟
⎜⎜⎝ m ⎟⎟⎠ = m ! N −m !
(
)
K
p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T
k =1
–  Multinomial
N
⎛
⎞ K mk
⎟⎟∏ µ k
Mult (m1m2 ..mK | µ , N ) = ⎜⎜
⎝ m1m2 ..mk ⎠ k =1
•  Conjugate priors:
–  Binomial is Beta
Beta ( µ | a, b) =
–  Multinomial is Dirichlet
Γ(a + b) a −1
µ (1 − µ )b−1
Γ(a)Γ(b)
K
K
Γ(α 0 )
α k −1
Dir( µ | α ) =
αk
∏ µ where α 0 = ∑
Γ(α1 )...Γ(α k ) k =1 k
k =1
19
Machine Learning
Srihari
Distributions: Landscape
DiscreteBinary
Bernoulli
DiscreteMultivalued
Binomial
Beta
Multinomial
Dirichlet
Continuous
Gaussian
Angular
Wishart
Student’s-t
Gamma
Exponential
Von Mises
Uniform
20
Machine Learning
Srihari
Distributions: Relationships
DiscreteBinary
N=1
Binomial
Conjugate
Prior
Bernoulli
N samples of Bernoulli
Beta
Continuous variable
between {0,1]
Single binary variable
K=2
DiscreteMulti-valued
Continuous
Gaussian
Angular
Multinomial
Large
N
One of K values =
K-dimensional
binary vector
Conjugate Prior
Dirichlet
K random variables
between [0.1]
Student’s-t Gamma
Wishart
Exponential
Generalization of
Gaussian robust to
Outliers
Infinite mixture of
Gaussians
Conjugate Prior of multivariate
Gaussian precision matrix
Special case of Gamma
ConjugatePrior of univariate
Gaussian precision
Gaussian-Gamma
Gaussian-Wishart
Conjugate prior of univariate Gaussian
Unknown mean and precision
Conjugate prior of multi-variate Gaussian
Unknown mean and precision matrix
Von Mises
Uniform
21