Download Discrete Probability Distributions

Machine Learning Srihari Discrete Probability Distributions Sargur N. Srihari 1 Machine Learning Srihari Binary Variables Bernoulli, Binomial and Beta 2 Machine Learning •  •  •  •  •  •  •  •  Srihari Bernoulli Distribution Expresses distribution of Single binary-valued random variable x e {0,1} Probability of x=1 is denoted by parameter µ, i.e., p(x=1|µ)=µ Therefore p(x=0|µ)=1-µ Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x Mean is shown to be E[x]=µ Variance is Var[x]=µ (1-µ) Likelihood of n observations independently drawn from p(x|µ) is N N n =1 n =1 Jacob Bernoulli 1654-1705 p( D | µ ) = ∏ p( xn | µ ) =∏ µ xn (1 − µ )1− xn Log-likelihood is N N n =1 n =1 ln p( D | µ ) = ∑ ln p( xn | µ ) =∑{xn lnµ + (1 − xn ) ln(1 − µ )} •  •  Maximum likelihood estimator –  obtained by setting derivative of ln p(D|µ) wrt m equal to zero is µ ML 1 = N N ∑x n =1 n If no of observations of x=1 is m then µML=m/N 3 Machine Learning Srihari Binomial Distribution •  Related to Bernoulli distribution •  Expresses Distribution of m –  No of observations for which x=1 •  It is proportional to Bern(x|µ) •  Add up all ways of obtaining heads ⎛N⎞ Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m ⎝m⎠ •  Mean and Variance are N E[m] = ∑ mBin(m | N , µ ) = Nµ Histogram of Binomial for N=10 and m=0.25 Binomial Coefficients: ⎛ N ⎞⎟ N! ⎜⎜ ⎟ ⎜⎜⎝ m ⎟⎟⎠ = m ! N −m ! ( ) m =0 Var[m] = Nµ (1 − µ ) 4 Machine Learning Srihari Beta Distribution •  Beta distribution Beta ( µ | a, b) = a=0.1, b=0.1 a=1, b=1 Γ(a + b) a −1 µ (1 − µ )b−1 Γ(a)Γ(b) •  Where the Gamma function ∞is defined as Γ( x) = ∫ u x −1e −u du a=2, b=3 0 •  a and b are hyperparameters that control distribution of parameter µ •  Mean and Variance E[ µ ] = a a+b var[µ ] = ab (a + b) 2 (a + b + 1) a=8, b=4 Beta distribution as function of µ For values of hyperparameters a and b 5 Machine Learning Srihari Bayesian Inference with Beta •  MLE of µ in Bernoulli is fraction of observations with x=1 –  Severely over-fitted for small data sets •  Likelihood function takes products of factors of the form µx(1-µ)(1-x) •  If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior –  Called conjugacy •  Beta has form suitable for a prior distribution of p(µ) 6 Machine Learning Srihari Bayesian Inference with Beta •  Posterior obtained by multiplying beta prior with binomial likelihood yields p(µ | m, l , a, b) α µ m + a −1 (1 − µ ) Illustration of one step in process a=2, b=2 p(µ) l +b −1 –  where l=N-m, which is no of tails –  m is no of heads •  It is another beta distribution Γ(m + a + l + b) m+ a −1 p( µ | m, l , a, b) = µ (1 − µ )l +b−1 Γ(m + a)Γ(l + b) –  Effectively increase value of a by m and b by l –  As number of observations increases distribution becomes more peaked N=m=1, with x=1 p(x=1/µ)= µ1(1-µ)0 a=3, b=2 p(µ/x=1) 7 Machine Learning Srihari Predicting next trial outcome •  Need predictive distribution of x given observed D –  From sum and products rule ∫ p(x = 1,µ | D)dµ = ∫ p(x = 1 | µ) p(µ | D)dµ = = ∫ µp(µ | D)dµ = E[µ | D] p(x = 1 | D) = 1 1 0 0 1 0 •  Expected value of the posterior distribution can be shown to be m+a p ( x = 1 | D) = m+a+l +b –  Which is fraction of observations (both fictitious and real) that correspond to x=1 •  Maximum likelihood and Bayesian results agree in the limit of infinite observations –  On average uncertainty (variance) decreases with observed data 8 Machine Learning Srihari Summary of Binary Distributions •  Single Binary variable distribution is represented by Bernoulli •  Binomial is related to Bernoulli –  Expresses distribution of number of occurrences of either 1 or 0 in N trials •  Beta distribution is a conjugate prior for Bernoulli –  Both have the same functional form 9 Machine Learning Srihari Sample Matlab Code Probability Distributions •  Binomial Distribution: –  Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P at the values in X. –  Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial distribution with parameters N and P •  Beta Distribution –  Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the values in X. –  Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with parameters A and B. Machine Learning Srihari Multinomial Variables Generalized Bernoulli and Dirichlet 11 Machine Learning Srihari Generalization of Binomial •  Binomial Histogram of Binomial for N=10 and µ=0.25 –  Tossing a coin –  Expresses probability of no of successes in N trials •  Probability of 3 rainy days in 10 days •  Multinomial –  Throwing a Die –  Probability of a given frequency for each value •  Probability of 3 specific letters in a string of N •  Probability Calculator –  http://stattrek.com/Tables/Multinomial.aspx 12 Machine Learning Srihari Generalization of Bernoulli •  Bernoulli distribution x is 0 or 1 Bern(x|µ)=µ x (1-µ) 1-x •  Discrete variable that takes one of K values (instead of 2) •  Represent as 1 of K scheme –  Represent x as a K-dimensional vector –  If x=3 then we represent it as x=(0,0,1,0,0,0)T –  Such vectors satisfy ∑ x = 1 K k =1 k •  If probability of xk=1 is denoted µk then distribution of x is given by K p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T k =1 Generalized Bernoulli 13 Machine Learning Srihari MLE of Generalized Bernoulli Parameters •  Data set D of N ind. observations x1,..xN –  where the nth observation is written as [xn1,.., xnK] •  Likelihood function has the form K ∑x ) ( p(D | µ) = ∏ ∏ µ = ∏ µk = ∏ µkm N K K x nk k n=1 k=1 n k=1 nk k k=1 –  where mk=Σn xnk is the no. of observations of xk=1 •  Maximum likelihood solution (obtained by setting derivative wrt µ of log-likelihood to zero) mk ML is µk = N which is fraction of N observations for which xk=1 14 Machine Learning Srihari Generalized Binomial Distribution •  Multinomial distribution (with K-state variable) N ⎛ ⎞ K mk ⎟⎟∏ µ k Mult (m1m2 ..mK | µ , N ) = ⎜⎜ ⎝ m1m2 ..mk ⎠ k =1 ∑ k µk = 1 –  Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size m1 , m2 ..mk •  Given by N ⎛ ⎞ N! ⎜⎜ ⎟⎟ = ⎝ m1m2 ..mk ⎠ m1!m2!..mk ! 15 Machine Learning Srihari Dirichlet Distribution •  Family of prior distributions for parameters µk of multinomial distribution •  By inspection of multinomial, form of conjugate prior is Lejeune Dirichlet 1805-1859 K p( µ | α ) α ∏ µ kα k −1 where 0 ≤ µ k ≤ 1 and ∑k µ k = 1 k =1 •  Normalized form of Dirichlet distribution K K Γ(α 0 ) α k −1 Dir( µ | α ) = µ k where α 0 = ∑α k ∏ Γ(α1 )...Γ(α k ) k =1 k =1 16 Machine Learning Srihari Dirichlet over 3 variables αk=0.1 •  Due to summation constraint ∑k µk = 1 –  Distribution over Plots of Dirichlet distribution over the space of {µk} is simplex for various confined to the settings of parameters αk simplex of dimensionality K-1 Γ(α ) Dir(µ | α) = ∏ µ where α = ∑ α –  For K=3 Γ(α )...Γ(α ) 3 0 1 3 k=1 3 αk −1 k 0 k=1 k αk=1 αk=10 17 Machine Learning Srihari Dirichlet Posterior Distribution •  Multiplying prior by likelihood K p( µ | D, α ) α p( D | µ ) p( µ | α ) α ∏ µ kα k + mk −1 k =1 •  Which has the form of the Dirichlet distribution p( µ | D, α ) = Dir ( µ | α + m) K Γ(α 0 + N ) α k + mk −1 = µ ∏ k Γ(α1 + m1 )..Γ(α K + mK ) k =1 18 Machine Learning Srihari Summary of Discrete Distributions •  Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x –  Binomial: ⎛N⎞ Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m ⎝m⎠ •  Generalized Bernoulli (K states): ⎛ N ⎞⎟ N! ⎜⎜ ⎟ ⎜⎜⎝ m ⎟⎟⎠ = m ! N −m ! ( ) K p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T k =1 –  Multinomial N ⎛ ⎞ K mk ⎟⎟∏ µ k Mult (m1m2 ..mK | µ , N ) = ⎜⎜ ⎝ m1m2 ..mk ⎠ k =1 •  Conjugate priors: –  Binomial is Beta Beta ( µ | a, b) = –  Multinomial is Dirichlet Γ(a + b) a −1 µ (1 − µ )b−1 Γ(a)Γ(b) K K Γ(α 0 ) α k −1 Dir( µ | α ) = αk ∏ µ where α 0 = ∑ Γ(α1 )...Γ(α k ) k =1 k k =1 19 Machine Learning Srihari Distributions: Landscape DiscreteBinary Bernoulli DiscreteMultivalued Binomial Beta Multinomial Dirichlet Continuous Gaussian Angular Wishart Student’s-t Gamma Exponential Von Mises Uniform 20 Machine Learning Srihari Distributions: Relationships DiscreteBinary N=1 Binomial Conjugate Prior Bernoulli N samples of Bernoulli Beta Continuous variable between {0,1] Single binary variable K=2 DiscreteMulti-valued Continuous Gaussian Angular Multinomial Large N One of K values = K-dimensional binary vector Conjugate Prior Dirichlet K random variables between [0.1] Student’s-t Gamma Wishart Exponential Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Conjugate Prior of multivariate Gaussian precision matrix Special case of Gamma ConjugatePrior of univariate Gaussian precision Gaussian-Gamma Gaussian-Wishart Conjugate prior of univariate Gaussian Unknown mean and precision Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Von Mises Uniform 21

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Discrete Probability Distributions