Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning Srihari Discrete Probability Distributions Sargur N. Srihari 1 Machine Learning Srihari Binary Variables Bernoulli, Binomial and Beta 2 Machine Learning • • • • • • • • Srihari Bernoulli Distribution Expresses distribution of Single binary-valued random variable x e {0,1} Probability of x=1 is denoted by parameter µ, i.e., p(x=1|µ)=µ Therefore p(x=0|µ)=1-µ Probability distribution has the form Bern(x|µ)=µ x (1-µ) 1-x Mean is shown to be E[x]=µ Variance is Var[x]=µ (1-µ) Likelihood of n observations independently drawn from p(x|µ) is N N n =1 n =1 Jacob Bernoulli 1654-1705 p( D | µ ) = ∏ p( xn | µ ) =∏ µ xn (1 − µ )1− xn Log-likelihood is N N n =1 n =1 ln p( D | µ ) = ∑ ln p( xn | µ ) =∑{xn lnµ + (1 − xn ) ln(1 − µ )} • • Maximum likelihood estimator – obtained by setting derivative of ln p(D|µ) wrt m equal to zero is µ ML 1 = N N ∑x n =1 n If no of observations of x=1 is m then µML=m/N 3 Machine Learning Srihari Binomial Distribution • Related to Bernoulli distribution • Expresses Distribution of m – No of observations for which x=1 • It is proportional to Bern(x|µ) • Add up all ways of obtaining heads ⎛N⎞ Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m ⎝m⎠ • Mean and Variance are N E[m] = ∑ mBin(m | N , µ ) = Nµ Histogram of Binomial for N=10 and m=0.25 Binomial Coefficients: ⎛ N ⎞⎟ N! ⎜⎜ ⎟ ⎜⎜⎝ m ⎟⎟⎠ = m ! N −m ! ( ) m =0 Var[m] = Nµ (1 − µ ) 4 Machine Learning Srihari Beta Distribution • Beta distribution Beta ( µ | a, b) = a=0.1, b=0.1 a=1, b=1 Γ(a + b) a −1 µ (1 − µ )b−1 Γ(a)Γ(b) • Where the Gamma function ∞is defined as Γ( x) = ∫ u x −1e −u du a=2, b=3 0 • a and b are hyperparameters that control distribution of parameter µ • Mean and Variance E[ µ ] = a a+b var[µ ] = ab (a + b) 2 (a + b + 1) a=8, b=4 Beta distribution as function of µ For values of hyperparameters a and b 5 Machine Learning Srihari Bayesian Inference with Beta • MLE of µ in Bernoulli is fraction of observations with x=1 – Severely over-fitted for small data sets • Likelihood function takes products of factors of the form µx(1-µ)(1-x) • If prior distribution of µ is chosen to be proportional to powers of µ and 1-µ, posterior will have same functional form as the prior – Called conjugacy • Beta has form suitable for a prior distribution of p(µ) 6 Machine Learning Srihari Bayesian Inference with Beta • Posterior obtained by multiplying beta prior with binomial likelihood yields p(µ | m, l , a, b) α µ m + a −1 (1 − µ ) Illustration of one step in process a=2, b=2 p(µ) l +b −1 – where l=N-m, which is no of tails – m is no of heads • It is another beta distribution Γ(m + a + l + b) m+ a −1 p( µ | m, l , a, b) = µ (1 − µ )l +b−1 Γ(m + a)Γ(l + b) – Effectively increase value of a by m and b by l – As number of observations increases distribution becomes more peaked N=m=1, with x=1 p(x=1/µ)= µ1(1-µ)0 a=3, b=2 p(µ/x=1) 7 Machine Learning Srihari Predicting next trial outcome • Need predictive distribution of x given observed D – From sum and products rule ∫ p(x = 1,µ | D)dµ = ∫ p(x = 1 | µ) p(µ | D)dµ = = ∫ µp(µ | D)dµ = E[µ | D] p(x = 1 | D) = 1 1 0 0 1 0 • Expected value of the posterior distribution can be shown to be m+a p ( x = 1 | D) = m+a+l +b – Which is fraction of observations (both fictitious and real) that correspond to x=1 • Maximum likelihood and Bayesian results agree in the limit of infinite observations – On average uncertainty (variance) decreases with observed data 8 Machine Learning Srihari Summary of Binary Distributions • Single Binary variable distribution is represented by Bernoulli • Binomial is related to Bernoulli – Expresses distribution of number of occurrences of either 1 or 0 in N trials • Beta distribution is a conjugate prior for Bernoulli – Both have the same functional form 9 Machine Learning Srihari Sample Matlab Code Probability Distributions • Binomial Distribution: – Probability Density Function : Y = binopdf (X,N,P) returns the binomial probability density function with parameters N and P at the values in X. – Random Number Generator: R = binornd (N,P,MM,NN) returns n MM-by-NN matrix of random numbers chosen from a binomial distribution with parameters N and P • Beta Distribution – Probability Density Function : Y = betapdf (X,A,B) returns the beta probability density function with parameters A and B at the values in X. – Random Number Generator: R = betarnd (A,B) returns a matrix of random numbers chosen from the beta distribution with parameters A and B. Machine Learning Srihari Multinomial Variables Generalized Bernoulli and Dirichlet 11 Machine Learning Srihari Generalization of Binomial • Binomial Histogram of Binomial for N=10 and µ=0.25 – Tossing a coin – Expresses probability of no of successes in N trials • Probability of 3 rainy days in 10 days • Multinomial – Throwing a Die – Probability of a given frequency for each value • Probability of 3 specific letters in a string of N • Probability Calculator – http://stattrek.com/Tables/Multinomial.aspx 12 Machine Learning Srihari Generalization of Bernoulli • Bernoulli distribution x is 0 or 1 Bern(x|µ)=µ x (1-µ) 1-x • Discrete variable that takes one of K values (instead of 2) • Represent as 1 of K scheme – Represent x as a K-dimensional vector – If x=3 then we represent it as x=(0,0,1,0,0,0)T – Such vectors satisfy ∑ x = 1 K k =1 k • If probability of xk=1 is denoted µk then distribution of x is given by K p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T k =1 Generalized Bernoulli 13 Machine Learning Srihari MLE of Generalized Bernoulli Parameters • Data set D of N ind. observations x1,..xN – where the nth observation is written as [xn1,.., xnK] • Likelihood function has the form K ∑x ) ( p(D | µ) = ∏ ∏ µ = ∏ µk = ∏ µkm N K K x nk k n=1 k=1 n k=1 nk k k=1 – where mk=Σn xnk is the no. of observations of xk=1 • Maximum likelihood solution (obtained by setting derivative wrt µ of log-likelihood to zero) mk ML is µk = N which is fraction of N observations for which xk=1 14 Machine Learning Srihari Generalized Binomial Distribution • Multinomial distribution (with K-state variable) N ⎛ ⎞ K mk ⎟⎟∏ µ k Mult (m1m2 ..mK | µ , N ) = ⎜⎜ ⎝ m1m2 ..mk ⎠ k =1 ∑ k µk = 1 – Where the normalization coefficient is the no of ways of partitioning N objects into K groups of size m1 , m2 ..mk • Given by N ⎛ ⎞ N! ⎜⎜ ⎟⎟ = ⎝ m1m2 ..mk ⎠ m1!m2!..mk ! 15 Machine Learning Srihari Dirichlet Distribution • Family of prior distributions for parameters µk of multinomial distribution • By inspection of multinomial, form of conjugate prior is Lejeune Dirichlet 1805-1859 K p( µ | α ) α ∏ µ kα k −1 where 0 ≤ µ k ≤ 1 and ∑k µ k = 1 k =1 • Normalized form of Dirichlet distribution K K Γ(α 0 ) α k −1 Dir( µ | α ) = µ k where α 0 = ∑α k ∏ Γ(α1 )...Γ(α k ) k =1 k =1 16 Machine Learning Srihari Dirichlet over 3 variables αk=0.1 • Due to summation constraint ∑k µk = 1 – Distribution over Plots of Dirichlet distribution over the space of {µk} is simplex for various confined to the settings of parameters αk simplex of dimensionality K-1 Γ(α ) Dir(µ | α) = ∏ µ where α = ∑ α – For K=3 Γ(α )...Γ(α ) 3 0 1 3 k=1 3 αk −1 k 0 k=1 k αk=1 αk=10 17 Machine Learning Srihari Dirichlet Posterior Distribution • Multiplying prior by likelihood K p( µ | D, α ) α p( D | µ ) p( µ | α ) α ∏ µ kα k + mk −1 k =1 • Which has the form of the Dirichlet distribution p( µ | D, α ) = Dir ( µ | α + m) K Γ(α 0 + N ) α k + mk −1 = µ ∏ k Γ(α1 + m1 )..Γ(α K + mK ) k =1 18 Machine Learning Srihari Summary of Discrete Distributions • Bernoulli (2 states) : Bern(x|µ)=µ x (1-µ) 1-x – Binomial: ⎛N⎞ Bin(m | N , µ ) = ⎜⎜ ⎟⎟µ m (1 − µ ) N −m ⎝m⎠ • Generalized Bernoulli (K states): ⎛ N ⎞⎟ N! ⎜⎜ ⎟ ⎜⎜⎝ m ⎟⎟⎠ = m ! N −m ! ( ) K p( x | µ ) = ∏ µ kxk where µ = ( µ1 ,.., µ K )T k =1 – Multinomial N ⎛ ⎞ K mk ⎟⎟∏ µ k Mult (m1m2 ..mK | µ , N ) = ⎜⎜ ⎝ m1m2 ..mk ⎠ k =1 • Conjugate priors: – Binomial is Beta Beta ( µ | a, b) = – Multinomial is Dirichlet Γ(a + b) a −1 µ (1 − µ )b−1 Γ(a)Γ(b) K K Γ(α 0 ) α k −1 Dir( µ | α ) = αk ∏ µ where α 0 = ∑ Γ(α1 )...Γ(α k ) k =1 k k =1 19 Machine Learning Srihari Distributions: Landscape DiscreteBinary Bernoulli DiscreteMultivalued Binomial Beta Multinomial Dirichlet Continuous Gaussian Angular Wishart Student’s-t Gamma Exponential Von Mises Uniform 20 Machine Learning Srihari Distributions: Relationships DiscreteBinary N=1 Binomial Conjugate Prior Bernoulli N samples of Bernoulli Beta Continuous variable between {0,1] Single binary variable K=2 DiscreteMulti-valued Continuous Gaussian Angular Multinomial Large N One of K values = K-dimensional binary vector Conjugate Prior Dirichlet K random variables between [0.1] Student’s-t Gamma Wishart Exponential Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Conjugate Prior of multivariate Gaussian precision matrix Special case of Gamma ConjugatePrior of univariate Gaussian precision Gaussian-Gamma Gaussian-Wishart Conjugate prior of univariate Gaussian Unknown mean and precision Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Von Mises Uniform 21