Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013 Outline • • • • • • • Probability distributions Maximum likelihood estimation Maximum a posteriori estimation Conjugate priors Conceptualizing models as collection of priors Noninformative priors Empirical Bayes Probability distribution • Density estimation – to model distribution p(x) of a random variable x given a finite set of observations x1, …, xN. Nonparametric approach • Histogram • Kernel density estimation • Nearest neighbor approach Parametric approach • Gaussian distribution • Beta distribution • … The Exponential Family Gaussian distribution Binomial distribution Beta distribution etc… Gaussian distribution • Central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed Bean machine by Sir Francis Galton Maximum likelihood estimation • The frequentist approach to estimate parameters of the distribution given a set of observations is to maximize likelihood. – data are i.i.d – monotonic transformation MLE for Gaussian distribution – simple average Maximum a posterior estimation • The bayesian approach to estimate parameters of the distribution given a set of observations is to maximize posterior distribution. • It allows to account for the prior information. MAP for Gaussian distribution Posterior distribution is given by – weighted average Conjugate prior • In general, for a given probability distribution p(x|η), we can seek a prior p(η) that is conjugate to the likelihood function, so that the posterior distribution has the same functional form as the prior. • For any member of the exponential family, there exists a conjugate prior that can be written in the form • Important conjugate pairs include: Binomial – Beta Multinomial – Dirichlet Gaussian – Gaussian (for mean) Gaussian – Gamma (for precision) Exponential – Gamma MLE for Binomial distribution • Binomial distribution models the probability of m “heads” out of N tosses. • The only parameter of the distribution μ encodes probability of a single event (“head”) • Maximum likelihood estimation is given by MAP for Binomial distribution • The conjugate prior for this distribution is Beta • The posterior is then given by where l = N – m, simply the number of “tails”. Models as collection of priors -1 • Take a simple regression model • Add a prior on weights • And get Bayesian linear regression! Models as collection of priors -2 yn • Take again a simple regression model β Where yn is some function of xn • Add a prior on function • And get Gaussian processes! yn K β Models as collection of priors -3 • Take a model where xn is discrete and unknown θ • Add a prior on states (xn), assuming they are temporarily smooth • And get Hidden Markov Model! x1 x2 t1 t2 xn-1 tn-1 xn tn xn+1 tn+1 Noninformative priors • Sometimes we have no strong prior belief but still want to apply Bayesian inference. Then we need noninformative priors. • If our parameter λ is a discrete variable with K states then we can simply set each prior probability to 1/K. • However for continues variables it is not so clear. • One example of a noninformative prior could be a noninformative prior over μ for Gaussian distribution: with • We can see that the effect of the prior on the posterior over μ is vanished in this case. Empirical Bayes • But what if still want to assume some prior information but want to learn it from the data instead of assuming in advance? • Imagine the following model λ θs xn N • We cannot use full Bayesian inference but we can approximate it by finding the best λ* to maximize p(X|λ) S Empirical Bayes • We can estimate the result by the following iterative procedure (EM-algorithm): • Initialize λ* • E-step: Compute p(θ|X, λ) given fixed λ* • M-step: • It illustrates the other term for Empirical Bayes – maximum marginal likelihood. • This is not fully Bayesian treatment however offers a useful compromise between Bayesian and frequentist approaches. Thank you for your attention!