Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Learning (From data to distributions) Reminders • HW5 deadline extended to Friday Agenda • Learning a probability distribution from data • Maximum likelihood estimation (MLE) • Maximum a posteriori (MAP) estimation • Expectation Maximization (EM) Motivation • Agent has made observations (data) • Now must make sense of it (hypotheses) – Hypotheses alone may be important (e.g., in basic science) – For inference (e.g., forecasting) – To take sensible actions (decision making) • A basic component of economics, social and hard sciences, engineering, … Candy Example • Candy comes in 2 flavors, cherry and lime, with identical wrappers • Manufacturer makes 5 (indistinguishable) bags H1 C: 100% L: 0% H2 C: 75% L: 25% H3 C: 50% L: 50% H4 C: 25% L: 75% H5 C: 0% L: 100% • Suppose we draw • What bag are we holding? What flavor will we draw next? Machine Learning vs. Statistics • Machine Learning automated statistics • This lecture – Bayesian learning, the more “traditional” statistics (R&N 20.1-3) – Learning Bayes Nets Bayesian Learning • Main idea: Consider the probability of each hypothesis, given the data • Data d: • Hypotheses: P(hi|d) h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% Using Bayes’ Rule • P(hi|d) = a P(d|hi) P(hi) is the posterior – (Recall, 1/a = Si P(d|hi) P(hi)) • P(d|hi) is the likelihood • P(hi) is the hypothesis prior h1 C: 100% L: 0% h2 C: 75% L: 25% h3 C: 50% L: 50% h4 C: 25% L: 75% h5 C: 0% L: 100% Computing the Posterior • Assume draws are independent • Let P(h1),…,P(h5) = (0.1,0.2,0.4,0.2,0.1) • d = { 10 x } P(d|h1) = 0 P(d|h2) = 0.2510 P(d|h3) = 0.510 P(d|h4) = 0.7510 P(d|h5) = 110 P(d|h1)P(h1)=0 P(d|h2)P(h2)=9e-8 P(d|h3)P(h3)=4e-4 P(d|h4)P(h4)=0.011 P(d|h5)P(h5)=0.1 Sum = 1/a = 0.1114 P(h1|d) =0 P(h2|d) =0.00 P(h3|d) =0.00 P(h4|d) =0.10 P(h5|d) =0.90 Posterior Hypotheses Predicting the Next Draw • P(X|d) = Si P(X|hi,d)P(hi|d) = Si P(X|hi)P(hi|d) H D X Probability that next candy drawn is a lime P(h1|d) =0 P(h2|d) =0.00 P(h3|d) =0.00 P(h4|d) =0.10 P(h5|d) =0.90 P(X|h1) =0 P(X|h2) =0.25 P(X|h3) =0.5 P(X|h4) =0.75 P(X|h5) =1 P(X|d) = 0.975 P(Next Candy is Lime | d) Other properties of Bayesian Estimation • Any learning technique trades off between good fit and hypothesis complexity • Prior can penalize complex hypotheses – Many more complex hypotheses than simple ones – Ockham’s razor Hypothesis Spaces often Intractable • A hypothesis is a joint probability table over state variables – 2n entries => hypothesis space is [0,1]^(2n) – 2^(2n) deterministic hypotheses 6 boolean variables => over 1022 hypotheses • Summing over hypotheses is expensive! Some Common Simplifications • Maximum a posteriori estimation (MAP) – hMAP = argmaxhi P(hi|d) – P(X|d) P(X|hMAP) • Maximum likelihood estimation (ML) – hML = argmaxhi P(d|hi) – P(X|d) P(X|hML) • Both approximate the true Bayesian predictions as the # of data grows large Maximum a Posteriori • hMAP = argmaxhi P(hi|d) • P(X|d) P(X|hMAP) P(X|hMAP) P(X|d) hMAP = h3 h4 h5 Maximum a Posteriori • For large amounts of data, P(incorrect hypothesis|d) => 0 • For small sample sizes, MAP predictions are “overconfident” P(X|h ) MAP P(X|d) Maximum Likelihood • hML = argmaxhi P(d|hi) • P(X|d) P(X|hML) P(X|hML) P(X|d) hML = undefined h5 Maximum Likelihood • hML= hMAP with uniform prior • Relevance of prior diminishes with more data • Preferred by some statisticians – Are priors “cheating”? – What is a prior anyway? Advantages of MAP and MLE over Bayesian estimation • Involves an optimization rather than a large summation – Local search techniques • For some types of distributions, there are closed-form solutions that are easily computed Learning Coin Flips (Bernoulli distribution) • Let the unknown fraction of cherries be q • Suppose draws are independent and identically distributed (i.i.d) • Observe that c out of N draws are cherries Maximum Likelihood • Likelihood of data d={d1,…,dN} given q – P(d|q) = Pj P(dj|q) = qc (1-q)N-c i.i.d assumption Gather c cherries together, then N-c limes Maximum Likelihood • Same as maximizing log likelihood • L(d|q)= log P(d|q) = c log q + (N-c) log(1-q) • maxq L(d|q) => dL/dq = 0 => 0 = c/q – (N-c)/(1-q) => q = c/N Maximum Likelihood for BN • For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data N=1000 E 500 B: 200 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar Alarm A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 E B P(A|E,B) T T 0.95 F T 0.95 T F 0.34 F F 0.003 Maximum Likelihood for Gaussian Models • Observe a continuous variable x1,…,xN • Fit a Gaussian with mean m, std s – Standard procedure: write log likelihood L = N(C – log s) – Sj (xj-m)2/(2s2) – Set derivatives to zero Maximum Likelihood for Gaussian Models • Observe a continuous variable x1,…,xN • Results: m = 1/N S xj (sample mean) s2 = 1/N S (xj-m)2 (sample variance) Maximum Likelihood for Conditional Linear Gaussians • Y is a child of X • Data (xj,yj) X Y • X is gaussian, Y is a linear Gaussian function of X – Y(x) ~ N(ax+b,s) • ML estimate of a, b is given by least squares regression, s by standard errors Back to Coin Flips • What about Bayesian or MAP learning? • Motivation – I pick a coin out of my pocket – 1 flip turns up heads – Whats the MLE? Back to Coin Flips • Need some prior distribution P(q) • P(q|d) = P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) 0 1 q MAP estimate • Could maximize qc (1-q)N-c P(q) using some optimization • Turns out for some families of P(q), the MAP estimate is easy to compute (Conjugate prior) P(q) Beta distributions 0 1 q Beta Distribution • Betaa,b(q) = a qa-1 (1-q)b-1 – a, b hyperparameters – a is a normalization constant – Mean at a/(a+b) Posterior with Beta Prior • Posterior qc (1-q)N-c P(q) = a qc+a-1 (1-q)N-c+b-1 • MAP estimate q=(c+a)/(N+a+b) • Posterior is also a beta distribution! – See heads, increment a – See tails, increment b – Prior specifies a “virtual count” of a heads, b tails Does this work in general? • Only specific distributions have the right type of prior – Bernoulli, Poisson, geometric, Gaussian, exponential, … • Otherwise, MAP needs a (often expensive) numerical optimization How to deal with missing observations? • Very difficult statistical problem in general • E.g., surveys – Did the person not fill out political affiliation randomly? – Or do independents do this more often than someone with a strong affiliation? • Better if a variable is completely hidden Expectation Maximization for Gaussian Mixture models Clustering: N gaussian distributions Data have labels to which Gaussian they belong to, but label is a hidden variable E step: compute probability a datapoint belongs to each gaussian M step: compute ML estimates of each gaussian, weighted by the probability that each sample belongs to it Learning HMMs Want to find transition and observation probabilities Data: many sequences {O1:t(j) for 1jN} Problem: we don’t observe the X’s! X0 X1 X2 X3 O1 O2 O3 Learning HMMs • Assume stationary markov chain, discrete states x1,…,xm • Transition parameters qij = P(Xt+1=xj|Xt=xi) • Observation parameters i = P(O|Xt=xi) X0 X1 X2 X3 O1 O2 O3 Learning HMMs • Assume stationary markov chain, discrete states x1,…,xm x • Transition parameters q13, q31 pij = P(Xt+1=xj|Xt=xi) x x • Observation parameters i = P(O|Xt=xi) 3 2 O • Initial states li = P(X0=xi) 1 2 3 Expectation Maximization • Initialize parameters randomly • E-step: infer expected probabilities of hidden variables over time, given current parameters • M-step: maximize likelihood of data over parameters x1 x2 P(initial state) P(transition ij) q13, q31 x3 P(emission) q = (l1, l2, l3,p11,p12,...,p32p33, 1,2,3) 2 O 3 Expectation Maximization q = (l1, l2, l3,q11,q12,...,q32q33, 1,2,3) Initialize q(0) E: Compute E[P(Z=z| q(0),O)] Z: all combinations of hidden sequences x1 x2 x3 x2 x2 x1 x1 x2 x2 x1 x3 x2 Result: probability distribution over hidden state at time t q(1) M: compute = ML estimate of transition / obs. distributions x1 x2 2 q13, q31 x3 O 3 Expectation Maximization q = (l1, l2, l3,q11,q12,...,q32q33, 1,2,3) Initialize q(0) E: Compute E[P(Z=z| q(0),O)] Z: all combinations of hidden sequences x1 x2 x3 x2 x2 x1 x1 x2 x2 x1 x3 x2 Result: probability distribution over hidden state at time t q(1) M: compute = ML estimate of transition / obs. distributions This is the hard part… x1 x2 2 q13, q31 x3 O 3 E-Step on HMMs • Computing expectations can be done by: – Sampling – Using the forward/backward algorithm on the unrolled HMM (R&N pp. 546) • The latter gives the classic Baum-Welch algorithm • Note that EM can still get stuck in local optima or even saddle points Next Time • Machine learning