Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical learning Chapter 20, Sections 1–4 Chapter 20, Sections 1–4 1 Outline ♦ Bayesian learning ♦ Maximum a posteriori and maximum likelihood learning ♦ Bayes net learning – ML parameter learning with complete data – linear regression ♦ Expectation-Maximization (EM) algorithm ♦ Instance-based learning Chapter 20, Sections 1–4 2 Full Bayesian learning View learning as Bayesian updating of a probability distribution over the hypothesis space H is the hypothesis variable, values h1, h2, . . ., prior P(H) jth observation dj gives the outcome of random variable Dj training data d = d1, . . . , dN Given the data so far, each hypothesis has a posterior probability: P (hi|d) = αP (d|hi)P (hi) where P (d|hi) is called the likelihood Predictions use a likelihood-weighted average over the hypotheses: P(X|d) = Σi P(X|d, hi)P (hi|d) = Σi P(X|hi)P (hi|d) No need to pick one best-guess hypothesis! Chapter 20, Sections 1–4 3 Example Suppose there are five kinds of bags of candies: 10% are h1: 100% cherry candies 20% are h2: 75% cherry candies + 25% lime candies 40% are h3: 50% cherry candies + 50% lime candies 20% are h4: 25% cherry candies + 75% lime candies 10% are h5: 100% lime candies Then we observe candies drawn from some bag: What kind of bag is it? What flavour will the next candy be? Chapter 20, Sections 1–4 4 Posterior probability of hypotheses Posterior probability of hypothesis 1 P(h1 | d) P(h2 | d) P(h3 | d) P(h4 | d) P(h5 | d) 0.8 0.6 0.4 0.2 0 0 2 4 6 Number of samples in d 8 10 Chapter 20, Sections 1–4 5 Prediction probability P(next candy is lime | d) 1 0.9 0.8 0.7 0.6 0.5 0.4 0 2 4 6 Number of samples in d 8 10 Chapter 20, Sections 1–4 6 MAP approximation Summing over the hypothesis space is often intractable (e.g., 18,446,744,073,709,551,616 Boolean functions of 6 attributes) Maximum a posteriori (MAP) learning: choose hMAP maximizing P (hi|d) I.e., maximize P (d|hi)P (hi) or log P (d|hi) + log P (hi) Log terms can be viewed as (negative of) bits to encode data given hypothesis + bits to encode hypothesis This is the basic idea of minimum description length (MDL) learning For deterministic hypotheses, P (d|hi) is 1 if consistent, 0 otherwise ⇒ MAP = simplest consistent hypothesis (cf. science) Chapter 20, Sections 1–4 7 ML approximation For large data sets, prior becomes irrelevant Maximum likelihood (ML) learning: choose hML maximizing P (d|hi) I.e., simply get the best fit to the data; identical to MAP for uniform prior (which is reasonable if all hypotheses are of the same complexity) ML is the “standard” (non-Bayesian) statistical learning method Chapter 20, Sections 1–4 8 ML parameter learning in Bayes nets Bag from a new manufacturer; fraction θ of cherry candies? Any θ is possible: continuum of hypotheses hθ θ is a parameter for this simple (binomial) family of models P (F=cherry ) θ Flavor Suppose we unwrap N candies, c cherries and ` = N − c limes These are i.i.d. (independent, identically distributed) observations, so P (d|hθ ) = N Y j =1 P (dj |hθ ) = θ c · (1 − θ)` Maximize this w.r.t. θ—which is easier for the log-likelihood: L(d|hθ ) = log P (d|hθ ) = c ` dL(d|hθ ) = − =0 dθ θ 1−θ N X j =1 log P (dj |hθ ) = c log θ + ` log(1 − θ) ⇒ c c θ= = c+` N Seems sensible, but causes problems with 0 counts! Chapter 20, Sections 1–4 9 Multiple parameters Red/green wrapper depends probabilistically on flavor: Likelihood for, e.g., cherry candy in green wrapper: P (F=cherry ) θ Flavor P (F = cherry, W = green|hθ,θ1,θ2 ) = P (F = cherry|hθ,θ1,θ2 )P (W = green|F = cherry, hθ,θ1,θ2 ) = θ · (1 − θ1) N candies, rc red-wrapped cherry candies, etc.: F P(W=red | F ) cherry lime θ1 θ2 Wrapper r P (d|hθ,θ1,θ2 ) = θ c(1 − θ)` · θ1rc (1 − θ1)gc · θ2` (1 − θ2)g` L = [c log θ + ` log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)] Chapter 20, Sections 1–4 10 Multiple parameters contd. Derivatives of L contain only the relevant parameter: c ` ∂L = − =0 ∂θ θ 1−θ c ⇒ θ= c+` rc gc ∂L = − =0 ∂θ1 θ1 1 − θ 1 rc ⇒ θ1 = rc + g c r` g` ∂L = − =0 ∂θ2 θ2 1 − θ 2 r` ⇒ θ2 = r` + g ` With complete data, parameters can be learned separately Chapter 20, Sections 1–4 11 Example: linear Gaussian model 1 0.8 0.6 y P(y |x) 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.4 0.6 0.8 x 1 0 1 0.8 0.6 0.4 y 0.2 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x (y−(θ1 x+θ2 ))2 1 − 2σ 2 e Maximizing P (y|x) = √ w.r.t. θ1, θ2 2πσ = minimizing E = N X j =1 (yj − (θ1xj + θ2))2 That is, minimizing the sum of squared errors gives the ML solution for a linear fit assuming Gaussian noise of fixed variance Chapter 20, Sections 1–4 12 Expectation Maximization (EM) When to use: • Data is only partially observable • Unsupervised clustering (target value unobservable) • Supervised learning (some instance attributes unobservable) Some uses: • Train Bayesian Belief Networks • Unsupervised clustering (AUTOCLASS) • Learning Hidden Markov Models Chapter 20, Sections 1–4 13 p(x) Generating Data from Mixture of k Gaussians x Each instance x generated by 1. Choosing one of the k Gaussians with uniform probability 2. Generating an instance at random according to that Gaussian Chapter 20, Sections 1–4 14 EM for Estimating k Means Given: • Instances from X generated by mixture of k Gaussian distributions • Unknown means hµ1, . . . , µk i of the k Gaussians • Don’t know which instance xi was generated by which Gaussian Determine: • Maximum likelihood estimates of hµ1, . . . , µk i Think of full description of each instance as yi = hxi, zi1, zi2i, where • zij is 1 if xi generated by jth Gaussian • xi observable • zij unobservable Chapter 20, Sections 1–4 15 EM for Estimating k Means EM Algorithm: Pick random initial h = hµ1, µ2i, then iterate E step: Calculate the expected value E[zij ] of each hidden variable zij , assuming the current hypothesis h = hµ1, µ2i holds. E[zij ] = = p(x = xi|µ = µj ) P2 n=1 p(x = xi |µ = µn ) e P2 − 12 (xi −µj )2 2σ n=1 e − 12 (xi −µn )2 2σ M step: Calculate a new maximum likelihood hypothesis h0 = hµ01, µ02i, assuming the value taken on by each hidden variable zij is its expected value E[zij ] calculated above. Replace h = hµ1, µ2i by h0 = hµ01, µ02i. µj ←− Pm i=1 E[zij ] xi Pm i=1 E[zij ] Chapter 20, Sections 1–4 16 EM Algorithm Converges to local maximum likelihood h and provides estimates of hidden variables zij In fact, local maximum in E[ln P (Y |h)] • Y is complete (observable plus unobservable variables) data • Expected value is taken over possible values of unobserved variables in Y Chapter 20, Sections 1–4 17 General EM Problem Given: • Observed data X = {x1, . . . , xm} • Unobserved data Z = {z1, . . . , zm} • Parameterized probability distribution P (Y |h), where – Y = {y1, . . . , ym} is the full data yi = xi ∪ zi – h are the parameters Determine: h that (locally) maximizes E[ln P (Y |h)] Many uses: • Train Bayesian belief networks • Unsupervised clustering (e.g., k means) • Hidden Markov Models Chapter 20, Sections 1–4 18 General EM Method Define likelihood function Q(h0|h) which calculates Y = X ∪ Z using observed X and current parameters h to estimate Z Q(h0|h) ←− E[ln P (Y |h0)|h, X] EM Algorithm: Estimation (E) step: Calculate Q(h0|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y . Q(h0|h) ←− E[ln P (Y |h0)|h, X] Maximization (M) step: Replace hypothesis h by the hypothesis h0 that maximizes this Q function. h ←− argmaxh0 Q(h0|h) Chapter 20, Sections 1–4 19 Instance-Based Learning Key idea: just store all training examples hxi, f (xi)i Nearest neighbor: • Given query instance xq , first locate nearest training example xn, then estimate fˆ(xq ) ←− f (xn) k-Nearest neighbor: • Given xq , take vote among its k nearest nbrs (if discrete-valued target function) • take mean of f values of k nearest nbrs (if real-valued) fˆ(xq ) ←− Pk i=1 f (xi ) k Chapter 20, Sections 1–4 20 When To Consider Nearest Neighbor • Instances map to points in <n • Less than 20 attributes per instance • Lots of training data Advantages: • Training is very fast • Learn complex target functions • Don’t lose information Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes Chapter 20, Sections 1–4 21