Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine learning: a probabilistic approach CS340: Machine Learning Review for midterm • We want to make models of data so we can find patterns and predict the future. • We will use probability theory to represent our uncertainty about what the right model is; this will induce uncertainty in our predictions. Kevin Murphy • We will update our beliefs about the model as we acquire data (this is called “learning”); as we get more data, the uncertainty in our predictions will decrease. Outline Probability basics • Probability basics • Chain (product) rule • Parameter estimation • Marginalization (sum) rule • Probability density functions • Prediction • Multivariate probability models • Conditional probability models • Information theory p(A, B) = p(A)p(B|A) p(B) = X (1) p(A = a, B) (2) p(A, B) p(B) (3) a • Hence we derive Bayes rule p(A|B) = PMF PDF • A probability mass function (PMF) is defined on a discrete random variable X ∈ {1, . . . , K} and satisfies • A probability density function (PDF) is defined on a continuous random variable X ∈ IR or X ∈ IR+ or X ∈ [0, 1] etc and satisfies Z 1 = p(X = x)dx (6) 1 = K X P (X = x) (4) X 0 ≤ p(X = x) x=1 0 ≤ P (X = x) ≤ 1 (5) • This is just a histogram! (7) • Note that it is possible that p(X = x) > 1, since multiplied by dx. • P (x ≤ X ≤ x + dx) ≈ p(x)dx • Statistics books often write fX (x) for pdfs and use P () for pmf’s. Moments of a distribution • Mean (first central moment) µ = E[X] = X xp(x) CDFs (8) x The cumulative distribution function is defined as Z x F (x) = P (X ≤ x) = f (x)dx Standard Normal Gaussian cdf 0.4 x 1 0.9 0.35 0.8 0.3 0.7 0.25 0.6 p(x) (9) p(x) • Variance (second central moment) X σ 2 = Var [X] = (x − µ)2p(x) = E[X 2] − µ2 (10) −∞ 0.2 0.5 0.4 0.15 0.3 0.1 0.2 0.05 0 −3 0.1 −2 −1 0 x 1 2 3 0 −3 −2 −1 0 x 1 2 3 Quantiles of a distribution P (a ≤ X ≤ b) = FX (b) − FX (a) eg if Z ∼ N (0, 1), FZ (z) = Φ(z). Then p(−1.96 ≤ Z ≤ 1.96) = Φ(1.96) − Φ(−1.96) = 0.975 − 0.025 = 0.95 Hence X ∼ N (µ, σ), p(−1.96σ < X − µ < 1.96σ) = 1 − 2 × 0.025 = 0.95 Hence the interval µ ± 1.96σ contains 0.95 mass. Outline (11) (12) (13) (14) • Probability basics • Probability density functions • Parameter estimation • Prediction • Multivariate probability models • Conditional probability models • Information theory Bernoulli distribution • X ∈ {0, 1}, p(X = 1) = θ, p(X = 0) = 1 − θ • X ∼ Be(θ) • E[X] = θ √ Multinomial distribution • X ∈ {1, . . . , K}, p(X = k) = θk , X ∼ M u(θ) • 0 ≤ θk ≤ 1 P • K k=1 θk = 1 • E[Xi] = θi Beta distribution (Univariate) Gaussian distribution • X ∈ IR, X ∼ N (µ, σ), µ ∈ IR, σ ∈ IR+ 1 − 1 (x−µ)2 def e 2σ2 N (x|µ, σ) = √ 2πσ 2 • E[X] = µ • X ∈ [0, 1], X ∼ Be(α1, α0), 0 ≤ α0, α1 (15) 1 • E[X] = α α+α 1 0 Be(X|α1, α0) ∝ [X α1−1(1 − X)α0−1] a=0.10, b=0.10 Standard Normal a=1.00, b=1.00 4 2 3 1.5 0.4 2 1 1 0.5 0.35 0.3 0 p(x) 0.25 2 0.15 1.5 0.1 0.5 1 0 0 0.5 1 a=8.00, b=4.00 3 2 1 0.05 0 −3 0 a=2.00, b=3.00 0.2 1 0.5 −2 −1 0 x 1 2 0 3 Dirichlet distribution • X ∈ [0, 1]K , X ∼ Dir(α1, . . . , αK ), 0 ≤ αk P • E[Xi] = ααi , where α = k αk . Y α −1 Dir(X|α1, . . . , αK ) ∝ Xk k k 0 0.5 1 0 0 0.5 Outline • Probability basics √ • Probability density functions √ • Parameter estimation • Prediction • Multivariate probability models • Conditional probability models • Information theory 1 Parameter estimation: MLE • Likelihood of iid data D = {xn}N n=1 L(θ) = p(D|θ) = Y n • Log likelihood MLE for Bernoullis p(xn|θ) (16) p(Xn|θ) = θXn (1 − θ)1−Xn X X `(θ) = Xn log θ + (1 − Xn) log(1 − θ) n `(θ) = log p(D|θ) = X n log p(xn|θ) θ̂ = arg max L(θ) = arg max `(θ) hence θ ∂L =0 ∂θi (17) (18) dL N N0 = 1− =0 dθ θ 1−θ N1 θ̂ = N1 + N0 `(θ) = (26) I(xn = k) log θk (27) n X k Nk log θk = (28) k l˜ = X k Nk log θk + λ 1 − ∂ l˜ N = k −λ=0 ∂θk θk Nk θ̂k = N (24) (25) Parameter estimation: Bayesian Y I(X =k) θk n k X X (22) (23) (19) MLE for Multinomials p(Xn|θ) = n = N1 log θ + N0 log(1 − θ) X N1 = I(Xn = 1) n • MLE = Maximum likelihood estimate θ (20) (21) X k θk (29) (30) (31) p(θ|D) ∝ p(D|θ)p(θ) Y = [ p(xn|θ)]p(θ) (32) (33) n We say p(θ) is conjugate to p(D|θ) if p(θ|D) in same family as p(θ). This allows recursive (sequential) updating. MLE is a point estimate. Bayes computes full posterior, natural model of uncertainty (e.g., posterior variance, entropy, credible intervals, etc) Parameter estimation: MAP For many models, as |D|→∞, where Bayes estimate for Bernoulli p(θ|D)→δ(θ − θ̂M AP )→δ(θ − θ̂M LE ) (34) θ̂M AP = arg max p(D|θ)p(θ) (35) θ MAP (maximum a posteriori) estimate is a posterior mode. Also called a penalized likelihood estimate since θ̂M AP = arg max log p(D|θ) − λc(θ) θ Xn ∼ Be(θ) θ ∼ Beta(α1, α0) Y p(θ|D) ∝ [ θXn (1 − θ)1−Xn ]θα1−1(1 − θ)α0−1 n N = θ 1 (1 − θ)N0 θα1−1(1 − θ)α0−1 ∝ Beta(α1 + N1, α0 + N0) (36) where c(θ) = − log p(θ) is a penalty or regularizer, and λ is the strenghth of the regularizer. Bayes estimate for Multinomial Xn ∼ M u(θ) θ ∼ Dir(α1, . . . , αK ) Y Y I(X =k) Y α −1 θk n ] θk k p(θ|D) ∝ [ n Y Nk α −1 θk k θk k = Outline (42) (43) √ √ (44) • Probability density functions √ • Parameter estimation (45) • Multivariate probability models (46) • Information theory k k ∝ Dir(α1 + N1, . . . , αK + NK ) • Probability basics • Prediction • Conditional probability models (37) (38) (39) (40) (41) Predicting the future Predictive for Bernoullis Posterior predictive density gotten by Bayesian model averaging Z p(X|D) = p(X|θ)p(θ|D)dθ (47) Plug-in principle: if p(θ|D) ≈ δ(θ − θ̂), then p(X|D) ≈ p(X|θ̂) (48) p(θ|D) = Beta(α1 + N1, α0 + N0) def 0 0 = Beta(α 1 , α0 ) Z p(X = 1|D) = p(X = 1|θ)p(θ|D)dθ Z = θp(θ|D)dθ = E[θ|D] α0 = 0 1 0 α1 + α0 Cross validation We can use CV to find the model with best predictive performance. Outline • Probability basics √ • Probability density functions √ • Parameter estimation √ • Prediction √ • Multivariate probability models • Conditional probability models • Information theory (49) (50) (51) (52) (53) (54) Multivariate probability models Independent features (bag of words model) • Each data case Xn is now a vector of variables Xni, i = 1 : p, n=1:N • e.g., Xni ∈ {1, . . . , K} is the i’th word in n’th sentence • e.g., Xni ∈ IR is the i’th attribute in the n’th feature vector • We need to define p(Xn|θ) for feature vectors of potentially variable size. Y p(Xn|θ) = i p(D|θ) = YY n i p(Xni|θi) (56) e.g., Xni ∈ {1, . . . , K}, product of multinomials Y Y I(X =k) θik ni p(Xn|θ) = (57) i i k def X Nik = `(θi) = n X (58) (60) I(Xni = k) (61) Nik log θik (62) k N θik = ik Ni k Bayes estimate for bag of words model We can estimate each θi separately, eg. product of multinomials Y Y Y I(X =k) θik ni (59) p(D|θ) = n i k YY N θikik = (55) e.g., Xni ∈ {0, 1}, product of Bernoullis Y X θi ni (1 − θi)Xni p(Xn|θ) = i MLE for bag of words model p(Xni|θi) (63) p(θ|D) ∝ Y i p(θi)[ YY n i p(Xni|θi)] We can update each θi separately, eg. product of multinomials Y N Y Dir(θi|αi1, . . . , αiK )[ θikik ] p(θ|D) = ∝ i Y i (64) (65) k Dir(θi|αi1 + Ni1, . . . , αiK + NiK ) Factored prior × factored likelihood = factored posterior (66) MLE for Markov model (Discrete time) Markov model p(Xn|θ) = p(Xn1|π) p Y t=2 p(Xnt|Xn,t−1, A) (67) For a discrete state space, Xnt ∈ {1, . . . , K}, πi = p(X1 = i) (68) Aij = p(Xt = j|Xt−1 = i) (69) Y I(X =i) Y Y Y I(Xnt=j,Xn,t−1=i) (70) Aij p(Xn|θ) = πi n1 t i j Y N 1 Y Y Nij Aij πi i = i j i def Ni1 = I(Xn1 = i) def X Nij = I(Xn,t = j, Xn,t−1 = i) t i (71) (72) (73) p(D|θ) = Y Y I(X =i) Y Y Y I(Xnt=j,Xn,t−1=i) (74) Aij πi n1 n (80) def X (76) Ni1 = I(Xn1 = i) n def X X Nij = n I(Xn,t = j, Xn,t−1 = i) π̂i = (78) Ni1 N • Probability basics (79) √ • Probability density functions √ • Parameter estimation √ • Prediction √ • Conditional probability models (81) Since π1 + π2 = 1, we have π1 = β α , π2 = , α+β α+β (82) (77) t Nij Âij = P j 0 Nij 0 • Multivariate probability models Balance flow across cut set j j Outline πi is fraction of time we spend in state i, given by principle eigenvector π1 α = π2 β i (75) i i Stationary distribution for a Markov chain Aπ = π t i Y N 1 Y Y Nij Aij πi i = • Information theory √ Conditional density estimation Naive Bayes = conditional bag of words model Generative model • Goal: learn p(y|x) where x is input and y is output. • Generative model p(y|x) ∝ p(x|y)p(y) (83) • Discriminative model e.g., logistic regression p(y|x) = σ(w T x) (84) p(y|x) ∝ p(x|y)p(y) in which features of x are conditionally independent given y: p Y p(y|x) ∝ p(y)[ p(xi|y)] (85) (86) i=1 e.g., for multinomials p(y|x) ∝ MLE for Naive Bayes c Ncik = i k c I(xni = k, yn = c) (89) n N θ̂cik = P cik k 0 Ncik 0 Nc π̂c = N c i (87) k Conditional Markov model We estimate params for each feature separately, partitioning the data based on label y. Y Y YY N θcikcik ] (88) p(D|θ, π) = [ πcNc ] [ X Y I(y=c) Y Y I(x =k,y=c) πc [ θcik i ] (90) (91) p(x|y) = p(x1|y) Y t p(xt|xt−1, y) (92) Fit a separate Markov model for every class of y. See homework 5. Summary Summary • Probability basics: Bayes rule, pdfs, pmfs • Probability density functions: Bernoulli, Multinomial; Gaussian, Beta, Dirichlet • Parameter estimation: MLE, Bayesian • Prediction: Plug-in, Bayesian, cross validation • Multivariate probability models: Bag of words, Markov models • Conditional probability models: Conditional BOW (Naive Bayes), Conditional Markov • Information theory Entropy • Measure of uncertainty. • Definition for PMF pk , k = 1 : K X pk log2 pk H(p) = − Joint entropy H(X, Y ) = − (93) k • eg binary entropy function, p1 = θ, p0 = 1 − θ H(θ) = −[p(X = 1) log2 p(X = 1) + p(X = 0) log2 p(X = 0)] (94) = −[θ log2 θ + (1 − θ) log2(1 − θ)] (95) X p(x, y) log p(x, y) (96) x,y If X ⊥ Y , then H(X, Y ) = H(X) + H(Y ). Pf: X H(X, Y ) = − p(x)p(y) log p(x)p(y) (97) x,y = − X x,y p(x)p(y) log p(x) − X p(x)p(y) log p(y) (98) x,y In general Pf: H(X, Y ) ≤ H(X) + H(Y ) (99) I(X, Y ) = H(X) + H(Y ) − H(X, Y ) ≥ 0 (100) Conditional entropy def X H(Y |X) = x p(x)H(Y |X = x) = H(X, Y ) − H(X) Mutual information (101) (102) (103) But ∃y.H(X|y) > H(X). Mutual information ≥ 0 I(X, Y ) = 0 if X ⊥ Y . Pf. X X p(x)p(y) p(x)p(y) log I(X; Y ) = p(x) p(y) y∈Y x∈X XX p(x)p(y) log 1 = 0 = y I(X, Y ) ≥ 0. Pf. p(x, y) log p(x, y) p(x) p(y) (104) In hw 5, you showed that I(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) (105) I(X, Y ) = H(X) + H(Y ) − H(X, Y ) (106) Subsituting H(Y |X) = H(X, Y ) − H(X) yields Relative entropy (107) (108) x I(X, Y ) = KL(p(x, y)||p(x)p(y)) ≥ 0 X X y∈Y x∈X In hw 5, you showed that H(Y |X) ≤ H(Y ). Pf H(Y |X) = H(X, Y ) − H(X) ≤ H(X) + H(Y ) − H(Y ) I(X; Y ) = def X p pk log k qk = pk log pk − D(p||q) = k X kX = − (109) k (110) X pk log qk (111) k pk log qk − H(p) (112) Hence measures extra number of bits needed to encode data if we use q instead of p. D(p||q) = 0 if p = q. Pf: trivial. D(p||q) ≥ 0. Pf: use Jensen’s inequality. MDL principle Minimum description length principle. To encode an event X = x which occurs with p(x) takes − log p(x) bits. Total cost of data D and model H is L(D, H) = − log p(D|H) − log p(H) (113) H ∗ = arg min L(D, H) (114) MDL principle says: pick H This is equivalent to the MAP hypothesis H ∗ = arg max log p(D|H) + log p(H) H #bits total #bits for model #bits for data best model (115)