Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CpSc 810: Machine Learning Bayesian Learning Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks! 2 Two Roles for Bayesian Methods Bayesian Decision Theory came long before Version Spaces, Decision Tree Learning and Neural Networks. It was studied in the field of Statistical Theory and more specifically, in the field of Pattern Recognition. Provides practical learning algorithms Naïve Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities Provides useful conceptual framework 3 Provides “gold standard” for evaluating other learning algorithm Additional insight into Occam’s razor Bayes Theorem Goal: To determine the most probable hypothesis, given the data D plus any initial knowledge about the prior probabilities of the various hypotheses in H. Prior probability of h, P(h): it reflects any background knowledge we have about the chance that h is a correct hypothesis (before having observed the data). Prior probability of D, P(D): it reflects the probability that training data D will be observed given no knowledge about which hypothesis h holds. 4 Conditional Probability of observation D, P(D|h): it denotes the probability of observing data D given some world in which hypothesis h holds. Bayes Theorem (Cont’d) Posterior probability of h, P(h|D): it represents the probability that h holds given the observed training data D. It reflects our confidence that h holds after we have seen the training data D and it is the quantity that Machine Learning researchers are interested in. Bayes Theorem allows us to compute P(h|D): P(h|D)=P(D|h)P(h)/P(D) 5 Choosing Hypotheses: Maximum A Posteriori (MAP) Goal: To find the most probable hypothesis h from a set of candidate hypotheses H given the observed data D. MAP Hypothesis, hMAP = argmax hH P(h|D) = argmax hH P(D|h)P(h)/P(D) = argmax hH P(D|h)P(h) If every hypothesis in H is equally probable a priori, we only need to consider the likelihood of the data D given h, P(D|h). Then, hMAP becomes the Maximum Likelihood, hML= argmax hH P(D|h) 6 Bayes Theorem: a example Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cased in which the disease is not present. Furthermore, 0.008 of the entire population have this cancer P(cancer)=0.008 P(¬cancer)=0.992 P(+|cancer)=0.98 P(-|cancer)=0.02 P(+|¬cancer)=0.03 P(-|¬cancer)=0.97 7 P(+|cancer)P(cancer)=0.98*0.008=0.0078 P(+|¬cancer)P(¬cancer)=0.03*0.992=0.0298 Basic Formulas for Probabilities 8 Brute Force MAP Hypothesis Learner For each hypothesis h in H, calculate the posterior probability P(h | D ) P( D | h ) P( h ) P( D ) Output the hypothesis hMAP with the highest posterior probability hMAP arg max P( h | D ) hH 9 Relation to Concept Learning Consider our usual concept learning task Instance space X, hypothesis space H, training examples D Consider the FindsS learning algorithm (outputs most specific hypothesis from the version space VSH,D Does FindS output a MAP hypothesis? 10 Relation to Concept Learning Assume Fixed set of instance < x1,… xm> Assume D is the set of classifications D= <c(x1),…,c(xm)> Choose P(D|h) P(D|h) =1 if h consistent with D P(D|h) =0 if h inconsistent with D Choose P(h) to be uniform distribution P(h)= 1/|H| for all h in H P(D) can be derive from theorem of total probability: P( D ) P( D | hi ) P(hi ) hi H 11 hi VSH , D 1 | VS H ,D | 1 1 1 0 1 | H | hi VSH , D | H | hi VSH , D | H | |H | Relation to Concept Learning If h is inconsistent with D P(h|D)=P(D|h)P(h)/P(D)=0*P(h)/P(D)=0 If h is consistent with D P(h|D)= P(D|h)P(h)/P(D)=1*(1/|H|)/P(D) =(1/|H|)*(|H|/| VSH,D |)=1/| VSH,D | So, every consistent hypothesis is a MAP hypothesis. 12 Evolution of Posterior Probabilities for Consistent Learners P(h) hypotheses 13 P(h|D1) hypotheses P(h|D1,D2) hypotheses Inductive Bias: A Second View 14 Maximum Likelihood and Least-Squared Error Consider any real –valued target function f and training examples < xi, di >, where di is noisy training values di =f(xi )+ei ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with mean = 0. Then, the maximum likelihood hypothesis hML is the one that minimizes the sum of squared errors: m hML arg min (d i h( xi )) 2 hH i 1 15 Maximum Likelihood and Least-Squared Error Assume the training examples are mutually independent given h, hML arg max p( D | h ) hH m arg max p( d i | h ) hH i 1 m arg max hH i 1 1 2 2 1 d h ( xi ) 2 ( i ) e 2 Maximize natural log of this instead m hML arg max (ln 1 1 d h( xi ) 2 ( i ) 2 2 2 m 1 d h( xi ) 2 arg max ( ( i ) 2 i 1 hH hH i 1 m arg max ( ( d i h( xi )) 2 hH 16 i 1 m arg min (d i h( xi )) 2 hH i 1 Maximum Likelihood Hypotheses for Predicting Probability Considering predicting survival probability from patient data: Training examples < xi, di >, where di is 1 or 0. Want to train neural network to output a probability given xi (not a 0 or 1) In this case: P(di | h, xi ) h( xi )di (1 h( xi ))1di Then: m P( D | h ) h( xi ) di (1 h( xi ))1di P( xi ) i 1 m hML arg max h( xi ) di (1 h( xi ))1di P( xi ) hH i 1 m arg max h( xi ) di (1 h( xi ))1di hH 17 i 1 m arg max d i lnh( xi ) (1 d i ) ln(1 h( xi )) hH i 1 Gradient Search to Maximize Likelihood in a Neural Net We denote G(h, D) to m d i lnh( xi ) (1 d i ) ln(1 h( xi )) i 1 The negative of G(h,D) is called cross entropy Then, G (h, D ) m G ( h, D ) h( xi ) w jk w jk i 1 h( xi ) (d i ln h( xi ) (1 d i ) ln(1 h( xi )) h( xi ) h( xi ) w jk i 1 m d i h( xi ) h( xi ) i 1 h ( xi )(1 h ( xi )) w jk m Assume our neural network is constructed from a single layer of sigmoid units, then h( xi ) ' ( xi ) xijk h( xi )(1 h( xi )) xijk w jk Thus, 18 G(h, D) m (d i h( xi )) xijk w jk i 1 Gradient Search to Maximize Likelihood in a Neural Net Weight update rule for a sigmoid unit: w jk w jk w jk Where m w jk (d i h( xi )) xijk i 1 Compare to BP: m w jk h( xi )(1 h( xi ))( d i h( xi )) xijk i 1 19 The rule that minimizes cross entropy seeks the maximum likelihood hypothesis under the assumption that the observed boolean value is a probabilistic function of the input space Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis Rewrite the hMAP hMAP arg max P( D | h) P(h) hH arg max log 2 P( D | h) log 2 P(h) hH arg min log 2 P( D | h) log 2 P(h) hH Interesting fact from information theory: 20 The optimal (shortest expected coding length) code for an event with probability p is -log2p bits Minimum Description Length Principle So interpret: –log2(P(h)) is the description length LC1(h) of h under the optimal coding C1 for the hypothesis space H; –log2(P(D | h)) is the description length LC2(D | h) of the data D given hypothesis h under its optimal coding C2. Prefer the hypothesis that minimizes length(h)+length(mis-clasifications) Minimum Description Length Principle (MDL): prefer the hypothesis h that minimizes hMDL arg min LC1 ( h ) LC2 ( D | h ) hH 21 Where LC(x) is the description length of x under encoding C. hMAP equals hMDL under optimal encodings. Minimum Description Length Principle Example: H=decision trees, D=training data labels LC1(h) is the # bits to describe tree h LC2(D|h) is the # bits to describe D give h Note LC2(D|h) =0 if examples classified perfectly by h. need only describe exceptions Hence, hMDL trades off tree size for training errors. 22 Most Probable Classification of New Instances So far we have sough the most probable hypothesis given the data D (i.e. hMAP) Given new instance s, what is its most probable classification? hMAP(x) is not the most probable classification Consider: Three possible hypotheses: P(h1|D) =0.4, P(h2|D)=0.3, P(h3|D)=0.3 Given new instance x h1(x) =+, h2(x)=-, h3(x)=What is most probable classification of x? 23 Bayes Optimal Classifier One great advantage of Bayesian Decision Theory is that it gives us a lower bound on the classification error that can be obtained for a given problem. Bayes Optimal Classification: The most probable classification of a new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities: argmaxvjVhi HP(vh|hi)P(hi|D) where V is the set of all the values a classification can take and vj is one possible such classification. 24 Bayes Optimal Classifier Properties of the Optimal Bayes Classifier: No other classification method using the same hypothesis space and same prior knowledge can outperform this classifier on average! Bayes optimal classifier may not be in H! Example: P(h1|D) =0.4, P(-|h1)=0, P(+|h1)=1 P(h2|D)=0.3, P(-|h2)=1, P(+|h2)=0 P(h3|D)=0.3, P(-|h3)=1, P(+|h3)=0 Therefore: hi HP(+|hi)P(hi|D)=0.4 hi HP(-|hi)P(hi|D)=0.6 And 25 argmaxvjVhi HP(vh|hi)P(hi|D)= - Gibbs Classifier Bayes optimal classifier provides best result, but can be expensive if many hypotheses. Gibbs Algorithms 1. Choose one hypothesis at random, according to P(h|D) 2. use this to classify new instance Surprising fact: assume target concepts are drawn at random from H according to priors on H, then: E[errorGibbs] 2E[errorBayesOptimal] 26 Naïve Bayes Classifier Let each instance x of a training set D be described by a conjunction of n attribute values <a1,a2,..,an> and let f(x), the target function, be such that f(x) V, a finite set. Bayesian Approach: vMAP = argmaxvj V P(vj|a1,a2,..,an) = argmaxvj V P(a1,a2,..,an|vj) P(vj)/P(a1,a2,..,an) = argmaxvj V P(a1,a2,..,an|vj) P(vj) Naïve Bayesian Approach: We assume that the attribute values are conditionally independent so that P(a1,a2,..,an|vj) =i P(a1|vj) [and not too large a data set is required.] Naïve Bayes Classifier: 27 vNB = argmaxvj V P(vj) i P(ai|vj) Naïve Bayes: Example Consider the weather data and we have to classify the instance: < Outlook = sunny, Temp = cool, Hum = high, Wind = strong> The task is to predict the value (yes or no) of the concept PlayTennis. We apply the naïve bayes rule: P(a | v ) vMAP arg max P(v j ) vj{ yes ,no} i j i arg max P(v j ) P(Outlook sunny | v j ) P(Temp cool | v j ) vj{ yes ,no} 28 P( Hum high | v j ) P(Wind strong | v j ) Naïve Bayes: Example Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P mild high false P cool normal false P cool normal true N cool normal true P mild high false N cool normal false P mild normal false P mild normal true P mild high true P hot normal false P mild high true N P(yes) = 9/14 P(no) = 5/14 29 Outlook P(sunny|yes) = 2/9 P(sunny|no) = 3/5 P(overcast|yes) = 4/9 P(overcast|no) = 0 P(rain|yes) = 3/9 P(rain|no) = 2/5 Temp P(hot|yes) = 2/9 P(hot|no) = 2/5 P(mild|yes) = 4/9 P(mild|no) = 2/5 P(cool|yes) = 3/9 P(cool|no) = 1/5 Hum P(high|yes) = 3/9 P(high|no) = 4/5 P(normal|yes) = 6/9 P(normal|no) = 2/5 Windy P(true|yes) = 3/9 P(true|no) = 3/5 P(false|yes) = 6/9 P(false|no) = 2/5 Naïve Bayes: Example classify the instance: < Outlook = sunny, Temp = cool, Hum = high, Wind = strong> P( yes ) P(sunny | yes ) P(cool | yes ) P(high | yes ) P(strong | yes ) .0053 P(no) p(sunny | no) P(cool | no) P(high | no) P(strong | no) .0206 Thus, the naïve Bayes classifier assigns the value no to PlayTennis! 30 Naïve Bayes: Estimating Probabilities To estimate the probability of an attribute-value A = v for a given class C we use: Relative frequency: nc/n, where nc is the number of training instances that belong to the class C and have value v for the attribute A, and n is the number of training instances of the class C; m-estimate of accuracy: (nc+ mp)/(n+m), where nc is the number of training instances that belong to the class C and have value v for the attribute A, n is the number of training instances of the class C, p is the prior probablity of the class C, and m is the weight of p. 31 Learning to Classify Text Simplifications: each document is represented by a vector of words; the words are considered as numerical attributes wk; the values of the word attributes wk are the frequencies the words occur in the text. To estimate the probability P(wk | v) we use: nk 1 n | Vocabulary | 32 where n is the total number of word positions in all the documents (instances) whose target value is v, nk is the number of times word wk is found in these n word positions, and |Vocabulary| is the total number of distinct words found in the training data. Naïve Bayes: subtleties Conditional independence assumption is often violated P(a1,a2,..,an|vj) =i P(a1|vj) But it works surprisingly well anyway. Note do not need estimated posteriors P(vj|x) to be correct; need only that argmaxvj V P(vj) i P(ai|vj) = argmaxvj V P(vj) P(ai.,…,an|vj) 33 Bayesian Belief Networks The Bayes Optimal Classifier is often too costly to apply. The Naïve Bayes Classifier uses the conditional independence assumption to defray these costs. However, in many cases, such an assumption is overly restrictive. Bayesian belief networks provide an intermediate approach which allows stating conditional independence assumptions that apply to subsets of the variable. Allows combining prior knowledge about dependencies among variables with observed training data 34 Conditional Independence We say that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z. i.e., (xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk) or, P(X|Y,Z)=P(X|Z) This definition can be extended to sets of variables as well: we say that the set of variables X1…Xl is conditionally independent of the set of variables Y1…Ym given the set of variables Z1…Zn , if 35 P(X1…Xl|Y1…Ym,Z1…Zn)=P(X1…Xl|Z1…Zn) Bayesian Belief Networks Network represents a set of conditional independence assertions. Each node is asserted to be conditionally independent of its non descendants, given its immediate predecessors. Directed acyclic graph 36 Bayesian Belief Networks Represent joint probability distribution over all variables E.g. P(Storm, BusTourGroup,… ForestFire) In general, P(y1,…, yi) = i P(yi|Parents(Yi)), where Parents(Yi) denotes immediate predecessors of Yi in graph So, joint distribution is fully defined by graph plus the P(yi|Parents(Yi)) 37 Inference in Bayesian Networks A Bayesian Network can be used to compute the probability distribution for any subset of network variables given the values or distributions for any subset of the remaining variables. Unfortunately, exact inference of probabilities in general for an arbitrary Bayesian Network is known to be NP-hard. In practice, can succeed in many case Exact inference methods work well for some network structures Monte Carlo methods “simulate” the network randomly to calculate approximate solutions. 38 Learning Bayesian Belief Networks The network structure is given in advance and all the variables are fully observable in the training examples. ==> Trivial Case: just estimate the conditional probabilities. The network structure is given in advance but only some of the variables are observable in the training data. ==> Similar to learning the weights for the hidden units of a Neural Net: Gradient Ascent Procedure The network structure is not known in advance. ==> Use a heuristic search or constraint-based technique to search through potential structures. 39 Gradient Ascent for Bayes Nets Let wijk denote one entry in the conditional probability table for variable Y in the network wijk =P(Yi=yij|Parents(Yi)=the list of uik values) E.g. if Yi= Campfire, the might be <Storm=T, BusTourGroup =F> Perform gradient ascent by repeatedly Update all using training data D wijk wijk dD Ph ( yij , uik | d ) wijk Then, renormalize the to assure Σ wijk =1 40 0≤ wijk ≤1 More on Learning Bayes Nets EM algorithm can also be used, Repeatedly Calculate probabilities of unobserved variables, assume h Calculate new to maximize E[lnP(D|h)] where D now includes both observed and unobserved variables When structure unknown Algorithm use greedy search to add/substract edges and nodes Active research topic. 41 Summary: Bayesian Belief Networks Combine prior knowledge with observed data Impact of prior knowledge is to lower the sample complexity 42 Expectation Maximization (EM) When to use: Data is only partially observable Unsupervised clustering (target value unobservable) Supervised learning (some instance attributes unobservable) Some Uses: Train Bayesian Belief Network Unsupervised clustering Learning hidden Markov models 43 Generating Data from Mixture of k Gaussians Each instance x is generated by Choosing one of the k Gaussians with uniform probability Generating an instance at random according to that Gaussian 44 EM for Estimating k Means Given: Instances from X generated by mixture of k Gaussian distributions Unknown means <1, 2 ,.., k> of the k Gaussians Do not know which instance xi was generate by which Gaussian Determine: Maximum likelihood estimates of <1, 2 ,.., k> Think of full description of each instance as yi = <xi, zi1,..,zik >, where 45 zij is 1 if xi generated by jth Gaussian xi observable zij unobservable EM for Estimating k Means EM algorithm: pick random initial h=<1, 2 ,.., k>, then iterate E Step: Calculate the expected value E[zij ] of each hidden variable zij , assuming that the current hypothesis h=<1, 2 ,.., k> holds. E[ zij ] p( x xi | u ui ) k p( x xi | u ui ) n 1 e 2 2 k 1 e n 1 46 1 2 ( xi ui ) 2 2 ( xi ui ) 2 EM for Estimating k Means M Step: Calculate a new maximum likelihood hypothesis h’=<1’, 2’ ,.., k’>, assuming the value taken on by each hidden variable zij is its expected value E[zij] calculated in E step. Then replace the hypothesis h=<1, 2 ,.., k> by the new hypothesis h’=<1’, 2’ ,.., k’> and iterate. k uj E[ zij ]xi i 1 k E[ zij ] i 1 47 EM Algorithm Converge to local maximum likelihood h and provides estimates of hidden variables zij In fact, local maximum in E[lnP(Y|h)] Y is complete data (observable plus unobservable variables) Expected value is taken over possible values of unobserved variables in Y 48 General EM Problem Given: Observed data X={x1, x2 ,.., xm} Unobserved data Z={z1, z2 ,.., zm} Parameterized probability distribution P(Y|h), where Y={y1, y2 ,.., ym} is the full data y i xi zi h are the parameters. Determine: h that locally maximize E[lnP(Y|h)] 49 General EM Method Define likelihood function Q(h’|h) which calculate Y=X U Z using observed X and current parameters h to estimate Z Q(h’|h)<- E[lnP(Y|h’)|h,X] EM algorithm: Estimation step: calculate Q(h’|h) using the current hypothesis h and the observed data X to estimate the probability distribution over Y Q(h’|h)<- E[lnP(Y|h’)|h,X] Maximization step: replace hypothesis h by the hypothesis h’ that maximize this Q function h-<- argmaxh’Q(h’|h) 50