Download CS 9633 Knowledge Discovery and Data Mining

Bayesian Learning Computer Science Department CS 9633 Machine Learning Bayesian Learning • Probabilistic approach to inference • Assumption – Quantities of interest are governed by probability distribution – Optimal decisions can be made by reasoning about probabilities and observations • Provides quantitative approach to weighing how evidence supports alternative hypotheses Computer Science Department CS 9633 Machine Learning Why is Bayesian Learning Important? • Some Bayesian approaches (like naive Bayes) are very practical learning approaches and competitive with other approaches • Provides a useful perspective for understanding many learning algorithms that do not explicitly manipulate probabilities Computer Science Department CS 9633 Machine Learning Important Features • Model is incrementally updated with training examples • Prior knowledge can be combined with observed data to determine the final probability of the hypothesis – Asserting prior probability of candidate hypotheses – Asserting a probability distribution over observations for each hypothesis • Can accommodate methods that make probabilistic predictions • New instances can be classified by combining predictions of multiple hypotheses • Can provide a gold standard for evaluating hypotheses Computer Science Department CS 9633 Machine Learning Practical Problems • Typically require initial knowledge of many probabilities. Can be estimated by: – Background knowledge – Previously available data – Assumptions about distribution • Significant computational cost of determining Bayes optimal hypothesis in general – linear in number of hypotheses in general case – Significantly lower for certain situations Computer Science Department CS 9633 Machine Learning Bayes Theorem • Goal: learn the “best” hypothesis • Assumption in Bayes learning: the “best” hypothesis is the most probable hypothesis • Bayes theorem allows computation of most probable hypothesis based on – Prior probability of hypothesis – Probability of observing certain data given the hypothesis – Observed data itself Computer Science Department CS 9633 Machine Learning Notation P(h) Prior probability of h P(D) Prior probability of D P(D|h) Probability of D given h posterior probability of D given h likelihood of Data given h P(h|D) Probability that h holds, given the data Computer Science Department CS 9633 Machine Learning Bayes Theorem • Based on definitions of P(D|h) and P(h|D) P ( D | h) P ( h ) P ( h | D)  P( D) D Computer Science Department CS 9633 Machine Learning h Maximum A Posteriori Hypothesis • Many learning algorithms try to identify the most probable hypothesis h  H given observations D • This is the maximum a posteriori hypothesis (MAP hypothesis) Computer Science Department CS 9633 Machine Learning Identifying the MAP Hypothesis using Bayes Theorem hMAP  arg max P(h | D) hH hMAP hMAP P ( D | h) P ( h)  arg max P( D) hH  arg max P( D | h) P(h) hH Computer Science Department CS 9633 Machine Learning Equally Probable Hypotheses hMAP  arg max P( D | h) P(h) hH hMAP  arg max P( D | h) hH Any hypothesis that maximizes P(D|h) is a Maximum Likelihood (ML) hypothesis hML  arg max P( D | h) hH Computer Science Department CS 9633 Machine Learning Bayes Theorem and Concept Learning • Concept Learning Task H Hypothesis space X Instance space c: X{0,1} Computer Science Department CS 9633 Machine Learning Brute-Force MAP Learning Algorithm • For each hypothesis h in H, calculate the posterior probability P ( D | h) P ( h ) P ( h | D)  P( D) • Output the hypothesis with the highest posterior probability hMAP  arg max P( D | h) hH Computer Science Department CS 9633 Machine Learning To Apply Brute Force MAP Learning • Specify P(h) • Specify P(D|h) Computer Science Department CS 9633 Machine Learning An Example • Assume – Training data D is noise free (di = c(xi)) – The target concept is contained in H – We have no a priori reason to believe one hypothesis is more likely than any other 1 P(h)  for all h  H H Computer Science Department CS 9633 Machine Learning Probability of Data Given Hypothesis 1 if d i  h(xi ) for all d i in D P ( D | h)   0 otherwise  Computer Science Department CS 9633 Machine Learning Apply the algorithm • Step 1 (2 cases) P ( D | h) P ( h ) P ( h | D)  P( D) – Case 1 (D is inconsistent with h) P( h | D)  (0) P(h) 0 P( D) – Case 2 (D is consistent with h) 1 1 (1) (1) H H 1 P(h | D)    P( D) VS H , D VS H , D H Computer Science Department CS 9633 Machine Learning Step 2 • Every consistent hypothesis has probability 1/|VSH,D| • Every inconsistent hypothesis has probability 0 Computer Science Department CS 9633 Machine Learning MAP hypothesis and consistent learners • FIND-S (finds maximally specific consistent hypothesis) • Candidate-Elimination (finds all consistent hypotheses. Computer Science Department CS 9633 Machine Learning Maximum Likelihood and Least-Squared Error Learning • New problem: learning a continuousvalued target function • Will show that under certain assumptions, any learning algorithm that minimized the squared error between output hypotheses on training data will output a maximum likelihood hypothesis. Computer Science Department CS 9633 Machine Learning Problem Setting • • • • Learner L Instance space X Hypothesis space H h: XR Task of L is to learn unknown target function f: XR • Have m examples • Target value for each example is corrupted by random noise drawn from Normal distribution Computer Science Department CS 9633 Machine Learning Work Through Derivation Computer Science Department CS 9633 Machine Learning Why Normal Distribution for Noise? • Its easy to work with • Good approximation of many physical processes • Important point: we are only dealing with noise in the target function—not the attribute values. Computer Science Department CS 9633 Machine Learning Bayes Optimal Classifier • Two Questions: – What is the most probable hypothesis given the training data? » Find MAP hypothesis – What is the most probable classification given the training data? Computer Science Department CS 9633 Machine Learning Example • Three hypotheses: P(h1|D) = 0.35 P(h2|D) = 0.45 P(h3|D) = 0.20 • New instance x h1 predicts negative h2 predicts positive h3 predicts negative • What is the predicted class using hMAP? • What is the predicted class using all hypotheses? Computer Science Department CS 9633 Machine Learning Bayes Optimal Classification • The most probable classification of a new instance is obtained by combining the predictions of all hypotheses, weighted by their posterior probabilities. • Suppose set of values for classification is from set V (each possible value is vj) • Probability that vj is the correct classification for new instance is: P(v j | D)   P(v hi H j | hi ) P(hi | D) • Pick the vj with the max probability as the predicted class Computer Science Department CS 9633 Machine Learning Bayes Optimal Classifier arg max v j V  P(v hi H j | hi ) P(hi | D) Apply this to the previous example: Computer Science Department CS 9633 Machine Learning Bayes Optimal Classification • Gives the optimal error-minimizing solution to prediction and classification problems. • Requires probability of exact combination of evidence • All classification methods can be viewed as approximations of Bayes rule with varying assumptions about conditional probabilities – Assume they come from some distribution – Assume conditional independence – Assume underlying model of specific format (linear combination of evidence, decision tree) Computer Science Department CS 9633 Machine Learning Simplifications of Bayes Rule • Given observations of attribute values a1, a2, …an,, compute the most probable target value vMAP vMAP  arg max P(v j | a1 , a2 ,, an ) v j V • Use Bayes Theorem to rewrite vMAP  arg max v j V P(a1 , a2 ,  , an | v j ) P (v j ) P (a1 , a2 ,  , an ) vMAP  arg max P(a1 , a2 ,  , an | v j ) P(v j ) v j V Computer Science Department CS 9633 Machine Learning Naïve Bayes • The most usual simplification of Bayes Rule is to assume conditional independence of the observations – Because it is approximately true – Because it is computationally convenient • Assume the probability of observing the conjunction a1, a2, …an is the product of the probabilities of the individual attributes vNB  arg max P(v j ) P(ai | v j ) v j V i • Learning consists of estimating probabilities Computer Science Department CS 9633 Machine Learning Simple Example • Two classes C1 and C2. • Two features – a1 – a2 Male, Female Blue eyes, Brown eyes • Instance (Male with blue eyes) What is the class? Probability C1 C2 P(Ci) 0.4 0.6 P(Male|Cj) 0.1 0.2 P(BlueEyes|Cj) 0.3 0.2 Computer Science Department CS 9633 Machine Learning Estimating Probabilities (Classifying Executables) • Two Classes (Malicious, Benign) • Features – a1 – a2 – a3 – a4 GUI present (yes/no) Deletes files (yes/no) Allocates memory (yes/no) Length (< 1K, 1-10 K, > 10K) Computer Science Department CS 9633 Machine Learning Instance a1 a2 a3 a4 Class 1 Yes No No Yes B 2 Yes No No No B 3 No Yes Yes No M 4 No No Yes Yes M 5 Yes No No Yes B 6 Yes No No No M 7 Yes Yes Yes No M 8 Yes Yes No Yes M 9 No No No Yes B 10 No No Yes No M Classify the Following Instance • <Yes, No, Yes, Yes> Computer Science Department CS 9633 Machine Learning Estimating Probabilities • To estimate P(C|D) • Let n be the number of training examples labeled D • Let nc be the number labeled D that are also labeled C • P(C|D) was estimated as nc/n • Problems – This is a biased underestimate of the probability – When the term is 0, it dominates all others Computer Science Department CS 9633 Machine Learning Use m-estimate of probability nc  mp nm • p is prior of what we are trying to estimate (often assume attribute values equally probable) • m is a constant (called equivalent sample size) view this augmenting with a virtual sample Computer Science Department CS 9633 Machine Learning Repeat Estimates • Use equal priors for attribute values • Use m value of 1 Computer Science Department CS 9633 Machine Learning Bayesian Belief Networks • Naïve Bayes is based on assumption of conditional independence • Bayesian networks provide a tractable method for specifying dependencies among variables Computer Science Department CS 9633 Machine Learning Terminology • A Bayesian Belief Network describes the probability distribution over a set of random variables Y1, Y2, …Yn • Each variable Yi can take on the set of values V(Yi) • The joint space of the set of variables Y is the cross product V(Y1)  V(Y2) …  V(Yn) • Each item in the joint space corresponds to one possible assignment of values to the tuple of variables <Y1, …Yn> • Joint probability distribution: specifies the probabilities of the items in the joint space • A Bayesian Network provides a way to describe the joint probability distribution in a compact manner. Computer Science Department CS 9633 Machine Learning Conditional Independence • Let X, Y, and Z be three discrete-valued random variables. • We say that X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z xi , y j , zk P( X  xi | Y  y j , Z  zk )  P( X  xi | Z  zk ) P( X | Y , Z )  P( X | Z ) Computer Science Department CS 9633 Machine Learning Bayesian Belief Network – A set of random variables makes up the nodes of the network – A set of directed links or arrows connects pairs of nodes. The intuitive meaning of an arrow from X to Y is that X has a direct influence on Y. – Each node has a conditional probability table that quantifies the effects that the parents have on the node. The parents of a node are all those nodes that have arrows pointing to it. – The graph has no directed cycles (it is a DAG) Computer Science Department CS 9633 Machine Learning Example (from Judea Pearl) You have a new burglar alarm installed at home. It is fairly reliable at detecting a burglary, but also responds on occasion to minor earthquakes. You also have two neighbors, John and Mary, who have promised to call you at work when they hear the alarm. John always calls when he hears the alarm, but sometimes confuses the telephone ringing with the alarm and calls then, too. Mary, on the other hand, likes rather loud music and sometimes misses the alarm altogether. Given the evidence of who has or has not called, we would like to estimate the probability of a burglary. Computer Science Department CS 9633 Machine Learning Step 1 • Determine what the propositional (random) variables should be • Determine causal (or another type of influence) relationships and develop the topology of the network Computer Science Department CS 9633 Machine Learning Topology of Belief Network Burglary Earthquake Alarm JohnCalls MaryCalls Computer Science Department CS 9633 Machine Learning Step 2 • Specify a conditional probability table or CPT for each node. • Each row in the table contains the conditional probability of each node value for a conditioning case (possible combinations of values for parent nodes). • In the example, the possible values for each node are true/false. • The sum of the probabilities for each value of a node given a particular conditioning case is 1. Computer Science Department CS 9633 Machine Learning Example: CPT for Alarm Node Burglary Earthquake P(Alarm|Burglary,Earthquake) True False True True 0.950 0.050 True False 0.940 0.060 False True 0.290 0.710 False False 0.001 0.999 Computer Science Department CS 9633 Machine Learning Complete Belief Network P(B) 0.001 P(E) 0.002 Burglary Earthquake B T T F F Alarm JohnCalls A T F P(J|A) 0.90 0.05 MaryCalls Computer Science Department CS 9633 Machine Learning E T F T F P(A|B,E) 0.95 0.94 0.29 0.01 A T F P(M|A) 0.70 0.01 Semantics of Belief Networks • View 1: A belief network is a representation of the joint probability distribution (“joint”) of a domain. • The joint completely specifies an agent’s probability assignments to all propositions in the domain (both simple and complex.) Computer Science Department CS 9633 Machine Learning Network as representation of joint • A generic entry in the joint probability distribution is the probability of a conjunction of particular assignments to each variable, such as: n P(x1,..., xn )   P(xi | Parents(Xi )) i1  Each entry in the joint is represented by the product of appropriate elements of the CPTs in the belief network. Computer Science Department CS 9633 Machine Learning Example Calculation Calculate the probability of the event that the alarm has sounded but neither a burglary nor an earthquake has occurred, and both John and Mary call. P(J ^ M ^ A ^ ~B ^ ~E) = P(J|A) P(M|A) P(A|~B,~E) P(~B) P(~E) = 0.90 * 0.70 * 0.001 * 0.999 * 0.998 = 0.00062 Computer Science Department CS 9633 Machine Learning Semantics • View 2: Encoding of a collection of conditional independence statements. – JohnCalls is conditionally independent of other variables in the network given the value of Alarm • This view is useful for understanding inference procedures for the networks. Computer Science Department CS 9633 Machine Learning Inference Methods for Bayesian Networks • We may want to infer the value of some target variable (Burglary) given observed values for other variables. • What we generally want is the probability distribution • Inference straightforward if all other values in network known • More general case, if we know a subset of the values of variables, we can infer a probability distribution over other variables. • NP-Hard problem • But approximations work well Computer Science Department CS 9633 Machine Learning Learning Bayesian Belief Networks • Focus of a great deal of research • Several situations of varying complexity – Network structure may be given or not – All variables may be observable or you may have some variables that cannot be observed • If the network structure is known and all variables can be observed, the CPT’s can be computed like they were for Naïve Bayes Computer Science Department CS 9633 Machine Learning Gradient Ascent Training of Bayesian Networks • Method developed by Russell • Maximizes P(D|h) by following the gradient of ln P(D|h) • Let wijk be a single entry in CPT table that variable Yi will take on value yij given that its immediate parent is Ui takes on values given by uik Computer Science Department CS 9633 Machine Learning Illustration Ui=uik wijk= P(Yi=yij|Ui=uik) Yi = yij Computer Science Department CS 9633 Machine Learning Result P(Yi  yi , j , U  ui ,k | d )  ln P( D | h)  wij wi , j ,k d D Computer Science Department CS 9633 Machine Learning Example Burglary Earthquake Alarm JohnCalls To compute P(A|B,E) we would need P(A,B,E|d) for each training example MaryCalls Computer Science Department CS 9633 Machine Learning EM Algorithm • The EM algorithm is a general purpose algorithm that is used in many settings including – Unsupervised learning – Learning CPT’s for Bayesian networks – Learning Hidden Markov models • Two-step algorithm for learning hidden variables Computer Science Department CS 9633 Machine Learning Two Step Process • For a specific problem with have three quantities – Xobserved data for instances – Z unobserved data for instances (this is usually what we are trying to learn) – Y full data • General approach – Determine initial hypothesis for values for Z – Step 1: Estimation » Compute a function Q(h’|h) using current hypothesis h and the observed data X to estimate the probability distribution over Y. – Step 2: Maximization » Revise hypothesis h with h’ that maximizes the Q function Computer Science Department CS 9633 Machine Learning K-means algorithm Assume that data comes from 2 Gaussian distributions. Means () are unknown P(x) x Computer Science Department CS 9633 Machine Learning Generation of data • Select one of the normal distributions at random • Generate a single random instance xi using this distribution p( x)  1 2 2 e 1 x 2  ( ) 2  E[ X ]   Computer Science Department CS 9633 Machine Learning Example Select initial values for h h = <1, 2> 2 X 1 Y Computer Science Department CS 9633 Machine Learning E-step: Compute the probability that datum xi generated by component i h = <1, 2> 2 X 1 Y Computer Science Department CS 9633 Machine Learning M-step: Replace hypothesis h with h’ that maximizes Q h’ = <1’, 2’> X 1’ 2’ Y Computer Science Department CS 9633 Machine Learning

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CS 9633 Knowledge Discovery and Data Mining