Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 21: Graphical Models Machine Learning CUNY Graduate Center Today • Graphical Models – Representing conditional dependence graphically 1 Graphical Models and Conditional Independence • More generally about probabilities, but used in classification and clustering. • Both Linear Regression and Logistic Regression use probabilistic models. • Graphical Models allow us to structure, and visualize probabilistic models, and the relationships between variables. 2 (Joint) Probability Tables • Represent multinomial joint probabilities between K variables as K-dimensional tables • Assuming D binary variables, how big is this table? • What is we had multinomials with M entries? 3 Probability Models • What if the variables are independent? • If x and y are independent: • The original distribution can be factored • How big is this table, if each variable is binary? 4 Conditional Independence • Independence assumptions are convenient (Naïve Bayes), but rarely true. • More often some groups of variables are dependent, but others are independent. • Still others are conditionally independent. 5 Conditional Independence • If two variables are conditionally independent. • E.g. y = flu?, x = achiness?, z = headache? 6 Factorization if a joint • Assume • How do you factorize: 7 Factorization if a joint • What if there is no conditional independence? • How do you factorize: 8 Structure of Graphical Models • Graphical models allow us to represent dependence relationships between variables visually – Graphical models are directed acyclic graphs (DAG). – Nodes: random variables – Edges: Dependence relationship – No Edge: Independent variables – Direction of the edge: indicates a parent-child relationship – Parent: Source – Trigger – Child: Destination – Response 9 Example Graphical Models x y x y • Parents of a node i are denoted πi • Factorization of the joint in a graphical model: 10 Basic Graphical Models • Independent Variables x y z y z • Observations x • When we observe a variable, (fix its value from data) we color the node grey. • Observing a variable allows us to condition on it. E.g. p(x,z|y) • Given an observation we can generate pdfs for the other variables. 11 Example Graphical Models • • • • X = cloudy? Y = raining? Z = wet ground? Markov Chain x y z 12 Example Graphical Models • Markov Chain x y z • Are x and z conditionally independent given y? 13 Example Graphical Models • Markov Chain x y z 14 One Trigger Two Responses • X = achiness? • Y = flu? • Z = fever? y x z 15 Example Graphical Models y x z • Are x and z conditionally independent given y? 16 Example Graphical Models y x z 17 Two Triggers One Response • X = rain? • Y = wet sidewalk? • Z = spilled coffee? x z y 18 Example Graphical Models x z y • Are x and z conditionally independent given y? 19 Example Graphical Models x z y 20 Factorization x1 x3 x5 x0 x2 x4 21 Factorization x1 x3 x5 x0 x2 x4 22 How Large are the probability tables? 23 Model Parameters as Nodes • Treating model parameters as a random variable, we can include these in a graphical model • Multivariate Bernouli µ0 µ1 µ2 x0 x1 x2 24 Model Parameters as Nodes • Treating model parameters as a random variable, we can include these in a graphical model • Multinomial µ x0 x1 x2 25 Naïve Bayes Classification y x0 x1 x2 • Observed variables xi are independent given the class variable y • The distribution can be optimized using maximum likelihood on each variable separately. • Can easily combine various types of distributions 26 Graphical Models • Graphical representation of dependency relationships • Directed Acyclic Graphs • Nodes as random variables • Edges define dependency relations • What can we do with Graphical Models – Learn parameters – to fit data – Understand independence relationships between variables – Perform inference (marginals and conditionals) – Compute likelihoods for classification. 27 Plate Notation y y x0 x1 … xn n xi • To indicate a repeated variable, draw a plate around it. 28 Completely observed Graphical Completely Observed graphical models Model Completely Observed graphical models • Observations for every node Suppose we have observations for every node. Suppose we have observations for every node. Flu Fever Sinus Ache Swell Head Y L Y Y Y N Flu Fever Sinus Ache Swell Head N M N N N N Y L Y Y Y N Y H N N Y Y N M N N N N Y M Y N N Y Y H N N Y Y Y M Y N N Y In the simplest – least general – graph, assume each independent. Train 6 separate models. In the simplest – least general – graph, assume each independent. Train 6 separate models. Fl Fe Si Ac Sw He • Simplest (least general) graph, assume each independent Fl Fe Si Ac Sw He 2nd simplest graph – most general – assume no independence. Build a 6-dimensional table. (Divide by total count.) 2nd simplest graph – most general – assume no independence. Build a 6-dimensional table. (Divide Fl by totalFecount.)Si Ac Sw He 29 Suppose we have observations for every node. Completely observed Graphical Completely Observed graphical models Flu Fever Sinus Ache Swell Head Y L Model Y Y Y N N M N N N N Suppose we have observations for every node. Y H N N Y Y Y M Y N N Y Flu Fever Sinus Ache Swell Head Y L Y Y Y N In the simplest – least general – graph, assume each independent. Train 6 N M N N N N separate models. Y H N N Y Y Y M Fl FeY Si N Ac N Sw Y He • Observations for every node • Second simplest graph, assume complete dependence In the simplest – least general – graph, assume each independent. Train 6 separate models. graph – most general – assume no independence. Build a 2nd simplest 6-dimensional table. Fl Fe Si Ac Sw He Fl Fe Si Ac Sw He 2nd simplest graph – most general – assume no independence. Build a 6-dimensional table. (Divide by total count.) 30 Fl Fe Si Ac Sw He Maximum Likelihood Conditional Probability Tables Maximum Likelihood Consider this Graphical Model x1 x0 x2 x3 x5 x4 Each node has a conditional probability table θ . • Each node has a conditional probability Given the table, we have a pdf table, θ p(x|θ) = p(x |π , θ ) • Given the tables, we can construct the pdf. i M− 1 i i i i= 0 We have m variables in x, and N data points, X. Maximum (log) Likelihood N− 1 M− 1 • Use Maximum Likelihood to =find the best argmax ln p(x θ = argmax ln p(X|θ) settings of θ in |θi ) ∗ θ θ argmax i= 0 N− 1 M − 1 N− 1 = n= 0 ln p(X n |θ) = argmax 31 ln p(xin |θi ) Maximum likelihood 32 Maximum Likelihood CPTs Count functions • Count the number of times something First, Kronecker’s delta function. appears in the data 1 if x = x n δ(xn , xm ) = 0 m otherwise Counts: the number of times something appears in the data N− 1 m(xi ) δ(xi , xin ) = n= 0 N− 1 m(X) δ(X, X n ) = n= 0 N= δ(x1, x2 ) m(x1 ) = x1 x1 x2 δ(x1 , x2 , x = x1 x2 x3 33 Maximum Likelihood m(X) = = m(X) ln ln p(xp(x |π ,| Maximum likelihood CPTs M− 1 M− 1 N− 1N− 1 l (θ)l (θ)= = i ln p(X ln p(X n |θ)n |θ) xn M− 1 n= 0 n= 0 N− 1 l (θ) = ln p(X n |θ) = = n= 0 n= 0 N− 1 p(X|θ) ln ln p(X|θ) X xn = X = = M− 1 m(X) ln p(X|θ) m(X) ln p(X|θ) xn xn = n xin= 0 i = 0 i = 0 xi , π i m(X) m(X) ln p ,πix X\ ,πixiX\ i = 0 i x=i 0 \ πxi i \ πi m(X) ln p(xi |π i , θi ) i M − 1M − 1 i = 0 xi , π i X\ xi \ π i m(X) ln p(X|θ) x m(X) ln p(x m(X) ln p(x i |πi i,| = = i= 0 X)p(X|θ) ln p(X|θ) n , ln δ(xδ(x n , X) = 0 X δ(xnn= , X)0 n= lnXp(X|θ) n= 0 xn M− 1 = = = p(xi |πi , θi ) = = i= 0 m(X) ln p(xi |π M i−, θ 1Mi )− 1 p(X|θ) N− 1N− 1 X i= 0 i= 0 M− 1 M− 1 M− 1 = δ(xn , X) ln m(X) ln xn ,X) n δ(xn δ(x ,X) n= 0 n= 0 X N− 1 = = N− 1N− 1 xn ln pi | m(xm(x lni )p(x i , πii ,) π = = m(xi , πi ) ln p(xi |πii=, θ i) ,πixi ,πi i = 0 xi 0 • Define a function: θ(x , π ), π= ) p(x = p(x |π ,|π θ ), θ ) Constraint: Constraint: Constraint: • Constraint: θ(x , π ) = 1 Define aDefine function: Define a function: a function: θ(xi , πi ) = p(xi |πi , θ(x θi ) ii i i i i i ii ii i θ(xθ(x =i ) 1= 1 i , πii ), π xi xi ii xi 23 / 36 34 Maximum Likelihood • Use Lagrange Multipliers 35 Maximum A Posteriori Training • Bayesians would never do that, the thetas need a prior. 36 Conditional Dependence Test • Can check conditional independence in a graphical model – “Is achiness (x3) independent of the flue (x0) given fever(x1)?” – “Is achiness (x3) independent of sinus infections(x2) given fever(x1)?” 37 ation and Bayes Ball D-Separation and Bayes Ball ✛ ✘ x1 x0 ✚ x2 x3 x5 ✙ x4 • Intuition: nodes are separated or blocked on: nodes are separat ed, or blocked by sets of nodes by sets of nodes. xample: nodes x1 and x2, x2, “ block” x0 to – E.g. nodes x1 and “block”the thepath path from from x0 So hen x0 ⊥⊥toxx5. , x3x0 is cond. ind.from x5 given x1 and 5 |x2 x2 38 Bayes Ball Algorithm • Shade nodes xc • Place a “ball” at each node in xa • Bounce balls around the graph according to rules • If no balls reach xb, then cond. ind. 39 Ten rules of Bayes Ball Theorem 40 Bayes Ball Example x0 ⊥⊥ x4 |x2? x3 x1 x0 x5 x2 x4 41 Bayes Ball Example x0 ⊥⊥ x5 |x1, x2 ? x3 x1 x0 x5 x2 x4 42 Undirected Graphs • What if we allow undirected graphs? • What do they correspond to? • Not Cause/Effect, or Trigger/Response, but general dependence • Example: Image pixels, each pixel is a bernouli – P(x11,…, x1M,…, xM1,…, xMM) – Bright pixels have bright neighbors • No parents, just probabilities. • Grid models are called Markov Random Fields 43 Undirected Graphs A D C B • Undirected separability is easy. • To check conditional independence of A and B given C, check the Graph reachability of A and B without going through nodes in C 44 Next Time • Inference in Graphical Models – Belief Propagation – Junction Tree Algorithm 45