Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea [email protected] 1 / 37 Graphical Models: Efficient Representation ◮ ◮ ◮ ◮ Consider a set of (discrete) random variables {X1 , . . . , Xn } where Xi takes on each of its r different values xi . A direct representation of p(x1 , . . . , xn ) requires an n-dimensional table (table of r n values representing the probabilities of each of the r n possible values of X1 , . . . , Xn ). Graphical models provide an efficient means of representing joint probability distributions over many random variables in situations where each random variable is conditionally independent of all but a handful of the n random variables. A graphical model can be thought of as a probabilistic database, a machine that can answer ”queries” regarding the values of sets of random variables. We build up the database in pieces, using probability theory to ensure that the pieces have a consistent overall interpretation. Probability theory also justifies the inferential machinery that allows the pieces to be put together ”on the fly” to answer queries. 2 / 37 The chain rule gives p(x1 , x2 , x3 , x4 ) = p(x4 |x3 , x2 , x1 )p(x3 |x2 , x1 )p(x2 |x1 )p(x1 ). X1 X1 X2 X3 GM (chain rule) X2 X3 X4 X4 GM (Markov chain) 3 / 37 Outline: Directed Graphical Models ◮ ◮ Directed graphs and joint probabilities Conditional independence and d-separation ◮ Three canonical graphs Bayes ball algorithm ◮ Characterization of directed graphical models ◮ 4 / 37 Notations ◮ ◮ ◮ ◮ Given a set of random variables, {X1 , . . . , Xn }, let xi represent the realization of random variable Xi . The probability mass function p(x1 , . . . , xn ) is defined as P(X1 = x1 , . . . , Xn = xn ). Use X to stand for {X1 , . . . , Xn } and x to stand for {x1 , . . . , xn }. XA is shorthand for {X1 , X2 } if A = {1, 2}. 5 / 37 Directed Graphs A directed graph is a pair, G(V, E), where V is a set of nodes (vertices) and E is a set of oriented edges. Assume that G is acyclic. ◮ Nodes ◮ ◮ ◮ ◮ Associated with random variables. a One-to-one mapping from nodes to random variables. Parent node, πi , is a set of parents of node i. Edges ◮ Edges represent conditional dependence. 6 / 37 Conditional Independence ◮ XA and XB are independent, XA ⊥XB , if p(xA , xB ) = p(xA )p(xB ). ◮ XA and XC are conditionally independent given XB , XA ⊥XC | XB , if p(xA , xC |xB ) = p(xA |xB )p(xC |xB ) or p(xA |xB , xC ) = p(xA |xB ), for all xB such that p(xB ) > 0. 7 / 37 An Example of DAG X4 X2 X1 X6 X3 X5 p(x1 , x2 , x3 , x4 , x5 , x6 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 )p(x5 |x3 )p(x6 |x2 , x5 ) 8 / 37 Factorization of Joint Probability Distributions Use the locality defined by the parent-child relationship to construct economical representation of joint probability distributions. Associate a function fi (xi , xπi ) to each node i ∈ V (Properties of conditional probability distributions: Nonnegativity and sum-to-one). Given a set of functions {fi (xi , xπi ) : i ∈ V} for V = 1, 2, ..., n, we define a joint probability distribution as follows: p(x1 , x2 , ..., xn ) , n Y fi (xi , xπi ). i =1 Given that fi (xi , xπi ) are conditional probabilities, we write in terms of p(xi | xπi ): p(x1 , x2 , ..., xn ) , n Y p(xi | xπi ). i =1 9 / 37 An Example of DAG: Revisited X4 X2 X1 X6 X3 X5 p(x1 , x2 , x3 , x4 , x5 , x6 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 )p(x5 |x3 )p(x6 |x2 , x5 ) 10 / 37 Economical Representation? Consider n discrete random variables {x1 , x2 , ..., xn } with each variable xi ranging over r values. ◮ Naive approach ◮ ◮ Need an n-dimensional table of size r n PGM ◮ Need an (mi +1)-dimensional table of size r mi +1 where mi is the number of parents of node Xi 11 / 37 Conditional Independence in DAG Two different factorization of a probability distribution function ◮ The chain rule leads to p(x1 , x2 , ..., xn ) = n Y p(xi | x1 , ..., xi −1 ). i =1 ◮ DAG leads to p(x1 , x2 , ..., xn ) = n Y p(xi | xπi ). i =1 Comparing these expressions, we might interpret that missing variables in the local conditional probability functions corresponds to missing edges (conditional independence) in the underlying graph. 12 / 37 Basic Conditional Independence Statements An ordering I of the nodes in a graph G is said to be topological if the nodes in πi appear before i in I for every node i ∈ V. (I = {1, 2, 3, 4, 5, 6} is a topological ordering in our example) Let νi denote the set of all nodes that appear earlier than i in I , excluding πi . (for example, ν5 = {1, 2, 4}) Given a topological ordering I , the set of basic conditional independence statements: {Xi ⊥Xνi | Xπi }, for i ∈ V. These statement can be verified by algebraic calculations Example: X4 ⊥{X1 , X3 } | X2 p(x4 |x1 , x2 , x3 ) = p(x1 , x2 , x3 , x4 ) p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 ) = = p(x4 |x2 ) p(x1 , x2 , x3 ) p(x1 )p(x2 |x1 )p(x3 |x1 ) 13 / 37 Markov Chain X ◮ Y X Y Z X ⊥Z | Y p(z|x, y ) = ◮ Z p(x, y , z) p(x)p(y |x)p(z|y ) = = p(z|y ) p(x, y ) p(x)p(y |x) There are no other conditional independencies Asserted conditional independencies always hold for the family of distributions associated with a given graph. Non-asserted conditional independencies sometimes fail to hold but sometimes do hold. 14 / 37 Hidden Cause Y X ◮ Z X Z X ⊥Z | Y p(x, z|y ) = ◮ Y p(y )p(x|y )p(z|y ) = p(x|y )p(z|y ) p(y ) We do not necessarily assume that X and Z are dependent. 15 / 37 Explaining-Away X Z X Y ◮ Z Y X ⊥Z p(y |x, z) = p(x)p(z)p(y |x, z) p(x, z) p(x, z) = p(x)p(z) ◮ No X ⊥Z | Y 16 / 37 Bayes Ball Algorithm We wish to decide whether a given conditional independence statement, XA ⊥XB | XC , is true for a directed graph G. ◮ A reachability algorithm ◮ A d-separation test ◮ Shade the nodes in XC . Place balls at each node in XA , let them bounce around according to some rules, and then ask if any of the balls reach any of the nodes in XB . 17 / 37 Bayes Ball Algorithm - Rules 1,2 X Y Z X Y X Y Z Y Z X Z 18 / 37 Bayes Ball Algorithm - Rules 3,4,5 X Z X Y Z Y X Y X Y X Y X Y 19 / 37 Example 1 X4 X2 X1 X6 X3 X5 X4 ⊥{X1 , X3 } | X2 is true. 20 / 37 Example 2 X4 X2 X1 X6 X3 X5 X1 ⊥X6 | {X2 , X3 } is true. 21 / 37 Example 3 X4 X2 X1 X6 X3 X5 X2 ⊥X3 | {X1 , X6 } is not true. 22 / 37 Characterization of Directed Graphical Models A graphical model is associated with a family of probability distributions and this family can be characterized in two equivalent ways. Qn ◮ D1 = p(x) = i =1 p(xi | xπi ) . ◮ ◮ ◮ D2 = {p(x) subject to Xi ⊥Xνi | Xπi }. (A family of probability distributions associated with G, that includes all p(xv ) that satisfy every conditional independence statement associated with G) D1 = D2 (Details will be discussed later) This provides a strong and important link between graph theory and probability theory. 23 / 37 Markov Blanket Markov blanket: V is a Markov blanket for X iff X ⊥Y |V for all Y ∈ / V. In a DAG, Markov blanket for node i is composed of its parents, its children, and its children’s parents. Markov blanket identifies all the variables that shield off the node from the rest of the network. This means that the Markov blanket of a node is the only knowledge that is needed to predict the behaviour of that node. (coined by J. Pearl) 24 / 37 A 25 / 37 Outline: Undirected Graphical Models ◮ Markov networks ◮ Clique potentials ◮ Characterization of undirected graphical models 26 / 37 Markov Networks (Undirected Graphical Models) X4 replacements Examples of Markov networks: X1 X3 X2 Boltzmann machines X5 Markov random fields Semantics: Every node is conditionally independent from its non-neighbors, given its neighbors. Markov blanket: V is a Markov blanket for X iff X ⊥Y |V for all Y ∈ / V. Markov boundary: Minimal Markov blanket 27 / 37 Boltzmann Machines Markov network over a vector of binary variables, xi ∈ {0, 1}, where some variables may be hidden (xiH ) and some may be observed (visible, xiV ). Learning for Boltzmann machines involves adjusting weights W such that the generative model 1 ⊤ 1 x Wx p(x|W ) = exp Z 2 is well-matched to a set of examples {x(t)} (t = 1, . . . , N). The learning is done via EM! 28 / 37 Conditional Independence For undirected graphs, the conditional independence semantics is the more intuitive. • Every path from a node in XA to a node in XC includes at least one node in XB ⇒ XA ⊥XC | XB . XB XA XA ⊥XC | XB XC 29 / 37 Comparative Semantics W replacements X Y Z X X Y Z X ⊥Y X ⊥Z X ⊥Y | Z not X ⊥Y Y Z X ⊥Y not X ⊥Y | Z X ⊥Y | {W , Z } W ⊥Z | {X , Y } X Y Z 30 / 37 Clique Potentials X4 Clique: A fully connected subgraph (usually maximal) X1 X3 X2 X5 Denote by xCi the set of variables in clique Ci . For each clique Ci , we assign a nonnegative function (potential function), ψCi (xCi ) which measures compatibility (agreement, constraint, or energy). p(x) = 1 Y ψC (xC ), Z C ∈C i i i where Z = P Q x Ci ∈C ψCi (xCi ) is the normalization. 31 / 37 Example x4 0 replacements x2 0 x1 x2 1 0 1.5 0.2 1 0.2 1.5 1 0 1.5 0.2 1 0.2 1.5 x6 X4 X2 x5 0 1.5 0.2 1 0.2 1.5 0 X1 1 x2 X6 x3 0 1 0 1 x1 0 1.5 0.2 1 0.2 1.5 X3 X5 x5 0 1 x3 0 1.5 0.2 1 0.2 1.5 32 / 37 Potential Functions: Parameterization The potential functions must be nonnegative. ψC (xC ) = exp{−HC (xC )}. Therefore, the joint probability for undirected models can be derived as follows : X 1 HCi (xCi ) . p(x) = exp − Z Ci ∈C The sum in the expression is generally referred to as the ”energy” : H(x) , X HCi (xCi ) Ci ∈C Finally, the joint probability can be represented as a Boltzman distribution : p(x) = 1 exp{−H(x)}. Z 33 / 37 Potential Functions, again! The factorization of the joint probability distribution in a Markov network is given by p(x) = 1 Y ψC (xC ). Z C ∈C i i i If all potential functions are strictly positive, then we have X p(x) = exp log ψCi (xCi ) − log Z . {z } | {z } Ci ∈C | θi Ti (X ) A(θ) In some cases, we have p(x) = exp ( X i θi Ti (X ) − A(θ) ) , which is known as exponential family. In statistical physics, − log Z is referred to as free energy. 34 / 37 Markov Random Fields Let X = {X1 , . . . , Xn } be a family of random variables defined on the set S in which each random variable Xi takes a value in L. Definition The family X is called a random field. Definition X is said to be Markov random field (MRF) with respect to a neighborhood system N if and only if the following two conditions are satisfied: 1. Positivity: P(x) > 0 ∀x ∈ X . 2. Markovianity: P(xi |xS\i ) = P(xi |xNi ). 35 / 37 Gibbs Random Fields Definition A set of random variables X is said to be a Gibbs random field (GRF) if and only iff its configurations obey the Gibbs distribution that has the form 1 1 p(x) = exp − E (x) , Z T P where E (x) = ψC (x). We often consider cliques of size up to 2, XX X ψ1 (xi ) + ψ2 (xi , xj ). E (x) ≈ i ∈S i ∈S j∈Ni Theorem (Hammersley-Clifford) X is an MRF on S with respect to N if and only iff X is a GRF on S with respect to N . 36 / 37 Characterization of Undirected Graphical Models ◮ U1 : A family of probability distributions by ranging over all possible choices of positive potential functions on the maximal cliques of the graph. n o 1 p(x) = exp − H(x) . Z ◮ U2 : A family of probability distributions via the conditional independence assertions associated (XA ⊥XB | XC ) with G. ◮ U1 ≡ U2 by Hammersley-Clifford theorem (proof, p85) 37 / 37