Download Conditional Independence and Factorization

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Conditional Independence and Factorization
Seungjin Choi
Department of Computer Science and Engineering
Pohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, Korea
[email protected]
1 / 37
Graphical Models: Efficient Representation
◮
◮
◮
◮
Consider a set of (discrete) random variables {X1 , . . . , Xn } where Xi
takes on each of its r different values xi .
A direct representation of p(x1 , . . . , xn ) requires an n-dimensional
table (table of r n values representing the probabilities of each of the
r n possible values of X1 , . . . , Xn ).
Graphical models provide an efficient means of representing joint
probability distributions over many random variables in situations
where each random variable is conditionally independent of all but a
handful of the n random variables.
A graphical model can be thought of as a probabilistic database, a
machine that can answer ”queries” regarding the values of sets of
random variables. We build up the database in pieces, using
probability theory to ensure that the pieces have a consistent overall
interpretation. Probability theory also justifies the inferential
machinery that allows the pieces to be put together ”on the fly” to
answer queries.
2 / 37
The chain rule gives
p(x1 , x2 , x3 , x4 ) = p(x4 |x3 , x2 , x1 )p(x3 |x2 , x1 )p(x2 |x1 )p(x1 ).
X1
X1
X2
X3
GM (chain rule)
X2
X3
X4
X4
GM (Markov chain)
3 / 37
Outline: Directed Graphical Models
◮
◮
Directed graphs and joint probabilities
Conditional independence and d-separation
◮
Three canonical graphs
Bayes ball algorithm
◮
Characterization of directed graphical models
◮
4 / 37
Notations
◮
◮
◮
◮
Given a set of random variables, {X1 , . . . , Xn }, let xi represent the
realization of random variable Xi .
The probability mass function p(x1 , . . . , xn ) is defined as
P(X1 = x1 , . . . , Xn = xn ).
Use X to stand for {X1 , . . . , Xn } and x to stand for {x1 , . . . , xn }.
XA is shorthand for {X1 , X2 } if A = {1, 2}.
5 / 37
Directed Graphs
A directed graph is a pair, G(V, E), where V is a set of nodes (vertices)
and E is a set of oriented edges. Assume that G is acyclic.
◮
Nodes
◮
◮
◮
◮
Associated with random variables.
a One-to-one mapping from nodes to random variables.
Parent node, πi , is a set of parents of node i.
Edges
◮
Edges represent conditional dependence.
6 / 37
Conditional Independence
◮
XA and XB are independent, XA ⊥XB , if
p(xA , xB ) = p(xA )p(xB ).
◮
XA and XC are conditionally independent given XB , XA ⊥XC | XB , if
p(xA , xC |xB ) = p(xA |xB )p(xC |xB )
or
p(xA |xB , xC ) = p(xA |xB ),
for all xB such that p(xB ) > 0.
7 / 37
An Example of DAG
X4
X2
X1
X6
X3
X5
p(x1 , x2 , x3 , x4 , x5 , x6 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 )p(x5 |x3 )p(x6 |x2 , x5 )
8 / 37
Factorization of Joint Probability Distributions
Use the locality defined by the parent-child relationship to construct economical
representation of joint probability distributions.
Associate a function fi (xi , xπi ) to each node i ∈ V (Properties of conditional
probability distributions: Nonnegativity and sum-to-one).
Given a set of functions {fi (xi , xπi ) : i ∈ V} for V = 1, 2, ..., n, we define a joint
probability distribution as follows:
p(x1 , x2 , ..., xn ) ,
n
Y
fi (xi , xπi ).
i =1
Given that fi (xi , xπi ) are conditional probabilities, we write in terms of
p(xi | xπi ):
p(x1 , x2 , ..., xn ) ,
n
Y
p(xi | xπi ).
i =1
9 / 37
An Example of DAG: Revisited
X4
X2
X1
X6
X3
X5
p(x1 , x2 , x3 , x4 , x5 , x6 ) = p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 )p(x5 |x3 )p(x6 |x2 , x5 )
10 / 37
Economical Representation?
Consider n discrete random variables {x1 , x2 , ..., xn } with each variable xi
ranging over r values.
◮
Naive approach
◮
◮
Need an n-dimensional table of size r n
PGM
◮
Need an (mi +1)-dimensional table of size r mi +1 where mi is the
number of parents of node Xi
11 / 37
Conditional Independence in DAG
Two different factorization of a probability distribution function
◮ The chain rule leads to
p(x1 , x2 , ..., xn ) =
n
Y
p(xi | x1 , ..., xi −1 ).
i =1
◮
DAG leads to
p(x1 , x2 , ..., xn ) =
n
Y
p(xi | xπi ).
i =1
Comparing these expressions, we might interpret that missing variables in
the local conditional probability functions corresponds to missing edges
(conditional independence) in the underlying graph.
12 / 37
Basic Conditional Independence Statements
An ordering I of the nodes in a graph G is said to be topological if the nodes in
πi appear before i in I for every node i ∈ V. (I = {1, 2, 3, 4, 5, 6} is a
topological ordering in our example)
Let νi denote the set of all nodes that appear earlier than i in I , excluding πi .
(for example, ν5 = {1, 2, 4})
Given a topological ordering I , the set of basic conditional independence
statements:
{Xi ⊥Xνi | Xπi },
for i ∈ V.
These statement can be verified by algebraic calculations
Example: X4 ⊥{X1 , X3 } | X2
p(x4 |x1 , x2 , x3 ) =
p(x1 , x2 , x3 , x4 )
p(x1 )p(x2 |x1 )p(x3 |x1 )p(x4 |x2 )
=
= p(x4 |x2 )
p(x1 , x2 , x3 )
p(x1 )p(x2 |x1 )p(x3 |x1 )
13 / 37
Markov Chain
X
◮
Y
X
Y
Z
X ⊥Z | Y
p(z|x, y ) =
◮
Z
p(x, y , z)
p(x)p(y |x)p(z|y )
=
= p(z|y )
p(x, y )
p(x)p(y |x)
There are no other conditional independencies
Asserted conditional independencies always hold for the family of
distributions associated with a given graph.
Non-asserted conditional independencies sometimes fail to hold but
sometimes do hold.
14 / 37
Hidden Cause
Y
X
◮
Z
X
Z
X ⊥Z | Y
p(x, z|y ) =
◮
Y
p(y )p(x|y )p(z|y )
= p(x|y )p(z|y )
p(y )
We do not necessarily assume that X and Z are dependent.
15 / 37
Explaining-Away
X
Z
X
Y
◮
Z
Y
X ⊥Z
p(y |x, z) =
p(x)p(z)p(y |x, z)
p(x, z)
p(x, z) = p(x)p(z)
◮
No X ⊥Z | Y
16 / 37
Bayes Ball Algorithm
We wish to decide whether a given conditional independence statement,
XA ⊥XB | XC , is true for a directed graph G.
◮
A reachability algorithm
◮
A d-separation test
◮
Shade the nodes in XC . Place balls at each node in XA , let them bounce
around according to some rules, and then ask if any of the balls reach any
of the nodes in XB .
17 / 37
Bayes Ball Algorithm - Rules 1,2
X
Y
Z
X
Y
X
Y
Z
Y
Z
X
Z
18 / 37
Bayes Ball Algorithm - Rules 3,4,5
X
Z
X
Y
Z
Y
X
Y
X
Y
X
Y
X
Y
19 / 37
Example 1
X4
X2
X1
X6
X3
X5
X4 ⊥{X1 , X3 } | X2 is true.
20 / 37
Example 2
X4
X2
X1
X6
X3
X5
X1 ⊥X6 | {X2 , X3 } is true.
21 / 37
Example 3
X4
X2
X1
X6
X3
X5
X2 ⊥X3 | {X1 , X6 } is not true.
22 / 37
Characterization of Directed Graphical Models
A graphical model is associated with a family of probability distributions and
this family can be characterized in two equivalent ways.
Qn
◮ D1 = p(x) =
i =1 p(xi | xπi ) .
◮
◮
◮
D2 = {p(x) subject to Xi ⊥Xνi | Xπi }. (A family of probability
distributions associated with G, that includes all p(xv ) that satisfy every
conditional independence statement associated with G)
D1 = D2 (Details will be discussed later)
This provides a strong and important link between graph theory and
probability theory.
23 / 37
Markov Blanket
Markov blanket: V is a Markov blanket for X iff X ⊥Y |V for all Y ∈
/ V.
In a DAG, Markov blanket for node i is composed of its parents, its children,
and its children’s parents.
Markov blanket identifies all the variables that shield off the node from the rest
of the network. This means that the Markov blanket of a node is the only
knowledge that is needed to predict the behaviour of that node. (coined by J.
Pearl)
24 / 37
A
25 / 37
Outline: Undirected Graphical Models
◮
Markov networks
◮
Clique potentials
◮
Characterization of undirected graphical models
26 / 37
Markov Networks (Undirected Graphical Models)
X4
replacements
Examples of Markov networks:
X1
X3
X2
Boltzmann machines
X5
Markov random fields
Semantics: Every node is conditionally independent from its non-neighbors,
given its neighbors.
Markov blanket: V is a Markov blanket for X iff X ⊥Y |V for all Y ∈
/ V.
Markov boundary: Minimal Markov blanket
27 / 37
Boltzmann Machines
Markov network over a vector of binary variables, xi ∈ {0, 1}, where some
variables may be hidden (xiH ) and some may be observed (visible, xiV ).
Learning for Boltzmann machines involves adjusting weights W such that the
generative model
1 ⊤
1
x Wx
p(x|W ) = exp
Z
2
is well-matched to a set of examples {x(t)} (t = 1, . . . , N).
The learning is done via EM!
28 / 37
Conditional Independence
For undirected graphs, the conditional independence semantics is the more
intuitive.
• Every path from a node in XA to a node in XC includes at least one node in
XB
⇒ XA ⊥XC | XB .
XB
XA
XA ⊥XC | XB
XC
29 / 37
Comparative Semantics
W
replacements
X
Y
Z
X
X
Y
Z X ⊥Y
X ⊥Z
X ⊥Y | Z
not X ⊥Y
Y
Z
X ⊥Y
not X ⊥Y | Z
X ⊥Y | {W , Z }
W ⊥Z | {X , Y }
X
Y
Z
30 / 37
Clique Potentials
X4
Clique: A fully connected subgraph
(usually maximal)
X1
X3
X2
X5
Denote by xCi the set of variables in
clique Ci .
For each clique Ci , we assign a nonnegative function (potential function),
ψCi (xCi ) which measures compatibility (agreement, constraint, or energy).
p(x) =
1 Y
ψC (xC ),
Z C ∈C i i
i
where Z =
P Q
x
Ci ∈C
ψCi (xCi ) is the normalization.
31 / 37
Example
x4
0
replacements
x2
0
x1
x2
1
0
1.5 0.2
1
0.2 1.5
1
0
1.5 0.2
1
0.2 1.5
x6
X4
X2
x5
0 1.5 0.2
1 0.2 1.5
0
X1
1
x2
X6
x3
0
1
0
1
x1 0
1.5 0.2
1
0.2 1.5
X3
X5
x5
0
1
x3 0
1.5 0.2
1
0.2 1.5
32 / 37
Potential Functions: Parameterization
The potential functions must be nonnegative.
ψC (xC ) = exp{−HC (xC )}.
Therefore, the joint probability for undirected models can be derived as follows :



 X
1
HCi (xCi ) .
p(x) = exp −


Z
Ci ∈C
The sum in the expression is generally referred to as the ”energy” :
H(x) ,
X
HCi (xCi )
Ci ∈C
Finally, the joint probability can be represented as a Boltzman distribution :
p(x) =
1
exp{−H(x)}.
Z
33 / 37
Potential Functions, again!
The factorization of the joint probability distribution in a Markov network is
given by
p(x) =
1 Y
ψC (xC ).
Z C ∈C i i
i
If all potential functions are strictly positive, then we have




X

p(x) = exp
log ψCi (xCi ) − log Z .

{z
} | {z }
Ci ∈C |

θi Ti (X )
A(θ)
In some cases, we have
p(x) = exp
(
X
i
θi Ti (X ) − A(θ)
)
,
which is known as exponential family. In statistical physics, − log Z is referred
to as free energy.
34 / 37
Markov Random Fields
Let X = {X1 , . . . , Xn } be a family of random variables defined on the set S in
which each random variable Xi takes a value in L.
Definition
The family X is called a random field.
Definition
X is said to be Markov random field (MRF) with respect to a neighborhood
system N if and only if the following two conditions are satisfied:
1. Positivity: P(x) > 0 ∀x ∈ X .
2. Markovianity: P(xi |xS\i ) = P(xi |xNi ).
35 / 37
Gibbs Random Fields
Definition
A set of random variables X is said to be a Gibbs random field (GRF) if and
only iff its configurations obey the Gibbs distribution that has the form
1
1
p(x) = exp − E (x) ,
Z
T
P
where E (x) =
ψC (x).
We often consider cliques of size up to 2,
XX
X
ψ1 (xi ) +
ψ2 (xi , xj ).
E (x) ≈
i ∈S
i ∈S j∈Ni
Theorem (Hammersley-Clifford)
X is an MRF on S with respect to N if and only iff X is a GRF on S with
respect to N .
36 / 37
Characterization of Undirected Graphical Models
◮
U1 :
A family of probability distributions by ranging over all possible choices of
positive potential functions on the maximal cliques of the graph.
n
o
1
p(x) = exp − H(x) .
Z
◮
U2 :
A family of probability distributions via the conditional independence
assertions associated (XA ⊥XB | XC ) with G.
◮
U1 ≡ U2 by Hammersley-Clifford theorem (proof, p85)
37 / 37