Download Graphical Models - UMD Department of Computer Science

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Graphical Models: An Introduction
Lise Getoor
Computer Science Dept
University of Maryland
http://www.cs.umd.edu/~getoor
Reading List for Next Lecture
• Learning Probabilistic Relational Models, L. Getoor, N. Friedman,
D. Koller, A. Pfeffer.
Invited contribution to the book Relational Data Mining, S.
Dzeroski and N. Lavrac, Eds., Springer-Verlag, 2001.
http://www.cs.umd.edu/~getoor/Publications/lprm-ch.ps
http://www.cs.umd.edu/class/spring2005/cmsc828g/Readings/lp
rm-ch.pdf
• Probabilistic Models for Relational Data, David Heckerman,
Christopher Meek and Daphne Koller
http://www.cs.umd.edu/projects/srl2004/Papers/heckerman.pdff
tp://ftp.research.microsoft.com/pub/tr/TR-2004-30.pdf
Graphical Models
• e.g. Bayesian networks, Bayes nets, Belief nets,
Markov networks, HMMs, Dynamic Bayes nets, etc.
• Themes:
– representation
– reasoning
– learning
• Materials based on upcoming book by Nir Friedman
and Daphne Koller.
• Slides based on material from Nir Friedman.
Probability Distributions
• Let X1,…,Xp be discrete random variables
• Let P be a joint distribution over X1,…,Xp
If the variables are binary, then we need O(2p)
parameters to describe P
Can we do better?
• Key idea: use properties of independence
Independent Random Variables
• Two variables X and Y are independent if
– P(X = x|Y = y) = P(X = x) for all values x, y
– That is, learning the values of Y does not change prediction
of X
• If X and Y are independent then
– P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)
• In general, if X1,…,Xp are independent, then
– P(X1,…,Xp)= P(X1)...P(Xp)
– Requires O(n) parameters
Conditional Independence
• Unfortunately, most of random variables of interest
are not independent of each other
• A more suitable notion is that of conditional
independence
• Two variables X and Y are conditionally
independent given Z if
– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z
– That is, learning the values of Y does not change prediction of X
once we know the value of Z
– notation: I ( X , Y | Z )
Example: Naïve Bayesian Model
• A common model in early diagnosis:
– Symptoms are conditionally independent given the disease
(or fault)
• Thus, if
– X1,…,Xp denote whether the symptoms exhibited by the
patient (headache, high-fever, etc.) and
– H denotes the hypothesis about the patients health
• then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),
• This naïve Bayesian model allows compact
representation
– It does embody strong independence assumptions
Graphical Models
• Graph is language for representing independencies
– Directed Acyclic Graph -> Bayesian Network
– Undirected Graph -> Markov Network
DAGS: Markov Assumption
• We now make this
independence assumption
more precise for directed
acyclic graphs (DAGs)
• Each random variable X, is
independent of its nondescendents, given its parents
Pa(X)
• Formally,
I (X, NonDesc(X) | Pa(X))
Ancestor
Parent
Y1
Y2
X
Non-descendent
Descendent
Markov Assumption Example
Earthquake
• In this example:
–
–
–
–
–
I
I
I
I
I
(
(
(
(
(
E, B )
B, {E, R} )
R, {A, B, C} | E )
A, R | B,E )
C, {B, E, R} | A)
Radio
Burglary
Alarm
Call
I-Maps
• A DAG G is an I-Map of a distribution P if the all
Markov assumptions implied by G are satisfied by P
(Assuming G and P both use the same set of random variables)
Examples:
X
Y
X
Y
Factorization
• Given that G is an I-Map of P, can we simplify the
representation of P?
• Example:
X
Y
• Since I(X,Y), we have that P(X|Y) = P(X)
• Applying the chain rule
P(X,Y) = P(X|Y) P(Y) = P(X) P(Y)
• Thus, we have a simpler representation of P(X,Y)
Factorization Theorem
Thm: if G is an I-Map of P, then
P(X1 ,..., X p )   P(X i | Pa(X i ))
i
Proof:
P(X i | X1 ,...,X i1 )
• By chain rule: P( X1 ,..., X p ) 
i
• wlog. X1,…,Xp is an ordering consistent with G

From assumption: Pa( X i )  {X1,  , X i1 }
{X1,  , X i1 }  Pa(X i )  NonDesc(X i )
• Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi))
• Hence,
I( X i , {X1,  , X i1 }  Pa( X i ) | Pa( X i ))
• We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )
Factorization Example
Earthquake
Radio
Burglary
Alarm
Call
P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E)
versus
P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)
Consequences
• We can write P in terms of “local” conditional
probabilities
If G is sparse,
– that is, |Pa(Xi)| < k ,
 each conditional probability can be specified
compactly
– e.g. for binary variables, these require O(2k) params.
 representation of P is compact
– linear in number of variables
DAGS: Summary
• The Markov Independences of a DAG G
– I (Xi , NonDesc(Xi) | Pai )
• G is an I-Map of a distribution P
– If P satisfies the Markov independencies implied by G
• if G is an I-Map of P, then
P(X1 ,..., X n )   P(X i | Pai )
i
Conditional Independencies
• Let Markov(G) be the set of Markov Independencies
implied by G
• The factorization theorem shows
P (X1 ,..., Xn )   P (Xi | Pai )
G is an I-Map of P 
i
• We can also show the opposite:
Thm:
of P
P (X1 ,..., Xn )   P (Xi | Pai )  G is an I-Map
i
Implied Independencies
• Does a graph G imply additional independencies as a
consequence of Markov(G)?
• We can define a logic of independence statements
• Some axioms:
– I( X ; Y | Z )  I( Y; X | Z )
– I( X ; Y1, Y2 | Z )  I( X; Y1 | Z )
d-seperation
• A procedure d-sep(X; Y | Z, G) that given a DAG G,
and sets X, Y, and Z returns either yes or no
• Goal:
d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)
Paths
• Intuition: dependency must “flow” along paths in
the graph
• A path is a sequence of neighboring variables
Examples:
• REAB
• CAER
Earthquake
Radio
Burglary
Alarm
Call
Paths
• We want to know when a path is
– active -- creates dependency between end nodes
– blocked -- cannot create dependency end nodes
• We want to classify situations in which paths are
active.
Path Blockage
Three cases:
Blocked
– Common cause
E
E
–
–
Unblocked
Active
R
A
R
A
Path Blockage
Three cases:
– Common cause
Blocked
E
Unblocked
Active
E
– Intermediate cause
–
A
A
C
C
Path Blockage
Three cases:
– Common cause
Blocked
E
– Intermediate cause
– Common Effect
Unblocked
Active
E
B
A
B
C
A
E
C
B
A
C
Path Blockage -- General Case
A path is active, given evidence Z, if
• Whenever we have the configuration
A
C
B
B or one of its descendents are in Z
• No other nodes in the path are in Z
A path is blocked, given evidence Z, if it is not active.
Example
– d-sep(R,B)?
E
R
B
A
C
Example
– d-sep(R,B) = yes
– d-sep(R,B|A)?
E
R
B
A
C
Example
– d-sep(R,B) = yes
– d-sep(R,B|A) = no
– d-sep(R,B|E,A)?
E
R
B
A
C
d-Separation
• X is d-separated from Y, given Z, if all paths from a
node in X to a node in Y are blocked, given Z.
• Checking d-separation can be done efficiently
(linear time in number of edges)
– Bottom-up phase:
Mark all nodes whose descendents are in Z
– X to Y phase:
Traverse (BFS) all edges on paths from X to Y and check if
they are blocked
Soundness
Thm:
• If
– G is an I-Map of P
– d-sep( X; Y | Z, G ) = yes
• then
– P satisfies I( X; Y | Z )
Informally,
• Any independence reported by d-separation is
satisfied by underlying distribution
Completeness
Thm:
• If d-sep( X; Y | Z, G ) = no
• then there is a distribution P such that
– G is an I-Map of P
– P does not satisfy I( X; Y | Z )
Informally,
• Any independence not reported by d-separation might
be violated by the underlying distribution
• We cannot determine this by examining the graph
structure alone
I-Maps revisited
• The fact that G is I-Map of P might not be that useful
• For example, complete DAGs
– A DAG is G is complete is we cannot add an arc without creating a
cycle
X2
X1
X3
X4
X2
X1
X3
• These DAGs do not imply any independencies
• Thus, they are I-Maps of any distribution
X4
Minimal I-Maps
A DAG G is a minimal I-Map of P if
• G is an I-Map of P
• If G’  G, then G’ is not an I-Map of P
Removing any arc from G introduces (conditional)
independencies that do not hold in P
Minimal I-Map Example
X2
X1
• If
is a minimal I-Map
X3
X4
• Then, these are not I-Maps:
X1
X3
X1
X3
X2
X4
X2
X4
X1
X3
X1
X3
X2
X4
X2
X4
Constructing minimal I-Maps
The factorization theorem suggests an algorithm
• Fix an ordering X1,…,Xn
• For each i,
– select Pai to be a minimal subset of {X1,…,Xi-1 },
such that I(Xi ; {X1,…,Xi-1 } - Pai | Pai )
• Clearly, the resulting graph is a minimal I-Map.
Non-uniqueness of minimal I-Map
• Unfortunately, there may be several minimal I-Maps for
the same distribution
– Applying I-Map construction procedure with different orders can lead to
different structures
Order: C, R, A, E, B
E
R
Original I-Map
E
B
A
C
R
B
A
C
Choosing Ordering & Causality
• The choice of order can have drastic impact on the
complexity of minimal I-Map
• Heuristic argument:
construct I-Map using causal ordering among
variables
• Justification?
– It is often reasonable to assume that graphs of causal influence
should satisfy the Markov properties.
P-Maps
• A DAG G is P-Map (perfect map) of a distribution P
if
– I(X; Y | Z) if and only if
d-sep(X; Y |Z, G) = yes
Notes:
• A P-Map captures all the independencies in the
distribution
• P-Maps are unique, up to DAG equivalence
P-Maps
• Unfortunately, some distributions do not have a PMap
Bayesian Networks
• A Bayesian network specifies a probability distribution
via two components:
– A DAG G
– A collection of conditional probability distributions P(Xi|Pai)
• The joint distribution P is defined by the factorization
P (X1 ,..., Xn )   P (Xi | Pai )
i
• Additional requirement: G is a minimal I-Map of P
DAGs and BNs
• DAGs as a representation of conditional
independencies:
– Markov independencies of a DAG
– Tight correspondence between Markov(G) and the
factorization defined by G
– d-separation, a sound & complete procedure for computing
the consequences of the independencies
– Notion of minimal I-Map
– P-Maps
• This theory is the basis for defining Bayesian
networks
Undirected Graphs: Markov Networks
• Alternative representation of conditional
independencies
• Let U be an undirected graph
• Let Ni be the set of neighbors of Xi
• Define Markov(U) to be the set of independencies
I( Xi ; {X1,…,Xn} - Ni - {Xi } | Ni )
• U is an I-Map of P if P satisfies Markov(U)
Example
This graph implies that
• I(A; C | B, D )
• I(B; D | A, C )
B
A
D
C
• Note: this example does not have a directed P-Map
Markov Network Factorization
Thm: if
• P is strictly positive, that is P(x1, …, xn ) > 0 for all
assignments
then
• U is an I-Map of P
if and only if
1
• there is a factorization P( X ,  , X ) 
f (C
1
n

Z
i
where C1, …, Ck are the maximal cliques in U
Alternative form:
1 i g (Ci )
P (X1 , , Xn )  e
Z
i
)
Relationship between Directed &
Undirected Models
Chain Graphs
Directed
Graphs
Undirected
Graphs
CPDs
• So far, we focused on how to represent
independencies using DAGs
• The “other” component of a Bayesian networks is the
specification of the conditional probability
distributions (CPDs)
• Here, we’ll just discuss the simplest representation of
CPDs
Tabular CPDs
• When the variable of interest are all discrete, the
common representation is as a table:
• For example P(C|A,B) can be represented by
A
B
P(C = 0 | A, B)
P(C = 1 | A, B)
0
0
1
1
0
1
0
1
0.25
0.50
0.12
0.33
0.75
0.50
0.88
0.67
Tabular CPDs
Pros:
• Very flexible, can capture any CPD of discrete
variables
• Can be easily stored and manipulated
Cons:
• Representation size grows exponentially with the
number of parents!
• Unwieldy to assess probabilities for more than few
parents
Continuous CPDs
• When X is a continuous variables, we need to
represent the density of X, given any value of its
parents
– Gaussian
– Conditional Gaussian
CPDs: Summary
• Many choices for representing CPDs
• Any “statistical” model of conditional distribution can
be used
– e.g., any regression model
• Representing structure in CPDs can have implications
on independencies among variables
Inference in Bayesian
Networks
Inference
• We now have compact representations of probability
distributions:
– Bayesian Networks
– Markov Networks
• Network describes a unique probability distribution P
• How do we answer queries about P?
• inference is name for the process of computing
answers to such queries
Queries: Likelihood
• There are many types of queries we might ask.
• Most of these involve evidence
– An evidence e is an assignment of values to a set E
variables in the domain
– Without loss of generality E = { Xk+1, …, Xn }
• Simplest query: compute probability of evidence
P (e )   P (x1 ,, xk , e )
x1
xk
• This is often referred to as computing the likelihood
of the evidence
Queries: A posteriori belief
• Often we are interested in the conditional probability of a
variable given the evidence
P (X , e )
P (X | e ) 
P (e )
• This is the a posteriori belief in X, given evidence e
• A related task is computing the term P(X, e)
– i.e., the likelihood of e and X = x for values of X
– we can recover the a posteriori belief by
P (X  x , e )
P (X  x | e ) 
 P (X  x , e )
x
A posteriori belief
This query is useful in many cases:
• Prediction: what is the probability of an outcome given
the starting condition
– Target is a descendent of the evidence
• Diagnosis: what is the probability of disease/fault given
symptoms
– Target is an ancestor of the evidence
• Note: the direction between variables does not restrict the
directions of the queries
– Probabilistic inference can combine evidence form all parts of the
network
Queries: MAP
• In this query we want to find the maximum a
posteriori assignment for some variable of interest
(say X1,…,Xl )
• That is, x1,…,xl maximize the probability
P(x1,…,xl | e)
• Note that this is equivalent to maximizing
P(x1,…,xl, e)
Queries: MAP
We can use MAP for:
• Classification
– find most likely label, given the evidence
• Explanation
– What is the most likely scenario, given the evidence
Queries: MAP
Cautionary note:
• The MAP depends on the set of variables
• Example:
– MAP of X
– MAP of (X, Y)
x
y
P(x,y)
0
0
0.35
0
1
0.05
1
0
0.3
1
1
0.3
Complexity of Inference
Thm:
Computing P(X = x) in a Bayesian network is NP-hard
Not surprising, since we can simulate Boolean gates.
Hardness
• Hardness does not mean we cannot solve inference
– It implies that we cannot find a general procedure that
works efficiently for all networks
– For particular families of networks, we can have provably
efficient procedures
Approaches to inference
• Exact inference
– Inference in Simple Chains
– Variable elimination
– Clustering / join tree algorithms
• Approximate inference
– Stochastic simulation / sampling methods
– Markov chain Monte Carlo methods
– Mean field theory
Inference in Simple Chains
X1
X2
How do we compute P(X2)?
P (x2 )   P (x1 , x2 )   P (x1 )P (x2 | x1 )
x1
x1
Inference in Simple Chains (cont.)
X1
X2
X3
How do we compute P(X3)?
P (x3 )   P (x2, x3 )   P (x2 )P (x3 | x2 )
x2
x2
• we already know how to compute P(X2)...
P (x2 )   P (x1 , x2 )   P (x1 )P (x2 | x1 )
x1
x1
Inference in Simple Chains (cont.)
X1
X2
X3
...
Xn
How do we compute P(Xn)?
• Compute P(X1), P(X2), P(X3), …
• We compute each term by using the previous one
P (xi 1 )   P (xi )P (xi 1 | xi )
xi
Complexity:
• Each step costs O(|Val(Xi)|*|Val(Xi+1)|) operations
• Compare to naïve evaluation, that requires summing
over joint values of n-1 variables
Inference in Simple Chains (cont.)
X1
X2
• Suppose that we observe the value of X2 =x2
• How do we compute P(X1|x2)?
– Recall that we it suffices to compute P(X1,x2)
P (x1 , x2 )  P (x2 | x1 )P (x1 )
Inference in Simple Chains (cont.)
X1
X2
X3
• Suppose that we observe the value of X3 =x3
• How do we compute P(X1,x3)?
P (x1 , x3 )  P (x1 )P (x3 | x1 )
• How do we compute P(x3|x1)?
P (x 3 | x1 )   P (x 2 , x 3 | x1 )   P (x 2 | x1 )P (x 3 | x1 , x 2 )
x2
  P (x 2 | x1 )P (x 3 | x 2 )
x2
x2
Inference in Simple Chains (cont.)
X1
X2
X3
...
Xn
• Suppose that we observe the value of Xn =xn
• How do we compute P(X1,xn)?
P (x1 ,x n)  P (x1 )P (xn | x1 )
• We compute P(xn|xn-1), P(xn|xn-2), … iteratively
P (x n | x i ) 
 P (x
xi  1
i 1
, xn | xi )
  P (xi 1 | xi )P (xn | xi 1 )
xi
Inference in Simple Chains (cont.)
X1
X2
...
Xk
...
Xn
• Suppose that we observe the value of Xn =xn
• We want to find P(Xk|xn )
• How do we compute P(Xk,xn )?
P (xk ,x n)  P (xk )P (xn | xk )
• We compute P(Xk ) by forward iterations
• We compute P(xn | Xk ) by backward iterations
Elimination in Chains
• We now try to understand the simple chain example
using first-order principles
A
B
C
D
• Using definition of probability, we have
P (e )   P (a , b, c ,d , e )
d
c
b
a
E
Elimination in Chains
A
B
C
D
E
• By chain decomposition, we get
P (e )   P (a , b, c , d , e )
d
c
b
a
d
c
b
a
  P (a )P (b | a )P (c | b )P (d | c )P (e | d )
Elimination in Chains
A
B
C
E
D
• Rearranging terms ...
P (e )   P (a )P (b | a )P (c | b )P (d | c )P (e | d )
d
c
b
d
c
b
a
  P (c | b )P (d | c )P (e | d ) P (a )P (b | a )
a
Elimination in Chains
X
A
B
C
E
D
• Now we can perform innermost summation
P (e )   P (c | b )P (d | c )P (e | d ) P (a )P (b | a )
d
c
b
d
c
b
a
  P (c | b )P (d | c )P (e | d ) p (b )
• This summation, is exactly the first step in the
forward iteration we describe before
Elimination in Chains
X
X
A
B
C
D
E
• Rearranging and then summing again, we get
P (e )   P (c | b )P (d | c )P (e | d ) p (b )
d
c
d
c
d
c
b
  P (d | c )P (e | d ) P (c | b ) p (b )
b
  P (d | c )P (e | d ) p (c )
Variable Elimination
General idea:
• Write query in the form
P (Xn , e )   P (xi | pai )
xk
x3 x2
i
• Iteratively
– Move all irrelevant terms outside of innermost sum
– Perform innermost sum, getting a new term
– Insert the new term into the product
A More Complex Example
• “Asia” network:
Visit to
Asia
Tuberculosis
Smoking
Lung Cancer
Abnormality
in Chest
X-Ray
Bronchitis
Dyspnea
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
S
V
L
T
Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
S
V
• We want to compute P(d)
• Need to eliminate: v,s,x,t,l,a,b
L
T
Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
Eliminate: v
Compute:
fv (t )   P (v )P (t |v )
v
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
Note: fv(t) = P(t)
In general, result of elimination is not necessarily a probability
term
• We want to compute P(d)
• Need to eliminate: s,x,t,l,a,b
S
V
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
Eliminate: s
Compute:
fs (b,l )   P (s )P (b | s )P (l | s )
s
 fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
Summing on s results in a factor with two arguments fs(b,l)
In general, result of elimination may be a function of several
variables
S
V
• We want to compute P(d)
• Need to eliminate: x,t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
Eliminate: x
Compute:
fx (a )   P (x | a )
x
 fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
Note: fx(a) = 1 for all values of a !!
S
V
• We want to compute P(d)
• Need to eliminate: t,l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
Eliminate: t
Compute:
ft (a ,l )  fv (t )P (a |t , l )
t
 fs (b, l )fx (a )ft (a , l )P (d | a , b )
S
V
• We want to compute P(d)
• Need to eliminate: l,a,b
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
 fs (b, l )fx (a )ft (a , l )P (d | a , b )
Eliminate: l
Compute:
fl (a , b )  fs (b,l )ft (a ,l )
 fl (a , b )fx (a )P (d | a , b )
l
• We want to compute P(d)
• Need to eliminate: b
S
V
L
T
• Initial factors
B
A
X
D
P (v )P (s )P (t |v )P (l | s )P (b | s )P (a |t ,l )P (x | a )P (d | a , b )
 fv (t )P (s )P (l | s )P (b | s )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )P (a |t , l )P (x | a )P (d | a , b )
 fv (t )fs (b, l )fx (a )P (a |t , l )P (d | a , b )
 fs (b, l )fx (a )ft (a , l )P (d | a , b )
 fl (a , b )fx (a )P (d | a , b )  fa (b,d )  fb (d )
Eliminate: a,b
Compute:
fa (b,d )  fl (a , b )fx (a ) p (d | a , b )
a
fb (d )  fa (b,d )
b
Variable Elimination
• We now understand variable elimination as a
sequence of rewriting operations
• Actual computation is done in elimination step
• Exactly the same computation procedure applies to
Markov networks
• Computation depends on order of elimination
Complexity of variable elimination
• Suppose in one elimination step we compute
fx (y1 ,, yk )  f 'x (x , y1 ,, yk )
x
m
f 'x (x , y1 , , y k )  fi (x , y1,1, , y1,li )
This requires
•
m  Val(X ) 
i 1
 Val(Y )
i
i
multiplications
– For each value for x, y1, …, yk, we do m multiplications
•
Val(X )   Val(Yi ) additions
i
– For each value of y1, …, yk , we do |Val(X)| additions
Complexity is exponential in number of variables in the intermediate
factor!
Understanding Variable Elimination
• We want to select “good” elimination orderings that
reduce complexity
• We start by attempting to understand variable
elimination via the graph we are working with
• This will reduce the problem of finding good ordering
to graph-theoretic operation that is well-understood
Undirected graph representation
• At each stage of the procedure, we have an algebraic
term that we need to evaluate
• In general this term is of the form:
P (x1 ,, xk )  fi ( Zi )
y1
yn
i
where Zi are sets of variables
• We now plot a graph where there is undirected edge
X--Y if X,Y are arguments of some factor
– that is, if X,Y are in some Zi
• Note: this is the Markov network that describes the
probability on the variables we did not eliminate yet
Chordal Graphs
• elimination ordering  undirected chordal graph
S
V
L
T
X
D
L
T
B
A
S
V
B
A
X
D
Graph:
• Maximal cliques are factors in elimination
• Factors in elimination are cliques in the graph
• Complexity is exponential in size of the largest clique in
graph
Induced Width
• The size of the largest clique in the induced graph is
thus an indicator for the complexity of variable
elimination
• This quantity is called the induced width of a graph
according to the specified ordering
• Finding a good ordering for a graph is equivalent to
finding the minimal induced width of the graph
General Networks
• From graph theory:
Thm:
• Finding an ordering that minimizes the induced width
is NP-Hard
However,
• There are reasonable heuristic for finding “relatively”
good ordering
• There are provable approximations to the best
induced width
• If the graph has a small induced width, there are
algorithms that find it in polynomial time
Elimination on Trees
• Formally, for any tree, there is an elimination
ordering with induced width = 1
Thm
• Inference on trees is linear in number of variables
PolyTrees
• A polytree is a network where there is at most one
path from one variable to another
A
Thm:
• Inference in a polytree is linear in the
representation size of the network
– This assumes tabular CPT representation
H
C
B
E
D
F
G
Approaches to inference
• Exact inference
– Inference in Simple Chains
– Variable elimination
– Clustering / join tree algorithms
• Approximate inference
– Stochastic simulation / sampling methods
– Markov chain Monte Carlo methods
– Mean field theory
Learning Bayesian Networks
Learning Bayesian networks
B
E
Data +
Prior information
Inducer
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
B
E
A
• Network structure is specified
– Inducer needs to estimate parameters
• Data does not contain missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Unknown Structure -- Complete Data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
Inducer
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
B
E
A
• Network structure is not specified
– Inducer needs to select arcs & estimate parameters
• Data does not contain missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure -- Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
Inducer
B
E
A
• Network structure is specified
• Data contains missing values
– We consider assignments to missing values
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Known Structure / Complete Data
• Given a network structure G
– And choice of parametric family for P(Xi|Pai)
• Learn parameters for network
Goal
• Construct a network that is “closest” to probability
that generated the data
Learning Parameters for a Bayesian
Network
• Training data has the form:
B
E
A
 E [1] B [1] A[1] C [1] 
 





D
 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


C
Learning Parameters for a Bayesian
Network
• Since we assume i.i.d. samples,
likelihood function is
B
E
A
L( : D )   P (E [m], B [m], A[m],C [m] : )
m
C
Learning Parameters for a Bayesian
Network
• By definition of network, we get
B
E
A
L( : D )   P (E [m ], B [m ], A[m ], C [m ] : )
C
m
P (E [m ] : )

m
P (B [m ] : )
P (A[m ] | B [m ], E [m ] : )
P (C [m ] | A[m ] : )
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


Learning Parameters for a Bayesian
Network
• Rewriting terms, we get
B
E
A
L( : D )   P (E [m ], B [m], A[m ], C [m ] : )
C
m
  P (E [m ] : )
m
 P (B [m] : )
m
 P (A[m] | B [m], E [m] : )
m
 P (C [m] | A[m] : )
m
 E [1] B [1] A[1] C [1] 
 






 


 


E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]


General Bayesian Networks
Generalizing for any Bayesian network:
L( : D )   P (x 1 [m ], , xn [m ] : )
i.i.d. samples
m
   P (xi [m ] | Pai [m ] : i )
m
i
i
m
Network factorization
   P (xi [m ] | Pai [m ] : i )
  Li (i : D )
i
• The likelihood
decomposes according to the structure of
the network.
General Bayesian Networks (Cont.)
Decomposition
 Independent Estimation Problems
If the parameters for each family are not related, then
they can be estimated independently of each other.
From Binomial to Multinomial
• For example, suppose X can have the values 1,2,…,K
• We want to learn the parameters  1,  2. …,  K
Sufficient statistics:
• N1, N2, …, NK - the number of times each outcome
is observed
Likelihood function:
K
L( : D )   k Nk
MLE:
k 1
Nk
ˆ
k 
 N

Likelihood for Multinomial Networks
• When we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
Li (i : D )   P (xi [m ] | Pai [m ] : i )
m

 P (x [m] | pa
paii m ,Paii [ m ]  paii
i
i
: i )
  P (xi | pai : i )N ( xii , paii )
paii
xii
pai
xi
  xi |pai N ( xi , pai )
Likelihood for Multinomial Networks
• When we assume that P(Xi | Pai ) is multinomial,
we get further decomposition:
Li (i : D )   xi |pai
pai
N ( xi , pai )
xi
• For each value pai of the parents of Xi we get an
independent multinomial problem
• The MLE is
N (xi , pai )
ˆ
x |pa 
i
i
N ( pai )
Bayesian Approach: Dirichlet Priors
• Recall that the likelihood function is
K
L (  : D )    k Nk
k 1
• A Dirichlet prior with hyperparameters 1,…,K is defined as
K
P ()   k
k 1
for legal  1,…,  K
k 1
Then the posterior has the same form, with hyperparameters 1+N
1,…,K +N K
K
K
K
k 1
k 1
k 1
P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1
Dirichlet Priors (cont.)
• We can compute the prediction on a new event
in closed form:
• If P() is Dirichlet with hyperparameters 1,…,K
then
P( X[1]  k )   k P()d 
k
 

• Since the posterior is also Dirichlet, we get
P( X[M  1]  k | D)   k P( | D)d 
 k  Nk
 (   N )

Prior Knowledge
• The hyperparameters 1,…,K can be thought of
as “imaginary” counts from our prior experience
• Equivalent sample size = 1+…+K
• The larger the equivalent sample size the more
confident we are in our prior
Bayesian Prediction(cont.)
• Given these observations, we can compute the
posterior for each multinomial  Xi | pai
independently
– The posterior is Dirichlet with parameters
(Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)
• The predictive distribution is then represented by
the parameters
(x i , pai )  N(x i , pai )
~
x i|pai 
(pai )  N(pai )
Learning Parameters: Summary
• Estimation relies on sufficient statistics
– For multinomial these are of the form N (xi,pai)
– Parameter estimation
N (xi , pai )
ˆ
x |pa 
i
i
N ( pai )
MLE
(xi , pai )  N (xi , pai )
~
x |pa 
i
i
( pai )  N ( pai )
Bayesian (Dirichlet)
• Bayesian methods also require choice of priors
• Both MLE and Bayesian are asymptotically equivalent
and consistent
• Both can be implemented in an on-line manner by
accumulating sufficient statistics
Learning Structure from Complete Data
Why Struggle for Accurate Structure?
Earthquake
Alarm Set
Burglary
Sound
Adding an arc
Earthquake
Alarm Set
Missing an arc
Sound
•
•
Earthquake
Burglary
Increases the number of
parameters to be fitted
Wrong assumptions about
causality and domain structure
Alarm Set
Burglary
Sound
•
•
Cannot be compensated by
accurate fitting of parameters
Also misses causality and
domain structure
Approaches to Learning Structure
• Constraint based
– Perform tests of conditional independence
– Search for a network that is consistent with the observed
dependencies and independencies
• Pros & Cons
 Intuitive, follows closely the construction of BNs
 Separates structure learning from the form of the
independence tests
 Sensitive to errors in individual tests
Approaches to Learning Structure
• Score based
– Define a score that evaluates how well the (in)dependencies in a
structure match the observations
– Search for a structure that maximizes the score
• Pros & Cons




Statistically motivated
Can make compromises
Takes the structure of conditional probabilities into account
Computationally hard
Likelihood Score for Structures
First cut approach:
– Use likelihood function
• Recall, the likelihood score for a network structure
and parameters is
L(G, G : D)   P(x1 [m],  , x n [m] : G, G )
m
  P(x i [m] | PaiG [m] : G, G,i )
m
i
• Since we know how to maximize parameters from
now we assume
L(G : D)  max G L(G , G : D)
Avoiding Overfitting
“Classic” issue in learning.
Approaches:
• Restricting the hypotheses space
– Limits the overfitting capability of the learner
– Example: restrict # of parents or # of parameters
• Minimum description length
– Description length measures complexity
– Prefer models that compactly describes the training data
• Bayesian methods
– Average over all possible parameter values
– Use prior knowledge
Bayesian Inference
• Bayesian Reasoning---compute expectation over
unknown G
P( x[M  1] | D)   P(x[M  1] | D, G)P(G | D)
G
• Assumption: Gs are mutually exclusive and
exhaustive
• We know how to compute P(x[M+1]|G,D)
– Same as prediction with fixed structure
• How do we compute P(G|D)?
Posterior Score
Using Bayes rule:
Prior over structures
Marginal likelihood
P (G | D ) 
P (D | G )P (G )
P (D )
Probability of Data
P(D) is the same for all structures G
Can be ignored when comparing structures
Marginal Likelihood
• By introduction of variables, we have that
P (D | G )   P (D | G , )P ( | G ) d
Likelihood
Prior over parameters
• This integral measures sensitivity to choice of
parameters
Marginal Likelihood for General
Network
The marginal likelihood has the form:
P (D | G )  
i
paiG


  ( paiG )

  ( paiG )  N ( paiG )

xi
( (xi , paiG )  N (xi , paiG ))
( (xi , paiG ))
Dirichlet Marginal Likelihood
For the sequence of values of Xi when
Xi’s parents have a particular value
where
• N(..) are the counts from the data
• (..) are the hyperparameters for each family given
G
Priors
• We need: prior counts (..) for each network
structure G
• This can be a formidable task
– There are exponentially many structures…
BDe Score
Possible solution: The BDe prior
• Represent prior using two elements M0, B0
– M0 - equivalent sample size
– B0 - network representing the prior probability of events
BDe Score
Intuition: M0 prior examples distributed by B0
• Set (xi,paiG) = M0 P(xi,paiG| B0)
– Note that paiG are not the same as the parents of Xi in B0.
– Compute P(xi,paiG| B0) using standard inference procedures
• Such priors have desirable theoretical properties
– Equivalent networks are assigned the same score
Bayesian Score: Asymptotic Behavior
Theorem: If the prior P( |G) is “well-behaved”, then
log M
log P (D | G )  l (G : D ) 
dim( G )  O (1)
2
Asymptotic Behavior: Consequences
log M
log P (D | G )  l (G : D ) 
dim( G )  O (1)
2
• Bayesian score is consistent
– As M  the “true” structure G* maximizes the score (almost
surely)
– For sufficiently large M, the maximal scoring structures are
equivalent to G*
• Observed data eventually overrides prior information
– Assuming that the prior assigns positive probability to all
cases
Asymptotic Behavior
log M
Score(G : D )  l (G : D ) 
dim( G )
2
• This score can also be justified by the
Minimal Description Length (MDL) principle
• This equation explicitly shows the tradeoff between
– Fitness to data --- likelihood term
– Penalty for complexity --- regularization term
Scores -- Summary
• Likelihood, MDL, (log) BDe have the form
Score (G : D )  Score (Xi | Pa iG : N (Xi Pai ))
i
• BDe requires assessing prior network.
It can naturally incorporate prior knowledge and
previous experience
• BDe is consistent and asymptotically equivalent (up
to a constant) to MDL
• All are score-equivalent
– G equivalent to G’

Score(G) = Score(G’)
Optimization Problem
Input:
– Training data
– Scoring function (including priors, if needed)
– Set of possible structures
• Including prior knowledge about structure
Output:
– A network (or networks) that maximize the score
Key Property:
– Decomposability: the score of a network is a sum of terms.
Heuristic Search
We address the problem by using heuristic search
• Define a search space:
– nodes are possible structures
– edges denote adjacency of structures
• Traverse this space looking for high-scoring
structures
Search techniques:
–
–
–
–
Greedy hill-climbing
Best first search
Simulated Annealing
...
Heuristic Search (cont.)
• Typical operations:
S
C
E
S
C
D
E
D
S
C
S
C
E
E
D
D
Exploiting Decomposability in Local
Search
S
S
C
C
E
E
D
D
S
C
S
C
E
E
D
D
• Caching: To update the score of after a local change,
we only need to re-score the families that were changed
in the last move
Greedy Hill-Climbing
• Simplest heuristic local search
– Start with a given network
• empty network
• best tree
• a random network
– At each iteration
• Evaluate all possible changes
• Apply change that leads to best improvement in score
• Reiterate
– Stop when no modification improves score
• Each step requires evaluating approximately n new
changes
Greedy Hill-Climbing: Possible Pitfalls
• Greedy Hill-Climbing can get struck in:
– Local Maxima:
• All one-edge changes reduce the score
– Plateaus:
• Some one-edge changes leave the score unchanged
• Happens because equivalent networks received the same score
and are neighbors in the search space
• Both occur during structure search
• Standard heuristics can escape both
– Random restarts
– TABU search
Search: Summary
• Discrete optimization problem
• In general, NP-Hard
– Need to resort to heuristic search
– In practice, search is relatively fast (~100 vars in ~10 min):
• Decomposability
• Sufficient statistics
• In some cases, we can reduce the search problem to an
easy optimization problem
– Example: learning trees
Graphical Models Intro Summary
• Representations
– Graphs are cool way to put constraints on distributions, so
that you can say lots of stuff without even looking at the
numbers!
• Inference
– GM let you compute all kinds of different probabilities
efficiently
• Learning
– You can even learn them auto-magically!