Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Conditional Probability
Distributions
Eran Segal
Weizmann Institute
Last Time
Local Markov assumptions – basic BN independencies
d-separation – all independencies via graph structure
G is an I-Map of P if and only if P factorizes over G
I-equivalence – graphs with identical independencies
Minimal I-Map
All distributions have I-Maps (sometimes more than one)
Minimal I-Map does not capture all independencies in P
Perfect Map – not every distribution P has one
PDAGs
Compact representation of I-equivalence graphs
Algorithm for finding PDAGs
CPDs
Thus far we ignored the representation of CPDs
Today we will cover the range of CPD representations
Discrete
Continuous
Sparse
Deterministic
Linear
Table CPDs
Entry for each joint assignment of X and Pa(X)
For each pax:  P( x | pa x )  1
xVal ( X )
Most general representation
Represents every discrete CPD
I
S
Limitations
Cannot model continuous RVs
Number of parameters exponential in |Pa(X)|
Cannot model large in-degree dependencies
Ignores structure within the CPD
P(I)
P(S|I)
I
S
i0
i1
I
s0
s1
0.7
0.3
i0
0.95
0.05
i1
0.2
0.8
Structured CPDs
Key idea: reduce parameters by modeling P(X|PaX)
without explicitly modeling all entries of the joint
Lose expressive power (cannot represent every CPD)
Deterministic CPDs
There is a function f: Val(PaX)  Val(X) such that
1 x  f ( pax )
P( x | pax )  
0 otherwise
Examples
OR, AND, NAND functions
Z = Y+X (continuous variables)
Deterministic CPDs
Replace spurious dependencies with deterministic CPDs
Need to make sure that deterministic CPD is compactly stored
T1
T2
T1
S
T2
T
S
S
T1
T2
s0
s1
t0
t0
0.95
0.05
t0
t1
0.2
0.8
t1
t0
0.2
0.8
T
s0
s1
t1
t1
0.2
0.8
t0
0.95
0.05
t1
0.2
0.8
S
T
T1
T2
t0
t1
t0
t0
1
0
t0
t1
0
1
t1
t0
0
1
t1
t1
0
1
Deterministic CPDs
Induce additional conditional independencies
Example: T is any deterministic function of T1,T2
T1
Ind(S1;S2 | T1,T2)
T2
T
S1
S2
Deterministic CPDs
Induce additional conditional independencies
Example: C is an XOR deterministic function of A,B
D
Ind(D;E | B,C)
A
B
C
E
Deterministic CPDs
Induce additional conditional independencies
Example: T is an OR deterministic function of T1,T2
T1
Ind(S1;S2 | T1=t1)
T2
T
S1
S2
Context specific independencies
Context Specific Independencies
Let X,Y,Z be pairwise disjoint RV sets
Let C be a set of variables and cVal(C)
X and Y are contextually independent given Z and c, denoted
(XcY | Z,c) if: P( X | Y , Z , c)  P( X | Z , c) whenever P(Y , Z , c)  0
Tree CPDs
A
B
C
A
B
D
C
D
D
A
B
C
d0
d1
a0
b0
c0
0.2
0.8
a0
b0
c1
0.2
0.8
a0
b1
c0
0.2
0.8
a0
b1
c1
0.2
0.8
a1
b0
c0
0.9
0.1
a1
b0
c1
0.7
0.3
a1
b1
c0
0.4
0.6
A1
b1
C1
0.4
0.6
A
a0
a1
(0.2,0.8)
B
b0
C
c0
(0.9,0.1)
8 parameters
b1
(0.4,0.6)
c1
(0.7,0.3)
4 parameters
Gene Regulation: Simple Example
State 1
State 2
State 3
Regulated gene
Activator
Repressor
Regulated gene
Activator
Repressor
Regulators
Activator
Repressor
Regulators
DNA Microarray
DNA Microarray
Regulated gene
Regulated gene
Regulation Tree
Segal et al., Nature Genetics ’03
Activator?
Activator
expression
false
true
true
Regulation
program
Repressor?
Repressor
expression
false
true
Module
genes
State 1
State 2
State 3
Respiration Module
Segal et al., Nature Genetics ’03
Module genes known targets of predicted regulators?
Predicted regulator
Regulation
program
Module
genes
Hap4+Msn4 known to regulate module genes
Rule CPDs
A rule r is a pair (c;p) where c is an assignment to a
subset of variables C and p[0,1]. Let Scope[r]=C
A rule-based CPD P(X|Pa(X)) is a set of rules R s.t.
For each rule rR  Scope[r]{X}Pa(X)
For each assignment (x,u) to {X}Pa(X) we have exactly
one rule (c;p)R such that c is compatible with (x,u).
Then, we have P(X=x | Pa(X)=u) = p
Rule CPDs
Example
Let X be a variable with Pa(X) = {A,B,C}
r1: (a1, b1, x0; 0.1)
r2: (a0, c1, x0; 0.2)
r3: (b0, c0, x0; 0.3)
r4: (a1, b0, c1, x0; 0.4)
r5: (a0, b1, c0; 0.5)
r6: (a1, b1, x1; 0.9)
r7: (a0, c1, x1; 0.8)
r8: (b0, c0, x1; 0.7)
r9: (a1, b0, c1, x1; 0.6)
Note: each assignment maps to exactly one rule
Rules cannot always be represented compactly within
tree CPDs
Tree CPDs and Rule CPDs
Can represent every discrete function
Can be easily learned and dealt with in inference
But, some functions are not represented compactly
XOR in tree CPDs: cannot split in one step on a0,b1 and a1,b0
Alternative representations exist
Complex logical rules
Context Specific Independencies
A
A=a1  Ind(D,C | A=a1)
A
B
B
C
C
D
D
A
a0
A=a0  Ind(D,B | A=a0)
A
B
C
C
c0
D
a1
(0.9,0.1)
B
c1
(0.7,0.3)
b0
(0.2,0.8)
Reasoning by cases implies that Ind(B,C | A,D)
b1
(0.4,0.6)
Independence of Causal Influence
Causes: X1,…Xn
Effect: Y
X1
X2
...
Xn
Y
General case: Y has a complex dependency on X1,…Xn
Common case
Each Xi influences Y separately
Influence of X1,…Xn is combined to an overall influence on Y
Example 1: Noisy OR
Two independent effects X1, X2
Y=y1 cannot happen unless one of X1, X2 occurs
P(Y=y0 | X1=x10 , X2=x20) = P(X1=x10)P(X2=x20)
X1
X2
Y
Y
X1
X2
y0
y1
x1 0
x2 0
1
0
x1 0
x2 1
0.2
0.8
x1 1
x2 0
0.1
0.9
x1 1
x2 1
0.02
0.98
Noisy OR: Elaborate Representation
X1
X2
X’2
X’1
X1
x1 0
x1 1
x1 0
1
0
x1 1
0.1
0.9
X’1
Noisy CPD 1
X’2
x2 0
x2 1
x2 0
1
0
x2 1
0.2
0.8
Noisy CPD 2
Y
Noise parameter X1=0.9
X2
Noise parameter X1=0.8
Y
X’1
X’2
y0
y1
x1 0
x2 0
1
0
x1 0
x2 1
0
1
x1 1
x2 0
0
1
x1 1
x2 1
0
1
Deterministic OR
Noisy OR: Elaborate Representation
Decomposition results in the same distribution
P(Y  y 0 | X 1  x11 , X 2  x12 ) 
 P(Y  y , X ' , X ' | X  x , X  x )
 P( X ' | X  x , X  x ) P( X ' | X ' , X  x , X  x ) P(Y  y
 P( X ' | X  x ) P( X ' | X  x ) P(Y  y | X ' , X ' )
0
1
2
1
1
1
2
1
2
X '1 , X '2
1
1
1
1
1
1
1
1
1
2
2
2
1
1
1
1
1
2
2
X '1 , X '2
X '1 , X '2
2
2
1
2
0
 0.1 0.2 1  0.1 0.8  0  0.9  0.2  0  0.9  0.8  0
 0.02
1
2
0
| X '1 , X '2 , X 1  x11 , X 2  x12 )
Noisy OR: General Case
Y is a binary variable with k binary parents X1,...Xn
CPD P(Y | X1,...Xn) is a noisy OR if there are k+1
noise parameters 0,1,...,n such that
P(Y  y 0 | X 1 ,..., X k )  (1  0 )
 (1  )
i
i: X i  x1i
P(Y  y | X 1 ,..., X k )  1  (1  0 )  (1 i )
i: X i  x1i
1
Noisy OR Independencies
X1
X2
X’1
X’2
Xn
...
X’n
Y
ij: Ind(Xi  Xj | Y=yo)
Generalized Linear Models
Model is a soft version of a linear threshold function
Example: logistic function
Binary variables X1,...Xn,Y
P(Y  y1 | X 1 ,..., X k ) 
1
k
1  exp  wo   wi 1( X i  1) 
i 1
General Formulation
Let Y be a random variable with parents X1,...Xn
The CPD P(Y | X1,...Xn) exhibits independence of
causal influence (ICI) if it can be described by
Logistic
X1
X2
...
Xn
Z1
Z2
...
Zn
Noisy OR
Zi = wi1(Xi=1)
Zi has noise model
Z = Zi
Z is an OR function
Z
Y is the identity CPD
Y = logit (Z)
Y
The CPD P(Z | Z1,...Zn) is deterministic
General Formulation
Key advantage: O(n) parameters
As stated, not all that useful as any complex CPD can
be represented through a complex deterministic CPD
Continuous Variables
One solution: Discretize
Often requires too many value states
Loses domain structure
Other solution: use continuous function for P(X|Pa(X))
Can combine continuous and discrete variables, resulting in
hybrid networks
Inference and learning may become more difficult
Gaussian Density Functions
Among the most common continuous representations
Univariate case:
P( X ) ~ N (  ,  ) if
2
1
p( X ) 
e
2 
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-4
-2
0
2
4
1  x 
 
2  
2
Gaussian Density Functions
A multivariate Gaussian distribution over X1,...Xn has
Mean vector 
nxn positive definite covariance matrix 
positive definite: x  n : xT  x  0
Joint density function:
p( x) 
1
(2 ) n / 2 |  |1/ 2
 1
exp  ( x   )T  1 ( x   )
 2
i=E[Xi]
ii=Var[Xi]
ij=Cov[Xi,Xj]=E[XiXj]-E[Xi]E[Xj] (ij)
Gaussian Density Functions
Marginal distributions are easy to compute
  
P( X , Y )  N   X  ;  XX
 YX
  Y 
P( X )  N  X
 XY  
YY  
;  XX 
Independencies can be determined from parameters
If X=X1,...Xn have a joint normal distribution N(;) then
Ind(Xi  Xj) iff ij=0
Does not hold in general for non-Gaussian distributions
Linear Gaussian CPDs
Y is a continuous variable with parents X1,...Xn
Y has a linear Gaussian model if it can be described
using parameters 0,...,n and 2 such that
P(Y | x1 ,..., xn )  N (  0  1 x1  ...   n xn ;  2 )
Vector notation: P(Y | x )  N (  0   T x; 2 )
Pros
Simple
Captures many interesting dependencies
Cons
Fixed variance (variance cannot depend on parents values)
Linear Gaussian Bayesian Network
A linear Gaussian Bayesian network is a Bayesian
network where
All variables are continuous
All of the CPDs are linear Gaussians
Key result: linear Gaussian models are equivalent to
multivariate Gaussian density functions
Equivalence Theorem
Y is a linear Gaussian of its parents X1,...Xn:
P(Y | x )  N (  0   T x;  2 )
Assume that X1,...Xn are jointly Gaussian with N(;)
Then:
The marginal distribution of Y is Gaussian with N(Y;Y2)
Y   0   T μ
 Y2   2   T  
The joint distribution over {X,Y} is Gaussian where
Cov[ X i ; Y ]   j 1  j ij
n
Linear Gaussian BNs define a joint Gaussian distribution
Converse Equivalence Theorem
If {X,Y} have a joint Gaussian distribution then
P(Y | X )  N (  0   T X ;  2 )
Implications of equivalence
Joint distribution has compact representation: O(n2)
We can easily transform back and forth between Gaussian
distributions and linear Gaussian Bayesian networks
Representations may differ in parameters
X2
Xn
Example: X1
...
Gaussian distribution has full covariance matrix
Linear Gaussian
Hybrid Models
Models of continuous and discrete variables
Conditional Linear Gaussians
Continuous variables with discrete parents
Discrete variables with continuous parents
Y continuous variable
X = {X1,...,Xn} continuous parents
U = {U1,...,Um} discrete parents
u  U : P(Y | u, x)  N au,0  i 1 au,i xi ; u2
n
A Conditional Linear Bayesian network is one where
Discrete variables have only discrete parents
Continuous variables have only CLG CPDs
Hybrid Models
Continuous parents for discrete children
Threshold models
x  10
 0.9
P(Y  y | x)  
0.05 otherwise
1
Linear sigmoid
P(Y  y | x1 ,..., xk ) 
1
1
k
1  exp  wo   wi xi ) 
i 1
Summary: CPD Models
Deterministic functions
Context specific dependencies
Independence of causal influence
Noisy OR
Logistic function
CPDs capture additional domain structure