* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download BBNFriedmanKollerAdapted
Convolutional neural network wikipedia , lookup
Data (Star Trek) wikipedia , lookup
Concept learning wikipedia , lookup
Mathematical model wikipedia , lookup
Hidden Markov model wikipedia , lookup
Neural modeling fields wikipedia , lookup
Machine learning wikipedia , lookup
Mixture model wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Learning Bayesian
Networks from Data
Nir Friedman
Hebrew U.
.
Daphne Koller
Stanford
Overview
 Introduction
 Parameter
Estimation
 Model Selection
 Structure Discovery
 Incomplete Data
 Learning from Structured Data
2
Bayesian Networks
Compact representation of probability
distributions via conditional independence
Family of Alarm
Qualitative part:
Earthquake
Directed acyclic graph (DAG)
 Nodes - random variables
Radio
 Edges - direct influence
Burglary
Alarm
E B P(A | E,B)
e b 0.9 0.1
e b
0.2 0.8
e b
0.9 0.1
e b
0.01 0.99
Call
Together:
Define a unique distribution
in a factored form
Quantitative part:
Set of conditional
probability distributions
P (B , E , A,C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A)
3
Example: “ICU Alarm” network
Domain: Monitoring Intensive-Care Patients
 37 variables
 509 parameters
…instead of 254
MINVOLSET
PULMEMBOLUS
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
BP
4
Inference
Posterior
Probability of any event given any evidence
Most
likely explanation
Scenario that explains evidence
Rational
probabilities
decision making
Maximize expected utility
Value of Information
Effect
of intervention
Earthquake
Radio
Burglary
Alarm
Call
5
Why learning?
Knowledge acquisition bottleneck
 Knowledge acquisition is an expensive process
 Often we don’t have an expert
Data is cheap
 Amount of available information growing rapidly
 Learning allows us to construct models from raw
data
6
Why Learn Bayesian Networks?
 Conditional
independencies & graphical language
capture structure of many real-world distributions
 Graph
structure provides much insight into domain
Allows “knowledge discovery”
 Learned
model can be used for many tasks
 Supports
all the features of probabilistic learning
Model selection criteria
Dealing with missing data & hidden variables
7
Learning Bayesian networks
B
E
Data
+
Prior Information
Learner
R
A
C
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
8
Known Structure, Complete Data
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Learner
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network structure is specified
B
E
Inducer needs to estimate parameters
Data does not contain missing values
9
Unknown Structure, Complete Data
E, B, A
<Y,N,N>
<Y,N,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
Learner
A
B
E
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network structure is not specified
B
E
Inducer needs to select arcs & estimate parameters
Data does not contain missing values
10
Known Structure, Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
Learner
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network structure is specified
 Data contains missing values
Need to consider assignments to missing values
11
Unknown Structure, Incomplete Data
E, B, A
<Y,N,N>
<Y,?,Y>
<N,N,Y>
<N,Y,?>
.
.
<?,Y,Y>
E B P(A | E,B)
e b
?
?
e b
?
?
e b
?
?
e b
?
?
B
E
Learner
B
E
A
A
E B P(A | E,B)
e b .9
.1
e b .7
.3
e b .8
.2
e b .99 .01
Network structure is not specified
 Data contains missing values
Need to consider assignments to missing values
12
Overview
 Introduction
 Parameter
Estimation
 Likelihood function
 Bayesian estimation
 Model Selection
 Structure Discovery
 Incomplete Data
 Learning from Structured Data
13
Learning Parameters
 Training
data has the form:
B
E
A
 E [1] B [1] A[1] C [1] 
 
D
 
 
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
C
14
Likelihood Function
 Assume
i.i.d. samples
 Likelihood function is
L( : D )   P (E [m], B [m], A[m],C [m] : )
B
E
A
C
m
15
Likelihood Function
 By
B
E
definition of network, we get
A
L( : D )   P (E [m], B [m], A[m], C [m] : )
C
m
 P (E [m] : )
 P (B [m] : )
 
m  P (A[m ] | B [m ], E [m ] : ) 
 P (C [m] | A[m] : )
 E [1] B [1] A[1] C [1] 
 
 
 
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
16
Likelihood Function
 Rewriting
B
E
terms, we get
A
L( : D )   P (E [m], B [m], A[m ], C [m] : )
C
m
 P (E [m] : )
m
 P (B [m] : )
=
m
 P (A[m] | B [m], E [m] : )
m
 P (C [m] | A[m] : )
m
 E [1] B [1] A[1] C [1] 
 
 
 
E
[
M
]
B
[
M
]
A
[
M
]
C
[
M
]
17
General Bayesian Networks
Generalizing for any Bayesian network:
L( : D )   P (x1 [m ], , xn [m ] : )
m
  P (xi [m ] | Pai [m ] : i )
i
m
  Li (i : D )
i
Decomposition
 Independent estimation problems
18
Likelihood Function: Multinomials
L( : D )  P (D | )   P (x [m ] | )
m
likelihood for the sequence H,T, T, H, H is
L( :D)
 The
L( : D )    (1  )  (1  )    
0
0.2
0.4
General case:
0.6
0.8
Count of kth
outcome in D
1
K
L (  : D )    k Nk
k 1
Probability of
kth outcome
19
Bayesian Inference
 Represent
uncertainty about parameters using a
probability distribution over parameters, data
 Learning using Bayes rule
Likelihood
Prior
P (x [1],  x [M ] | )P ()
P ( | x [1],  x [M ]) 
P (x [1],  x [M ])
Posterior
Probability of data
20
Bayesian Inference
 Represent
Bayesian distribution as Bayes net
X[1]
X[2]
X[m]
Observed data
values of X are independent given 
 P(x[m] |  ) = 
 Bayesian prediction is inference in this network
 The
21
Bayesian Nets & Bayesian Prediction
Y|X
X
X
m
X[1]
X[2]
X[M]
X[M+1]
Y[1]
Y[2]
Y[M]
Y[M+1]
X[m]
Y|X
Y[m]
Plate notation
Observed data
Query
 Priors
for each parameter group are independent
 Data instances are independent given the
unknown parameters
22
Bayesian Nets & Bayesian Prediction
Y|X
X
X[1]
X[2]
X[M]
Y[1]
Y[2]
Y[M]
Observed data
 We
can also “read” from the network:
Complete data 
posteriors on parameters are independent
 Can compute posterior over parameters separately!
23
Learning Parameters: Summary
 Estimation
relies on sufficient statistics
For multinomials: counts N(xi,pai)
Parameter estimation
N (xi , pai )
ˆ
x |pa 
i
i
N ( pai )
MLE
(xi , pai )  N (xi , pai )
~
x |pa 
i
i
( pai )  N ( pai )
Bayesian (Dirichlet)
 Both
are asymptotically equivalent and consistent
 Both can be implemented in an on-line manner by
accumulating sufficient statistics
24
Overview
 Introduction
 Parameter
Learning
 Model Selection
 Scoring function
 Structure search
 Structure Discovery
 Incomplete Data
 Learning from Structured Data
25
Why Struggle for Accurate Structure?
Earthquake
Alarm Set
Burglary
Sound
Missing an arc
Earthquake
Alarm Set
Adding an arc
Burglary
Earthquake
Sound
Cannot be compensated
for by fitting parameters
 Wrong assumptions about
domain structure
Alarm Set
Burglary
Sound
Increases the number of
parameters to be estimated
 Wrong assumptions about
domain structure
26
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
B
E
E
E
A
A
B
B
A
Search for a structure that maximizes the score
27
Likelihood Score for Structure
(G : D)  log L(G : D)  M  I(Xi ; PaiG )  H(Xi ) 
i
Mutual information between
Xi and its parents
 Larger
dependence of Xi on Pai  higher score
 Adding
arcs always helps
I(X; Y)  I(X; {Y,Z})
Max score attained by fully connected network
Overfitting: A bad idea…
28
Bayesian Score
Likelihood score: L(G : D)  P(D | G, θˆG )
Max likelihood params
Bayesian approach:
 Deal with uncertainty by assigning probability to
all possibilities
P (D | G )   P (D | G , )P ( | G ) d
Marginal Likelihood
Likelihood
Prior over parameters
P(D |G)P(G)
P(G |D) 
P(D)
29
Heuristic Search
 Define
a search space:
 search states are possible structures
 operators make small changes to structure
 Traverse space looking for high-scoring structures
 Search techniques:
 Greedy hill-climbing
 Best first search
 Simulated Annealing
 ...
30
Local Search
 Start
with a given network
empty network
best tree
a random network
 At
each iteration
 Evaluate all possible changes
 Apply change based on score
 Stop
when no modification improves score
31
Heuristic Search
 Typical
S
operations:
C
E
S
C
To update score after local change,D
E
only re-score families
that changed
D
S
C
score =
S({C,E} D)
- S({E} D)
S
C
E
E
D
D
32
Learning in Practice: Alarm domain
2
true distribution
KL Divergence to
1.5
Structure known, fit params
1
Learn both structure & params
0.5
0
0
500
1000
1500
2000
2500
3000
#samples
3500
4000
4500
5000
33
Local Search: Possible Pitfalls
 Local
search can get stuck in:
Local Maxima:
 All
one-edge changes reduce the score
Plateaux:
 Some
 Standard
one-edge changes leave the score unchanged
heuristics can escape both
Random restarts
TABU search
Simulated annealing
34
Improved Search: Weight Annealing
Standard annealing process:
 Take bad steps with probability  exp(score/t)
 Probability increases with temperature
 Weight annealing:
 Take uphill steps relative to perturbed score
 Perturbation increases with temperature
Score(G|D)
G
35
Perturbing the Score
 Perturb
the score by reweighting instances
 Each weight sampled from distribution:
 Mean = 1
 Variance  temperature
 Instances sampled from “original” distribution
 … but perturbation changes emphasis
Benefit:
 Allows global moves in the search space
36
Weight Annealing: ICU Alarm network
Cumulative performance of 100 runs of
annealed structure search
True structure
Learned params
Greedy
hill-climbing
Annealed
search
37
Structure Search: Summary
 Discrete
optimization problem
 In some cases, optimization problem is easy
 Example: learning trees
 In general, NP-Hard
 Need to resort to heuristic search
 In practice, search is relatively fast (~100 vars in
~2-5 min):
 Decomposability
 Sufficient
statistics
Adding randomness to search is critical
38
Overview
 Introduction
 Parameter
Estimation
 Model Selection
 Structure Discovery
 Incomplete Data
 Learning from Structured Data
39
Structure Discovery
Task: Discover structural properties
 Is there a direct connection between X & Y
 Does X separate between two “subsystems”
 Does X causally effect Y
Example: scientific data mining
 Disease properties and symptoms
 Interactions between the expression of genes
40
Discovering Structure
P(G|D)
E
R
B
A
C
 Current
practice: model selection
Pick a single high-scoring model
Use that model to infer domain structure
41
Discovering Structure
P(G|D)
E
B
R
A
C
R
E
B
E
A
C
R
B
A
C
E
R
B
A
C
E
R
B
A
C
Problem
Small sample size  many high scoring models
Answer based on one model often useless
Want features common to many models
42
Bayesian Approach
 Posterior
distribution over structures
 Estimate probability of features
 Edge XY
 Path X…  Y
 …
Bayesian score
for G
P (f | D )  f (G )P (G | D )
Feature of G,
e.g., XY
G
Indicator function
for feature f
43
MCMC over Networks
Cannot enumerate structures, so sample structures
1
P (f (G ) | D ) 
n
n
f (G )
i 1
i
MCMC Sampling
 Define Markov chain over BNs
 Run chain to get samples from posterior P(G | D)
Possible pitfalls:
 Huge (superexponential) number of networks
 Time for chain to converge to posterior is unknown
 Islands of high posterior, connected by low bridges
44
ICU Alarm BN: No Mixing
 500
instances:
Score of
cuurent sample
score
-8400
-8600
-8800
-9000
-9200
empty
greedy
-9400
0
100000
200000
300000
400000
500000
600000
iteration
MCMC
Iteration
 The
runs clearly do not mix
45
Effects of Non-Mixing
MCMC runs over same 500 instances
 Probability estimates for edges for two runs
True BN
Random start
 Two
True BN
True BN
Probability estimates highly variable, nonrobust
46
Fixed Ordering
Suppose that
 We know the ordering of variables
 say, X1 > X2 > X3 > X4 > … > Xn
parents for Xi must be in X1,…,Xi-1
 Limit
number of parents per nodes to k
2k•n•log n
networks
Intuition: Order decouples choice of parents
 Choice of Pa(X7) does not restrict choice of Pa(X12)
Upshot: Can compute efficiently in closed form
 Likelihood P(D | )
 Feature probability P(f | D, )
47
Our Approach: Sample Orderings
We can write
P (f | D )   P (f |,D )P ( | D )
Sample orderings and approximate
n
P (f | D )   P (f |  i , D )
i 1
MCMC Sampling
 Define Markov chain over orderings
 Run chain to get samples from posterior P( | D)
48
Mixing with MCMC-Orderings
4
runs on ICU-Alarm with 500 instances
fewer iterations than MCMC-Nets
approximately same amount of computation
-8400
Score of
cuurent sample
score
-8405
-8410
-8415
-8420
-8425
-8430
-8435
-8440
-8445
random
greedy
-8450
0
10000
20000
30000
40000
50000
60000
iteration
MCMC
Iteration
Process appears to be mixing!
49
Mixing of MCMC runs
 Two
MCMC runs over same instances
 Probability estimates for edges
500 instances
1000 instances
Probability estimates very robust
50
Application to Gene Array Analysis
51
Chips and Features
52
Application to Gene Array Analysis
 See
www.cs.tau.ac.il/~nin/BioInfo04
 Bayesian Inference :
http://www.cs.huji.ac.il/~nir/Papers/FLNP1Full.pdf
53
Application: Gene expression
Input:
Measurement of gene expression under
different conditions
 Thousands of genes
 Hundreds of experiments
Output:
Models of gene interaction
 Uncover pathways
54
Map of Feature Confidence
Yeast data
[Hughes et al 2000]
 600
genes
 300 experiments
55
“Mating response” Substructure
KAR4
SST2
TEC1
NDJ1
KSS1
YLR343W
YLR334C
MFA1
STE6
FUS1
PRM1
AGA1
AGA2 TOM6
FIG1
FUS3
YEL059W
Automatically constructed sub-network of highconfidence edges
 Almost exact reconstruction of yeast mating pathway
56
Summary of the course
 Bayesian
learning
 The idea of considering model parameters as
variables with prior distribution
 PAC learning
 Assigning confidence and accepted error to the
learning problem and analyzing polynomial L T
 Boosting and Bagging
 Use of the data for estimating multiple models
and fusion between them
57
Summary of the course
 Hidden
Markov Models
 Markov property is widely assumed, hidden
Markov model very powerful & easily estimated
 Model selection and validation
 Crucial for any type of modeling!
 (Artificial) neural networks
 The brain performs computations radically
different than modern computers & often much
better! we need to learn how?
 ANN powerful modeling tool (BP, RBF)
58
Summary of the course
 Evolutionary
learning
 Very different learning rule, has its merits
 VC dimensionality
 Powerful theoretical tool in defining solvable
problems, difficult for practical use
 Support Vector Machine
 Clean theory, different than classical statistics as
it looks for simple estimators in high dim, rather
than reducing dim
 Bayesian networks: compact way to represent
conditional dependencies between variables
59
Final project
60
Probabilistic Relational Models
Key ideas:
Universals: Probabilistic patterns hold for all objects in class
 Locality: Represent direct probabilistic dependencies
Links give us potential interactions!
Professor
Teaching-Ability
Student
Intelligence
Course
Difficulty
A
B
C
easy,weak
Reg
Grade
Satisfaction
easy,smart
hard,weak
hard,smart
0%
20%
40%
60%
80%
100%
61
PRM Semantics
Prof. Jones
Teaching-ability
Prof. Smith
Teaching-ability
Instantiated PRM BN
 variables: attributes of all objects
 dependencies:
determined by
Grade|Intell,Diffic
links & PRM
Grade
Welcome to
Intelligence
Satisfac
Forrest Gump
Geo101
Grade
Welcome to
Difficulty
Satisfac
CS101
Grade
Difficulty
Satisfac
Intelligence
Jane Doe
62
The Web of Influence
Welcome to
 Objects are all correlated
CS101
 Need to perform inference over entire model
 For large databases, use approximate inference:
 Loopy beliefCpropagation
0%
0%
50%
50%
100%
100%
Welcome to
Geo101
easy / hard
A
weak
smart
weak / smart
63
PRM Learning: Complete Data
Prof.
Jones
Low
Teaching-ability
Prof.
Smith
High
Teaching-ability
Grade|Intell,Diffic
Grade
C
Satisfac
Like
Weak
Intelligence
Welcome to
Introduce Geo101
prior over parameters
 Entire database is single “instance”
B
 Update prior with sufficient Grade
statistics:
 Parameters
Easyused many times in instance
Difficulty
Hate
Satisfac
Count(Reg.Grade=A,Reg.Course.Diff=lo,Reg.Student.Intel=hi)
Welcome
to
CS101
A
Grade
Easy
Difficulty
Smart
Intelligence
Like
Satisfac
64
PRM Learning: Incomplete Data
???
???
C
Hi
???
Welcome
to
Use expected
sufficient
statistics
Geo101
 But, everything is correlated:
A
 E-step uses
(approx) inference over entire model
???
Welcome to
Low
CS101
???
B
Hi
???
65
Example: Binomial Data
uniform for  in [0,1]
 P( |D)  the likelihood L( :D)
 Prior:
P ( | x [1],x [M])  P (x [1],x [M] |  )  P ( )
(NH,NT ) = (4,1)
MLE for P(X = H ) is 4/5 = 0.8
 Bayesian
prediction is
0
0.2
0.4
0.6
0.8
1
5
P (x [M  1]  H | D )    P ( | D )d   0.7142 
7
66
Dirichlet Priors
 Recall
that the likelihood function is
K
L (  : D )    k Nk
k 1
 Dirichlet
prior with hyperparameters 1,…,K
K
P ()   k k 1
k 1
 the posterior has the same form, with
hyperparameters 1+N 1,…,K +N
K
K
K
K
k 1
k 1
k 1
P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1
67
Dirichlet Priors - Example
5
4.5
Dirichlet(heads, tails)
4
P(heads)
3.5
3
2.5
Dirichlet(0.5,0.5)
Dirichlet(5,5)
2
Dirichlet(2,2)
1.5
Dirichlet(1,1)
1
0.5
0
0
0.2
0.4
0.6
0.8
1
heads
68
Dirichlet Priors (cont.)
 If
P() is Dirichlet with hyperparameters 1,…,K
k
P (X [1]  k )   k P ()d 
 
 Since
the posterior is also Dirichlet, we get
 k  Nk
P (X [M  1]  k | D )   k P ( | D )d 
 (   N  )
69
Learning Parameters: Case Study
1.4
Instances sampled from
ICU Alarm network
KL Divergence
to true distribution
1.2
M’ — strength of prior
1
0.8
MLE
0.6
Bayes; M'=20
0.4
Bayes; M'=50
0.2
Bayes; M'=5
0
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
# instances
70
Marginal Likelihood: Multinomials
Fortunately, in many cases integral has closed form
 P()
is Dirichlet with hyperparameters 1,…,K
 D is a dataset with sufficient statistics N1,…,NK
Then
    
 
(   N )
P (D ) 
(  )
 
  (   N ) 
 
71
Marginal Likelihood: Bayesian Networks
 Network
structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 1:
Two Dirichlet marginal likelihoods
X
P(
)
Integral over X
P(
)
Integral over Y
Y
72
Marginal Likelihood: Bayesian Networks
 Network
structure
determines form of
marginal likelihood
1
2
3
4
5
6
7
X
H
T
T
H
T
H
H
Y
H
T
H
H
T
T
H
Network 2:
Three Dirichlet marginal likelihoods
X
Y
P(
)
Integral over X
P(
)
Integral over Y|X=H
P(
)
Integral over Y|X=T
73
Marginal Likelihood for Networks
The marginal likelihood has the form:
P (D | G )   Dirichlet marginal likelihood
i
paiG
  ( pa iG )
for multinomial P(Xi | pai)
  ( paiG )  N ( pa iG )
xi
( (xi , paiG )  N (xi , paiG ))
( (xi , pa iG ))
N(..) are counts from the data
(..) are hyperparameters for each family given G
74
Bayesian Score: Asymptotic Behavior
log M
log P(D | G)  (G : D) 
dim(G)  O(1)
2
log M
 M  I(Xi ; PaiG )  H(Xi )  
dim(G)  O(1)
2
i
Fit dependencies in
empirical distribution
 As
Complexity
penalty
M (amount of data) grows,
Increasing pressure to fit dependencies in distribution
Complexity term avoids fitting noise
 Asymptotic
equivalence to MDL score
 Bayesian score is consistent
Observed data eventually overrides prior
75
Structure Search as Optimization
Input:
 Training data
 Scoring function
 Set of possible structures
Output:
 A network that maximizes the score
Key Computational Property: Decomposability:
score(G) =  score ( family of X in G )
76
Tree-Structured Networks
MINVOLSET
PULMEMBOLUS
Trees:
 At most one parent per variable
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTLUNG
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
DISCONNECT
VENITUBE
PRESS
MINOVL
VENTALV
PVSAT
Why trees?
 Elegant math
 we can solve the
optimization problem
 Sparse parameterization
 avoid overfitting
VENTMACH
LVFAILURE
STROEVOLUME
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
BP
77
Learning Trees
p(i) denote parent of Xi
 We can write the Bayesian score as
 Let
Score (G : D )  Score (Xi : Pai )
i
  Score (Xi : X p (i ) )  Score (Xi )  Score (Xi )
i
Improvement over
“empty” network
i
Score of “empty”
network
Score = sum of edge scores + constant
78
Learning Trees
w(ji) =Score( Xj  Xi ) - Score(Xi)
 Find tree (or forest) with maximal weight
 Set
Standard max spanning tree algorithm — O(n2 log n)
Theorem: This procedure finds tree with max score
79
Beyond Trees
When we consider more complex network, the
problem is not as easy
 Suppose we allow at most two parents per node
 A greedy algorithm is no longer guaranteed to find
the optimal network
 In
fact, no efficient algorithm exists
Theorem: Finding maximal scoring structure with at
most k parents per node is NP-hard for k > 1
80
					 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            