Download The Exploration of Greedy Hill-climbing Search in Markov

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Formal concept analysis wikipedia , lookup

Pattern recognition wikipedia , lookup

Unification (computer science) wikipedia , lookup

Point set registration wikipedia , lookup

Gene expression programming wikipedia , lookup

Multi-armed bandit wikipedia , lookup

Minimax wikipedia , lookup

Genetic algorithm wikipedia , lookup

Stemming wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
The Exploration of Greedy Hill-climbing Search in
Markov Equivalence Class Space
1
Huijuan Xu1, Hua Yu1, Juyun Wang2, Jinke Jiang1
College of Engineering, Graduate University of Chinese Academy of Sciences, Beijing, China
2
College of Science, Communication University of China, Beijing, China
Abstract - The greedy Hill-climbing search in the
Markov Equivalence Class space (E-space) can
overcome the drawback of falling into local maximum in
Directed Acyclic Graph space (DAG space) caused by
the score equivalent property of Bayesian scoring
function, and one representative algorithm is Greedy
Equivalence Search algorithm (GES algorithm) which is
inclusion optimal in the large sample size, but not
parameter optimal. In fact GES algorithm does not
comply with the inclusion boundary condition which is a
guarantee of gaining the highest score, but the
unrestricted form of GES algorithm (UGES algorithm)
complies with the inclusion boundary condition
approximately. However, the greedy Hill-climbing
search both in the DAG space and in the E-space has the
drawback of time-consuming. The idea of confining the
search using the constraint-based method is a good
solution for the time-consuming drawback. This paper
conducts experiments to compare the effects of greedy
Hill-climbing search algorithm in DAG space (GS
algorithm), GES algorithm and UGES algorithm both
without the restriction of the parents and children sets
and with the restriction of parents and children sets, and
finds that GS/GES/UGES with the restriction have
achieved improvement in time-efficiency and structure
difference, with a little reduction in Bayesian scoring
function.
Keywords: Inclusion Boundary Condition, GES
algorithm, UGES algorithm, MMHC algorithm, local
discovery algorithms
1 Introduction
In the field of Bayesian network (BN), BN
structure learning is a research hotspot. There are two
main categories of BN structure learning algorithms,
namely the constraint-based method and the
search-and-score
method.
The
constraint-based
method[1][2] firstly uses the conditional independence (CI)
test to determine the skeleton of the BN structure, and
then directs the skeleton using the directional rules. The
CI test has several forms, like Pearson's chi2 test, G2
likelihood ratio test, and mutual information.
The search-and-score method includes two typical
algorithms, namely K2 algorithm[3] and greedy
Hill-climbing algorithm. The K2 algorithm finds parent
nodes for each node from the nodes before the node
according to the initial node sequence, and builds the
Bayesian network structure gradually. The initial node
sequence can be determined by the method mentioned in
paper [4], namely building the maximum weight
spanning tree (MWST) using mutual information, and
then using the topological order of the oriented MWST
as the initial node sequence.
The greedy Hill-climbing algorithm in the DAG
space (GS algorithm) takes an initial graph, defines a
neighborhood, computes a score for every graph in this
neighborhood, and chooses the one which maximizes
the score for the next iteration, until the scoring function
between two consecutive iterations does not improve.
And the neighborhood is defined as the set of graphs
that differ only with one insertion, one reversion, and
one deletion from our current graph. The initial graph
can be chosen as the oriented MWST. The GS algorithm
has several drawbacks. Firstly, according to the score
equivalent property of Bayesian scoring function, the
directed acyclic graphs in the same equivalence class
have the same score value. So if the two adjacent
iterations are within the same equivalence class, the GS
algorithm will fall into local maximum. Secondly, if we
use random selection to break ties when we encounter
more than one graph owning the highest score in the
neighborhood, the finally learnt BN structures are more
fluctuating when the GS algorithm runs multiple times.
Thirdly, the GS algorithm is much more
time-consuming, as the number of variables grows.
The greedy Hill-climbing search in the Markov
Equivalence Class space can overcome the drawback of
falling into local maximum caused by the score
equivalent property of Bayesian scoring function, and
can improve the volatility of the finally learnt BN
structures. One state of the art algorithm of the greedy
Hill-climbing search in the E-space is the GES
algorithm[5]. GES algorithm starts from the empty graph,
firstly adds edges until the scoring value reaches a local
maximum, then deletes edges until the scoring value
reaches another local maximum, and finally returns that
equivalence class as the solution. In paper [6], it has
been proved that in the limit of large sample sizes, the
GES algorithm identifies an inclusion-optimal
equivalence class of DAG models. However, GES may
not be able to identify a parameter-optimal model even
in the limit of large sample size, namely the parameters
of the finally identified model are not the fewest among
all the BNs that include the distribution. Then according
to the consistent property of Bayesian scoring function,
the score of the finally identified equivalence class in the
GES algorithm is not the highest.
The paper [7] defines the Inclusion Boundary
Condition based on the Inclusion Order provided by
Meek’s conjecture, and the paper [8] proves that the
hill-climbing algorithm using the penalized score
function and a traversal operator satisfying the Inclusion
Boundary Condition always finds the faithful model
which has the highest score. In fact, the GES algorithm
does not comply with the Inclusion Boundary Condition
and does not get the guarantee of gaining the highest
score. The UGES algorithm considers both the edge
addition and the edge deletion at each step, and complies
with the Inclusion Boundary Condition approximately. It
is an unrestricted form of the GES algorithm and may
get higher score than the GES algorithm. Both the GES
algorithm and the UGES algorithm have the drawback
of time-consuming.
About the drawback of time-consuming, some
algorithms improve this drawback by restricting the
maximum number of parents allowed for every node in
the network, like the Sparse Candidate algorithm[9][10].
But it is not a sound solution. Firstly it is hard to set the
value of the maximum number of parents, and secondly
this parameter imposes a uniform constraint for each
node in the network. If we use the constraint-based
method to find the set of parents and children for each
node, and restrict the greedy search within the parents
and children set of each node, it will not encounter such
two problems. The Max-Min Hill-climbing algorithm
(MMHC algorithm)[11] is one such BN structure learning
algorithm. It firstly uses the Max-Min Parents and
Children algorithm (MMPC algorithm)[12] to find the set
of parents and children for each node, and then applies
the GS algorithm within the parents and children set of
each node. It is shown that the MMHC algorithm results
in computational savings and performs better than many
existing algorithms.
This paper intends to conduct experiments to
compare the GES algorithm and the UGES algorithm
from the aspects of Bayesian score, Structural Hamming
Distance (SHD)[11] and number of calls to scoring
function, with the GS algorithm as a reference. Besides,
we intend to use the parents and children sets learnt by
the MMPC algorithm to restrict the GES algorithm and
the UGES algorithm to improve the time-efficiency of
these two algorithms, and conduct experiments to
compare the performances of the GES algorithm and the
UGES algorithm under the restriction of parents and
children sets, with the MMHC algorithm (MMPC+GS)
as a reference. We use the data sampled from the Alarm
network[13] to conduct experiments. We implement the
experiments in Matlab environment with the Bayes Net
toolbox[14], the BNT Structure Learning package[15], and
the Causal Explorer package[16].
This paper is structured as follows. Section 2
analyzes the GES algorithm and the UGES algorithm
using the knowledge of E-space, scoring function, and
Inclusion Boundary Condition. Then Section 3
introduces the local discovery algorithm MMPC and the
combination of MMPC with GES algorithm and UGES
algorithm. Experimental results and analysis are showed
in Section 4. Finally, Section 5 presents some
conclusions.
2
The greedy Hill-climbing search in
the Markov Equivalence Class
space
2.1. The Markov Equivalence Class space
and the scoring function
The Markov Equivalence Class space (E-space) is
divided into Markov equivalence classes. The DAGs in
the same Markov equivalence class are equivalent.
Theorem 1 Two DAGs are equivalent if and only
if they have the same skeletons and the same
v-structures[17].
The skeleton of any DAG is the undirected graph
resulting from ignoring the directionality of every edge.
A v-structure in DAG G is an ordered triple of nodes (X,
Y, Z) such that (1) G contains the edges X->Y and Z->Y,
and (2) X and Z are not adjacent in G.
The equivalence classes of DAGs are represented
with completed partially DAGs (CPDAGs). The
CPDAG is a graph that contains both directed and
undirected edges. The directed edges represent the
compelled edges and the undirected edges represent the
reversible edges. The compelled edge is the edge that
has the same orientation for every member of the
equivalence class. The reversible edge is the edge that is
not compelled.
In a CPDAG or PDAG, a pair of nodes is neighbor
if they are connected by an undirected edge, and they are
adjacent if they are connected by either an undirected
edge or a directed edge.
The DAGs in the same equivalence class gain the
same score if the scoring function owns the property of
score equivalence. In fact, the scoring function may own
three properties:
Property 1 The scoring function is decomposable
if it can be written as a sum of measures, each of which
is a function only of one node and its parents.
Property 2 The scoring function is score
equivalent if the DAGs in the same equivalence class
gain the same score.
Property 3 The scoring function is consistent if in
the large sample size, the following conditions hold: (1)
If H contains the distribution and G does not contain the
distribution, then the score of H is larger than the score
of G; (2) If H and G both contain the distribution, and H
contains fewer parameters than G, then the score of H is
larger than the score of G.
Bayesian Dirichlet (BD) scoring function is one of
the commonly used scoring functions. BDe scoring
function is the BD scoring function which satisfies the
score equivalent property. And the BDeu scoring
function is the BDe scoring function whose parameter
prior has uniform mean. The form of the BDeu scoring
function is as follows:
ri  ( N '  N )
 ( Nij' )
n qi
ijk
ijk
p ( G , D | )  log p ( G | )   
 
'
'
i 1 j 1  ( Nij  N ij ) k 1  ( N ijk )
Where
' 
N ijk
N'
ri  qi
,
qi
(1)
denotes the number of
configurations of the parent set
 ( xi )
,
ri
denotes the
number of states of variable xi , Nijk is the number of
records in dataset D for which xi k and  ( xi ) is in
the jth configuration, and Nij k Nijk .  ( ) is the
Gamma function, which satisfies  ( y1) y ( y ) and
 (1)1 .
'
The components Nijk
and p ( G| ) specify the prior
knowledge. In BDeu scoring function, the parameter
prior
' 
Nijk
N'
ri  qi
is set as uniform joint distribution,
N'
is
the equivalent sample size, and N ' is usually set to ten
in most experiments. p (G| ) is the network structure
prior, the assessment of p (G| ) is discussed in paper
[18] in detail, and we set the network structure prior
p ( G| ) as one in this paper. From Eq. (1), we can find
that the BDeu scoring function is decomposable.
Besides, the paper [18] shows that the BDeu scoring
function is score equivalent, and the paper [5] shows
that it is consistent. So the BDeu scoring function
satisfies the three properties of the scoring function
mentioned above.
2.2. Greedy Equivalent Search algorithm
The Greedy Equivalent Search algorithm (GES
algorithm) starts from the empty graph, firstly adds
edges until the scoring value reaches a local maximum,
then deletes edges until the scoring value reaches
another local maximum, and finally returns that
equivalence class as the solution. The GES algorithm
searches through the E-space and the equivalence
classes in the E-space are represented with CPDAGs.
The paper [5] proves Meek’s conjecture, and shows that
the local maximum reached in the first phase of the
algorithm contains the generative distribution and the
final equivalence class that results from the GES
algorithm is asymptotically a perfect map of the
generative distribution.
There are two kinds of operators which are used to
construct the neighborhood in GES algorithm, namely
the Insert operator in the first phase and the Delete
operator in the second phase. In order to compute the
scores of the PDAGs in the neighborhood after applying
the operators, it needs to convert the PDAG to DAG,
and the paper [19] gives a simple implementation of the
algorithm PDAG-To-DAG. Besides, the GES algorithm
needs the algorithm DAG-To-CPDAG[20] to convert the
finally chosen DAG to CPDAG to make the greedy
search continue. The two kinds of operators are listed as
follows.
Definition 1 Insert ( X ,Y ,  ) [5]
For non-adjacent nodes X and Y in the CPDAG
P c , and for any subset  of the neighbors of Y that
are not adjacent to X , the Insert ( X ,Y , ) operator
modifies P c by (1) inserting the directed edge X  Y ,
and (2) for each T  , directing the previously
undirected edge between T and Y as T  Y .
Definition 2 Delete ( X , Y ,  ) [5]
For adjacent nodes X and Y in the CPDAG P c
connected either as X  Y or X  Y , and for any subset
 of the neighbors of Y that are adjacent to X , the
operator modifies P c by deleting the
edge between X and Y , and for each H   , (1)
directing the previously undirected edge between Y
and H as Y  H and (2) directing any previously
undirected edge between X and H as X  H .
These two operators need to satisfy the validity
conditions, so that the PDAG P c resulting from
applying the operators will admit a consistent extension.
If a DAG G has the same skeleton and the same set of
v-structures as a PDAG P and if every directed edge in
P has the same orientation in G , we say that G is a
consistent extension of P . If there is at least one
consistent extension of a PDAG P , we say that P
admits a consistent extension. If the PDAG P c resulting
from applying the operators admits a consistent
extension, it will be converted to a DAG by the
algorithm PDAG-To-DAG. Otherwise, the algorithm
PDAG-To-DAG will give an error during the conversion
of the PDAG P c .
The validity conditions all include the problem of
judging clique. A clique in a DAG or a PDAG is a set of
nodes for which every pair of nodes is adjacent. And
directing the edges of the clique will not create new
v-structure. About the implementation of judging clique,
there are two methods. One is firstly finding all the
maximal cliques of the graph, and then checking that
whether the node set is a subset of any maximal clique
found in the first phase. However, the general problem
of finding optimal triangulations for undirected graphs
in the process of finding all the maximal cliques of the
graph is NP-hard[21], so heuristic algorithms[22][23] are
developed, which are time-consuming. The other
method of judging clique is a direct form used in the
algorithm PDAG-To-DAG which is more time-efficient.
For every vertex y, adjacent to x, with (x, y) undirected,
if y is adjacent to all the other vertices which are
adjacent to x, then x and all the vertices adjacent to x
form a clique. This paper takes the second method of
judging clique.
From the Profile file of the GES algorithm, we find
that the GES algorithm is a time-consuming algorithm,
and most of the running time is consumed in the
calculation of the scores of the neighborhood and the
conversion from PDAG to DAG. Since the size of the
neighborhood is exponential in the number of
adjacencies for a node, one paper [25] proposes
changing exhaustive search by greedy search which is
linear, to improve the time-efficiency of the GES
algorithm. This paper intends to improve the
time-efficiency of the GES algorithm by finding the
parents and children set for each node using the
constraint-based method and confining the GES within
the parents and children set of each node.
The paper [6] proves that in the limit of large
sample sizes, the GES algorithm identifies an
inclusion-optimal equivalence class of DAG models, but
GES may not be able to identify a parameter-optimal
model. The parameter-optimal model is the model which
includes the distribution with the fewest parameters. So
the parameters of the final model identified by the GES
algorithm may be not the fewest among all the BNs that
include the distribution. Then according to the consistent
property of the BDeu scoring function, the score of the
finally identified equivalence class in the GES algorithm
Delete ( X , Y ,  )
*
*
*
is not the highest. This paper investigates the conditions
of recovering the right structure, and intends to combine
the conditions with the GES algorithm to improve the
score of the GES algorithm.
2.3. Inclusion Boundary Condition
Inclusion Boundary Condition is one of the
conditions that will guarantee the achievement of the
highest score of the hill-climbing algorithm. It is based
on the theory of Inclusion Order.
A graphical Markov model (GMM) M (G ) is the
family of probability distributions that are Markov over
G . A probability distribution P is Markov over a
graph G if and only if every CI restriction encoded in
G is satisfied by P . The intuition behind the Inclusion
Order is that one GMM M (G ) precedes another GMM
M ( G ' ) if and only if all the CI restrictions encoded in
G are also encoded in G ' . Meek has provided an
Inclusion Order in Meek’s conjecture and Chickering
has proved Meek’s conjecture.
Conjecture 1 Meek's conjecture[24]
Let D( G) and D ( G ' ) be two Bayesian Networks
determined by two DAGs G and G' . The conditional
independence model induced by D( G) is included in the
one induced by D ( G ' ) , i.e. D I ( G )  D I ( G ' ) , if and only
if there exists a sequence of DAGs L1 ,...,Ln such that
G L1 , G'  Ln and the DAG Li1 is obtained from Li
by applying either the operation of covered arc reversal
or the operation of arc removal for i 1,..., n .
The paper [7] defines the Inclusion Boundary
IB ( G ) , and intuitively, the Inclusion Boundary of a given
GMM M (G ) consists of those GMMs M ( Gi ) that
induce a set of CI restrictions MI (Gi ) which
immediately follow or precede M I ( G ) under the
Inclusion Order. Then based on the definition of
Inclusion Boundary, the paper [7] gives the definition of
Inclusion Boundary Condition and the paper [8] gives
the theorem which proves the correctness of the
hill-climbing algorithm under the faithfulness and
unbounded data assumptions.
Definition 3 Inclusion Boundary Condition[7]
A learning algorithm for GMMs satisfies the
Inclusion Boundary Condition if for every GMM
determined by a graph G , the traversal operator creates
neighborhood N ( G ) such that N ( G ) IB (G ) .
Theorem 2 The hill-climbing algorithm using the
scoring function and a traversal operator satisfying the
Inclusion Boundary Condition always finds the faithful
model which has the highest score[8].
Based on the Inclusion Order defined by Meek’s
conjecture, the neighborhood created by the operators of
one arc addition or one arc removal in the E-space,
namely the ENR neighborhood, satisfies the Inclusion
Boundary Condition. And the neighborhood created by
the operators of one arc addition, one arc removal or one
arc reversal in the DAG space does not satisfy the
Inclusion Boundary Condition, which is the
neighborhood formed in the GS algorithm.
The paper [8] gives an approximation
neighborhood of the ENR neighborhood in the DAG
space. But when we use the CPDAG to represent the
equivalence class in the E-space, we can implement the
operators of producing the ENR neighborhood directly
on the CPDAG. However, in the GES algorithm, the
neighborhoods formed both in the first phase and in the
second phase do not contain the Inclusion Boundary
defined by Meek’s conjecture. So the GES algorithm
does not satisfy the Inclusion Boundary Condition, the
score of the BN structure identified by the GES
algorithm is not the highest, and the final score of the
GES algorithm still has room for improvement. This is
consistent with the conclusion that the BN structure
identified by the GES algorithm may be not
parameter-optimal.
2.4. Unrestricted GES algorithm
GES algorithm uses the operator Insert ( X ,Y , ) to
produce the neighborhoods in the first phase and uses
Delete ( X , Y ,  )
the operator
to produce the
neighborhoods in the second phase. The unrestricted
GES algorithm (UGES algorithm) uses both the operator
Insert ( X ,Y ,  )
and the operator Delete ( X , Y ,  ) to
produce the neighborhoods in each iteration. The
neighborhood formed in the UGES algorithm
asymptotically satisfies the Inclusion Boundary
Condition. Therefore, we can predict that the score of
the BN structure identified by the UGES algorithm is
higher than the score of the BN structure identified by
the GES algorithm. But the size of the neighborhood in
the UGES algorithm is larger than that in the GES
algorithm, so the UGES algorithm may be more
time-consuming than the GES algorithm. In this paper,
we conduct experiments to compare the UGES
algorithm with the GES algorithm from both the aspect
of algorithm time-efficiency and the aspect of structure
identification quality.
3
Algorithm Improvement
Both the GES algorithm and the UGES algorithm
have the drawback of time-consuming. Most of the
running time is spent in the conversion from PDAG to
DAG and the computation of the scoring function. This
part of time has relation with the size of the
neighborhood formed in the GES algorithm and the
UGES algorithm. We can find the parents and children
set for each node using the constraint-based method, and
confine the GES algorithm and the UGES algorithm
within the parents and children set of each node to
reduce the size of the neighborhood and thus improve
the time-consuming drawback of the GES algorithm and
the UGES algorithm.
3.1. Max-Min
Parents
algorithm
and
Children
The Max-Min Parents and Children algorithm
(MMPC algorithm) is the first local learning algorithm
for discovering the parents and children sets of nodes.
The MMPC algorithm discovers the parents and children
set using a two-phase scheme. In phase I, the forward
phase, variables enter sequentially a candidate parents
and children set, by use of a heuristic function. In phase
II, the backward phase, it removes all false positives that
entered in the first phase. In the end, the candidate
parents and children set is the parents and children set.
In the backward phase, the MMPC algorithm relies
on the result of the CI test to remove the false positive.
The kind of CI test that the MMPC algorithm uses in the
backward phase is G2 likelihood ratio test. In the
forward phase, the MMPC algorithm needs to measure
the strength of association between a pair of variables.
The MMPC algorithm uses the negative p value
returned by the G2 likelihood ratio test as the measure of
association.
3.2. The combination of the GES/UGES
algorithm with the MMPC algorithm
There are already papers describing the
combination of the MMPC algorithm with the GS
algorithm, namely the Max-Min Hill-climbing algorithm
(MMHC algorithm), and the experiments show that the
MMHC algorithm is a promising new algorithm that
outperforms all other comparison algorithms, like the
GS algorithm, the GES algorithm and the BNPC
algorithm[2], in the aspects of running time, number of
statistical calls, Bayesian score and Structural Hamming
Distance (SHD).
This paper intends to combine the GES algorithm
with the MMPC algorithm (MMPC-GES algorithm) and
combine the UGES algorithm with the MMPC algorithm
(MMPC-UGES algorithm). The specific implementation
of the combination of the GES/UGES algorithm with the
MMPC algorithm is that when the GES/UGES
algorithm uses the operator Insert ( X ,Y , ) to produce the
neighborhood, X should be in the parents and children
set of Y . We conduct experiments to compare the
MMPC-GES algorithm and the MMPC-UGES
algorithm with the GS/GES/UGES algorithm without
the restriction of the parents and children sets, as well as
the MMHC algorithm, and find some useful
conclusions.
4
Experimental results and analysis
This paper uses the datasets sampled from the
ALARM network[13] to conduct comparison experiments.
The ALARM network stands for a medical diagnostic
system of patient monitoring. It contains 37 variables
and 46 edges. Each variable has two to four possible
values. The max indegree is four and the max outdegree
is five. The max and min of the size of the parents and
children set is six and one. We randomly sampled 10
training datasets for each of the three different sample
sizes 500, 1000, and 5000. Each reported statistic is the
average over the 10 runs of an algorithm on the 10
different datasets of certain sample size.
This paper chooses three metrics to measure the
quality of structure identification and the time-efficiency
of the algorithms, namely Bayesian score, Structural
Hamming Distance (SHD), and number of calls to the
scoring function.
We use the BDeu scoring function to calculate the
score, with the equivalent sample size of ten and the
network structure prior of one. The BDeu score is the
bigger the better. SHD is the sum of missing edges,
extra edges, and reversed edges including edges that are
undirected in one graph and directed in the other,
between two PDAGs. SHD does not penalize for
structural differences that cannot be statistically
distinguished, so it is defined on PDAGs instead of
DAGs. The SHD is the smaller the better.
Since the running time has relation with the
configuration of computer, the usage of CPU and so on,
this paper does not use this metric to measure the
time-efficiency of the algorithms and uses number of
calls to scoring function during the greedy search to
measure the time-efficiency of the algorithms which is
in proportion to the running time.
4.1. BDeu score results
We randomly sample one testing dataset
containing 5000 cases for each of the training dataset.
And the BDeu score is the average over the 10 runs of
an algorithm on the 10 different testing datasets of
certain sample size. Besides, we calculate the score of
the true ALARM network by the same way. The results
of the average BDeu score of three different sample
sizes are in Table 1.
From Table 1, we can find several useful rules.
Firstly, with the increase of the sample size, the BDeu
score of the identified BN becomes higher. Secondly,
the GES algorithm and the UGES algorithm in the
E-space, achieve higher BDeu score than the GS
algorithm in the DAG space, since the greedy search in
the E-space improves the drawback of falling into local
maximum caused by the score equivalent property of
BDeu scoring function in the DAG space. Thirdly, the
UGES algorithm achieves slightly higher BDeu score
than the GES algorithm, since the UGES algorithm
satisfies
the
Inclusion
Boundary
Condition
asymptotically but the GES algorithm does not satisfy
the Inclusion Boundary Condition. Fourthly, when the
GS/GES/UGES algorithms are restricted by the parents
and children sets produced by the MMPC algorithm,
however the MMPC-GES algorithm performs the worst
in BDeu score among the three algorithms MMHC,
MMPC-GES and MMPC-UGES. Maybe the
performance of MMPC-GES is reduced greatly by its
strict two-phase search strategy. The MMPC-UGES
algorithm in E-space still outperforms the MMHC
algorithm in DAG space in BDeu score.
Fifthly, in all, GS/GES/UGES with the restriction
of MMPC achieve lower BDeu score than
GS/GES/UGES without the restriction, namely the
BDeu score of GS/GES/UGES is reduced by the
restriction of parents and children sets. Especially we
find that the BDeu score of the MMHC algorithm is
lower than that of the GS algorithm, and the BDeu score
of the GS algorithm doesn’t outperform the BDeu score
of the true ALARM network, which doesn’t show signs
of overfitting. We need to remember that the parents and
children sets identified by the MMPC algorithm may be
not the exact parents and children sets of the true
network and have relation with the quality of dataset.
Sample size
500
1000
5000
True Alarm
-47892.85
-47953.66
-47536.78
GS
-48742.44
-48490.64
-47941.26
Table 1: Average BDeu Score Results.
GES
UGES
MMHC
-48425.9
-48422.26
-49538.36
-48338.42
-48318.08
-49486.32
-47865.43
-47835.32
-48994.44
MMPC-GES
-50381.35
-50159.25
-49008.54
MMPC-UGES
-49002.71
-48816.94
-48765.58
Table 2: Average SHD Results. In the format A(B, C, D), A is the SHD, B is the number of extra edges,
C is the number of missing edges, and D is the number of reversal edges.
Sample size
500
1000
5000
Sample size
500
1000
5000
GS
71.3(46.3, 3.4, 21.6)
64.6(39.2, 1.8, 23.6)
57.4(27.4, 0.9, 29.1)
GS
105265.9
101082.2
94832.8
GES
59.2(43.9, 2.6, 12.7)
56.7(39.3, 1.7, 15.7)
43.7(23.4, 1.1, 19.2)
UGES
60.0(44.8, 2.5, 12.7)
55.3(38.7, 1.7, 14.9)
44.0(24.0, 0.8, 19.2)
MMHC
30.8(7.6, 3.7, 19.5)
27.8(5.7, 2.3, 19.7)
23.4(4.3, 0.3, 18.8)
MMPC-GES
30.3(3.0, 8.5, 18.8)
24.5(1.5, 6.4, 16.6)
20.8(0.7, 0.1, 20.0)
Table 3: Average Number of Calls to BDeu Scoring Function Results.
GES
UGES
MMHC
MMPC-GES
111524.5
118216.1
6073.1
1662
114311.9
120510.4
5709.7
1527.7
108326.8
118639.2
6117.1
2013.6
4.2. SHD results
This paper has calculated the SHD which is the
average over the 10 runs of an algorithm on the 10
different training datasets of certain sample size, as well
as the three components of SHD, namely number of
extra edges, number of missing edges and number of
reversal edges including edges that are undirected in one
graph and directed in the other. The results of the
average SHD of three different sample sizes are in Table
2.
From Table 2, we can find that with the increase of
the sample size, the SHD of the identified BN becomes
smaller, as well as the number of extra edges and the
number of missing edges, but the number of reversal
edges doesn’t show this feature. The GES algorithm and
the UGES algorithm perform better than the GS
algorithm in SHD, but the gap between the GES
algorithm and the UGES algorithm in SHD is not
obvious. Besides, the gaps among MMHC, MMPC-GES
and MMPC-UGES in SHD are not obvious too.
In all, the SHD of GS/GES/UGES with the
restriction of MMPC is lower than that of
GS/GES/UGES without the restriction, namely the SHD
of GS/GES/UGES is improved by the restriction of
parents and children sets, especially number of extra
edges.
4.3. Number of calls to BDeu scoring
function results
We use number of calls to BDeu scoring function
to measure the time-efficiency of algorithms. We need
to know that each call to the BDeu scoring function in
the E-space corresponds to one conversion from PDAG
to DAG, and the time cost of one conversion is nearly
the same as that of one call to the BDeu scoring function.
The results of the average number of calls to BDeu
scoring function of three different sample sizes are in
Table 3.
From Table 3, we can find that the GES/UGES
algorithms calculate about ten percent more BDeu
scoring function than the GS algorithm, but the
GES/UGES algorithms in E-space need extra
MMPC-UGES
29.6(8.2, 3.8, 17.6)
24.3(5.7, 2.3, 16.3)
25.9(5.1, 0.0, 20.8)
MMPC-UGES
4379.4
3909.7
4322.4
conversions between PDAG and DAG, so we can expect
that the running time of GES/UGES is much longer than
that of GS. The UGES algorithm calculates about seven
percent more BDeu scoring function than the GES
algorithm, so we can expect that the running time of
UGES is slightly longer than that of GES. Besides, the
number of calls to BDeu scoring function in the
MMPC-UGES algorithm is more than twice that in
MMPC-GES algorithm, so the running time of
MMPC-UGES is expected to be about twice that of
MMPC-GES. The MMHC algorithm calculates about
40% more BDeu scoring function than the
MMPC-UGES algorithm, considering the extra time
cost of conversions between PDAG and DAG in
MMPC-UGES, so we can expect that the running time
of MMHC and MMPC-UGES should be almost the
same.
In all, the number of calls to BDeu scoring
function in GS/GES/UGES with the restriction of
MMPC is lower than that in GS/GES/UGES without the
restriction, namely the amount of calculation in
GS/GES/UGES is reduced by the restriction of parents
and children sets. The number of calls to BDeu scoring
function in the MMPC-GES algorithm reduces the most,
which also causes the worst performance of
MMPC-GES in BDeu score, comparing with MMHC
and MMPC-UGES.
5
Conclusions
In this paper, we use the Inclusion Boundary
Condition to analyze the GES algorithm and the GS
algorithm. We point out that the unrestricted form of
GES algorithm which implements both the Insert
operator and the Delete operator to produce the
neighborhood asymptotically satisfies the Inclusion
Boundary Condition. Still, the combinations of the
GES/UGES algorithms with the local discovery
algorithm MMPC, namely the MMPC-GES algorithm
and the MMPC-UGES algorithm, are proposed. We
compare GS/GES/UGES/MMHC/MMPC-GES/MMPCUGES on the datasets sampled from the ALARM
network. The experiments show that MMHC/MMPCGES/MMPC-UGES compute less number of calls to
BDeu scoring function and achieve better SHD than
GS/GES/UGES without the restriction of the parents and
children sets, while the BDeu score is reduced by the
restriction of the parents and children sets. Considering
the huge improvement on time-efficiency and SHD,
MMHC/MMPC-GES/MMPC-UGES
are
still
compelling. And among the three algorithms MMHC/
MMPC-GES/MMPC-UGES, MMPC-GES performs the
worst in BDeu score, and MMPC-UGES performs the
best in BDeu score.
Finally, the combination of constraint-based
method, Bayesian search-and-score method and Markov
Equivalence Class space is a promising combination.
The problem of confining the greedy search can also be
seen as the problem of directing the skeleton identified.
Many constraint-based methods usually firstly identify
the skeleton and then direct the skeleton using direction
rules. In some of my initial experiments with the
constraint-based algorithms, the effect of skeleton
identification is good, but the effect of direction on
skeleton is not good, like the BNPC algorithm. If we use
the search-and-score method to direct the skeleton
identified by the constraint-based method, it will
improve the final BN structure identified by the
constraint-based method.
Acknowledgments: This work is supported by
National Basic Research Program of China (973
program) with Grant No. 2011CB706900, National
Natural Science Foundation of China (Grant No.
70971128), Beijing Natural Science Foundation (Grant
No. 9102022) and the President Fund of GUCAS (Grant
No. O95101HY00).
6
References and Notes
[1] P. Spirtes, C. N. Glymour, and R. Scheines. “Causation,
Prediction, and Search”. The MIT Press, 2000.
[2] J. Cheng, R. Greiner, J. Kelly, D. Bell and W. Liu.
“Learning
Bayesian
Networks
from
Data:
An
Information-theory Based Approach”; Journal of Artificia
Intelligence, Vol. 137, Issue 1, 43-90, 2002.
[3] G. F. Cooper and E. Herskovits. “A Bayesian Method for
the Induction of Probabilistic Networks from Data”; Journal of
Machine Learning, Vol. 9, Issue 4, 309-347, 1992.
[4] H. Xu, H. Yu, and J. Wang. “Poison Identification Based
on Bayesian Method in Biochemical Terrorism Attacks”;
Advanced Science Letters, Vol. 5, 1-5, 2012.
[5] D. M. Chickering. “Optimal Structure Identification with
Greedy Search” ; Journal of Machine Learning Research, Vol.
3, 507-550, 2002.
[6] D. M. Chickering and C. Meek. “Finding Optimal
Bayesian Networks”. Proceedings of the Eighteenth
Conference on Uncertainty in Artificial Intelligence, Pages
94-102, 2002.
[7] T. Kocka, R. Bouckaert, and M. Studeny. “On the
Inclusion Problem”. Technical Report, Academy of Sciences
of the Czech Republic, 2001.
[8] R. Castelo. and T. Kočka. “Towards an Inclusion Driven
Learning of Bayesian Networks”. Techical Report CS-2002-05,
2002.
[9] N. Friedman, I. Nachman, and D. Peer. “Learning
Bayesian Network Structure from Massive Datasets: The
Sparse Candidate Algorithm”. Proceedings of the Fifteenth
Conference on Uncertainty in Artificial Intelligence, 1999.
[10] N. Friedman, M. Linial, I. Nachman, and D. Peer.
“Using Bayesian Networks to Analyze Expression Data”;
Journal of Computational Biology, Vol. 7, Issue 3, 601–620,
2000.
[11] I. Tsamardinos, L. Brown, and C. Aliferis. “The
Max-Min Hill-climbing Bayesian Network Structure Learning
Algorithm”; Journal of Machine Learning, Vol. 65, Issue 1,
31-78, 2006.
[12] I. Tsamardinos, C. F. Aliferis, and A. Statnikov. “Time
and Sample Efficient Discovery of Markov Blankets and
Direct Causal Relations”. Technical Report DSL-03-02,
Vanderbilt University, 2003.
[13] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F.
Cooper. “The ALARM Monitoring System: A Case Study
with Two Probabilistic Inference Techniques for Belief
Networks”. Proceedings of the Second European Conference
on Artificial Intelligence in Medicine, Pages 247-256, London,
1989.
[14] K. P. Murphy. the Bayes Net Toolbox for Matlab.
http://code.google.com/p/bnt/
[15] P. Leray. the BNT Structure Learning Package.
http://bnt.insa-rouen.fr/index.html
[16] C. Aliferis, I. Tsamardinos, A. Statnikov, and L. E.
Brown. “Causal Explorer: A Causal Probabilistic Network
Learning Toolkit for Biomedical Discovery”. METMBS,
371–376, 2003.
[17] T. Verma and J. Pearl. “Equivalence and Synthesis of
Causal Models”. Proceedings of the Sixth Conference on
Uncertainty in Artificial Intelligence, Pages 220–227, 1991.
[18] D. Heckerman, D. Geiger, and D. Chickering. “Learning
Bayesian Networks: The Combination of Knowledge and
Statistical Data”; Journal of Machine Learning Research, Vol.
20, Issue 3, 197–243, 1995.
[19] D. Dor and M. Tarsi. “A Simple Algorithm to Construct
a Consistent Extension of a Partially Oriented Graph”.
Technical Report R-185, 1992.
[20] D. M. Chickering. “Learning Equivalence Classes of
Bayesian-Network Structures”; Journal of Machine Learning
Research, Vol. 2, 445-498, 2002.
[21] M. Yannakakis. “Computing the Minimum Fill-in is
NP-complete”; SIAM Journal on Algebraic and Discrete
Methods, Vol. 2, Issue 1, 77–79, 1981.
[22] U. Kjaerul. “Triangulation of Graphs - Algorithms
Giving Small Total State Space”. Technical Report R-90-09,
1990.
[23] C. Huang and A. Darwiche. “Inference in Belief
Networks: A Procedural Guide”; International Journal of
Approximate Reasoning, Vol. 15, Issue 3, 225-263, 1996.
[24] C. Meek. “Graphical Models, Selecting Causal and
Statistical Models”. PhD Thesis, Carnegie Mellon University,
1997.
[25] J. I. Alonso-Barba, L. Ossa, J. A. Gámez, and J. M.
Puerta. “Scaling up the Greedy Equivalence Search Algorithm
by Constraining the Search Space of Equivalence Classes”.
Proceedings of the 11th European Conference on Symbolic
and Quantitative Approaches to Reasoning with Uncertainty,
Pages 194-205, 2011.