Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Exploration of Greedy Hill-climbing Search in Markov Equivalence Class Space 1 Huijuan Xu1, Hua Yu1, Juyun Wang2, Jinke Jiang1 College of Engineering, Graduate University of Chinese Academy of Sciences, Beijing, China 2 College of Science, Communication University of China, Beijing, China Abstract - The greedy Hill-climbing search in the Markov Equivalence Class space (E-space) can overcome the drawback of falling into local maximum in Directed Acyclic Graph space (DAG space) caused by the score equivalent property of Bayesian scoring function, and one representative algorithm is Greedy Equivalence Search algorithm (GES algorithm) which is inclusion optimal in the large sample size, but not parameter optimal. In fact GES algorithm does not comply with the inclusion boundary condition which is a guarantee of gaining the highest score, but the unrestricted form of GES algorithm (UGES algorithm) complies with the inclusion boundary condition approximately. However, the greedy Hill-climbing search both in the DAG space and in the E-space has the drawback of time-consuming. The idea of confining the search using the constraint-based method is a good solution for the time-consuming drawback. This paper conducts experiments to compare the effects of greedy Hill-climbing search algorithm in DAG space (GS algorithm), GES algorithm and UGES algorithm both without the restriction of the parents and children sets and with the restriction of parents and children sets, and finds that GS/GES/UGES with the restriction have achieved improvement in time-efficiency and structure difference, with a little reduction in Bayesian scoring function. Keywords: Inclusion Boundary Condition, GES algorithm, UGES algorithm, MMHC algorithm, local discovery algorithms 1 Introduction In the field of Bayesian network (BN), BN structure learning is a research hotspot. There are two main categories of BN structure learning algorithms, namely the constraint-based method and the search-and-score method. The constraint-based method[1][2] firstly uses the conditional independence (CI) test to determine the skeleton of the BN structure, and then directs the skeleton using the directional rules. The CI test has several forms, like Pearson's chi2 test, G2 likelihood ratio test, and mutual information. The search-and-score method includes two typical algorithms, namely K2 algorithm[3] and greedy Hill-climbing algorithm. The K2 algorithm finds parent nodes for each node from the nodes before the node according to the initial node sequence, and builds the Bayesian network structure gradually. The initial node sequence can be determined by the method mentioned in paper [4], namely building the maximum weight spanning tree (MWST) using mutual information, and then using the topological order of the oriented MWST as the initial node sequence. The greedy Hill-climbing algorithm in the DAG space (GS algorithm) takes an initial graph, defines a neighborhood, computes a score for every graph in this neighborhood, and chooses the one which maximizes the score for the next iteration, until the scoring function between two consecutive iterations does not improve. And the neighborhood is defined as the set of graphs that differ only with one insertion, one reversion, and one deletion from our current graph. The initial graph can be chosen as the oriented MWST. The GS algorithm has several drawbacks. Firstly, according to the score equivalent property of Bayesian scoring function, the directed acyclic graphs in the same equivalence class have the same score value. So if the two adjacent iterations are within the same equivalence class, the GS algorithm will fall into local maximum. Secondly, if we use random selection to break ties when we encounter more than one graph owning the highest score in the neighborhood, the finally learnt BN structures are more fluctuating when the GS algorithm runs multiple times. Thirdly, the GS algorithm is much more time-consuming, as the number of variables grows. The greedy Hill-climbing search in the Markov Equivalence Class space can overcome the drawback of falling into local maximum caused by the score equivalent property of Bayesian scoring function, and can improve the volatility of the finally learnt BN structures. One state of the art algorithm of the greedy Hill-climbing search in the E-space is the GES algorithm[5]. GES algorithm starts from the empty graph, firstly adds edges until the scoring value reaches a local maximum, then deletes edges until the scoring value reaches another local maximum, and finally returns that equivalence class as the solution. In paper [6], it has been proved that in the limit of large sample sizes, the GES algorithm identifies an inclusion-optimal equivalence class of DAG models. However, GES may not be able to identify a parameter-optimal model even in the limit of large sample size, namely the parameters of the finally identified model are not the fewest among all the BNs that include the distribution. Then according to the consistent property of Bayesian scoring function, the score of the finally identified equivalence class in the GES algorithm is not the highest. The paper [7] defines the Inclusion Boundary Condition based on the Inclusion Order provided by Meek’s conjecture, and the paper [8] proves that the hill-climbing algorithm using the penalized score function and a traversal operator satisfying the Inclusion Boundary Condition always finds the faithful model which has the highest score. In fact, the GES algorithm does not comply with the Inclusion Boundary Condition and does not get the guarantee of gaining the highest score. The UGES algorithm considers both the edge addition and the edge deletion at each step, and complies with the Inclusion Boundary Condition approximately. It is an unrestricted form of the GES algorithm and may get higher score than the GES algorithm. Both the GES algorithm and the UGES algorithm have the drawback of time-consuming. About the drawback of time-consuming, some algorithms improve this drawback by restricting the maximum number of parents allowed for every node in the network, like the Sparse Candidate algorithm[9][10]. But it is not a sound solution. Firstly it is hard to set the value of the maximum number of parents, and secondly this parameter imposes a uniform constraint for each node in the network. If we use the constraint-based method to find the set of parents and children for each node, and restrict the greedy search within the parents and children set of each node, it will not encounter such two problems. The Max-Min Hill-climbing algorithm (MMHC algorithm)[11] is one such BN structure learning algorithm. It firstly uses the Max-Min Parents and Children algorithm (MMPC algorithm)[12] to find the set of parents and children for each node, and then applies the GS algorithm within the parents and children set of each node. It is shown that the MMHC algorithm results in computational savings and performs better than many existing algorithms. This paper intends to conduct experiments to compare the GES algorithm and the UGES algorithm from the aspects of Bayesian score, Structural Hamming Distance (SHD)[11] and number of calls to scoring function, with the GS algorithm as a reference. Besides, we intend to use the parents and children sets learnt by the MMPC algorithm to restrict the GES algorithm and the UGES algorithm to improve the time-efficiency of these two algorithms, and conduct experiments to compare the performances of the GES algorithm and the UGES algorithm under the restriction of parents and children sets, with the MMHC algorithm (MMPC+GS) as a reference. We use the data sampled from the Alarm network[13] to conduct experiments. We implement the experiments in Matlab environment with the Bayes Net toolbox[14], the BNT Structure Learning package[15], and the Causal Explorer package[16]. This paper is structured as follows. Section 2 analyzes the GES algorithm and the UGES algorithm using the knowledge of E-space, scoring function, and Inclusion Boundary Condition. Then Section 3 introduces the local discovery algorithm MMPC and the combination of MMPC with GES algorithm and UGES algorithm. Experimental results and analysis are showed in Section 4. Finally, Section 5 presents some conclusions. 2 The greedy Hill-climbing search in the Markov Equivalence Class space 2.1. The Markov Equivalence Class space and the scoring function The Markov Equivalence Class space (E-space) is divided into Markov equivalence classes. The DAGs in the same Markov equivalence class are equivalent. Theorem 1 Two DAGs are equivalent if and only if they have the same skeletons and the same v-structures[17]. The skeleton of any DAG is the undirected graph resulting from ignoring the directionality of every edge. A v-structure in DAG G is an ordered triple of nodes (X, Y, Z) such that (1) G contains the edges X->Y and Z->Y, and (2) X and Z are not adjacent in G. The equivalence classes of DAGs are represented with completed partially DAGs (CPDAGs). The CPDAG is a graph that contains both directed and undirected edges. The directed edges represent the compelled edges and the undirected edges represent the reversible edges. The compelled edge is the edge that has the same orientation for every member of the equivalence class. The reversible edge is the edge that is not compelled. In a CPDAG or PDAG, a pair of nodes is neighbor if they are connected by an undirected edge, and they are adjacent if they are connected by either an undirected edge or a directed edge. The DAGs in the same equivalence class gain the same score if the scoring function owns the property of score equivalence. In fact, the scoring function may own three properties: Property 1 The scoring function is decomposable if it can be written as a sum of measures, each of which is a function only of one node and its parents. Property 2 The scoring function is score equivalent if the DAGs in the same equivalence class gain the same score. Property 3 The scoring function is consistent if in the large sample size, the following conditions hold: (1) If H contains the distribution and G does not contain the distribution, then the score of H is larger than the score of G; (2) If H and G both contain the distribution, and H contains fewer parameters than G, then the score of H is larger than the score of G. Bayesian Dirichlet (BD) scoring function is one of the commonly used scoring functions. BDe scoring function is the BD scoring function which satisfies the score equivalent property. And the BDeu scoring function is the BDe scoring function whose parameter prior has uniform mean. The form of the BDeu scoring function is as follows: ri ( N ' N ) ( Nij' ) n qi ijk ijk p ( G , D | ) log p ( G | ) ' ' i 1 j 1 ( Nij N ij ) k 1 ( N ijk ) Where ' N ijk N' ri qi , qi (1) denotes the number of configurations of the parent set ( xi ) , ri denotes the number of states of variable xi , Nijk is the number of records in dataset D for which xi k and ( xi ) is in the jth configuration, and Nij k Nijk . ( ) is the Gamma function, which satisfies ( y1) y ( y ) and (1)1 . ' The components Nijk and p ( G| ) specify the prior knowledge. In BDeu scoring function, the parameter prior ' Nijk N' ri qi is set as uniform joint distribution, N' is the equivalent sample size, and N ' is usually set to ten in most experiments. p (G| ) is the network structure prior, the assessment of p (G| ) is discussed in paper [18] in detail, and we set the network structure prior p ( G| ) as one in this paper. From Eq. (1), we can find that the BDeu scoring function is decomposable. Besides, the paper [18] shows that the BDeu scoring function is score equivalent, and the paper [5] shows that it is consistent. So the BDeu scoring function satisfies the three properties of the scoring function mentioned above. 2.2. Greedy Equivalent Search algorithm The Greedy Equivalent Search algorithm (GES algorithm) starts from the empty graph, firstly adds edges until the scoring value reaches a local maximum, then deletes edges until the scoring value reaches another local maximum, and finally returns that equivalence class as the solution. The GES algorithm searches through the E-space and the equivalence classes in the E-space are represented with CPDAGs. The paper [5] proves Meek’s conjecture, and shows that the local maximum reached in the first phase of the algorithm contains the generative distribution and the final equivalence class that results from the GES algorithm is asymptotically a perfect map of the generative distribution. There are two kinds of operators which are used to construct the neighborhood in GES algorithm, namely the Insert operator in the first phase and the Delete operator in the second phase. In order to compute the scores of the PDAGs in the neighborhood after applying the operators, it needs to convert the PDAG to DAG, and the paper [19] gives a simple implementation of the algorithm PDAG-To-DAG. Besides, the GES algorithm needs the algorithm DAG-To-CPDAG[20] to convert the finally chosen DAG to CPDAG to make the greedy search continue. The two kinds of operators are listed as follows. Definition 1 Insert ( X ,Y , ) [5] For non-adjacent nodes X and Y in the CPDAG P c , and for any subset of the neighbors of Y that are not adjacent to X , the Insert ( X ,Y , ) operator modifies P c by (1) inserting the directed edge X Y , and (2) for each T , directing the previously undirected edge between T and Y as T Y . Definition 2 Delete ( X , Y , ) [5] For adjacent nodes X and Y in the CPDAG P c connected either as X Y or X Y , and for any subset of the neighbors of Y that are adjacent to X , the operator modifies P c by deleting the edge between X and Y , and for each H , (1) directing the previously undirected edge between Y and H as Y H and (2) directing any previously undirected edge between X and H as X H . These two operators need to satisfy the validity conditions, so that the PDAG P c resulting from applying the operators will admit a consistent extension. If a DAG G has the same skeleton and the same set of v-structures as a PDAG P and if every directed edge in P has the same orientation in G , we say that G is a consistent extension of P . If there is at least one consistent extension of a PDAG P , we say that P admits a consistent extension. If the PDAG P c resulting from applying the operators admits a consistent extension, it will be converted to a DAG by the algorithm PDAG-To-DAG. Otherwise, the algorithm PDAG-To-DAG will give an error during the conversion of the PDAG P c . The validity conditions all include the problem of judging clique. A clique in a DAG or a PDAG is a set of nodes for which every pair of nodes is adjacent. And directing the edges of the clique will not create new v-structure. About the implementation of judging clique, there are two methods. One is firstly finding all the maximal cliques of the graph, and then checking that whether the node set is a subset of any maximal clique found in the first phase. However, the general problem of finding optimal triangulations for undirected graphs in the process of finding all the maximal cliques of the graph is NP-hard[21], so heuristic algorithms[22][23] are developed, which are time-consuming. The other method of judging clique is a direct form used in the algorithm PDAG-To-DAG which is more time-efficient. For every vertex y, adjacent to x, with (x, y) undirected, if y is adjacent to all the other vertices which are adjacent to x, then x and all the vertices adjacent to x form a clique. This paper takes the second method of judging clique. From the Profile file of the GES algorithm, we find that the GES algorithm is a time-consuming algorithm, and most of the running time is consumed in the calculation of the scores of the neighborhood and the conversion from PDAG to DAG. Since the size of the neighborhood is exponential in the number of adjacencies for a node, one paper [25] proposes changing exhaustive search by greedy search which is linear, to improve the time-efficiency of the GES algorithm. This paper intends to improve the time-efficiency of the GES algorithm by finding the parents and children set for each node using the constraint-based method and confining the GES within the parents and children set of each node. The paper [6] proves that in the limit of large sample sizes, the GES algorithm identifies an inclusion-optimal equivalence class of DAG models, but GES may not be able to identify a parameter-optimal model. The parameter-optimal model is the model which includes the distribution with the fewest parameters. So the parameters of the final model identified by the GES algorithm may be not the fewest among all the BNs that include the distribution. Then according to the consistent property of the BDeu scoring function, the score of the finally identified equivalence class in the GES algorithm Delete ( X , Y , ) * * * is not the highest. This paper investigates the conditions of recovering the right structure, and intends to combine the conditions with the GES algorithm to improve the score of the GES algorithm. 2.3. Inclusion Boundary Condition Inclusion Boundary Condition is one of the conditions that will guarantee the achievement of the highest score of the hill-climbing algorithm. It is based on the theory of Inclusion Order. A graphical Markov model (GMM) M (G ) is the family of probability distributions that are Markov over G . A probability distribution P is Markov over a graph G if and only if every CI restriction encoded in G is satisfied by P . The intuition behind the Inclusion Order is that one GMM M (G ) precedes another GMM M ( G ' ) if and only if all the CI restrictions encoded in G are also encoded in G ' . Meek has provided an Inclusion Order in Meek’s conjecture and Chickering has proved Meek’s conjecture. Conjecture 1 Meek's conjecture[24] Let D( G) and D ( G ' ) be two Bayesian Networks determined by two DAGs G and G' . The conditional independence model induced by D( G) is included in the one induced by D ( G ' ) , i.e. D I ( G ) D I ( G ' ) , if and only if there exists a sequence of DAGs L1 ,...,Ln such that G L1 , G' Ln and the DAG Li1 is obtained from Li by applying either the operation of covered arc reversal or the operation of arc removal for i 1,..., n . The paper [7] defines the Inclusion Boundary IB ( G ) , and intuitively, the Inclusion Boundary of a given GMM M (G ) consists of those GMMs M ( Gi ) that induce a set of CI restrictions MI (Gi ) which immediately follow or precede M I ( G ) under the Inclusion Order. Then based on the definition of Inclusion Boundary, the paper [7] gives the definition of Inclusion Boundary Condition and the paper [8] gives the theorem which proves the correctness of the hill-climbing algorithm under the faithfulness and unbounded data assumptions. Definition 3 Inclusion Boundary Condition[7] A learning algorithm for GMMs satisfies the Inclusion Boundary Condition if for every GMM determined by a graph G , the traversal operator creates neighborhood N ( G ) such that N ( G ) IB (G ) . Theorem 2 The hill-climbing algorithm using the scoring function and a traversal operator satisfying the Inclusion Boundary Condition always finds the faithful model which has the highest score[8]. Based on the Inclusion Order defined by Meek’s conjecture, the neighborhood created by the operators of one arc addition or one arc removal in the E-space, namely the ENR neighborhood, satisfies the Inclusion Boundary Condition. And the neighborhood created by the operators of one arc addition, one arc removal or one arc reversal in the DAG space does not satisfy the Inclusion Boundary Condition, which is the neighborhood formed in the GS algorithm. The paper [8] gives an approximation neighborhood of the ENR neighborhood in the DAG space. But when we use the CPDAG to represent the equivalence class in the E-space, we can implement the operators of producing the ENR neighborhood directly on the CPDAG. However, in the GES algorithm, the neighborhoods formed both in the first phase and in the second phase do not contain the Inclusion Boundary defined by Meek’s conjecture. So the GES algorithm does not satisfy the Inclusion Boundary Condition, the score of the BN structure identified by the GES algorithm is not the highest, and the final score of the GES algorithm still has room for improvement. This is consistent with the conclusion that the BN structure identified by the GES algorithm may be not parameter-optimal. 2.4. Unrestricted GES algorithm GES algorithm uses the operator Insert ( X ,Y , ) to produce the neighborhoods in the first phase and uses Delete ( X , Y , ) the operator to produce the neighborhoods in the second phase. The unrestricted GES algorithm (UGES algorithm) uses both the operator Insert ( X ,Y , ) and the operator Delete ( X , Y , ) to produce the neighborhoods in each iteration. The neighborhood formed in the UGES algorithm asymptotically satisfies the Inclusion Boundary Condition. Therefore, we can predict that the score of the BN structure identified by the UGES algorithm is higher than the score of the BN structure identified by the GES algorithm. But the size of the neighborhood in the UGES algorithm is larger than that in the GES algorithm, so the UGES algorithm may be more time-consuming than the GES algorithm. In this paper, we conduct experiments to compare the UGES algorithm with the GES algorithm from both the aspect of algorithm time-efficiency and the aspect of structure identification quality. 3 Algorithm Improvement Both the GES algorithm and the UGES algorithm have the drawback of time-consuming. Most of the running time is spent in the conversion from PDAG to DAG and the computation of the scoring function. This part of time has relation with the size of the neighborhood formed in the GES algorithm and the UGES algorithm. We can find the parents and children set for each node using the constraint-based method, and confine the GES algorithm and the UGES algorithm within the parents and children set of each node to reduce the size of the neighborhood and thus improve the time-consuming drawback of the GES algorithm and the UGES algorithm. 3.1. Max-Min Parents algorithm and Children The Max-Min Parents and Children algorithm (MMPC algorithm) is the first local learning algorithm for discovering the parents and children sets of nodes. The MMPC algorithm discovers the parents and children set using a two-phase scheme. In phase I, the forward phase, variables enter sequentially a candidate parents and children set, by use of a heuristic function. In phase II, the backward phase, it removes all false positives that entered in the first phase. In the end, the candidate parents and children set is the parents and children set. In the backward phase, the MMPC algorithm relies on the result of the CI test to remove the false positive. The kind of CI test that the MMPC algorithm uses in the backward phase is G2 likelihood ratio test. In the forward phase, the MMPC algorithm needs to measure the strength of association between a pair of variables. The MMPC algorithm uses the negative p value returned by the G2 likelihood ratio test as the measure of association. 3.2. The combination of the GES/UGES algorithm with the MMPC algorithm There are already papers describing the combination of the MMPC algorithm with the GS algorithm, namely the Max-Min Hill-climbing algorithm (MMHC algorithm), and the experiments show that the MMHC algorithm is a promising new algorithm that outperforms all other comparison algorithms, like the GS algorithm, the GES algorithm and the BNPC algorithm[2], in the aspects of running time, number of statistical calls, Bayesian score and Structural Hamming Distance (SHD). This paper intends to combine the GES algorithm with the MMPC algorithm (MMPC-GES algorithm) and combine the UGES algorithm with the MMPC algorithm (MMPC-UGES algorithm). The specific implementation of the combination of the GES/UGES algorithm with the MMPC algorithm is that when the GES/UGES algorithm uses the operator Insert ( X ,Y , ) to produce the neighborhood, X should be in the parents and children set of Y . We conduct experiments to compare the MMPC-GES algorithm and the MMPC-UGES algorithm with the GS/GES/UGES algorithm without the restriction of the parents and children sets, as well as the MMHC algorithm, and find some useful conclusions. 4 Experimental results and analysis This paper uses the datasets sampled from the ALARM network[13] to conduct comparison experiments. The ALARM network stands for a medical diagnostic system of patient monitoring. It contains 37 variables and 46 edges. Each variable has two to four possible values. The max indegree is four and the max outdegree is five. The max and min of the size of the parents and children set is six and one. We randomly sampled 10 training datasets for each of the three different sample sizes 500, 1000, and 5000. Each reported statistic is the average over the 10 runs of an algorithm on the 10 different datasets of certain sample size. This paper chooses three metrics to measure the quality of structure identification and the time-efficiency of the algorithms, namely Bayesian score, Structural Hamming Distance (SHD), and number of calls to the scoring function. We use the BDeu scoring function to calculate the score, with the equivalent sample size of ten and the network structure prior of one. The BDeu score is the bigger the better. SHD is the sum of missing edges, extra edges, and reversed edges including edges that are undirected in one graph and directed in the other, between two PDAGs. SHD does not penalize for structural differences that cannot be statistically distinguished, so it is defined on PDAGs instead of DAGs. The SHD is the smaller the better. Since the running time has relation with the configuration of computer, the usage of CPU and so on, this paper does not use this metric to measure the time-efficiency of the algorithms and uses number of calls to scoring function during the greedy search to measure the time-efficiency of the algorithms which is in proportion to the running time. 4.1. BDeu score results We randomly sample one testing dataset containing 5000 cases for each of the training dataset. And the BDeu score is the average over the 10 runs of an algorithm on the 10 different testing datasets of certain sample size. Besides, we calculate the score of the true ALARM network by the same way. The results of the average BDeu score of three different sample sizes are in Table 1. From Table 1, we can find several useful rules. Firstly, with the increase of the sample size, the BDeu score of the identified BN becomes higher. Secondly, the GES algorithm and the UGES algorithm in the E-space, achieve higher BDeu score than the GS algorithm in the DAG space, since the greedy search in the E-space improves the drawback of falling into local maximum caused by the score equivalent property of BDeu scoring function in the DAG space. Thirdly, the UGES algorithm achieves slightly higher BDeu score than the GES algorithm, since the UGES algorithm satisfies the Inclusion Boundary Condition asymptotically but the GES algorithm does not satisfy the Inclusion Boundary Condition. Fourthly, when the GS/GES/UGES algorithms are restricted by the parents and children sets produced by the MMPC algorithm, however the MMPC-GES algorithm performs the worst in BDeu score among the three algorithms MMHC, MMPC-GES and MMPC-UGES. Maybe the performance of MMPC-GES is reduced greatly by its strict two-phase search strategy. The MMPC-UGES algorithm in E-space still outperforms the MMHC algorithm in DAG space in BDeu score. Fifthly, in all, GS/GES/UGES with the restriction of MMPC achieve lower BDeu score than GS/GES/UGES without the restriction, namely the BDeu score of GS/GES/UGES is reduced by the restriction of parents and children sets. Especially we find that the BDeu score of the MMHC algorithm is lower than that of the GS algorithm, and the BDeu score of the GS algorithm doesn’t outperform the BDeu score of the true ALARM network, which doesn’t show signs of overfitting. We need to remember that the parents and children sets identified by the MMPC algorithm may be not the exact parents and children sets of the true network and have relation with the quality of dataset. Sample size 500 1000 5000 True Alarm -47892.85 -47953.66 -47536.78 GS -48742.44 -48490.64 -47941.26 Table 1: Average BDeu Score Results. GES UGES MMHC -48425.9 -48422.26 -49538.36 -48338.42 -48318.08 -49486.32 -47865.43 -47835.32 -48994.44 MMPC-GES -50381.35 -50159.25 -49008.54 MMPC-UGES -49002.71 -48816.94 -48765.58 Table 2: Average SHD Results. In the format A(B, C, D), A is the SHD, B is the number of extra edges, C is the number of missing edges, and D is the number of reversal edges. Sample size 500 1000 5000 Sample size 500 1000 5000 GS 71.3(46.3, 3.4, 21.6) 64.6(39.2, 1.8, 23.6) 57.4(27.4, 0.9, 29.1) GS 105265.9 101082.2 94832.8 GES 59.2(43.9, 2.6, 12.7) 56.7(39.3, 1.7, 15.7) 43.7(23.4, 1.1, 19.2) UGES 60.0(44.8, 2.5, 12.7) 55.3(38.7, 1.7, 14.9) 44.0(24.0, 0.8, 19.2) MMHC 30.8(7.6, 3.7, 19.5) 27.8(5.7, 2.3, 19.7) 23.4(4.3, 0.3, 18.8) MMPC-GES 30.3(3.0, 8.5, 18.8) 24.5(1.5, 6.4, 16.6) 20.8(0.7, 0.1, 20.0) Table 3: Average Number of Calls to BDeu Scoring Function Results. GES UGES MMHC MMPC-GES 111524.5 118216.1 6073.1 1662 114311.9 120510.4 5709.7 1527.7 108326.8 118639.2 6117.1 2013.6 4.2. SHD results This paper has calculated the SHD which is the average over the 10 runs of an algorithm on the 10 different training datasets of certain sample size, as well as the three components of SHD, namely number of extra edges, number of missing edges and number of reversal edges including edges that are undirected in one graph and directed in the other. The results of the average SHD of three different sample sizes are in Table 2. From Table 2, we can find that with the increase of the sample size, the SHD of the identified BN becomes smaller, as well as the number of extra edges and the number of missing edges, but the number of reversal edges doesn’t show this feature. The GES algorithm and the UGES algorithm perform better than the GS algorithm in SHD, but the gap between the GES algorithm and the UGES algorithm in SHD is not obvious. Besides, the gaps among MMHC, MMPC-GES and MMPC-UGES in SHD are not obvious too. In all, the SHD of GS/GES/UGES with the restriction of MMPC is lower than that of GS/GES/UGES without the restriction, namely the SHD of GS/GES/UGES is improved by the restriction of parents and children sets, especially number of extra edges. 4.3. Number of calls to BDeu scoring function results We use number of calls to BDeu scoring function to measure the time-efficiency of algorithms. We need to know that each call to the BDeu scoring function in the E-space corresponds to one conversion from PDAG to DAG, and the time cost of one conversion is nearly the same as that of one call to the BDeu scoring function. The results of the average number of calls to BDeu scoring function of three different sample sizes are in Table 3. From Table 3, we can find that the GES/UGES algorithms calculate about ten percent more BDeu scoring function than the GS algorithm, but the GES/UGES algorithms in E-space need extra MMPC-UGES 29.6(8.2, 3.8, 17.6) 24.3(5.7, 2.3, 16.3) 25.9(5.1, 0.0, 20.8) MMPC-UGES 4379.4 3909.7 4322.4 conversions between PDAG and DAG, so we can expect that the running time of GES/UGES is much longer than that of GS. The UGES algorithm calculates about seven percent more BDeu scoring function than the GES algorithm, so we can expect that the running time of UGES is slightly longer than that of GES. Besides, the number of calls to BDeu scoring function in the MMPC-UGES algorithm is more than twice that in MMPC-GES algorithm, so the running time of MMPC-UGES is expected to be about twice that of MMPC-GES. The MMHC algorithm calculates about 40% more BDeu scoring function than the MMPC-UGES algorithm, considering the extra time cost of conversions between PDAG and DAG in MMPC-UGES, so we can expect that the running time of MMHC and MMPC-UGES should be almost the same. In all, the number of calls to BDeu scoring function in GS/GES/UGES with the restriction of MMPC is lower than that in GS/GES/UGES without the restriction, namely the amount of calculation in GS/GES/UGES is reduced by the restriction of parents and children sets. The number of calls to BDeu scoring function in the MMPC-GES algorithm reduces the most, which also causes the worst performance of MMPC-GES in BDeu score, comparing with MMHC and MMPC-UGES. 5 Conclusions In this paper, we use the Inclusion Boundary Condition to analyze the GES algorithm and the GS algorithm. We point out that the unrestricted form of GES algorithm which implements both the Insert operator and the Delete operator to produce the neighborhood asymptotically satisfies the Inclusion Boundary Condition. Still, the combinations of the GES/UGES algorithms with the local discovery algorithm MMPC, namely the MMPC-GES algorithm and the MMPC-UGES algorithm, are proposed. We compare GS/GES/UGES/MMHC/MMPC-GES/MMPCUGES on the datasets sampled from the ALARM network. The experiments show that MMHC/MMPCGES/MMPC-UGES compute less number of calls to BDeu scoring function and achieve better SHD than GS/GES/UGES without the restriction of the parents and children sets, while the BDeu score is reduced by the restriction of the parents and children sets. Considering the huge improvement on time-efficiency and SHD, MMHC/MMPC-GES/MMPC-UGES are still compelling. And among the three algorithms MMHC/ MMPC-GES/MMPC-UGES, MMPC-GES performs the worst in BDeu score, and MMPC-UGES performs the best in BDeu score. Finally, the combination of constraint-based method, Bayesian search-and-score method and Markov Equivalence Class space is a promising combination. The problem of confining the greedy search can also be seen as the problem of directing the skeleton identified. Many constraint-based methods usually firstly identify the skeleton and then direct the skeleton using direction rules. In some of my initial experiments with the constraint-based algorithms, the effect of skeleton identification is good, but the effect of direction on skeleton is not good, like the BNPC algorithm. If we use the search-and-score method to direct the skeleton identified by the constraint-based method, it will improve the final BN structure identified by the constraint-based method. Acknowledgments: This work is supported by National Basic Research Program of China (973 program) with Grant No. 2011CB706900, National Natural Science Foundation of China (Grant No. 70971128), Beijing Natural Science Foundation (Grant No. 9102022) and the President Fund of GUCAS (Grant No. O95101HY00). 6 References and Notes [1] P. Spirtes, C. N. Glymour, and R. Scheines. “Causation, Prediction, and Search”. The MIT Press, 2000. [2] J. Cheng, R. Greiner, J. Kelly, D. Bell and W. Liu. “Learning Bayesian Networks from Data: An Information-theory Based Approach”; Journal of Artificia Intelligence, Vol. 137, Issue 1, 43-90, 2002. [3] G. F. Cooper and E. Herskovits. “A Bayesian Method for the Induction of Probabilistic Networks from Data”; Journal of Machine Learning, Vol. 9, Issue 4, 309-347, 1992. [4] H. Xu, H. Yu, and J. Wang. “Poison Identification Based on Bayesian Method in Biochemical Terrorism Attacks”; Advanced Science Letters, Vol. 5, 1-5, 2012. [5] D. M. Chickering. “Optimal Structure Identification with Greedy Search” ; Journal of Machine Learning Research, Vol. 3, 507-550, 2002. [6] D. M. Chickering and C. Meek. “Finding Optimal Bayesian Networks”. Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, Pages 94-102, 2002. [7] T. Kocka, R. Bouckaert, and M. Studeny. “On the Inclusion Problem”. Technical Report, Academy of Sciences of the Czech Republic, 2001. [8] R. Castelo. and T. Kočka. “Towards an Inclusion Driven Learning of Bayesian Networks”. Techical Report CS-2002-05, 2002. [9] N. Friedman, I. Nachman, and D. Peer. “Learning Bayesian Network Structure from Massive Datasets: The Sparse Candidate Algorithm”. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, 1999. [10] N. Friedman, M. Linial, I. Nachman, and D. Peer. “Using Bayesian Networks to Analyze Expression Data”; Journal of Computational Biology, Vol. 7, Issue 3, 601–620, 2000. [11] I. Tsamardinos, L. Brown, and C. Aliferis. “The Max-Min Hill-climbing Bayesian Network Structure Learning Algorithm”; Journal of Machine Learning, Vol. 65, Issue 1, 31-78, 2006. [12] I. Tsamardinos, C. F. Aliferis, and A. Statnikov. “Time and Sample Efficient Discovery of Markov Blankets and Direct Causal Relations”. Technical Report DSL-03-02, Vanderbilt University, 2003. [13] I. A. Beinlich, H. J. Suermondt, R. M. Chavez, and G. F. Cooper. “The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks”. Proceedings of the Second European Conference on Artificial Intelligence in Medicine, Pages 247-256, London, 1989. [14] K. P. Murphy. the Bayes Net Toolbox for Matlab. http://code.google.com/p/bnt/ [15] P. Leray. the BNT Structure Learning Package. http://bnt.insa-rouen.fr/index.html [16] C. Aliferis, I. Tsamardinos, A. Statnikov, and L. E. Brown. “Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery”. METMBS, 371–376, 2003. [17] T. Verma and J. Pearl. “Equivalence and Synthesis of Causal Models”. Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, Pages 220–227, 1991. [18] D. Heckerman, D. Geiger, and D. Chickering. “Learning Bayesian Networks: The Combination of Knowledge and Statistical Data”; Journal of Machine Learning Research, Vol. 20, Issue 3, 197–243, 1995. [19] D. Dor and M. Tarsi. “A Simple Algorithm to Construct a Consistent Extension of a Partially Oriented Graph”. Technical Report R-185, 1992. [20] D. M. Chickering. “Learning Equivalence Classes of Bayesian-Network Structures”; Journal of Machine Learning Research, Vol. 2, 445-498, 2002. [21] M. Yannakakis. “Computing the Minimum Fill-in is NP-complete”; SIAM Journal on Algebraic and Discrete Methods, Vol. 2, Issue 1, 77–79, 1981. [22] U. Kjaerul. “Triangulation of Graphs - Algorithms Giving Small Total State Space”. Technical Report R-90-09, 1990. [23] C. Huang and A. Darwiche. “Inference in Belief Networks: A Procedural Guide”; International Journal of Approximate Reasoning, Vol. 15, Issue 3, 225-263, 1996. [24] C. Meek. “Graphical Models, Selecting Causal and Statistical Models”. PhD Thesis, Carnegie Mellon University, 1997. [25] J. I. Alonso-Barba, L. Ossa, J. A. Gámez, and J. M. Puerta. “Scaling up the Greedy Equivalence Search Algorithm by Constraining the Search Space of Equivalence Classes”. Proceedings of the 11th European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty, Pages 194-205, 2011.