Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Performance Analysis of an Acyclic Genetic approach to Learn Bayesian Network Structure (Student Paper) Pankaj B. Gupta1 and Vicki H. Allan2 1 2 Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA, [email protected] Computer Science Department, Utah State University, Logan, UT 84322, [email protected] Abstract. We introduce a new genetic algorithm approach for learning a Bayesian network structure from data. Our method is capable of learning over all node orderings and structures. Our encoding scheme is inherently acyclic and is capable of performing crossover on chromosomes with different node orders. We present an analysis of this approach using different Bayesian networks such as ASIA and ALARM. Results suggest that the method is effective. The tests we perform include varying the population size of the genetic algorithms, restricting the maximum number of parents a node can have, and learning with a fixed node order. Keywords: Structure Learning, Bayesian Networks, Genetic Algorithms. 1 Introduction Bayesian networks are probabilistic networks capable of representing causal relationships [6]. They are directed acyclic graphs with nodes representing the variables of a problem and edges representing the causal relations between the variables. Each node has a conditional probability distribution in which the parents of the nodes are the condition variables. Learning structure and learning probabilities of the nodes are treated as two separate problems where the former is considered to be much more challenging. In this research, we consider the problem of learning the structure of a Bayesian network. Genetic algorithms are evolutionary algorithms for solving problems with a large solution space [14]. A possible solution of the problem is termed a chromosome. An initial population of chromosomes is generated randomly. Each chromosome is evaluated according to a fitness function. Two chromosomes are chosen from the population at random, and they are crossed over to produce two daughter chromosomes. Eventually, the population size is truncated to the original size by discarding the worst quality chromosomes. This process is repeated until a quality chromosome is generated. In the past few years, Bayesian networks have become important in the field of artificial intelligence. Learning the structure of a Bayesian network from data is a challenging problem, known to be NP-Hard [2, 7]. Since the solution space is large, genetic algorithms are considered. Previous attempts at using genetic algorithms to learn Bayesian network structure either assume a pre-defined topological ordering of nodes, or have the overhead of dealing with cyclic networks. We present a genetic algorithms technique to learn Bayesian network structure that does not assume any initial node ordering and inherently preserves the acyclic nature of the graph over crossover and mutation [5]. This results in an effective method which derives the structure without throwing away important information. We present a detailed analysis of our method. 2 Previous Work The approach to use genetic algorithms for learning the Bayesian networks structure was introduced by Larranaga, et al. [9]. Their tests show that it is a challenging area of research[10]. Their method generates illegal cyclic Bayesian networks. To handle illegal networks, either a pre-existing node order is assumed or cycle causing arcs are deleted. Myers, et. al., use genetic algorithms to learn the structure of Bayesian networks from incomplete data [12]. They handle cyclic Bayesian networks generated in the process by assigning them a low score. Guo, et al., use genetic algorithm to tune the node ordering of a Bayesian network [4]. The node ordering serves an input to the K2 algorithm [3]. Our approach is different because the acyclic property is inherent to our encoding scheme. We do not assume any pre-defined node ordering and do not have to deal with cyclic networks. In addition, we do not lose information during the learning process by deleting arcs required for maintaining the network acyclic. Our crossover function is a closed operation. 3 Scoring Metric Given a data set D and a learned Bayesian network BS , P (BS |D) represents the posterior probability that BS is learned from D. Using the Bayes theorem it can be represented as : P(BS | D) = P(D | BS ) P(BS ) P(D) P(D|BS ) represents the probability that data D is generated from BS and is known as a marginal probability likelihood function. P(BS ) can be assumed to be constant if no previous information of the structure is available. P(D) is the normalizing constant and is independent of the network. Thus, P(D|BS ) can be used a scoring metric. Let n be the number of nodes of a Bayesian network. Assume that node i has si different states, πi denotes the parent set of i and pi denotes the number of different states of the parent set πi . Cooper et al. [3] showed that the following function, known as Bayesian Dirichlet Metric, can be used as the marginal likelihood function : P(D | BS ) = pi n Y Y i=1 j=1 0 Γ (Nij ) 0 0 si Y Γ (Nijk + Nijk ) Γ (Nij + Nij ) j=1 0 Γ (Nijk ) Nijk is number of instances in the data set such that node i is in its k th state 0 and its parent set (πi ) is in its j th state. Nijk is a hyper-parameter that may be set to 1, if no prior information about the structure of a Bayesian network is available. Since there are a lot of multiplication operations in the Bayesian Dirichlet function, the logarithm of Bayesian Dirichlet (LogBD) is often used as a scoring metric. All the research described in section 2 use LogBD as the scoring metric. We also use the LogBD as our scoring metric. LogBD of a Bayesian network is the sum of the LogBD values for each node of the network. LogBD of a node depends only on its parent set. If a node has the same parents in two different Bayesian networks, the LogBD score of that node is the same in both the networks. This proves beneficial as we do not recalculate the score of a node (if it has the same parent set) multiple times during the learning process. 4 4.1 Our Algorithm The Encoding Scheme Our chromosome defines both a Bayesian network and a topological order on nodes of the Bayesian network [5]. Associating a node order with a chromosome has the following advantages: First, by crossing over chromosomes of different node orders, our algorithm learns the optimal node ordering, and second, the acyclic nature is preserved over operations like crossover and mutation. Consider a Bayesian network of size n. The corresponding encoded chromosome is a sequence of n genes. Each gene corresponds to a node of the Bayesian network. Thus, the sequence of n genes also represents the node order associated with the chromosome. For any node, the nodes preceding it in the topological order are its candidate parents. Each gene corresponding to a node has a set of booleans describing which of the candidate parents of the node are the actual parents. 4.2 Crossover Function Our O(n2 ) crossover function ensures that the daughter chromosomes produced are acyclic in nature. The crossover function can be thought of as a two step process: producing node orders and assigning parents. Let the two parent chromosomes be A and B, and the two daughter chromosomes be X and Y . Step 1: Producing Node Orders Based on a random crossover point, the node order of A is split into two parts, a1 and a2. The node order for chromosome B is divided into two parts b1 and b2, such that b1 has the same nodes as in a1 (but the order within B is preserved), and b2 has the same nodes as in a2 (but the order within B is preserved). a1 is concatenated with b2 to generate the node order for X, and b1 is concatenated with a2 to generate the node order for Y . This step enables our learning algorithm to crossover chromosomes with different node orders to learn the optimal one. It allows our learning algorithm to span over the entire set of node orders. Step 2: Producing Parent Information In producing parent information, we generate two different sets of parents for X and Y . It is decided randomly which set of daughter chromosomes (methods M1 or M2) is actually produced. Arcs in the parent nodes are transferred to the daughters, either directly or by reversing the order. For X, the parent relation between the nodes of a1 (or b2) are directly transferred from A (or B). Similarly for Y , the parent relations between the nodes of b1 (or a2) are directly transferred from B (or A). If method M1 is used, parent relation between a node of a1 and and a node of b2 is transferred from A to X, whereas the relation between a node of b1 and a node of a2 is transferred from B to Y . In this case, the direction of the relation transferred from B to Y may be reversed, as the daughter nodes order are different than those of parents. In method M2, the roles of A and B are interchanged. For every crossover function, it is important to choose one of the two methods at random. If only one method is used for every crossover function, our learning algorithm does not span the entire solution space. It should be noted that parent arcs between nodes of the parent chromosomes are transferred to exactly one of the two daughter chromosomes. The sum of edges in both the parents is the same as that of the daughter chromosomes. In a daughter, the nodes originating from the same parent preserve the dependence according to that parent. This ensures that our crossover function preserves the goodness of the chromosomes. This proves particularly beneficial if one of the parent chromosomes has captured the dependence between few nodes and the other parent has captured the dependence among the other nodes. Cycle formation has been a problem in the previous attempts of structure learning using genetic algorithms. Since we have a node order associated with our daughter chromosomes, we prevent cycles by flipping the direction of those cyclic edges. By doing so, we ensure that we do not lose any parent information by deleting any edges. Steck uses the concept of edge reversal in his learning algorithm [13]. Though reversing an edge between two nodes may not represent the exact relationship between them as before, it does preserve some dependence relation between the two. The other possible approach is to delete the cycle forming edges. Larranaga [9] uses this approach. To investigate which of the two approaches is better, we modify our method to delete the cycle forming edges instead of reversing them. We use a sample data set of 1000 entries to learn the structure of ASIA[11] network using our method and the modified method. Figure 1 shows the percentage of times the optimal ASIA network was learned. From the figure, it is evident that reversing an edge is a better approach than deleting it. Fig. 1. Comparison of two approaches to prevent cycle formation while learning the structure of ASIA network. In one approach, potentially dangerous edges are reversed, whereas in the other approach they are deleted. 4.3 Mutations Mutations are an integral part of the genetic algorithms. Our encoding scheme ensures that the acyclic nature of the chromosomes is preserved over mutations. For each node and each parent information of our chromosome, a random number between 0 and 1 is generated. If the number is less than the mutation probability, the node or parent information is mutated. A node is mutated by swapping it with some another randomly chosen node of the chromosome. A parent bit is mutated by toggling its value. 5 Enriching Initial Population We use the term spanning population to mean that through crossover the initial population is capable of producing every chromosome in the solution space. A spanning population is important for the success of a genetic algorithm. We introduce the following functions : InvertOrder and ToggleParents. The b generates a chromosome with a node order InvertOrder(A) (represented as A) reverse to that of the input chromosome A. The sets of booleans (representing b are identical. InvertOrder is the parent information) in the genes of A and A required for our algorithm to be able to iterate over all node orders in the solution space. ToggleParents(A) (represented as A) toggles the parent information of each gene to produce a new chromosome. ToggleParents is important to produce every possible parent configuration for a given node order. We generate our initial population as follows : Randomly generated chromosomes are added to the population. To make the population rich, for every b A and A. b This ensures that random chromosome A, we add to the population A, using our crossover function every chromosome in the solution space is derivable from our initial population. 6 6.1 Results Data Generation We use the Logic Sampling method to generate a data set [8]. Given a Bayesian network and a set of conditional probability tables, initially a state value for each root node is generated using a random state generator weighted by the probabilities of different states. Once the states of all root nodes are determined, the same operation is performed for the nodes whose parent states have been determined. The process is repeated until a state has been instantiated for each node in the network. This yields one data entry for the data set. The above process is repeated to generate a data set of the desired size. 6.2 Testing Procedure 1. We choose a Bayesian network (network structure + probability tables). 2. The Logic Sampling method is used to generate data. 3. The Genetic Algorithm approach is used to learn the structure of the Bayesian network. We generate an enriched population of size p as described in section 5. In each iteration of the genetic algorithm, we produce p daughter chromosomes. Based on the Bayesian Dirichlet score function, the worst p chromosomes (chromosomes with the lowest score) among the union of daughters and parents are discarded. 4. The best network generated is compared with the optimal network. The term optimal network denotes the original network, unless a better network is learned over the span of all the tests performed. We test our algorithm on Bayesian networks of different sizes. To discuss our results, we use the following Bayesian networks : ASIA[11], DualASIA, DualASIA+, SubALARM and ALARM [1]. ASIA and ALARM networks are commonly used for Bayesian analysis. ASIA is a small network with 8 nodes and 8 edges that calculates the probability of a patient having diseases like tuberculosis, lung cancer and bronchitis. ALARM, a large network with 37 nodes and 46 edges, is a prototype that models potential anesthesia problems in an operating room. To be able to analyze our algorithm for intermediate size Bayesian networks, we create DualASIA, DualASIA+ and SubALARM. DualASIA, a 16 node and 16 edge network, consists of two identical ASIA networks. DualASIA+ is similar to DualASIA, but has 4 additional edges connecting the two ASIA networks. Though it may seem that ASIA, DualASIA and DualASIA+ will produce similar results, their behavior is actually different. All of them represent different Bayesian networks. SubALARM is a 24 node sub graph of the ALARM network. Using our algorithm, we are able to learn the optimal structure of all networks we test. 6.3 Test: Effect of Varying Population Size Fig. 2. The effect of varying population size on the quality of structures learned. The best score and the average scores learned are plotted in the graph. Beside each data entry in the graph, we show the Hamming Distance (HD)/Reverse Edge Count(RevCt) values. Initially we test our algorithm by varying the population size of the genetic algorithms. We present the effects of varying the population size on the quality of Bayesian networks learned in the case of DualASIA network. We create a data set of 1000 entries. The optimal network for the data set is learned by performing different tests described in this section. The optimal network has a better score (-7728.02) than the original network (-7739.68) and differs from the original network in 2 edges. Figure 2 shows the effect of varying the population size on the quality of networks learned. The average is calculated by repeating the test 50 times for each population size. A mutation probability of 0.05 is used and the stopping condition is set at 250 iterations. Beside each data entry in the graph, we present the Hamming Distance and the Reverse Edge Count between the learned network and the optimal network. The Hamming Distance(HD) between two networks is a measure of the number of edges that are not common to both the networks. HD does not include the edges that differ in direction. The count of such edges is represented by the Reverse Edge Count(RevCt). In figure 2, we observe that for low population sizes the difference between the average score and the best score learned is large. As the population size increases, the average score and the best score converge to the optimal score value. Increasing the population size further does not produce any more improvements. This indicates that as the population size increases, the probability of learning a good Bayesian network approaches to a value of 1. Similarly, as the population size increases, HD and the average HD converge to the optimal value 0. RevCt tends to improve as the population size increases but the improvement is not consistent. This is because reversing an edge does not always have an adverse effect on the relationships encoded by a Bayesian network. 6.4 Test: Random Population vs. Enriched Population As mentioned in section 5, the initial population is enriched by adding three variants of each random chromosome present in the initial population. To demonstrate the positive effects of enriching a population, we generate a random population of size p1 , and perform the learning process with the random population. The population is then enriched, and the learning process is repeated. We compare the results obtained from the two learning processes. It should be noted that the size of the enriched population (4 ∗ p1 ) is 4 times the size of the random population(p1 ). To compare the results on a fair basis, in each iteration of the latter learning process, we generate only p1 (instead of 4 ∗ p1 ) daughters. Mutations are not used for this test as they help the learning algorithm span the whole solution space, thereby reducing the negative effects of not enriching a population. It is observed that with an enriched population better networks are learned. Table 1. The results of learning DualASIA+ network using a Random Population vs. an Enriched Population Data set size = 1000 entries, iterations = 500 Optimal Network Score = -6929.96 Random Population Enriched Population p = 25 p = 25 (random) + 75 (variants) Best Network Score -7009.84 -6929.96 HD / RevCt 10 / 3 0/0 Ave. Score -7110.37 -6964.22 Ave. HD/ Ave. RevCt 15.95 / 6.5 3.8 / 2.75 Table 1 shows the results of the DualAsia+ network for this test. Average values are calculated by repeating the experiment 50 times. We are able to learn the optimal DualASIA+ when the enriched population is used, whereas if random population is used the best network learned has a HD value of 10. Considering that DualASIA+ is a network with 20 edges, an HD value of 10 is high. Even the average values in the case of the random population are worse. In the previous test, the size of the enriched population is much larger than that of the random population. In this test, we compare the two processes, mentioned above, using same sized populations. Figure 3 shows the results of this test for the ASIA network. It is observed that enriching the population yields better results for low population sizes. However, for larger populations, both processes yield similar results. We hypothesize the reason is that the process of enriching a population provides four different parent configurations for each node of a random chromosome. Starting with those four configurations, a node is theoretically able to generate any possible parent configuration. In the case of a large population, each node already has a sufficient number of random parent configurations and so enriching the population does not help any further. Fig. 3. Comparing ASIA networks learned when Random Populations and Enriched Populations of the same sizes are used. For larger networks, such as ALARM, we are not able to see much improvement in the learning process even with low population sizes. The reason being that in the case of low population sizes for larger networks, the learning algorithm is not able to reach anywhere near an optimal solution. Thus, multiple executions of the learning process generate inconsistent results and are difficult to compare. From this test, we learn that enriching the population does not pro- duce any adverse effects. In a few cases, it helps improve the learning process, whereas in other cases, the initial population is rich enough that enriching it does not have a great effect. 6.5 Test: Using both Methods (M1 and M2) vs. Using One Method As explained in section 4.2, in crossover, one of the two methods (M1 and M2) is randomly chosen to generate daughter chromosomes. It is imperative that both methods be used in the learning process. It can be mathematically proven that the learning algorithm requires both methods to be able to span the whole solution space [5]. To demonstrate this experimentally, we run our learning algorithm with only method M1 and compare the results with those generated by using both the methods. All the remaining parameters are kept constant. Mutations are not used for this particular test as they help the learning algorithm span the whole solution space, thereby reducing the negative effects of using only method M1. It is observed that the quality of the learned Bayesian networks deteriorates when only method M1 is used. Table 2 shows the results of this experiment for the SubALARM network. It can be seen, when both methods are used, the best networks learned have scores very close to that of the original network, whereas when only method M1 is used, the networks learned have a much lower score and a much higher HD. Table 2. The effect of learning SubALARM using only method M1 vs. using both methods. The experiment is repeated 10 times. p is the population size and i is the number of iterations. Data Size 1000 Data Size 10000 Original BN Score: -6628 Original BN Score: -63739 p=400, i=350 p=1200, i=500 p=1200, i=500 Method M1 Only Both M1 Only Both M1 Only Both Best Score -6772 -6637 -6699 -6620 -63922 -63746 HD 47 14 31 7 28 2 6.6 Test: Restricting Maximum Number of Parents Many researchers use a restriction on the maximum number of parents that a node can have. To understand how our algorithm performs with this restriction, we perform the following test. We compare the results of 4 different situations namely : No Reduction (NR), Random Reduction (RR), Exact Smart Reduction (ESR) and Atmost Smart Reduction (ASR). In NR there is no restriction on the number of parents that a node can have (this is our original algorithm), whereas in all the other methods, each node can have maximum of q parents. Crossover and mutation operations may produce networks with nodes that have more than q parents. For such nodes, the number of parents in excess of q are deleted. In RR the parents to be deleted are chosen at random. In the case of ESR, the Dirichlet score is calculated for each q parent subset of the original parent set. The subset with the highest score is chosen as the new parent set of the node and the remaining parent relations are deleted. In the case of ASR, the best subset of size q or less is chosen. Fig. 4. Comparison between NR, RR, ESR and ASR methods for ASIA network with a maximum parent restriction set to 4. Each experiment is repeated 100 times for various population sizes. For each of the four methods, we present the results of learning the ASIA network with the maximum number of parents restricted to 4 (figure 4) and 2 (figure 5). It is interesting to note that in figure 4 the results from all the methods are similar. Since ASIA is an 8 node network, any node can have a maximum of 7 parents. The ASIA network has nodes with a maximum of 2 parents. Even after restricting the parents to a maximum of 4, each node has a sufficient number of parent configurations to be able to learn an optimal parent configuration. Thus, all methods yield similar results. Figure 5 is even more interesting. We see that the quality of learning in ESR and ASR has reduced, whereas RR maintains its learning quality. We hypothesize the reason as follows. In the case of ESR and ASR, we try to find the best two element subset of the parents of a node. Thus, the genetic algorithm tends to converge before a sufficient number of parent configurations is generated by crossover and mutation operations. This approach prohibits a node from having such parent configurations that are themselves not very good, but may produce good results when crossed over with other configurations. In ESR and ASR, the learning process tends to greedily reach a local maxima. Thus, the quality of ESR and ASR reduces when compared to the other methods. The learning quality of RR is similar to that of NR, but outperforms NR for low population sizes. This is because NR performs the search in a much larger solution space as compared to RR. With low population sizes the probability that NR is unable to produce an optimal structure is higher. But as the population size increases, NR produces results that are similar to that of RR. Fig. 5. Comparison between NR, RR, ESR and ASR methods for ASIA network with a maximum parent restriction set to 2. Each experiment is repeated 100 times. 6.7 Test: Using a Fixed Node Order Learning Bayesian networks with a fixed node order is common to previous researchers. Not only does it reduce the solution space by a large factor, it also removes the problem of dealing with cyclic networks formed by crossover and mutation operations. Our encoding scheme is inherently acyclic, but using a fixed node order does reduce the solution space of our learning algorithm. In our algorithm, crossing over chromosomes with identical node orders, produce daughters with the same node orders. Thus, it is easy for us to test our algorithm with a pre-defined node order. We present the results for the ALARM network in this situation. In the previous tests, we are able to learn the optimal solution for Fig. 6. Learning ALARM network using our original algorithm vs. using a Fixed Node Order(FNO). Besides each data point, we represent score / HD value. The maximum number of parents are limited to a value 4 using RR method. A data set of size 1000 is used. The score of the original ALARM network for this data set is -10003. all other networks except ALARM. For this test, we create three different data sets of sizes 1000, 5000 and 10000 respectively. In this test, we are able to learn an optimal ALARM network for each data set. Figure 6 shows a comparison between the learned ALARM networks with and without a fixed node order. 7 Conclusions We have presented a new technique of using genetic algorithms to learn Bayesian network structure. Our encoding scheme is inherently acyclic which is easily preserved during the learning process. Unlike some of the other previous attempts, our algorithm does not have the overhead of dealing with cycles formed by crossover and mutation operations. We do not have to remove random edges, and thereby lose information to prevent generation of cycles in the Bayesian networks. For the structure learning process, we do not have to assume any predefined variable ordering. Our algorithm learns the optimal node ordering and the optimal network simultaneously. Results show that our algorithm is able to learn optimal structures of various Bayesian networks such as ASIA and ALARM. As the population size increases, the probability that our learning algorithm produces an optimal network, approaches to a value of 1. Enriching the initial population improves the quality of learning algorithm. However, if random population and enriched populations of same sizes are compared, we see an improvement only for small networks and when a small population size is used. Using both methods M1 and M2 is im- perative to our learning algorithm. If only one of the two methods is used, the learning quality of our algorithm deteriorates as the algorithm is unable to span the whole solution space. Restricting number of parents to a maximum value reduces the solution space of the learning algorithm. Random Reduction method (RR) produces the best results in all the situations. Our algorithm can benefit if a pre-existing node order is available. By using a pre-defined node order, we are able to improve performance as shown in learning the optimal ALARM network. References 1. I. A. Beinlinch, H. J. Suermondt, R. M. Chavez and G. F. Cooper (1989). The ALARM monitoring system: A case study with two probabilistic inference probabilistic inference techniques for belief networks. Proceedings of the Second European Conference on Artificial Intelligence in Medicine. 2. D. M. Chickering, D. Geiger, and D. Heckerman (1994). Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research. 3. G. F. Cooper and E. Herskovits (1992). A Bayesian Method for the Induction of Probabilistic Networks from Data. Machine Learning, 9(4):309-347. 4. H. Guo, B. B. Perry, J. A. Stilson, W. H. Hsu (2002). A Genetic Algorithm for Tuning Variable Orderings in Bayesian Network Structure Learning. Student Abstract, AAAI-2002. 5. P. B. Gupta, V. H. Allan (2003). The Acyclic Bayesian Net Generator. 1st Indian International Conference on Artificial Intelligence, IICAI-03. 6. D. Heckerman (1995). A tutorial on Learning Bayesian Networks. Technical Report, MSR-TR-95-06. Microsoft Research. 7. D. Heckerman, D. Geiger, and D. Chickering (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197243. 8. M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logic sampling (1988). Uncertainty in Artificial Intellgience 2, pages 149-163, New York, N. Y. Elsevier Science Publishing Company, Inc. 9. P. Larranaga, M. Poza, Y. Yurramendi, R. H. Murga, C. M. H. Kuijpers (1996a). Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance Analysis of Control Parameters. IEEE Journal on Pattern Analysis and Machine Intelligence, 18(9):912-926. 10. P. Larranaga, R. Murga, M. Poza, C. Kuijpers (1996b). Structure Learning of Bayesian Networks by Hybrid Genetic Algorithms. In D. Fisher and H. J. Lenz. (eds.), Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes in Statistics 112, New York, NY:Springer-Verlag:165-174. 11. S. L. Lauritzen, and D. J. Spiegelhalter (1988). Local Computations with Probabilities on Graphical Structures and Their Application on Expert Systems. Journal of Royal Statistical Society, 50(2):157-224. 12. J. W. Myers, K. B. Laskey, K. A. DeJong (1999). Learning Bayesian Networks from Incomplete Data using Evolutionary Algorithms. Proceedings of the Genetic and Evolutionary Computation Conference. 13. H. Steck. On the Use of Skeletons when Learning in Bayesian Networks (2000). Sixteenth Conference of Artificial Intelligence, UAI-2000 14. D. Whitley. A genetic algorithm tutorial (1993). Technical Report, CS-93-103, Department of Computer Science, Colorado State University.