Download 11. Pankaj Gupta and V.H. Allan, The Acyclic Bayesian Net

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Minimax wikipedia , lookup

Concept learning wikipedia , lookup

Gene expression programming wikipedia , lookup

Convolutional neural network wikipedia , lookup

Machine learning wikipedia , lookup

Pattern recognition wikipedia , lookup

Catastrophic interference wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Genetic algorithm wikipedia , lookup

Transcript
Performance Analysis of an Acyclic Genetic
approach to Learn Bayesian Network Structure
(Student Paper)
Pankaj B. Gupta1 and Vicki H. Allan2
1
2
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA,
[email protected]
Computer Science Department, Utah State University, Logan, UT 84322,
[email protected]
Abstract. We introduce a new genetic algorithm approach for learning a Bayesian network structure from data. Our method is capable of
learning over all node orderings and structures. Our encoding scheme is
inherently acyclic and is capable of performing crossover on chromosomes
with different node orders. We present an analysis of this approach using
different Bayesian networks such as ASIA and ALARM. Results suggest that the method is effective. The tests we perform include varying
the population size of the genetic algorithms, restricting the maximum
number of parents a node can have, and learning with a fixed node order.
Keywords: Structure Learning, Bayesian Networks, Genetic Algorithms.
1
Introduction
Bayesian networks are probabilistic networks capable of representing causal relationships [6]. They are directed acyclic graphs with nodes representing the
variables of a problem and edges representing the causal relations between the
variables. Each node has a conditional probability distribution in which the parents of the nodes are the condition variables. Learning structure and learning
probabilities of the nodes are treated as two separate problems where the former is considered to be much more challenging. In this research, we consider the
problem of learning the structure of a Bayesian network.
Genetic algorithms are evolutionary algorithms for solving problems with a
large solution space [14]. A possible solution of the problem is termed a chromosome. An initial population of chromosomes is generated randomly. Each
chromosome is evaluated according to a fitness function. Two chromosomes are
chosen from the population at random, and they are crossed over to produce two
daughter chromosomes. Eventually, the population size is truncated to the original size by discarding the worst quality chromosomes. This process is repeated
until a quality chromosome is generated.
In the past few years, Bayesian networks have become important in the field
of artificial intelligence. Learning the structure of a Bayesian network from data
is a challenging problem, known to be NP-Hard [2, 7]. Since the solution space
is large, genetic algorithms are considered. Previous attempts at using genetic
algorithms to learn Bayesian network structure either assume a pre-defined topological ordering of nodes, or have the overhead of dealing with cyclic networks.
We present a genetic algorithms technique to learn Bayesian network structure
that does not assume any initial node ordering and inherently preserves the
acyclic nature of the graph over crossover and mutation [5]. This results in an
effective method which derives the structure without throwing away important
information. We present a detailed analysis of our method.
2
Previous Work
The approach to use genetic algorithms for learning the Bayesian networks structure was introduced by Larranaga, et al. [9]. Their tests show that it is a challenging area of research[10]. Their method generates illegal cyclic Bayesian networks.
To handle illegal networks, either a pre-existing node order is assumed or cycle causing arcs are deleted. Myers, et. al., use genetic algorithms to learn the
structure of Bayesian networks from incomplete data [12]. They handle cyclic
Bayesian networks generated in the process by assigning them a low score. Guo,
et al., use genetic algorithm to tune the node ordering of a Bayesian network [4].
The node ordering serves an input to the K2 algorithm [3].
Our approach is different because the acyclic property is inherent to our encoding scheme. We do not assume any pre-defined node ordering and do not have
to deal with cyclic networks. In addition, we do not lose information during the
learning process by deleting arcs required for maintaining the network acyclic.
Our crossover function is a closed operation.
3
Scoring Metric
Given a data set D and a learned Bayesian network BS , P (BS |D) represents
the posterior probability that BS is learned from D. Using the Bayes theorem
it can be represented as :
P(BS | D) =
P(D | BS ) P(BS )
P(D)
P(D|BS ) represents the probability that data D is generated from BS and is
known as a marginal probability likelihood function. P(BS ) can be assumed to
be constant if no previous information of the structure is available. P(D) is the
normalizing constant and is independent of the network. Thus, P(D|BS ) can be
used a scoring metric.
Let n be the number of nodes of a Bayesian network. Assume that node i
has si different states, πi denotes the parent set of i and pi denotes the number
of different states of the parent set πi . Cooper et al. [3] showed that the following function, known as Bayesian Dirichlet Metric, can be used as the marginal
likelihood function :
P(D | BS ) =
pi
n Y
Y
i=1 j=1
0
Γ (Nij )
0
0
si
Y
Γ (Nijk + Nijk )
Γ (Nij + Nij ) j=1
0
Γ (Nijk )
Nijk is number of instances in the data set such that node i is in its k th state
0
and its parent set (πi ) is in its j th state. Nijk is a hyper-parameter that may
be set to 1, if no prior information about the structure of a Bayesian network is
available.
Since there are a lot of multiplication operations in the Bayesian Dirichlet
function, the logarithm of Bayesian Dirichlet (LogBD) is often used as a scoring
metric. All the research described in section 2 use LogBD as the scoring metric.
We also use the LogBD as our scoring metric. LogBD of a Bayesian network is the
sum of the LogBD values for each node of the network. LogBD of a node depends
only on its parent set. If a node has the same parents in two different Bayesian
networks, the LogBD score of that node is the same in both the networks. This
proves beneficial as we do not recalculate the score of a node (if it has the same
parent set) multiple times during the learning process.
4
4.1
Our Algorithm
The Encoding Scheme
Our chromosome defines both a Bayesian network and a topological order on
nodes of the Bayesian network [5]. Associating a node order with a chromosome
has the following advantages: First, by crossing over chromosomes of different
node orders, our algorithm learns the optimal node ordering, and second, the
acyclic nature is preserved over operations like crossover and mutation.
Consider a Bayesian network of size n. The corresponding encoded chromosome is a sequence of n genes. Each gene corresponds to a node of the Bayesian
network. Thus, the sequence of n genes also represents the node order associated
with the chromosome. For any node, the nodes preceding it in the topological
order are its candidate parents. Each gene corresponding to a node has a set of
booleans describing which of the candidate parents of the node are the actual
parents.
4.2
Crossover Function
Our O(n2 ) crossover function ensures that the daughter chromosomes produced
are acyclic in nature. The crossover function can be thought of as a two step
process: producing node orders and assigning parents. Let the two parent chromosomes be A and B, and the two daughter chromosomes be X and Y .
Step 1: Producing Node Orders
Based on a random crossover point, the node order of A is split into two parts,
a1 and a2. The node order for chromosome B is divided into two parts b1 and b2,
such that b1 has the same nodes as in a1 (but the order within B is preserved),
and b2 has the same nodes as in a2 (but the order within B is preserved). a1 is
concatenated with b2 to generate the node order for X, and b1 is concatenated
with a2 to generate the node order for Y .
This step enables our learning algorithm to crossover chromosomes with different node orders to learn the optimal one. It allows our learning algorithm to
span over the entire set of node orders.
Step 2: Producing Parent Information
In producing parent information, we generate two different sets of parents for X
and Y . It is decided randomly which set of daughter chromosomes (methods M1
or M2) is actually produced.
Arcs in the parent nodes are transferred to the daughters, either directly or by
reversing the order. For X, the parent relation between the nodes of a1 (or b2) are
directly transferred from A (or B). Similarly for Y , the parent relations between
the nodes of b1 (or a2) are directly transferred from B (or A). If method M1 is
used, parent relation between a node of a1 and and a node of b2 is transferred
from A to X, whereas the relation between a node of b1 and a node of a2 is
transferred from B to Y . In this case, the direction of the relation transferred
from B to Y may be reversed, as the daughter nodes order are different than
those of parents. In method M2, the roles of A and B are interchanged. For every
crossover function, it is important to choose one of the two methods at random.
If only one method is used for every crossover function, our learning algorithm
does not span the entire solution space.
It should be noted that parent arcs between nodes of the parent chromosomes are transferred to exactly one of the two daughter chromosomes. The sum
of edges in both the parents is the same as that of the daughter chromosomes. In
a daughter, the nodes originating from the same parent preserve the dependence
according to that parent. This ensures that our crossover function preserves the
goodness of the chromosomes. This proves particularly beneficial if one of the
parent chromosomes has captured the dependence between few nodes and the
other parent has captured the dependence among the other nodes. Cycle formation has been a problem in the previous attempts of structure learning using
genetic algorithms. Since we have a node order associated with our daughter
chromosomes, we prevent cycles by flipping the direction of those cyclic edges.
By doing so, we ensure that we do not lose any parent information by deleting
any edges. Steck uses the concept of edge reversal in his learning algorithm [13].
Though reversing an edge between two nodes may not represent the exact relationship between them as before, it does preserve some dependence relation
between the two. The other possible approach is to delete the cycle forming edges.
Larranaga [9] uses this approach. To investigate which of the two approaches is
better, we modify our method to delete the cycle forming edges instead of reversing them. We use a sample data set of 1000 entries to learn the structure of
ASIA[11] network using our method and the modified method. Figure 1 shows
the percentage of times the optimal ASIA network was learned. From the figure,
it is evident that reversing an edge is a better approach than deleting it.
Fig. 1. Comparison of two approaches to prevent cycle formation while learning the
structure of ASIA network. In one approach, potentially dangerous edges are reversed,
whereas in the other approach they are deleted.
4.3
Mutations
Mutations are an integral part of the genetic algorithms. Our encoding scheme
ensures that the acyclic nature of the chromosomes is preserved over mutations.
For each node and each parent information of our chromosome, a random number
between 0 and 1 is generated. If the number is less than the mutation probability,
the node or parent information is mutated. A node is mutated by swapping it
with some another randomly chosen node of the chromosome. A parent bit is
mutated by toggling its value.
5
Enriching Initial Population
We use the term spanning population to mean that through crossover the initial
population is capable of producing every chromosome in the solution space. A
spanning population is important for the success of a genetic algorithm.
We introduce the following functions : InvertOrder and ToggleParents. The
b generates a chromosome with a node order
InvertOrder(A) (represented as A)
reverse to that of the input chromosome A. The sets of booleans (representing
b are identical. InvertOrder is
the parent information) in the genes of A and A
required for our algorithm to be able to iterate over all node orders in the solution
space.
ToggleParents(A) (represented as A) toggles the parent information of each
gene to produce a new chromosome. ToggleParents is important to produce every
possible parent configuration for a given node order.
We generate our initial population as follows : Randomly generated chromosomes are added to the population. To make the population rich, for every
b A and A.
b This ensures that
random chromosome A, we add to the population A,
using our crossover function every chromosome in the solution space is derivable
from our initial population.
6
6.1
Results
Data Generation
We use the Logic Sampling method to generate a data set [8]. Given a Bayesian
network and a set of conditional probability tables, initially a state value for
each root node is generated using a random state generator weighted by the
probabilities of different states. Once the states of all root nodes are determined,
the same operation is performed for the nodes whose parent states have been
determined. The process is repeated until a state has been instantiated for each
node in the network. This yields one data entry for the data set. The above
process is repeated to generate a data set of the desired size.
6.2
Testing Procedure
1. We choose a Bayesian network (network structure + probability tables).
2. The Logic Sampling method is used to generate data.
3. The Genetic Algorithm approach is used to learn the structure of the Bayesian
network. We generate an enriched population of size p as described in section 5. In each iteration of the genetic algorithm, we produce p daughter
chromosomes. Based on the Bayesian Dirichlet score function, the worst
p chromosomes (chromosomes with the lowest score) among the union of
daughters and parents are discarded.
4. The best network generated is compared with the optimal network. The
term optimal network denotes the original network, unless a better network
is learned over the span of all the tests performed.
We test our algorithm on Bayesian networks of different sizes. To discuss our
results, we use the following Bayesian networks : ASIA[11], DualASIA, DualASIA+, SubALARM and ALARM [1]. ASIA and ALARM networks are commonly
used for Bayesian analysis. ASIA is a small network with 8 nodes and 8 edges
that calculates the probability of a patient having diseases like tuberculosis, lung
cancer and bronchitis. ALARM, a large network with 37 nodes and 46 edges,
is a prototype that models potential anesthesia problems in an operating room.
To be able to analyze our algorithm for intermediate size Bayesian networks, we
create DualASIA, DualASIA+ and SubALARM. DualASIA, a 16 node and 16
edge network, consists of two identical ASIA networks. DualASIA+ is similar
to DualASIA, but has 4 additional edges connecting the two ASIA networks.
Though it may seem that ASIA, DualASIA and DualASIA+ will produce similar results, their behavior is actually different. All of them represent different
Bayesian networks. SubALARM is a 24 node sub graph of the ALARM network.
Using our algorithm, we are able to learn the optimal structure of all networks
we test.
6.3
Test: Effect of Varying Population Size
Fig. 2. The effect of varying population size on the quality of structures learned. The
best score and the average scores learned are plotted in the graph. Beside each data
entry in the graph, we show the Hamming Distance (HD)/Reverse Edge Count(RevCt)
values.
Initially we test our algorithm by varying the population size of the genetic
algorithms. We present the effects of varying the population size on the quality
of Bayesian networks learned in the case of DualASIA network. We create a data
set of 1000 entries. The optimal network for the data set is learned by performing
different tests described in this section. The optimal network has a better score
(-7728.02) than the original network (-7739.68) and differs from the original
network in 2 edges. Figure 2 shows the effect of varying the population size on
the quality of networks learned. The average is calculated by repeating the test
50 times for each population size. A mutation probability of 0.05 is used and the
stopping condition is set at 250 iterations. Beside each data entry in the graph,
we present the Hamming Distance and the Reverse Edge Count between the
learned network and the optimal network. The Hamming Distance(HD) between
two networks is a measure of the number of edges that are not common to both
the networks. HD does not include the edges that differ in direction. The count
of such edges is represented by the Reverse Edge Count(RevCt). In figure 2, we
observe that for low population sizes the difference between the average score and
the best score learned is large. As the population size increases, the average score
and the best score converge to the optimal score value. Increasing the population
size further does not produce any more improvements. This indicates that as the
population size increases, the probability of learning a good Bayesian network
approaches to a value of 1. Similarly, as the population size increases, HD and
the average HD converge to the optimal value 0. RevCt tends to improve as the
population size increases but the improvement is not consistent. This is because
reversing an edge does not always have an adverse effect on the relationships
encoded by a Bayesian network.
6.4
Test: Random Population vs. Enriched Population
As mentioned in section 5, the initial population is enriched by adding three variants of each random chromosome present in the initial population. To demonstrate the positive effects of enriching a population, we generate a random population of size p1 , and perform the learning process with the random population.
The population is then enriched, and the learning process is repeated. We compare the results obtained from the two learning processes. It should be noted
that the size of the enriched population (4 ∗ p1 ) is 4 times the size of the random population(p1 ). To compare the results on a fair basis, in each iteration of
the latter learning process, we generate only p1 (instead of 4 ∗ p1 ) daughters.
Mutations are not used for this test as they help the learning algorithm span
the whole solution space, thereby reducing the negative effects of not enriching
a population. It is observed that with an enriched population better networks
are learned.
Table 1. The results of learning DualASIA+ network using a Random Population vs.
an Enriched Population
Data set size = 1000 entries, iterations = 500
Optimal Network Score = -6929.96
Random Population
Enriched Population
p = 25
p = 25 (random) + 75 (variants)
Best Network Score
-7009.84
-6929.96
HD / RevCt
10 / 3
0/0
Ave. Score
-7110.37
-6964.22
Ave. HD/ Ave. RevCt
15.95 / 6.5
3.8 / 2.75
Table 1 shows the results of the DualAsia+ network for this test. Average
values are calculated by repeating the experiment 50 times. We are able to
learn the optimal DualASIA+ when the enriched population is used, whereas
if random population is used the best network learned has a HD value of 10.
Considering that DualASIA+ is a network with 20 edges, an HD value of 10 is
high. Even the average values in the case of the random population are worse.
In the previous test, the size of the enriched population is much larger than
that of the random population. In this test, we compare the two processes,
mentioned above, using same sized populations. Figure 3 shows the results of
this test for the ASIA network. It is observed that enriching the population
yields better results for low population sizes. However, for larger populations,
both processes yield similar results. We hypothesize the reason is that the process
of enriching a population provides four different parent configurations for each
node of a random chromosome. Starting with those four configurations, a node
is theoretically able to generate any possible parent configuration. In the case of
a large population, each node already has a sufficient number of random parent
configurations and so enriching the population does not help any further.
Fig. 3. Comparing ASIA networks learned when Random Populations and Enriched
Populations of the same sizes are used.
For larger networks, such as ALARM, we are not able to see much improvement in the learning process even with low population sizes. The reason being
that in the case of low population sizes for larger networks, the learning algorithm is not able to reach anywhere near an optimal solution. Thus, multiple
executions of the learning process generate inconsistent results and are difficult
to compare. From this test, we learn that enriching the population does not pro-
duce any adverse effects. In a few cases, it helps improve the learning process,
whereas in other cases, the initial population is rich enough that enriching it
does not have a great effect.
6.5
Test: Using both Methods (M1 and M2) vs. Using One Method
As explained in section 4.2, in crossover, one of the two methods (M1 and M2)
is randomly chosen to generate daughter chromosomes. It is imperative that
both methods be used in the learning process. It can be mathematically proven
that the learning algorithm requires both methods to be able to span the whole
solution space [5].
To demonstrate this experimentally, we run our learning algorithm with only
method M1 and compare the results with those generated by using both the
methods. All the remaining parameters are kept constant. Mutations are not
used for this particular test as they help the learning algorithm span the whole
solution space, thereby reducing the negative effects of using only method M1. It
is observed that the quality of the learned Bayesian networks deteriorates when
only method M1 is used. Table 2 shows the results of this experiment for the
SubALARM network. It can be seen, when both methods are used, the best
networks learned have scores very close to that of the original network, whereas
when only method M1 is used, the networks learned have a much lower score
and a much higher HD.
Table 2. The effect of learning SubALARM using only method M1 vs. using both
methods. The experiment is repeated 10 times. p is the population size and i is the
number of iterations.
Data Size 1000
Data Size 10000
Original BN Score: -6628
Original BN Score: -63739
p=400, i=350 p=1200, i=500
p=1200, i=500
Method
M1 Only Both M1 Only Both M1 Only
Both
Best Score -6772 -6637 -6699 -6620 -63922
-63746
HD
47
14
31
7
28
2
6.6
Test: Restricting Maximum Number of Parents
Many researchers use a restriction on the maximum number of parents that a
node can have. To understand how our algorithm performs with this restriction,
we perform the following test. We compare the results of 4 different situations
namely : No Reduction (NR), Random Reduction (RR), Exact Smart Reduction
(ESR) and Atmost Smart Reduction (ASR). In NR there is no restriction on the
number of parents that a node can have (this is our original algorithm), whereas
in all the other methods, each node can have maximum of q parents. Crossover
and mutation operations may produce networks with nodes that have more than
q parents. For such nodes, the number of parents in excess of q are deleted. In
RR the parents to be deleted are chosen at random. In the case of ESR, the
Dirichlet score is calculated for each q parent subset of the original parent set.
The subset with the highest score is chosen as the new parent set of the node
and the remaining parent relations are deleted. In the case of ASR, the best
subset of size q or less is chosen.
Fig. 4. Comparison between NR, RR, ESR and ASR methods for ASIA network with a
maximum parent restriction set to 4. Each experiment is repeated 100 times for various
population sizes.
For each of the four methods, we present the results of learning the ASIA
network with the maximum number of parents restricted to 4 (figure 4) and 2
(figure 5). It is interesting to note that in figure 4 the results from all the methods
are similar. Since ASIA is an 8 node network, any node can have a maximum of 7
parents. The ASIA network has nodes with a maximum of 2 parents. Even after
restricting the parents to a maximum of 4, each node has a sufficient number of
parent configurations to be able to learn an optimal parent configuration. Thus,
all methods yield similar results.
Figure 5 is even more interesting. We see that the quality of learning in ESR
and ASR has reduced, whereas RR maintains its learning quality. We hypothesize
the reason as follows. In the case of ESR and ASR, we try to find the best two
element subset of the parents of a node. Thus, the genetic algorithm tends to
converge before a sufficient number of parent configurations is generated by
crossover and mutation operations. This approach prohibits a node from having
such parent configurations that are themselves not very good, but may produce
good results when crossed over with other configurations. In ESR and ASR, the
learning process tends to greedily reach a local maxima. Thus, the quality of
ESR and ASR reduces when compared to the other methods.
The learning quality of RR is similar to that of NR, but outperforms NR for
low population sizes. This is because NR performs the search in a much larger
solution space as compared to RR. With low population sizes the probability that
NR is unable to produce an optimal structure is higher. But as the population
size increases, NR produces results that are similar to that of RR.
Fig. 5. Comparison between NR, RR, ESR and ASR methods for ASIA network with
a maximum parent restriction set to 2. Each experiment is repeated 100 times.
6.7
Test: Using a Fixed Node Order
Learning Bayesian networks with a fixed node order is common to previous
researchers. Not only does it reduce the solution space by a large factor, it also
removes the problem of dealing with cyclic networks formed by crossover and
mutation operations. Our encoding scheme is inherently acyclic, but using a
fixed node order does reduce the solution space of our learning algorithm. In
our algorithm, crossing over chromosomes with identical node orders, produce
daughters with the same node orders. Thus, it is easy for us to test our algorithm
with a pre-defined node order. We present the results for the ALARM network in
this situation. In the previous tests, we are able to learn the optimal solution for
Fig. 6. Learning ALARM network using our original algorithm vs. using a Fixed Node
Order(FNO). Besides each data point, we represent score / HD value. The maximum
number of parents are limited to a value 4 using RR method. A data set of size 1000
is used. The score of the original ALARM network for this data set is -10003.
all other networks except ALARM. For this test, we create three different data
sets of sizes 1000, 5000 and 10000 respectively. In this test, we are able to learn
an optimal ALARM network for each data set. Figure 6 shows a comparison
between the learned ALARM networks with and without a fixed node order.
7
Conclusions
We have presented a new technique of using genetic algorithms to learn Bayesian
network structure. Our encoding scheme is inherently acyclic which is easily preserved during the learning process. Unlike some of the other previous attempts,
our algorithm does not have the overhead of dealing with cycles formed by
crossover and mutation operations. We do not have to remove random edges,
and thereby lose information to prevent generation of cycles in the Bayesian
networks. For the structure learning process, we do not have to assume any predefined variable ordering. Our algorithm learns the optimal node ordering and
the optimal network simultaneously.
Results show that our algorithm is able to learn optimal structures of various
Bayesian networks such as ASIA and ALARM. As the population size increases,
the probability that our learning algorithm produces an optimal network, approaches to a value of 1. Enriching the initial population improves the quality
of learning algorithm. However, if random population and enriched populations
of same sizes are compared, we see an improvement only for small networks and
when a small population size is used. Using both methods M1 and M2 is im-
perative to our learning algorithm. If only one of the two methods is used, the
learning quality of our algorithm deteriorates as the algorithm is unable to span
the whole solution space. Restricting number of parents to a maximum value reduces the solution space of the learning algorithm. Random Reduction method
(RR) produces the best results in all the situations. Our algorithm can benefit if
a pre-existing node order is available. By using a pre-defined node order, we are
able to improve performance as shown in learning the optimal ALARM network.
References
1. I. A. Beinlinch, H. J. Suermondt, R. M. Chavez and G. F. Cooper (1989). The
ALARM monitoring system: A case study with two probabilistic inference probabilistic inference techniques for belief networks. Proceedings of the Second European
Conference on Artificial Intelligence in Medicine.
2. D. M. Chickering, D. Geiger, and D. Heckerman (1994). Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research.
3. G. F. Cooper and E. Herskovits (1992). A Bayesian Method for the Induction of
Probabilistic Networks from Data. Machine Learning, 9(4):309-347.
4. H. Guo, B. B. Perry, J. A. Stilson, W. H. Hsu (2002). A Genetic Algorithm for Tuning Variable Orderings in Bayesian Network Structure Learning. Student Abstract,
AAAI-2002.
5. P. B. Gupta, V. H. Allan (2003). The Acyclic Bayesian Net Generator. 1st Indian
International Conference on Artificial Intelligence, IICAI-03.
6. D. Heckerman (1995). A tutorial on Learning Bayesian Networks. Technical Report,
MSR-TR-95-06. Microsoft Research.
7. D. Heckerman, D. Geiger, and D. Chickering (1995). Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20(3):197243.
8. M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logic
sampling (1988). Uncertainty in Artificial Intellgience 2, pages 149-163, New York,
N. Y. Elsevier Science Publishing Company, Inc.
9. P. Larranaga, M. Poza, Y. Yurramendi, R. H. Murga, C. M. H. Kuijpers (1996a).
Structure Learning of Bayesian Networks by Genetic Algorithms: A Performance
Analysis of Control Parameters. IEEE Journal on Pattern Analysis and Machine
Intelligence, 18(9):912-926.
10. P. Larranaga, R. Murga, M. Poza, C. Kuijpers (1996b). Structure Learning of
Bayesian Networks by Hybrid Genetic Algorithms. In D. Fisher and H. J. Lenz.
(eds.), Learning from Data: Artificial Intelligence and Statistics V, Lecture Notes
in Statistics 112, New York, NY:Springer-Verlag:165-174.
11. S. L. Lauritzen, and D. J. Spiegelhalter (1988). Local Computations with Probabilities on Graphical Structures and Their Application on Expert Systems. Journal
of Royal Statistical Society, 50(2):157-224.
12. J. W. Myers, K. B. Laskey, K. A. DeJong (1999). Learning Bayesian Networks
from Incomplete Data using Evolutionary Algorithms. Proceedings of the Genetic
and Evolutionary Computation Conference.
13. H. Steck. On the Use of Skeletons when Learning in Bayesian Networks (2000).
Sixteenth Conference of Artificial Intelligence, UAI-2000
14. D. Whitley. A genetic algorithm tutorial (1993). Technical Report, CS-93-103, Department of Computer Science, Colorado State University.