Download CIS664 KD&DM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Network tap wikipedia , lookup

Drift plus penalty wikipedia , lookup

Airborne Networking wikipedia , lookup

List of wireless community networks by region wikipedia , lookup

Transcript
CIS664 KD&DM
G.F. Cooper, E. Herskovits
A Bayesian Method for the
Induction of Probabilistic Networks
from Data
Presented by Uroš Midić
Mar 21, 2007
Introduction
Bayesian Belief Network
Learning Network Parameters
Learning Network Structure
Probability of Network Structure Given a Dataset
Finding the Most Probable Network Structure
K2 algorithm
Experimental result
Pros and cons
Introduction
Events A and B are independent if
P(A∩B)=P(A)P(B)
or in the case that P(A)>0 and P(B)>0,
P(A|B) = P(A) and P(B|A) = P(B)
Introduction
Discrete-valued random variables X and Y are
independent if
for any a, b, the events X=a and Y=b are
independent, i.e.
P(X=a∩Y=b)=P(X=a)P(Y=b)
Or
P(X=a|Y=b) = P(X=a)
Introduction
Let X, Y, and Z be three discrete-valued random
variables. X is conditionally independent of Y
given Z if for any a, b, c,
P(X=a | Y=b,Z=c)=P(X=a|Z=c)
This can be extended to sets of variables, e.g. we
can say that X1…Xl is conditionally independent of
Y1…Ym given Z1…Zn, if
P(X1…Xl | Y1…Ym, Z1…Zn) = P(X1…Xl | Z1…Zn)
Introduction
Let X = {X1, …, Xn} be a set of discrete-valued
variables, and each variable Xi has a defined set
of possible values V(Xi).
The joint space of the set of variables X is defined
as V(X1)xV(X2)x…xV(Xn).
The probability distribution over the joint space
specifies the probability for each of the possible
variable bindings for the tuple (X1, …, Xn), and is
called the joint probability distribution.
Bayesian Belief Networks
A Bayesian Belief Network represents the joint
probability distribution for a set of variables.
It specifies a set of conditional independence
assumptions (in form of a Directed Acyclic Graph)
and sets of local conditional probabilities (in form
of tables).
Each variable is represented by a node in the
graph. A network arc represents the assertion
that the variable is conditionally independent of
its non-descendants, given its immediate
predecessors.
Bayesian Belief Networks
Example:
Storm
Lightning
Thunder
BusTourGroup
Campfire
S,B S,¬B ¬S,B ¬S,¬B
C 0.4 0.1 0.8 0.2
¬C 0.6 0.9 0.2 0.8
ForestFire
Campfire
Bayesian Belief Networks
Storm
Lightning
Thunder
BusTourGroup
Campfire
S,B S,¬B ¬S,B ¬S,¬B
C 0.4 0.1 0.8 0.2
¬C 0.6 0.9 0.2 0.8
ForestFire
Campfire
Example
P(Campfire = True|Storm=True,BusTourGroup = True) = 0.4
Bayesian Belief Networks
For any assignment of values (x1,…,xn) to the
variables (X1, …, Xn), we can compute the joint
probability
P(x1,…,xn) = ∏i=1..nP(xi|Parents(Xi))
The values P(xi|Parents(Xi)) come directly from
the tables associated with respective Xi.
We can therefore recover any probability of the
form P(X1|X2), where X1,X2 are subsets of
X = {X1, …, Xn}.
Learning Network Parameters
We are given a network structure.
We are given a dataset, such that for each
training example all variables have assigned
values, i.e. we always get a full assignment
(x1,…,xn) to the variables (X1, …, Xn).
Then learning the conditional probability tables is
straightforward:
1.
2.
3.
Initialize then counter in all the tables to 0,
For each training example increase all the appropriate
counters in the tables
Convert the counts into probabilities.
Learning Network Parameters
We are given a network structure but the dataset
has missing values.
In this case, learning the conditional probability
tables is not straightforward, and usually
involves a gradient-ascent procedure to
maximize P(D|h) – probability of the dataset
given the model h.
Note that D (dataset) is fixed and we search for
the best h that maximizes P(D|h).
Learning Network Structure
The assumption for learning the network
parameters is that we know (or assume) a
network structure.
In simple cases, the network structure is
constructed by a domain expert.
In most cases a domain expert is not available,
or the dataset is so complicated that even a
domain expert is powerless.
Example: Gene-expression microarray datasets
have > 10K variables.
Probability of N.S. Given a Dataset
We are given a dataset and two networks. Which
of the networks is better fit for this dataset:
Case
1
2
3
4
5
6
7
8
9
10
x1
+
+
+
+
+
-
x2
+
+
+
+
+
-
x3
+
+
+
+
+
+
-
S1:
S2:
x1
x2
x3
x2
x1
x3
P(BS1|D) is much greater
than P(BS2|D). S1 was used
to generate the dataset.
Probability of N.S. Given a Dataset
We are given a dataset and two networks.

PB

P BS1 , D
  PD  PB
| D  PB , D  PB
P BS1 | D
S2

S1
S2
P D 
S2

, D
,D
How to calculate P(BS,D)?
Probability of N.S. Given a Dataset
Assumptions made in the paper:
1.
2.
3.
4.
The database variables are discrete
Cases occur independently, given a beliefnetwork model.
There are no cases that have variables with
missing values.
Prior probabilities (before observing the data)
for conditional probability assignments are
uniform.
Probability of N.S. Given a Dataset
How to calculate P(BS,D)?
Xi has a set of parents πi
Xi has ri possible assignments (vi1, …, viri)
There are qi unique instantiations (wi1, wi2, … , wiqi) for πi in D.
Nijk is the number of cases in D in which variable xi has value vik and
πi is instantiated as wij.
Let Nij = ∑k=1..riNijk
With the four assumptions we get a formula:
n
qi
PBS , D   PBS 
i 1
j 1
ri  1!
ri
N

N  r  1!
ij
i
k 1
ijk
!
Probability of N.S. Given a Dataset
How to calculate P(BS,D)?
n
qi
PBS , D   PBS 
i 1
j 1
ri  1!
ri
N

N  r  1!
ij
i
ijk
!
k 1
Surprisingly, after some indexing and constant bounding we get that
the time complexity for this formula is O(mn) – linear in the
number of variables and number of cases.
Probability of N.S. Given a Dataset
We have an efficient way to calculate P(BS,D)?
We could calculate all
P(BSi|D) = P(BSi,D)/(∑P(BS,D))
and find the optimal one, but the set of possible BSi grows
exponentially with n.
However, if we had a situation where
∑BS∊YP(BS,D) ≈P(D),
and Y is small enough, then we could efficiently
approximate all P(BS|D) for BS∊Y.
K2 algorithm
We start with:
n
qi
PBS , D   PBS 
i 1
j 1
ri  1!
ri
N

N  r  1!
ij
i
ijk
!
k 1
ri
 qi


ri  1!
max PBS , D   c max 
N ijk !

i 1
 j 1 N ij  ri  1! k 1

n
The time complexity is O(mn2r2n). However, if we
assume that a node can have at most u
parents, then the complexity is O(munrT(n,u)),
where T(n,u) = ∑k=0..uchoose(n,k)
K2 algorithm
Let πi be the set of parents of xi in Bs, which we
denote as πi->xi. Then we can rewrite as P(BS)
P(BS) = ∏i=1..n P(πi->xi), and we get a formula:
qi
ri



ri  1!
max PBS , D    max  P i  xi 
N ijk !

i
i 1
j 1 N ij  ri  1! k 1


n
If we assume an ordering of the variables, such
that xi cannot be a parent of xj if i<j, then the
number of possible πi-s for each xi is smaller,
but the overall complexity is still exponential.
K2 algorithm
K2 is a heuristic algorithm. It takes as input a set of n nodes, an
ordering on the nodes, an upper bound on the number of parents a
node may have, and a database D containing m cases.
It starts with the assumption that a node has no parents, and then
tries to incrementally add parents whose addition most increases
the probability of the resulting structure.
When the addition of no single parent can increase the probability,
the algorithm continues with another node.
This procedure is applied with respect to the ordering of the nodes
(starting with X1 for which the ordering assumes that it cannot be
a parent in the DAG).
The time complexity is polynomial O(mu2n2r), which in the worst
case, when u=n, is O(mn4r).
Experimental result
Using a predefined network structure with 37 nodes
and the associated conditional probabilities – provided
by an expert in the domain of medicine – that
describes potential problems with anaesthesia in the
operating room, the authors generated a database with
10000 cases.
They ran the algorithm, with the generated database,
and an ordering of nodes that was consistent with the
original structure.
The algorithm almost completely reconstructed the
original network. It missed one original arc, and
added one arc that was not present in the original
network.
Pros and cons
+ Any exact algorithm has exponential complexity, this
heuristic algorithm has polynomial complexity.
+ The preliminary results are promising.
+ The algorithm can be extended to cover the databases
with missing values. However, this extension is
exponential in the number of missing values.
– The algorithm still requires an ordering of the
nodes/variables.
References
G. Cooper and E. Herskovits, “A Bayesian Method for
the Induction of Probabilistic Networks from Data”,
Machine Learning 9 (1992) pp. 309-347.
Tom Mitchell, Machine Learning.