Download Probabilistic Knowledge and Probabilistic Common Knowledge 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

Transcript
Probabilistic Knowledge and Probabilistic Common
Knowledge
Paul Krasucki1 ,
Rohit Parikh2
and Gilbert Ndjatou3
Abstract: In this paper we develop a theory of probabilistic common knowledge and
probabilistic knowledge in a group of individuals whose knowledge partitions are not wholly
independent.
1
Introduction
Our purpose in this paper is to extend conventional information theory and to address the
issue of measuring the amount of knowledge that n individuals have in common. Suppose,
for example, that two individuals have partitions which correspond closely, then we would
expect that they share a great deal. However, the conventional definition of mutual knowledge may give us the conclusion that there is no fact which is mutually known, or even
known to one as being known to another.
This is unfortunate because [CM] and [HM] both give us arguments that seem to show
that common knowledge (mutual knowledge if two individuals are involved) is both difficult to attain and necessary for certain tasks. If however, we can show that probabilistic
knowledge is both easier to attain and a suitable substitute in many situations, then we
have made progress. See [Pa2] for a description of situations where partial knowledge is
adequate for communication.
To this end, we shall develop a theory of probabilistic common knowledge which turns
out to have surprising and fruitful connections both with traditional information theory and
with Markov chains. To be sure, these theories have their own areas of intended application.
Nonetheless, it will turn out that our mathematical theory has many points in common with
these two theories.
The standard Logics of Knowledge tend to use Kripke models with S5 accessibility
relations, one for each knower. One can easily study instead the partitions corresponding
to these accessibility relations and we shall do this. We also assume that the space W of
possible worlds has a probability measure µ given with it.
1
Department of Computer Science, Rutgers-Camden
Department of Computer Science, CUNY Graduate center, 33 West 42nd Street, New York, NY 10036.
email: [email protected].
3
Department of Computer Science, College of Staten Island, CUNY and CUNY Graduate center
2
1
In Figure I below, Ann has partition A = {A1 , A2 } and Bob has partition B = {B1 , B2 }
so that each of the sets Ai , Bj has probability .5 and the intersections Ai ∩Bj have probability
.45 when i = j and .05 otherwise.
The vertical line divides A1 from A2
The slanted line divides B1 from B2
.45
LL.05
L
L
L
L
L
.05
L
.45
Figure I
Since the meet of the partitions is trivial, there is no common knowledge in the usual
sense of [Au], [HM]. In fact there is no nontrivial proposition p such that Ann knows that
Bob knows p. It is clear, however, that Ann and Bob have nearly the same information
and if the partitions are themselves common knowledge, then Ann and Bob will be able to
guess, with high probability, what the other knows. We would like then to say that Ann and
Bob have probabilistic common knowledge, but how much? One purpose of this paper is to
answer this question and to prove properties of our definition that show why the answer is
plausible.
A closely related question is that of measuring indirect probabilistic knowledge. For
example, we would expect that what Ann knows about Bob’s knowledge is less than or
equal to what Bob himself knows, and what Ann knows of Bob’s knowledge of Carol is in
turn less than or equal to the amount of knowledge that Bob has about Carol’s knowledge.
We would expect in the limit that what Ann knows about what Bob knows about what
Ann knows... about what Bob knows will approach whatever ordinary common knowledge
they have.
It turns out that to tackle these questions successfully, we need a third notion. This
is the notion of the amount of information acquired when one’s probabilities change as a
result of new information (which does not invalidate old information). Suppose for example
that I am told that a certain fruit is a peach. I may then assign a probability of .45 to the
proposition that it is sweet. If I learn then that it just came off a tree, then I will expect
that it was probably picked for shipping and the probability may drop to .2, but if I learn
again that it fell off the tree, then it will rise to .9. In each case I am getting information,
consistent with previous information and causing me to revise my probabilities, but how
2
much information am I getting?
2
Preliminaries
We start by giving some definitions, some old, some apparently new. If a space has 2n
points, all equally likely, then the amount of information gained by knowing the identity of
a specific point x is n bits. If one only knows a set X in which x falls, then the information
gained is less, in fact equal to I(X) = − log(µ(X)) where µ(X) is the probability4 of X. If
P = P1 , ..., Pk is a partition of the whole space W , then the expected information when one
discovers the identity of the Pi which contains x, is
H(P) =
k
X
µ(Pi )I(Pi ) =
i=1
k
X
−µ(Pi ) log(µ(Pi ))
i=1
These definitions so far are standard in the literature [Sh], [Ab], [Dr]. We now introduce
a notion which is apparently new.
Suppose I have a partition P = {P1 , ..., Pk } whose a priori probabilities are y1 , ..., yk ,
but some information that I receive causes me to change them to u1 , ..., uk . How much
information have I received?
Definition 1 :
IG(~u, ~y ) =
k
X
ui (log ui − log yi ) =
k
X
i=1
i=1
ui log(
ui
)
yi
Here IG stands for “information gain”.
Clearly this definition needs some justification. We will first provide an intuitive explanation, and then prove some properties of this notion IG which will make it more plausible
that it is the right one.
(a) Suppose that the space had 2n points, and the distribution of probabilities that we
had was the flat distribution. Then the set Pi has 2n · yi points 5 . After we receive our
information, the points are no longer equally likely, and each point in Pi has probability
ui
|Pi |
=
ui
yi 2 n .
Thus the expected information of the partition of the 2n singleton sets is
−
k
X
i=1
(yi · 2n )
ui
ui
log( n )
n
yi 2
yi 2
4
We will use the letter µ for both absolute and relative probabilities, to save the letter p for other uses.
All logs will be to base 2 and since x log(x) → 0 as x → 0 we will take x log(x) to be 0 when x is 0.
5
There is a tacit assumption here that the yi are of the form k/2n . But note that numbers of this form
are dense in the unit interval and if we assume that the function IG is continuous, then it is sufficient to
consider numbers of this form.
3
which comes out to
α=n−
k
X
ui (log ui − log yi )
i=1
Since the flat distribution had expected information n, we have gained information equal
to
n − α = n − (n −
k
X
ui (log ui − log yi )) =
i=1
k
X
ui (log ui − log yi ) =
i=1
k
X
ui log(
i=1
ui
)
yi
(b) In information theory, we have a notion of the information that two partitions P and
Q share, also called their mutual information, and usually denoted by I(P; Q).
I(P; Q) =
X
µ(Pi ∩ Qj ) log(
i,j
µ(Pi ∩ Qj )
)
µ(Pi ) · µ(Qj )
We will recalculate this quantity using the function IG. If Ann has partition P, then with
probability µ(Pi ) she knows that Pi is true. In that case, she will revise her probabilities
~ to µ(Q|P
~ i ) and and in that case her information gain about
of Bob’s partition from µ(Q)
~ i ), µ(Q)).
~
Bob’s partition is IG(µ(Q|P
Summing over all the Pi we get
X
~ i ), µ(Q))
~ =
µ(Pi ) · IG(µ(Q|P
X
X
µ(Pi )(
j
i
i
(µ(Qj |Pi ) log(
µ(Qj |Pi )
))
µ(Qj )
and an easy calculation shows that this is the same as
I(P; Q) =
X
µ(Pi ∩ Qj ) log(
i,j
µ(Pi ∩ Qj )
)
µ(Pi ) · µ(Qj )
Since the calculation through IG gives the same result as the usual formula, this gives
additional support to the claim that our formula for the information gain is the right one.
3
Properties of information gain
Theorem 1 : (a) IG(~u, ~v ) ≥ 0 and IG(~u, ~v ) = 0 iff ~u = ~v .
~ ) and if there is set X such that ui = µ(Pi |X) for all i, then
(b1) If p~ = µ(P
IG(~u, p~) ≤ −log(µ(X))
Thus the information received, by way of a change of probabilities, is less than or equal to
the information I(X) contained in X.
(b2) Equality obtains in (b1) above iff for all i, either µ(Pi |X) = µ(Pi ), or else µ(Pi ∩X) = 0.
4
Thus if all nonempty sets involved have non-zero measure, every Pi is either a subset of X
or disjoint from it.
Proof : (a) It is straightforward to show using elementary calculus that
log x < (x − 1) log e except when x = 1 when the two are equal.6 Replacing x by 1/x we
get log x > (1 − 1/x) log e except again at x = 1. This yields
IG(~u, ~v ) = (
X
ui (log
i
X
X
X
vi
ui
)) ≥ ( ui (1 − )) log e = (( ui ) − ( vi )) log e = 0
vi
ui
i
i
i
with equality holding iff, for all i, either
arise, since we know that
P
i ui
=
P
i vi
ui
vi
= 1, or ui = 0. However, the case ui = 0 cannot
= 1 and ui ≤ vi for all i.
(b1) Let ui = µ(Pi |X), p~ = (µ(P1 ), ..., µ(Pk )).
IG(~u, p~) =
k
X
µ(Pi |X) log
i=1
k
X
µ(Pi |X) log
i=1
where α =
Pk
i=1 µ(Pi |X) log
k
µ(Pi ∩ X)
µ(Pi |X) X
=
µ(Pi |X) log
=
µ(Pi )
µ(Pi )µ(X)
i=1
k
µ(Pi ∩ X) X
µ(Pi |X) log µ(X)) = α + I(X)
−
µ(Pi )
i=1
µ(Pi ∩X)
µ(Pi )
≤ 0, since
µ(Pi ∩X)
µ(Pi )
≤ 1 for all i and
Pk
i=1 µ(Pi |X)
= 1.
(b2) α = 0 only if, for all i, µ(Pi |X) = 0 or µ(Pi ∩ X) = µ(Pi ), i.e. either Pi ∩ X = ∅ or
2
Pi ⊆ X (X is a union of the Pi ’s).
If we learn that one of the sets can be excluded, that we had initially considered possible (its probability was greater then zero), then our information gain is the least if the
probability of the excluded piece is distributed over all the other elements of the partition,
proportionately to their initial probabilities. the gain is greatest when the probability of the
excluded piece is shifted to a single element of the partition, and this element was initially
one of the least likely elements.
Theorem 2 : Let ~v = (v1 , ..., vk−1 , vk ), ~u = (u1 , ..., uk−1 , uk ), where uk = 0, ui =
vi + ai vk for i = 1, ..., k − 1, ai ≥ 0,
(a) IG(~u, ~v ) is minimum when
ai
vi
Pk−1
i=1
ai = 1, and vk > 0. Then:
= c is the same for all i = 1, ..., k − 1, and c =
1
1−vk .
moreover, this minimum value is just − log(1 − vk ).
(b) IG(~u, ~v ) is maximum when ai = 1 for some i such that vi = minj=1...k−1 (vj ) and the
other aj are 0.
Proof : (a) Let a = (a1 , ..., ak−2 , ak−1 ). Since
6
Pk−1
i=1
ai = 1 we have ak−1 = 1−
e is of course the number whose natural log is 1. Note that log e = log2 e =
is tangent to the curve y = log x at (1,0), and lies above it.
5
1
.
ln 2
Pk−2
i=1
ai .
The line y = (x−1) log e
So we need only look at f : [0, 1]k−2 → R, defined by:
k−2
X
(vk−1 + vk (1 −
vi + ai vk
+(vk−1 +vk (1−
aj )) log
f (~a) = IG(~u, ~v ) =
(vi +ai vk ) log
vi
vk−1
j=1
i=1
k−2
X
Pk−2
j=1
To find the extrema of f in [0, 1]k−2 , consider the partial derivatives
vk−1 + vk (1 −
∂f
vi + ai vk
= vk (log
− log
∂ai
vi
vk−1
∂f
∂ai
=
for all
i vk
0 iff vi +a
vi
ak−1
i, avii = vk−1
Pk−2
=
(vk−1 +vk (1−
vk−1
j=1
get c =
iff ai <
Now
j=1
aj )
. Recall that ak−1 = 1 −
) log e
Pk−2
i=1
ai . Then we have
or ai = cvi where c is a constant and i range over 1, ..., k − 1. If we add
these equations and use the fact that
1
1−vk .
vi
1−vk .
aj ))
Pk−2
∂f
∂ai
Pk−1
i=1
ai = 1 and the fact that
is an increasing function of ai , so it is > 0 iff
Thus f has a minimum when ai =
vi
1−vk
Pk−1
i=1 vi
vi
ai > 1−v
k
= 1 − vk we
and it is < 0
for all i. The fact that this minimum
value is − log(1 − vk ) is easily calculated by substitution. Note that this quantity is exactly
equal to I(X) where X is the complement of the set Pk whose probability was vk . Thus we
have an exact correspondence with parts (b1) and (b2) of the previous theorem.
(b) To get the maximum, note that since the first derivatives
∂f
∂ai
are always increasing,
and the second derivatives are all positive, the maxima can only occur at the vertices
of [0, 1]k−1 . (If they occurred elsewhere, we could increase the value by moving in some
direction). Now the values of f at the points pj = (0, ...0, 1, 0, ...0) (ai = δ(i, j)), are
x+vk
k
IG(~u, ~v ) = g(vj ) where g(x) = (x+vk ) log x+v
x . But g(x) = (x+vk ) log x is a decreasing
function of x. so IG(u, v) is maximum when aj = 1 for some j such that vj is minimal. 2
Example 1 : Suppose for example that a partition {P1 , P2 , P3 , P4 } is such that all the Pi
have probabilities equal to .25. If we now receive the information that P4 is impossible, then
we will have gained information approximately equal to IG(.33, .33, .33, 0, .25, .25, .25, .25) ≈
.33
3·(.33) log .25
≈ log 43 ≈ .42. Similarly, if we discover instead that it is P3 which is impossible.
If, however, we only discover that the total probability of the set P3 ∪ P4 has decreased to
.33, then our information gain is only IG(.33, .33, .17, .17, .25, .25, .25, .25) ≈ .08, which is
much less. And this makes sense, since knowing that the set P3 ∪ P4 has gone down in
weight tells us less than knowing that half of it is no longer to be considered, and moreover
which half.
If we discover that P4 is impossible and all the cases that we had thought to be in P4
are in fact in P1 , then the information gain is IG(.50, .25, .25, 0, .25, .25, .25, .25) =
which is .5 and more than our information gain in the two previous cases.
6
1
2
log 2
aj )
Example 2 : As the following example shows, IG doesn’t satisfy the triangle inequality.
I.e. if we revise our probabilities form ~y to ~u and then again to ~v , our total gain can be
less than revising it straight from ~y to ~v . This may perhaps explain why we do not notice
gradual changes, but are struck by the cumulative effect of all of them.
Take ~v = (0.1, 0.9), ~u = (0.25, 0.75), ~y = (0.5, 0.5). IG(~v , ~u) + IG(~u, ~y ) = 0.13 + 0.21 =
0.34, while IG(~v , ~y ) = 0.53 (approximately). Also IG(~y , ~v ) = 0.74 so that IG is not
symmetric.
Another way to see that this failure of the triangle equality is reasonable is to notice
that we could have gained information by first relativising to a set X, and then to another
set Y , gaining information ≤ − log(µ(X)) and − log(µ(Y )) respectively. However, to get
the cumulative information gain, we might need to relativise to X ∩ Y whose probability
might be much less than µ(X)µ(Y ).
We have defined the mutual knowledge I(P; Q) of two partitions P, Q. If we denote
their join as P +Q then the quantity usually denoted in the literature as H(P, Q)) is merely
H(P + Q). The connection between mutual information and entropy is well known [Ab]:
H(P + Q) = H(P) + H(Q) − I(P; Q)
Moreover, the equivocation H(P|Q) of P with respect to Q is defined as H(P|Q) = H(P) −
I(P; Q). If i and j are agents with respective partitions Pi and Pj respectively, then inf(ij)
will be just I(Pi ; Pj ).
The equivocations are non-negative, and I is symmetric, and so we have:
I(P; Q) ≤ min(H(P), H(Q))
Thus what Ann knows about Bob’s knowledge is always less than what Bob knows and
what Ann herself knows.
We want now to generalise these notions to more than two people, for which we will
need a notion from the theory of Markov chains, namely stochastic matrices. We start by
making a connection between boolean matrices and the usual notion of knowledge.
4
Common knowledge and Boolean matrices
We start by reviewing some notions from ordinary knowledge theory, [Au], [HM], [PK].
Definition 2 : Suppose that {1,...,k} are individuals and i has knowledge partition Pi .
If w ∈ W then i knows E at w iff Pi (w) ⊆ E, where Pi (w) is the element of the partition
7
Pi containing w. Ki (E) = {w|i knows E at w}. Note that Ki (E) is always a subset of E.
Write w ≈i w0 if w and w0 are in the same element of the partition Pi (iff Pi (w) =
Pi (w0 )). Then i knows E at w iff for all w0 , w ≈i w0 → w0 E.
Also, it follows that i knows that j knows E at w iff wKi (Kj (E)) iff
S
l
l
l≤n {Pj |Pj
∩
Pi (w) 6= ∅} ⊆ E i.e. {w0 |∃v such that w ≈i v ≈j w0 } ⊆ E.
Definition 3 : An event E is common knowledge between a group of individuals
i1 , ..., im at w iff (∀j1 , ..., jk ∈ {i1 , ..., im })(w ≈j1 w1 , ..., wk−1 ≈jk w0 ) → (w0 ∈ E) iff for all
X{K1 , ..., Kn }∗ wX(E).
We now analyse knowledge and common knowledge using boolean transition matrices7 :
Definition 4 : The boolean transition matrix Bij of ij is defined by letting Bij (k, l) = 1
if Pik ∩ Pjl 6= ∅, and 0 otherwise.
We can extend this definition to a string of individuals x = i1 ...ik :
Definition 5 : The boolean transition matrix Bx for a string x = i1 ...ik is
Bx = Bi1 i2 ⊗ Bi2 i3 ⊗ ... ⊗ Bik−1 ik where ⊗ is defined as normalised matrix multiplication:
if (B × B 0 )(k, l) > 0 then (B ⊗ B 0 )(k, l) is set to 1, otherwise it is 0. We can also define ⊗
as: (B ⊗ B 0 )(k, l) =
Wn
m=1 (B(k, m)
∧ B 0 (m, l))
We say that there is no non-trivial common knowledge iff the only event that is common
knowledge at any w is the whole space W .
Fact 1 : There is no non-trivial common knowledge iff for every string x including all
individuals, limn→∞ Bxn = 1 where 1 is the matrix filled with 1’s only.
We now consider the case of stochastic matrices.
5
Information via a string of agents
When we consider boolean transition matrices, we may lose some information. If we know
the probabilities of all the elements of the σ-field generated by the join of the partitions Pi ,
the boolean transition matrix Bij is created by putting a 1 in position (k, l) iff µ(Pjl |Pik ) > 0,
and 0 otherwise. We keep more of the information by having µ(Pjl |Pik ) in position (k, l).
We denote this matrix by Mij and we call it the transition matrix from i to j.
7
The subscripts to the matrices will denote the knowers, and the row and column will be presented
explicitly as arguments. Thus Bij (k, l) is the entry in the kth row and jth column of the matrix Bij .
8
Definition 6 : For every i, j, the ij-transition matrix Mij is defined by: Mij (a, b) =
µ(Pjb |Pia ).
For all i, Mii is the unit matrix of dimension equal to the size of partition Pi .
Definition 7 : If x is a string of elements of {1, ..., k} (x ∈ {1, ..., k}∗ , x = x1 ...xn ),
then Mx = Mx1 x2 × ... × Mxn−1 xn is the transition matrix for x.
We now define inf(ixj), where x is a sequence of agents. inf(ixj) will be the information
that i has about j via x. If e.g. i = 3, x = 1, j = 2, we should interpret inf(ixj) as the
amount of information 3 has about 1’s knowledge of 2.
Example 3 : In our example in the introduction, If i were Ann and j were Bob, then
we would get
Mij
.9 .1
=
.1 .9
The matrix Mji equals the matrix Mij and the matrix Miji is
Miji
.82 .18
=
.18 .82
Thus it turns out that each of Ann and Bob has .53 bits of knowledge about the other and
Ann has .32 bits of knowledge about Bob’s knowledge of her.
Definition 8 : Let m
~ l = (ml1 , ..., mlk ) be the lth row vector of the transition matrix
Mixj (mlt = µ(Pjt |x Pil ), where µ(Pjt |x Pil ) is the probability that a point in Pil will end up
in Pjt after a random move within Pil followed by a sequence of random moves respectively
within the elements of those Pxr which form x). Then:
inf(ixj) =
k
X
~ j ))
µ(Pil )IG(m
~ l , µ(P
l=1
~ j )) is the information gain of the distribution m
where IG(m
~ l , µ(P
~ l over the distribution
~ j ).
µ(P
~ j ). However,
The intuitive idea is that the a priori probabilites of j’s partition are µ(P
if w is in Pil , the l’th set in i’s partition, then these probabilities will be revised according
~ j )). The
to the l’th row of the matrix Mixj and the information gain will be IG(m
~ l , µ(P
expected information gain for i about j via x is then obtained by multiplying by the µ(Pil )’s
and summing over all l.
Example 4 : Consider Miji . For convenience we’ll denote elements Pim by Am and
elements Pjm by Bm (so that the A’s are elements of i’s partition, and the B’s are elements
9
of j’s partition). Therefore Miji = Mij × Mji where:
Mij
µ(B |A ) · · · µ(B |A )
1 1
k 1
µ(B1 |A2 ) · · · µ(Bk |A2 )
= ..
..
..
.
.
.
µ(B1 |Ak ) · · · µ(Bk |Ak )
Mji
µ(A |B ) · · · µ(A |B )
1 1
1
k
µ(A1 |B2 ) · · · µ(Ak |B2 )
= ..
..
..
.
.
.
µ(A1 |Bk ) · · · µ(Ak |Bk )
Miji is the matrix of probabilities µ(Al |j Am ) for l, m = 1, ..., k, where µ(Al |j Am ) is the
probability that a point in Am , will end up in Al after a random move within Am followed
by a random move within some Bs .
Miji
µ(A | A ) µ(A | A ) · · · µ(A | A )
1 j 1
1 j 2
1 j k
µ(A2 |j A1 ) µ(A2 |j A2 ) · · · µ(A2 |j Ak )
= ..
..
..
..
.
.
.
.
µ(Ak |j A1 ) µ(Ak |j A2 ) · · · µ(Ak |j Ak )
Note that for x = λ, where λ is the empty string, inf(ij) = I(Pi ; Pj ), as in the standard
t
l
P
~ j )) = Pk µ(P l ) Pk µ(Pj |P l ) log µ(Pj |Pt i )
definition: inf(ij) = k µ(P l )IG(µ(P~j |P l ), µ(P
l=1
=
6
Pk
l,t=1 µ(Pj
∩ Pil ) log
i
l=1
i
i
t=1
i
µ(Pj )
µ(Pjt ∩Pil )
µ(Pjt )µ(Pil )
Properties of transition matrices
The results in this section are either from the theory of Markov chains, or easily derived
from these.
Definition 9 : A matrix M is stochastic if all elements of M are reals in [0,1] and the
sum of every row is 1.
Fact 2 : For every x, the matrix Mx is stochastic.
Definition 10 : A matrix M is regular if there is m such that ∀(k, l)M m (k, l) > 0.
The following fact establishes a connection between regular stochastic matrices and
common knowledge:
Fact 3 : Matrix Mixi is regular iff there is no common knowledge between i and
individuals from x.
Fact 4 : For every regular stochastic matrix M , there is a matrix M 0 such that
lim M n = M 0
n→∞
M 0 is stochastic, and all the rows in M 0 are the same. Moreover the rate of convergence is
exponential: for a given column r, let dn (r) be the difference between the maximum and
10
the minimum in M n , in that column. Then there is < 1 such that for all columns r and
all sufficiently large n, dn (r) ≤ n .
By combining the last two facts we get the following corollary:
Fact 5 : If there is no common knowledge between i and the individuals in x, then
lim (Mixi )n = M
n→∞
where M is stochastic, and all rows in M are equal to the vector ~ui of probabilities of the
sets in the partition Pi .
A matrix with all rows equal represents the situation that all information is lost and all
that is known is the a priori probabilities.
Fact 6 : If L, S are stochastic matrices and all the rows of L are equal, then S × L = L,
and L × S = L0 , where all rows in L0 are equal (though they may be different from those of
L).
Fact 7 : For any stochastic matrix S and regular matrix Mixi :
S × lim (Mixi )n = M 0
n→∞
where
M 0 = lim (Mixi )n
n→∞
Definition 11 : For a given partition Pi and string x = x1 x2 ...xk we can define a
relation ≈x between the partitions Pi and Pj . Pim ≈x Pjn iff for w ∈ Pim and w0 ∈ Pjn ,
there are v1 , ..., vk−1 such that v1 ∈ Pim , vk ∈ Pjn and w ≈x1 v1 ≈x2 ...vk−1 ≈xk w0 .
Definition 12 : ≈∗x is the transitive closure of ≈x . It is an equivalence relation.
Fact 8 : Assume that x contains all j. Then the relation ≈∗x does not depend on the
particular x and we may drop the x. Pim ≈∗ Pjn iff Pim and Pjn are subsets of the same
element of P − where P − is the meet of the partitions of all the individuals.
Observation: We can permute the elements of the partition Pi so that the elements of
the same equivalence class of ≈∗ have consecutive numbers and then Mixi looks as follows:
Mixi
M1
= ...
0
··· 0
..
..
.
.
· · · Mr
where Ml for l ≤ r is the matrix corresponding to the transitions within one equivalence
class of ≈∗ . All submatrices Ml are square and regular.
11
Note that if there is no common knowledge then ≈∗ has a single equivalence class.
Since we can always renumber the elements of the partitions so that the transition matrix
is in the form described above, we will assume from now on that the transition matrix is
always given in such a form.
Fact 9 : If x contains all j then
lim (Mixi )n = M
n→∞
where M is stochastic, submatrices Ml of M are regular (in fact positive) and all the rows
within every submatrix Ml are the same.
7
Properties of inf(ixj)
Theorem 3 : If there is no common knowledge and x includes all the individuals, then
lim inf(i(jxj)n ) = 0
n→∞
Proof : Matrix M = limn→∞ (Mjxj )n has all rows positive and equal. Let m
~ be a
~ j )). Since the limiting vector m
row vector of M . Then limn→∞ inf(i(jxj)n ) = IG(m,
~ µ(P
~
~ j ), we get: limn→∞ inf(i(jxj)n ) = IG(µ(P
~ j ), µ(P
~ j )) = 0. 2
is equal to the distribution µ(P
The last theorem can be easily generalised to the following:
Fact 10 : If there is no common knowledge among the individuals in x, and i, j occur
in x, then as n → ∞, inf(ixn j) goes to zero.
8
Probabilistic common knowledge
Common knowledge is very rare. But, even if there is no common knowledge in the system,
we often have probabilistic common knowledge.
Definition 13 : Individuals {1, ..., n} have probabilistic common knowledge if
∀x ∈ {1, ..., n}∗ inf(x) > 0
We note that there is no probabilistic common knowledge in the system iff there is some
string x such that for some i, Mxi is a matrix with all rows equal and Mxi (·, t) = µ(Pit ) for
all t.
12
Theorem 4 : If there is common knowledge in the system then there is probabilistic
common knowledge, and
∀x ∈ {1, ..., n}∗ inf(x) ≥ H(P − )
Proof : We know from Fact 9 that
Mixi
M1
= ...
0
··· 0
..
..
.
.
· · · Mr
where Ml for l ≤ r is the matrix corresponding to the transitions within one equivalence class
of ≈∗x , and all submatrices Ml are square and regular. Here r is the number of partitions in
P − . Suppose that the probabilities of the sets in the partition Pi are u1 , ..., uk and that the
probabilities of the partition P − are w1 , ..., wr . Each wj is going to be the sum of those ul
where the lth set in the partition Pi is a subset of the jth set in the partition P − . Let m
~l
be the lth row of the matrix Mixi . Then inf(ixi) is
Pk
~ l , ~u).
l=1 ul IG(m
The row m
~ l consists
of zeroes, except in places corresponding to subsets of the apppropriate element Pj− of P − .
1
Then, by theorem 2, part (a): IG(m
~ i , ~u) ≥ log( 1−(1−w
) = − log wj . This quantity
j)
may repeat, since several elements of Pi may be contained in Pj− . When we add up all the
multipliers ui that occur with log wj , these multipliers also add up to wj . Thus we get
inf(ixi) ≥
r
X
−wj log(wj ) = H(P − )
j=1
2
.
We can also show:
Theorem 5 : If x contains i, j and there is common knowledge between i, j and all the
components of x, then the limiting information always exists and limn→∞ inf(i(jxj)n ) =
H(P − )
We postpone the proof to the full paper.
References
[Ab] Abramson, N., Information Theory and Coding, McGraw-Hill, 1963
[AH] M. Abadi and J. Halpern, Decidability and expressiveness for first-order logics of probability, Proc. of the 30th Annual Conference on Foundations of Computer Science, 1989,
pp. 148–153.
[Au] Aumann, R., “Agreeing to Disagree”, Annals of Statistics, 1976, 4, pp. 1236-1239
13
[Ba] F. Bacchus, On Probability Distributions over Possible Worlds, Proceedings of the 4th
Workshop on Uncertainty in AI, 1988, pp. 15-21
[CM] H. H. Clark and C. R. Marshall, “Definite Reference and Mutual Knowledge”, in
Elements of Discourse Understanding, Ed. Joshi, Webber and Sag, Cambridge U. Press,
1981.
[Dr] F. Dretske. Knowledge and the Flow of Information, MIT Press, 1981.
[Ha] J. Halpern, An analysis of first-order logics of probability, Proc. of the 11th International Joint Conference on Artificial Intelligence (IJCAI 89), 1989, pp. 1375–1381.
[HM] Halpern, J. and Moses, Y., “Knowledge and Common Knowledge in a Distributed
Environment”, Proc. 3rd ACM Conf. on Principles of Distributed Computing, 1984, pp.
50-61
[KS] Kemeny, J. and Snell, L., Finite Markov Chains, Van Nostrand, 1960
[Pa] Parikh, R., “Levels of Knowledge in Distributed Computing”, Proc. IEEE Symposium
on Logic in Computer Science, 1986, pp. 322-331
[Pa2] Parikh, R., “A Utility Based Approach to Vague Predicates” To appear.
[PK] Parikh, R. and Krasucki, P. “Levels of Knowledge in Distributed Computing”, Research report, Brooklyn College, CUNY (1986). Revised version of [Pa] above.
[Sh] Shannon, C., “Mathematical Theory of Communication” Bell System Technical Journal, 28, 1948. (Reprinted in: Shannon and Weaver, A Mathematical Theory of Communication University of Illinois Press, 1964.)
14