Download Notes for Dr. Vargas`s guest lecture of September 9, 2003

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSCE 822
Data Mining & Knowledge Discovery
Information as a Measure1
Several books were written in the late 50’s to formally define the
concept of information as a measure of surprise (or lack of) or as
the uncertainty of outcomes. These books were inspired by earlier
work by Shannon and Wiener, who independently arrived to the
same expression for average information.
Let X be a random variable associated to a space  having n
mutually exclusive events, such that
|E| = [e1, e2, e3…en]
so
|P| = [p1, p2, p3, … pn]
so
u
 Ek  U
k 1
n
 pk  1
k 1
Let E(X) be some function such that, if experiments are conducted
many times, the averages of X will approach E(X). Shannon and
Wiener suggested the expression below to quantify the average
uncertainty (or chaos, or disorder, or entropy) associated to a
complete sample space :
n
H ( X )   pi log pi
i 1
A measure is a rather precise definition (involving such things as -algebras) which makes it difficult to
understand for non-mathematicians such as myself. All we need to know here is that this and other
definitions form the basis for much of what is known as Mathematical Analysis. For example, every
definition of an integral is based on a particular measure. The study of measures and their application to
integration is known as Measure Theory.
1
University of South Carolina
1
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
For each event ek there is a value, or quantity, xk, such that
xk   log P{ek }   log pk
The term – log (pk) is called the amount of self-information
associated to the event ek. The unit of information, called a bit
(binary unit) is equivalent to the amount of information associated
with selecting one event from the set.
The average amount of information, called Entropy, is defined for
a sample space  of equally probable events Ek as:
n
H ( X ) I ( Ek )  pk log pk
k 1
If we have a fair coin, such that p(H) = p(T) = ½, then
1
1 1
1
H ( Ek ) H ( E1  E2 )   log( )  log( )   log 1  1 bit
2
2
2 2
2
Note that I(E1) = I(E2) = -log(1/2) = 1 bit. Extending this example,
if we have a sample  with 2N equally probable events, Ek
(k=1,2,…,2N), then
I ( Ek )   log pk   log( 2 N )  N bits
Example:
Ea=[A1, A2]
P=[1/256, 255/256]
=> I ( Ak ) = 0.0369 bit
Eb=[B1, B2]
P=[½, ½]
=> I ( Bk ) = 1 bit
Suggesting that it is easier to guess the value of Ak than to guess
the value of Bk.
University of South Carolina
2
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
The measure H complies with the following axioms:
1. Continuity: If the probabilities of events change, the entropy
associated to the system changes accordingly.
2. Symmetry:
H is invariant to the order of events, i.e.,
H(p1,p2,…pn) = H(p2,p1,…pn).
3. Extremal Value: The value of H is largest when all events
are equally likely, because it is most uncertain which event
could occur.
H2 = H1 + pmHm when the mth event is a
4. Additivity:
composition of other events.
The Shannon-Wiener formulation for Entropy gained popularity
due to its simplicity and its axiomatic properties. To illustrate,
consider an earlier definition of information, due to R. A. Fisher,
who essentially defined it as an average second moment in a
sample distribution with density f(x) and mean m:

I  [

 ln f ( x) 2
] f ( x)dx
m
Thus, for example, expressing the normal distribution
f ( x) 
1
2 
exp[ (
xm
2
)2 ]
University of South Carolina
3
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
in logarithmic form, deriving with respect to the mean, and taking
the integral:
1
xm 2
log f ( x)   ln 2  ln   (
)
2
2
 ln f x  m

m
2
I

 x m
   2 

The
2
2
 x m
1
exp[ 
 ]dx  2

2 
 2 
1
Shannon-Wiener
expression
for
information
can
be
generalized for 2-dimensional probability schemes, and by
induction, to any n-dimensional probability schemes. Let 1 and
2 be two discrete sample spaces with sets {E} =[E1,E2,…,En] and
{F} =[F1,F2,…,Fm].
We can have three complete sets of probability schemes:
P{E} = [P{Ek}]
P{F} = [P{Fj}]
P{EF}= [P{EkFj}]
University of South Carolina
4
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
The joint probability matrix is given by:
 p{1,1} p{1,2}

p{2,1} p{2,2}
[ P{ X , Y }]  
 ...
...

p
{
n
,
1
}
p
{
n,2}

....
...
...
...
p{1, m} 

p{2.m}
... 

p{n, m}
We can obtain the marginal probabilities for each variable as
in:
m
P{xk }   p{xk , y j }
j 1
 p{1,1} p{1,2}

p{2,1} p{2,2}
[ P{ X , Y }]  
 ...
...

p
{
n
,
1
}
p
{
n,2}

....
...
...
...
p{1, m} 

p{2.m}
... 

p{n, m}
p ( x1)
=
p ( x 2)
p ( x3)
p ( x 4)
m
P{xk }   p( x, y)   p{xk , y j }
j 1
{x, y}\ y
P{x1}  P{E1}  P{E1 F1  E1 F2  ...  E1 Fm }  p{1,1} p{1,2}...  p{1, m}
and
n
P{ y j }   p( x, y)   p{xk , y j }
k 1
{x, y}\ x
From the matrix we can also compute the total and marginal
entropies, H(X), H(Y), H(X,Y)
n
m
H ( X , Y )   p{k , j}log p{k , j}
k 1 j 1
n  m
m
n


H ( X )     p{k , j} log  p{k , j}   p{xk } log p{xk }
k 1 
j 1
k 1


 j 1
University of South Carolina
5
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
m  n
n
m



H (Y )     p{k , j} log  p{k , j}   p{ y j }log p{ y j }
j 1  k 1
k 1
j 1


Note that to obtain H we must find the corresponding p(xk) and
p(yj) first. To better understand the calculations involved in
H(X,Y) versus H(X), and H(Y), let m=n=3. Then
n
m
H ( X , Y )   p{k , j}log p{k , j}
k 1 j 1
H ( X , Y )   p(1,1) log( p(1,1)))   p(1,2) log( p(1,2)))   p(1,3) log( p(1,3))

  p(2,1) log( p(2,1)))   p(2,2) log( p(2,2)))   p(2,3) log( p(2,3))
  p(3,1) log( p(3,1)))   p(3,2) log( p(3,2)))   p(3,3) log( p(3,3))


while
n  m
m
n


H ( X )     p{k , j} log  p{k , j}   p{xk } log p{xk }
k 1 
j 1
k 1
 j 1


H ( X )  - [p(1,1)+p(1,2)+p(1,3)][log(p(1,1)+p(1,2)+p(1,3)]
- [p(2,1)+p(2,2)+p(2,3)][log(p(2,1)+p(2,2)+p(2,3)]
- [p(3,1)+p(3,2)+p(3,3)][log(p(3,1)+p(3,2)+p(3,3)]
We can also compute the conditional entropies. Due to the
addition theorem of probability, the union of Ek is
m
E
k 1
k
1
Therefore, marginalizing Ek
n
F j   Ek F j
k 1
from Bayes theorem:
University of South Carolina
p( x, y )  p( x | y ) p( y )  p( y | x) p( x) , therefore:
6
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
P{ X  xk | Y  y j } 
p{xk | y j } 
P{ X  xk  Y  y j }
P{Y  y j }
p{k , j}
p{ y j }
where p{yj} is the jth marginal; and so
m
n
H ( X | Y )   p{xk , y j }log p{xk | y j )
j 1 k 1
n
m
H (Y | X )   p{xk , y j } log p{ y j | xk )
k 1 j 1
Note again that the two equations imply that you have to compute
the marginals first. From Bayes theorem p( x, y)  p( x | y) p( y)  p( y | x) p( x)
we can write:
H(X,Y) = H(X|Y) + H(Y) = H(Y|X) + H(X)
Example: Two “honest” dice, X and Y, are thrown. Compute
H(X,Y), H(X), H(Y), H(X|Y) and H(Y|X). The joint probability table is:
Y\X
1
2
3
4
5
6
e(x)
1
1/36
1/36
1/36
1/36
1/36
1/36
1/6
2
1/36
1/36
1/36
1/36
1/36
1/36
1/6
3
1/36
1/36
1/36
1/36
1/36
1/36
1/6
4
1/36
1/36
1/36
1/36
1/36
1/36
1/6
5
1/36
1/36
1/36
1/36
1/36
1/36
1/6
6
1/36
1/36
1/36
1/36
1/36
1/36
1/6
f(y)
1/6
1/6
1/6
1/6
1/6
1/6
1/1
The entropies can be calculated from the table:
University of South Carolina
7
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
6
6
H ( X , Y )     Pij log 1
i 1 j 1
36
  log 1
36
 5.17
6
H ( X )  H (Y )  Pi log 1   log 1  2.58
6
6
i 1
6
6
H ( X | Y )  H (Y | X )  Pij log 1  2.58
6
i 1 j 1
bits
bits
bits
A Measure of Mutual Information
We would like to formulate a measure for the mutual information
between two symbols, (xi,yj). Solomon Kullback wrote in 1958 a
book on the study of logarithmic measures of information and
their application to the testing of statistical hypotheses such as
determining if two independent, random samples, were drawn
from the same population, or if the samples are conditionally
independent, etc.
Let Hi (i=1,2) be the hypothesis that X is from a population with a
probability measure
P( H i | x) 
 i . Applying Bayes theorem:
P( H i ) fi( x)
P( H1 ) f1 ( x)  P( H 2 ) f 2 ( x)
University of South Carolina
for i=1,2
8
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
Expanding P(Hi|x) for i=1,2 solving for f1 and f2 and simplifying
f1( x) P( H1 | x) P( H 2 )

f 2 ( x) P( H 2 | x) P( H1)
taking the log we obtain
log
f1 ( x)
P( H1 | x)
P( H1 )
 log
 log
f 2 ( x)
P( H 2 | x )
P( H 2 )
The right side of the equation is a measure of the difference
between the odds in favor of H1 after the observation X=x and
before the observation. Kullback defined this expression as the
“information in X=x for discriminating in favor of H1 against H2.”
The mean information is the integral of the expression, which is
written as
I (1 : 2)   log
P( H1 | x)
P( H1 )
d1  log
P( H 2 | x)
P( H 2 )
Generalizing
for
for 1   2
k-dimensional
Euclidean
spaces
of
two
dimensions with elements {X,Y}, the mutual information
between {X,Y}
is given by
I ( X : Y )   f ( x, y ) log
f ( x, y )
dxdy
g ( x ) h( y )
University of South Carolina
9
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
We can think of the pair {X,Y} as the signals that a transmitter X
sends to a receiver Y. At the transmitter, p(xi) conveys the priors
for each signal being sent, while at the receiver, p(xi|yj) is the
probability that xi was sent given that yj was received. Therefore
the gain in information has to involve the ratio of the final and
initial ignorance, or p(xi|yj) / p(xi).
Let
N
 xi  N1
{X} =[x1,x2,…,xn]
and
i 1
M
 y j  N2
{Y} =[y1,y2,…,ym]
j 1
We can re-write the mutual information I(X:Y) for the discrete case
as:
I(X :Y)  
i
Using
p( xi , y j )
 p( xi , y j ) log P( x ) P( y
j
i
j)
p( x, y)  p( x | y) p( y)  p( y | x) p( x) ,
I(X :Y)  
i
 p( xi , y j ) log
j
we can also write I(X:Y) as
p( xi | y j )
P( xi )
We can also write I(X:Y) as expressions involving entropy:
I(X:Y) = H(X) + H(Y) – H(X,Y)
University of South Carolina
10
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
I(X:Y) = H(X) – H(X|Y)
I(Y:X) = H(Y) – H(Y|X)
Example: Compute I(X:Y) for a transmitter with an alphabet of 5
signals, [x1, x2, x3, x4, x5] and a receiver with 4 signals [y1, y2, y3,
y4].
The Joint Probability Table (JPT) and a system graph are:
y1
y2
y3
y4
x1
0.25
0
0
0
x2
0.10
0.30
0
0
x3
0
0.05
0.10
0
x4
0
0
0.05
0.10
x5
0
0
0.05
0
X1
Y1
X2
Y2
X3
Y3
Y4
X4
X5
f(x1) = 0.25
g(y1) = 0.25 + 0.10 = 0.35
f(x2) = 0.10 + 0.30 = 0.40
g(y2) = 0.30 + 0.05 = 0.35
f(x3) = 0.05 + 0.10 = 0.15
g(y3) = 0.10 + 0.05 + 0.05 = 0.20
f(x4) = 0.05 + 0.10 = 0.15
g(y4) = 0.10
f(x5) = 0.05
p(x1|y1) = p(x1,y1)/g(y1) = .25/.35 = 5/7
p(y1|x1) = p(x1,y1)/f(x1)=.25/.25=1.0
p(x2|y2) = .3/.35=6/7
p(y2|x2)=p(x2,y2)/f(x2)=.3/.4 = .75
p(x3|y3) = 0.5
University of South Carolina
11
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
p(y3|x3) = 2/3
p(x4|y4) = 1.0
p(y4|x4) = 2/3
p(x2|y1) = 2/7
p(y1|x2) = 1/4
p(x3|y2) = 1/7
p(y2|x3) = 1/3
p(x4|y3) = ¼
p(y3|x4) = 1/3
p(x5|y2) = 0.05/0.20 = ¼
p(y3|x5) = 0.05/0.05 = 1.0
H ( X , Y )   p( x, y ) log p( x, y )  .25 log .25  .1log .1  .3 log .3  .05 log .05  .1log .1  .05 log .05  .1log .1  .05 log .05
x
y
H(X,Y) = 2.665
etc…
Note:
log 2 ( N )
log 10 ( N )
0.3010
Likewise, the calculations for H(X), H(Y), H(X|Y) and H(Y|X) can
be performed. Given these, we can assess whether X and Y are
independent variables.
Another interesting question is where do probabilities come from
and how can we use them to create a Bayesian network? To
answer these questions, let's consider the two sets below, S, and C,
University of South Carolina
12
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
which were sampled from a database related to the famous "Chest
Clinic" example. The variables S and C represent instantiations
of Smoking (Y/N) and Cancer (Y/N).
s={111011010011010110000010101101001011101000011101
100000011101100000011100000010010111100010101000110
0}
c={000000010000000000000000100000000000000000000000
000000000100100000000000000000000000000000000000000
0}
The joint probability table for the sample is approximately:
C0
C1
S0
0.55 0.0
S1
0.41 0.04
P(S,C) = 0.55 + 0.41 + 0.0 + 0.04 = 1.0
H(S,C) = -0.41log(0.41) - 0.04log(0.04) - 055log(0.55) = 1.1834
H ( S )  p ( s, c) log( p ( S ))
S
=
-0.55log(0.55)-0.41log(0.45)-0.04log(0.45) =
C
0.99277
H (C )  p( s, c) log( p(C ))
C
=
-0.55log(0.96) - 0.41log(0.96) - 0.0 -
S
0.04log(0.04) = 0.243244
 p ( s, c ) 

H ( S | C )  p( s, c) log( p( s | c))   p( s, c) log 
S C
S C
 p(C ) 
H(S|C) = -0.55log(0.55/0.96) - 0.41log(0.41/0.96) - 0.0 - 0.04log(0.04/0.04)
= 0.9453
University of South Carolina
13
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
 p ( s, c) 

H (C | S )  p( s, c) log( p(c | s ))   p( s, c) log 
C S
C S
 p( S ) 
H(C|S) = -0.55log(0.55/0.55) - 0.41log(0.41/0.45) - 0.04log(0.04/0.45) =
0.19467
H(S,C) = H(S) + H(C|S) = H(C) + H(S|C)
0.99277 + 0.19467 = 0.24324 + 0.9453
1.18744  1.188
I(S:C) = H(S) + H(C) - H(S,C) = 0.99277 + 0.2432 - 1.188
 0.048
I(S:C) = H(S) - H(S|C)
 0.048
I(C:S) = H(S) + (HC) - (H(S,C)
= 0.99277 - 0.9453
= 0.99277 + 0.2432 - 1.188

0.048
I(C:S) = H(C) - H(C|S) = 0.24324 - 0.19467
 0.048
Note that we could have also calculated I(C:S) by inverting the
order of i and j in the summations.
All we want is to assess
conditional dependency. The fact that
H(S|C) = 0.9453 >> H(C|S)= 0.19467
indicates that there is less uncertainty (surprise) regarding C
when S is known, and therefore a Bayesian network involving the
two variables carries more information when this relationship is
represented as
University of South Carolina
14
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
S
C
Therefore the conditional probability table associated to the edge
should be:
C0
C1
S0
55/55
0/55
S1
41/45
04/45
Or S0
S1
C0
C1
1.0
0.0
0.911
0.089
And the next question is: If we have a data base with N variables,
should we compute the mutual information for each of the (N-1)
pairs? For N=5 variables, we only need 4+3+2+1 = 10 calculations
of mutual information. However when the number of variables is
much larger, as for example, N=103, we would need to find ways to
reduce the number of computations.
Kullback also defined the divergence, J(1,2), as the mean
observation from 1 for discriminating in favor of H2 against H1,
as
I (2 : 1)   f 2 ( x) log
f 2 ( x)
d1
f1 ( x)
 I (2 :1)   f 2 ( x) log
and
f1 ( x)
d1
f 2 ( x)
J (1,2)  I (1 : 2)  I (2 : 1)  ( f1 ( x)  f 2 ( x)) log
University of South Carolina
f1 ( x)
P ( H1 | x )
P( H1 | x)
d   log
d1   log
d 2
f 2 ( x)
P( H 2 | x )
P( H 2 | x )
15
Juan E. Vargas- CSCE
CSCE 822
Data Mining & Knowledge Discovery
J(1,2) is a measure of the divergence between H1 and H2, i.e., a
measure of
how difficult it is to discriminate between them.
Kullback studied properties about these measures (additivity,
convexity,
invariance,
sufficiency,
minimum
discrimination
information and others) and made.
University of South Carolina
16
Juan E. Vargas- CSCE
Related documents