Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCE 822 Data Mining & Knowledge Discovery Information as a Measure1 Several books were written in the late 50’s to formally define the concept of information as a measure of surprise (or lack of) or as the uncertainty of outcomes. These books were inspired by earlier work by Shannon and Wiener, who independently arrived to the same expression for average information. Let X be a random variable associated to a space having n mutually exclusive events, such that |E| = [e1, e2, e3…en] so |P| = [p1, p2, p3, … pn] so u Ek U k 1 n pk 1 k 1 Let E(X) be some function such that, if experiments are conducted many times, the averages of X will approach E(X). Shannon and Wiener suggested the expression below to quantify the average uncertainty (or chaos, or disorder, or entropy) associated to a complete sample space : n H ( X ) pi log pi i 1 A measure is a rather precise definition (involving such things as -algebras) which makes it difficult to understand for non-mathematicians such as myself. All we need to know here is that this and other definitions form the basis for much of what is known as Mathematical Analysis. For example, every definition of an integral is based on a particular measure. The study of measures and their application to integration is known as Measure Theory. 1 University of South Carolina 1 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery For each event ek there is a value, or quantity, xk, such that xk log P{ek } log pk The term – log (pk) is called the amount of self-information associated to the event ek. The unit of information, called a bit (binary unit) is equivalent to the amount of information associated with selecting one event from the set. The average amount of information, called Entropy, is defined for a sample space of equally probable events Ek as: n H ( X ) I ( Ek ) pk log pk k 1 If we have a fair coin, such that p(H) = p(T) = ½, then 1 1 1 1 H ( Ek ) H ( E1 E2 ) log( ) log( ) log 1 1 bit 2 2 2 2 2 Note that I(E1) = I(E2) = -log(1/2) = 1 bit. Extending this example, if we have a sample with 2N equally probable events, Ek (k=1,2,…,2N), then I ( Ek ) log pk log( 2 N ) N bits Example: Ea=[A1, A2] P=[1/256, 255/256] => I ( Ak ) = 0.0369 bit Eb=[B1, B2] P=[½, ½] => I ( Bk ) = 1 bit Suggesting that it is easier to guess the value of Ak than to guess the value of Bk. University of South Carolina 2 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery The measure H complies with the following axioms: 1. Continuity: If the probabilities of events change, the entropy associated to the system changes accordingly. 2. Symmetry: H is invariant to the order of events, i.e., H(p1,p2,…pn) = H(p2,p1,…pn). 3. Extremal Value: The value of H is largest when all events are equally likely, because it is most uncertain which event could occur. H2 = H1 + pmHm when the mth event is a 4. Additivity: composition of other events. The Shannon-Wiener formulation for Entropy gained popularity due to its simplicity and its axiomatic properties. To illustrate, consider an earlier definition of information, due to R. A. Fisher, who essentially defined it as an average second moment in a sample distribution with density f(x) and mean m: I [ ln f ( x) 2 ] f ( x)dx m Thus, for example, expressing the normal distribution f ( x) 1 2 exp[ ( xm 2 )2 ] University of South Carolina 3 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery in logarithmic form, deriving with respect to the mean, and taking the integral: 1 xm 2 log f ( x) ln 2 ln ( ) 2 2 ln f x m m 2 I x m 2 The 2 2 x m 1 exp[ ]dx 2 2 2 1 Shannon-Wiener expression for information can be generalized for 2-dimensional probability schemes, and by induction, to any n-dimensional probability schemes. Let 1 and 2 be two discrete sample spaces with sets {E} =[E1,E2,…,En] and {F} =[F1,F2,…,Fm]. We can have three complete sets of probability schemes: P{E} = [P{Ek}] P{F} = [P{Fj}] P{EF}= [P{EkFj}] University of South Carolina 4 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery The joint probability matrix is given by: p{1,1} p{1,2} p{2,1} p{2,2} [ P{ X , Y }] ... ... p { n , 1 } p { n,2} .... ... ... ... p{1, m} p{2.m} ... p{n, m} We can obtain the marginal probabilities for each variable as in: m P{xk } p{xk , y j } j 1 p{1,1} p{1,2} p{2,1} p{2,2} [ P{ X , Y }] ... ... p { n , 1 } p { n,2} .... ... ... ... p{1, m} p{2.m} ... p{n, m} p ( x1) = p ( x 2) p ( x3) p ( x 4) m P{xk } p( x, y) p{xk , y j } j 1 {x, y}\ y P{x1} P{E1} P{E1 F1 E1 F2 ... E1 Fm } p{1,1} p{1,2}... p{1, m} and n P{ y j } p( x, y) p{xk , y j } k 1 {x, y}\ x From the matrix we can also compute the total and marginal entropies, H(X), H(Y), H(X,Y) n m H ( X , Y ) p{k , j}log p{k , j} k 1 j 1 n m m n H ( X ) p{k , j} log p{k , j} p{xk } log p{xk } k 1 j 1 k 1 j 1 University of South Carolina 5 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery m n n m H (Y ) p{k , j} log p{k , j} p{ y j }log p{ y j } j 1 k 1 k 1 j 1 Note that to obtain H we must find the corresponding p(xk) and p(yj) first. To better understand the calculations involved in H(X,Y) versus H(X), and H(Y), let m=n=3. Then n m H ( X , Y ) p{k , j}log p{k , j} k 1 j 1 H ( X , Y ) p(1,1) log( p(1,1))) p(1,2) log( p(1,2))) p(1,3) log( p(1,3)) p(2,1) log( p(2,1))) p(2,2) log( p(2,2))) p(2,3) log( p(2,3)) p(3,1) log( p(3,1))) p(3,2) log( p(3,2))) p(3,3) log( p(3,3)) while n m m n H ( X ) p{k , j} log p{k , j} p{xk } log p{xk } k 1 j 1 k 1 j 1 H ( X ) - [p(1,1)+p(1,2)+p(1,3)][log(p(1,1)+p(1,2)+p(1,3)] - [p(2,1)+p(2,2)+p(2,3)][log(p(2,1)+p(2,2)+p(2,3)] - [p(3,1)+p(3,2)+p(3,3)][log(p(3,1)+p(3,2)+p(3,3)] We can also compute the conditional entropies. Due to the addition theorem of probability, the union of Ek is m E k 1 k 1 Therefore, marginalizing Ek n F j Ek F j k 1 from Bayes theorem: University of South Carolina p( x, y ) p( x | y ) p( y ) p( y | x) p( x) , therefore: 6 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery P{ X xk | Y y j } p{xk | y j } P{ X xk Y y j } P{Y y j } p{k , j} p{ y j } where p{yj} is the jth marginal; and so m n H ( X | Y ) p{xk , y j }log p{xk | y j ) j 1 k 1 n m H (Y | X ) p{xk , y j } log p{ y j | xk ) k 1 j 1 Note again that the two equations imply that you have to compute the marginals first. From Bayes theorem p( x, y) p( x | y) p( y) p( y | x) p( x) we can write: H(X,Y) = H(X|Y) + H(Y) = H(Y|X) + H(X) Example: Two “honest” dice, X and Y, are thrown. Compute H(X,Y), H(X), H(Y), H(X|Y) and H(Y|X). The joint probability table is: Y\X 1 2 3 4 5 6 e(x) 1 1/36 1/36 1/36 1/36 1/36 1/36 1/6 2 1/36 1/36 1/36 1/36 1/36 1/36 1/6 3 1/36 1/36 1/36 1/36 1/36 1/36 1/6 4 1/36 1/36 1/36 1/36 1/36 1/36 1/6 5 1/36 1/36 1/36 1/36 1/36 1/36 1/6 6 1/36 1/36 1/36 1/36 1/36 1/36 1/6 f(y) 1/6 1/6 1/6 1/6 1/6 1/6 1/1 The entropies can be calculated from the table: University of South Carolina 7 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery 6 6 H ( X , Y ) Pij log 1 i 1 j 1 36 log 1 36 5.17 6 H ( X ) H (Y ) Pi log 1 log 1 2.58 6 6 i 1 6 6 H ( X | Y ) H (Y | X ) Pij log 1 2.58 6 i 1 j 1 bits bits bits A Measure of Mutual Information We would like to formulate a measure for the mutual information between two symbols, (xi,yj). Solomon Kullback wrote in 1958 a book on the study of logarithmic measures of information and their application to the testing of statistical hypotheses such as determining if two independent, random samples, were drawn from the same population, or if the samples are conditionally independent, etc. Let Hi (i=1,2) be the hypothesis that X is from a population with a probability measure P( H i | x) i . Applying Bayes theorem: P( H i ) fi( x) P( H1 ) f1 ( x) P( H 2 ) f 2 ( x) University of South Carolina for i=1,2 8 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery Expanding P(Hi|x) for i=1,2 solving for f1 and f2 and simplifying f1( x) P( H1 | x) P( H 2 ) f 2 ( x) P( H 2 | x) P( H1) taking the log we obtain log f1 ( x) P( H1 | x) P( H1 ) log log f 2 ( x) P( H 2 | x ) P( H 2 ) The right side of the equation is a measure of the difference between the odds in favor of H1 after the observation X=x and before the observation. Kullback defined this expression as the “information in X=x for discriminating in favor of H1 against H2.” The mean information is the integral of the expression, which is written as I (1 : 2) log P( H1 | x) P( H1 ) d1 log P( H 2 | x) P( H 2 ) Generalizing for for 1 2 k-dimensional Euclidean spaces of two dimensions with elements {X,Y}, the mutual information between {X,Y} is given by I ( X : Y ) f ( x, y ) log f ( x, y ) dxdy g ( x ) h( y ) University of South Carolina 9 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery We can think of the pair {X,Y} as the signals that a transmitter X sends to a receiver Y. At the transmitter, p(xi) conveys the priors for each signal being sent, while at the receiver, p(xi|yj) is the probability that xi was sent given that yj was received. Therefore the gain in information has to involve the ratio of the final and initial ignorance, or p(xi|yj) / p(xi). Let N xi N1 {X} =[x1,x2,…,xn] and i 1 M y j N2 {Y} =[y1,y2,…,ym] j 1 We can re-write the mutual information I(X:Y) for the discrete case as: I(X :Y) i Using p( xi , y j ) p( xi , y j ) log P( x ) P( y j i j) p( x, y) p( x | y) p( y) p( y | x) p( x) , I(X :Y) i p( xi , y j ) log j we can also write I(X:Y) as p( xi | y j ) P( xi ) We can also write I(X:Y) as expressions involving entropy: I(X:Y) = H(X) + H(Y) – H(X,Y) University of South Carolina 10 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery I(X:Y) = H(X) – H(X|Y) I(Y:X) = H(Y) – H(Y|X) Example: Compute I(X:Y) for a transmitter with an alphabet of 5 signals, [x1, x2, x3, x4, x5] and a receiver with 4 signals [y1, y2, y3, y4]. The Joint Probability Table (JPT) and a system graph are: y1 y2 y3 y4 x1 0.25 0 0 0 x2 0.10 0.30 0 0 x3 0 0.05 0.10 0 x4 0 0 0.05 0.10 x5 0 0 0.05 0 X1 Y1 X2 Y2 X3 Y3 Y4 X4 X5 f(x1) = 0.25 g(y1) = 0.25 + 0.10 = 0.35 f(x2) = 0.10 + 0.30 = 0.40 g(y2) = 0.30 + 0.05 = 0.35 f(x3) = 0.05 + 0.10 = 0.15 g(y3) = 0.10 + 0.05 + 0.05 = 0.20 f(x4) = 0.05 + 0.10 = 0.15 g(y4) = 0.10 f(x5) = 0.05 p(x1|y1) = p(x1,y1)/g(y1) = .25/.35 = 5/7 p(y1|x1) = p(x1,y1)/f(x1)=.25/.25=1.0 p(x2|y2) = .3/.35=6/7 p(y2|x2)=p(x2,y2)/f(x2)=.3/.4 = .75 p(x3|y3) = 0.5 University of South Carolina 11 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery p(y3|x3) = 2/3 p(x4|y4) = 1.0 p(y4|x4) = 2/3 p(x2|y1) = 2/7 p(y1|x2) = 1/4 p(x3|y2) = 1/7 p(y2|x3) = 1/3 p(x4|y3) = ¼ p(y3|x4) = 1/3 p(x5|y2) = 0.05/0.20 = ¼ p(y3|x5) = 0.05/0.05 = 1.0 H ( X , Y ) p( x, y ) log p( x, y ) .25 log .25 .1log .1 .3 log .3 .05 log .05 .1log .1 .05 log .05 .1log .1 .05 log .05 x y H(X,Y) = 2.665 etc… Note: log 2 ( N ) log 10 ( N ) 0.3010 Likewise, the calculations for H(X), H(Y), H(X|Y) and H(Y|X) can be performed. Given these, we can assess whether X and Y are independent variables. Another interesting question is where do probabilities come from and how can we use them to create a Bayesian network? To answer these questions, let's consider the two sets below, S, and C, University of South Carolina 12 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery which were sampled from a database related to the famous "Chest Clinic" example. The variables S and C represent instantiations of Smoking (Y/N) and Cancer (Y/N). s={111011010011010110000010101101001011101000011101 100000011101100000011100000010010111100010101000110 0} c={000000010000000000000000100000000000000000000000 000000000100100000000000000000000000000000000000000 0} The joint probability table for the sample is approximately: C0 C1 S0 0.55 0.0 S1 0.41 0.04 P(S,C) = 0.55 + 0.41 + 0.0 + 0.04 = 1.0 H(S,C) = -0.41log(0.41) - 0.04log(0.04) - 055log(0.55) = 1.1834 H ( S ) p ( s, c) log( p ( S )) S = -0.55log(0.55)-0.41log(0.45)-0.04log(0.45) = C 0.99277 H (C ) p( s, c) log( p(C )) C = -0.55log(0.96) - 0.41log(0.96) - 0.0 - S 0.04log(0.04) = 0.243244 p ( s, c ) H ( S | C ) p( s, c) log( p( s | c)) p( s, c) log S C S C p(C ) H(S|C) = -0.55log(0.55/0.96) - 0.41log(0.41/0.96) - 0.0 - 0.04log(0.04/0.04) = 0.9453 University of South Carolina 13 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery p ( s, c) H (C | S ) p( s, c) log( p(c | s )) p( s, c) log C S C S p( S ) H(C|S) = -0.55log(0.55/0.55) - 0.41log(0.41/0.45) - 0.04log(0.04/0.45) = 0.19467 H(S,C) = H(S) + H(C|S) = H(C) + H(S|C) 0.99277 + 0.19467 = 0.24324 + 0.9453 1.18744 1.188 I(S:C) = H(S) + H(C) - H(S,C) = 0.99277 + 0.2432 - 1.188 0.048 I(S:C) = H(S) - H(S|C) 0.048 I(C:S) = H(S) + (HC) - (H(S,C) = 0.99277 - 0.9453 = 0.99277 + 0.2432 - 1.188 0.048 I(C:S) = H(C) - H(C|S) = 0.24324 - 0.19467 0.048 Note that we could have also calculated I(C:S) by inverting the order of i and j in the summations. All we want is to assess conditional dependency. The fact that H(S|C) = 0.9453 >> H(C|S)= 0.19467 indicates that there is less uncertainty (surprise) regarding C when S is known, and therefore a Bayesian network involving the two variables carries more information when this relationship is represented as University of South Carolina 14 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery S C Therefore the conditional probability table associated to the edge should be: C0 C1 S0 55/55 0/55 S1 41/45 04/45 Or S0 S1 C0 C1 1.0 0.0 0.911 0.089 And the next question is: If we have a data base with N variables, should we compute the mutual information for each of the (N-1) pairs? For N=5 variables, we only need 4+3+2+1 = 10 calculations of mutual information. However when the number of variables is much larger, as for example, N=103, we would need to find ways to reduce the number of computations. Kullback also defined the divergence, J(1,2), as the mean observation from 1 for discriminating in favor of H2 against H1, as I (2 : 1) f 2 ( x) log f 2 ( x) d1 f1 ( x) I (2 :1) f 2 ( x) log and f1 ( x) d1 f 2 ( x) J (1,2) I (1 : 2) I (2 : 1) ( f1 ( x) f 2 ( x)) log University of South Carolina f1 ( x) P ( H1 | x ) P( H1 | x) d log d1 log d 2 f 2 ( x) P( H 2 | x ) P( H 2 | x ) 15 Juan E. Vargas- CSCE CSCE 822 Data Mining & Knowledge Discovery J(1,2) is a measure of the divergence between H1 and H2, i.e., a measure of how difficult it is to discriminate between them. Kullback studied properties about these measures (additivity, convexity, invariance, sufficiency, minimum discrimination information and others) and made. University of South Carolina 16 Juan E. Vargas- CSCE