Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stochastic processes A stochastic (or random) process {Xi } is an indexed sequence of random variables. The dependencies among the random variables can be arbitrary. The process is characterized by the joint probability mass functions p(x1 , x2 , . . . , xn ) = Pr {(X1 , X2 , . . . , Xn ) = (x1 , x2 , . . . , xn )} (x1 , x2 , . . . , xn ) ∈ X n A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index, ie Pr {X1 = x1 , . . . , Xn = xn } = Pr {X1+l = x1 , . . . , Xn+l = xn } for every n and shift l and for all x1 , . . . , xn ∈ X . Markov chains A discrete stochastic process X1 , X2 , . . . is said to be a Markov chain or a Markov process if for n = 1, 2, . . . Pr {Xn+1 = xn+1 |Xn = xn , . . . , X1 = x1 } = Pr {Xn+1 = xn+1 |Xn = xn } for all x1 , . . . , xn , xn+1 ∈ X . For a Markov chain we can write the joint probability mass function as p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 ) A Markov chain is said to be time invariant if the conditional probability p(xn+1 |xn ) does not depend on n, ie Pr {Xn+1 = b|Xn = a} = Pr {X2 = b|X1 = a} for all a, b ∈ X and n = 1, 2, . . .. Markov chains, cont. If {Xi } is a Markov chain, Xn is called the state at time n. A time-invariant Markov chain is characterized by its initial state and a probability transition matrix P = [Pij ], where Pij = Pr {Xn+1 = j|Xn = i} (assuming that X = {1, 2, . . . , m}). If it is possible to go with positive probability from any state of the Markov chain to any other state in a finite number of steps, the Markov chain is said to be irreducible. If the largest common factor of the lengths of different paths from a state to itself is 1, the Markov chain is said to be aperiodic. If the probability mass function at time n is p(xn ), the probability mass function at time n + 1 is X p(xn+1 ) = p(xn )Pxn xn+1 xn Markov chains, cont. A distribution on the states such that the distribution at time n + 1 is the same as the distribution at time n is called a stationary distribution. The stationary distribution is so called because if the initial state of the distribution is drawn according to a stationary distribution, the Markov chain forms a stationary process. If a finite-state markov chain is irreducible and aperiodic, the stationary distribution is unique, and from any starting distribution, the distribution of Xn tends to the stationary distribution as n → ∞. Entropy rate The entropy rate (or just entropy) of a stochastic process {Xi } is defined by 1 H(X ) = lim H(X1 , X2 , . . . , Xn ) n→∞ n when the limit exists. It can also be defined as H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 ) n→∞ when the limit exists. For stationary processes, both limits exist and are equal H(X ) = H 0 (X ) For a stationary Markov chain, we have H(X ) = H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 . . . , X1 ) n→∞ = lim H(Xn |Xn−1 ) = H(X2 |X1 ) n→∞ Entropy rate, cont. For a Markov chain, the stationary distribution µ is the solution to the equation system X µj = µi Pij , j = 1, . . . , m i Given a stationary Markov chain {Xi } with stationary distribution µ and transition matrix P. Let X1 ∼ µ. The entropy rate of the Markov chain is then given by X X X H(X ) = − µi Pij log Pij = µi (− Pij log Pij ) ij i ie a weighted sum of the entropies for each state. j Functions of Markov chains Let X1 , X2 , . . . be a stationary markov chain and let Yi = φ(Xi ) be a process where each term is a function of the corresponding state in the Markov chain. What is the entropy rate H(Y)? If Yi also forms a Markov chain this is easy to calculate, but this is not generally the case. We can however find upper and lower bounds. Since the Markov chain is stationary, so is Y1 , Y2 , . . .. Thus, the entropy rate H(Y) is well defined. We might try to compute H(Yn |Yn−1 . . . Y1 ) for each n to find the entropy rate, however the convergence might be aribtrarily slow. We already know that H(Yn |Yn−1 . . . Y1 ) is an upper bound on H(Y), since it converges monotonically to the entropy rate. For a lower bound, we can use H(Yn |Yn−1 , . . . , Y1 , X1 ). Functions of Markov chains, cont. We have H(Yn |Yn−1 , . . . , Y2 , X1 ) ≤ H(Y) Proof: For k = 1, 2, . . . we have H(Yn |Yn−1 , . . . , Y2 , X1 ) = H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 ) = H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 , X0 , . . . , X−k ) = H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 , X0 , . . . , X−k , Y0 , . . . , Y−k ) ≤ H(Yn |Yn−1 , . . . , Y2 , Y1 , Y0 , . . . , Y−k ) = H(Yn+k+1 |Yn+k , . . . , Y2 , Y1 ) Since this is true for all k it will be true in the limit too. Thus H(Yn |Yn−1 , . . . , Y1 , X1 ) ≤ lim H(Yn+k+1 |Yn+k , . . . , Y2 , Y1 ) = H(Y) k→∞ Functions of Markov chains, cont. The interval between the upper and lower bounds tends to 0, ie H(Yn |Yn−1 , . . . , Y1 ) − H(Yn |Yn−1 , . . . , Y1 , X1 ) → 0 Proof: The difference can be rewritten as H(Yn |Yn−1 , . . . , Y1 ) − H(Yn |Yn−1 , . . . , Y1 , X1 ) = I (X1 ; Yn |Yn−1 , . . . , Y1 ) By the properties of mutual information I (X1 ; Y1 , Y2 , . . . , Yn ) ≤ H(X1 ) Since I (X1 ; Y1 , Y2 , . . . , Yn ) increases with n, the limit exists and lim I (X1 ; Y1 , Y2 , . . . , Yn ) ≤ H(X1 ) n→∞ Functions of Markov chains, cont. The chain rule for mutual information gives us H(X1 ) ≥ = = lim I (X1 ; Y1 , Y2 , . . . , Yn ) n→∞ lim n→∞ ∞ X n X I (X1 ; Yi |Yi−1 , . . . , Y1 ) i=1 I (X1 ; Yi |Yi−1 , . . . , Y1 ) i=1 Since this infinite sum is finite and all terms are non-negative, the terms must tend to 0, which means that lim I (X1 ; Yn |Yn−1 , . . . , Y1 ) = 0 n→∞ Functions of Markov chains, cont. We have thus shown that H(Yn |Yn−1 , . . . , Y1 , X1 ) ≤ H(Y) ≤ H(Yn |Yn−1 , . . . , Y1 ) and that lim H(Yn |Yn−1 , . . . , Y1 , X1 ) = H(Y) = lim H(Yn |Yn−1 , . . . , Y1 ) n→∞ n→∞ This result can also be generalized to the case when Yi is a random function of Xi , called a hidden Markov model. Higher order Markov sources A Markov chain is a process where the dependency of an outcome at time n only reaches back one step in time p(xn |xn−1 , xn−2 , xn−3 , . . .) = p(xn |xn−1 ) In some applications, particularly in data compression, we might want to have source models where the dependency reaches back longer in time. A Markov source of order k is a random process where the following holds p(xn |xn−1 , xn−2 , xn−3 , . . .) = p(xn |xn−1 , . . . , xn−k ) A Markov source of order 0 is a memoryless process. A Markov source of order 1 is a Markov chain. Given Xn , a stationary Markov source of order k. Form a new source Sn = (Xn−k+1 , Xn−k+2 , . . . , Xn ). This new process is a Markov chain. The entropy rate is H(Sn |Sn−1 ) = H(Xn |Xn−1 , . . . , Xn−k ) Differential entropy A continuous random variable X has the probability density function f (x). Let the support set S be the set of x where f (x) > 0. The differential entropy h(X ) of the variable is defined as Z h(X ) = − f (x) log f (x) dx S Unlike the entropy for a discrete variable, the differential entropy can be both positive and negative. Some common distributions Normal distribution (gaussian distribution) f (x) = √ (x−m)2 1 e − 2σ2 2πσ , h(X ) = 1 log 2πeσ 2 2 , h(X ) = 1 log 2e 2 σ 2 2 Laplace distribution 1 f (x) = √ e − 2σ √ 2|x−m| σ Uniform distribution 1 a≤x ≤b b−a f (x) = 0 otherwise , h(X ) = log(b − a) = 1 log 12σ 2 2 AEP for continuous variables Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random variables drawn according to the density f (x). Then 1 − log f (X1 , X2 , . . . , Xn ) → E [− log f (X )] = h(X ) in probability n (n) For > 0 and any n we define the typical set A follows n A(n) = {(x1 , . . . , xn ) ∈ S : | − with respect to f (x) as 1 log f (x1 , . . . , xn ) − h(X )| ≤ } n The volume Vol(A) of any set A is defined as Z Vol(A) = dx1 dx2 . . . dxn A AEP, cont. (n) The typical set A has the following properties 1. (n) Pr {A } > 1 − for sufficiently large n 2. (n) Vol(A ) ≤ 2n(h(X )+) for all n (n) 3. Vol(A ) ≥ (1 − )2n(h(X )−) for sufficiently large n This indicates that the typical set contains almost all probability and has a volume of approximately 2nh(X ) . Quantization Suppose we do uniform quantization of a continuous random variable X , ie we divide the range of X into bins of length ∆. Assuming that f (x) is continuous in each bin, there exists a value xi within each bin such that Z (i+1)∆ f (xi )∆ = f (x)dx i∆ Consider the quantized variable X ∆ defined by X ∆ = xi if i∆ ≤ X < (i + 1)∆ The probability p(xi ) = pi that X ∆ = xi is Z (i+1)∆ pi = f (x)dx = f (xi )∆ i∆ Quantization, cont. The entropy of the quantized variable is X H(X ∆ ) = − pi log pi i = − X ∆f (xi ) log(∆f (xi )) i = − X ∆f (xi ) log f (xi ) − i Z ≈ X ∆f (xi ) log ∆ i ∞ f (x) log f (x) dx − log ∆ − −∞ = h(X ) − log ∆ Differential entropy, cont. Two random variables X and Y with joint density function f (x, y ) and conditional density functions f (x|y ) and f (y |x). The joint differential entropy is defined as Z h(X , Y ) = − f (x, y ) log f (x, y ) dxdy The conditional differential entropy is defined as Z h(X |Y ) = − f (x, y ) log f (x|y ) dxdy We have h(X , Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y ) which can be generalized to (chain rule) h(X1 , X2 , . . . , Xn ) = h(X1 ) + h(X2 |X1 ) + . . . + h(Xn |X1 , X2 , . . . , Xn−1 ) Relative entropy The relative entropy (Kullback-Leibler distance) between two densities f and g is defined by Z f D(f ||g ) = f log g The relative entropy is finite only if the support set of f is contained in the support set for g . Mutual information The mutual information between X and Y is defined as Z f (x, y ) I (X ; Y ) = f (x, y ) log dxdy f (x)f (y ) which gives I (X ; Y ) = h(X ) − h(X |Y ) = h(Y ) − h(Y |X ) = h(X ) + h(Y ) − h(X , Y ) and I (X ; Y ) = D(f (x, y )||f (x)f (y )) Given two uniformely quantized versions of X and Y I (X ∆ ; Y ∆ ) = H(X ∆ ) − H(X ∆ |Y ∆ ) ≈ h(X ) − log ∆ − (h(X |Y ) − log ∆) = I (X ; Y ) Properties D(f ||g ) ≥ 0 with equality iff f and g are equal almost everywhere. I (X ; Y ) ≥ 0 with equality iff X and Y are independent. h(X |Y ) ≤ h(X ) with equality iff X and Y are independent. h(X + c) = h(X ) h(aX ) = h(X ) + log |a| h(AX) = h(X) + log | det(A)| Differential entropy, cont. The gaussian distribution is the distribution that maximizes the differential entropy, for a given covariance matrix. Ie, for the one-dimensional case, the differential entropy for a variable X with variance σ 2 satisfies the inequality h(X ) ≤ 1 log 2πeσ 2 2 with equality iff X is gaussian. For the general case, the differential entropy for a random n-dimensional vector X with covariance matrix K satisfies the inequality h(X) ≤ 1 log(2πe)n |K | 2 with equality iff X is a multivariate gaussian vector.