Download Stochastic processes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Stochastic processes
A stochastic (or random) process {Xi } is an indexed sequence of random
variables. The dependencies among the random variables can be arbitrary.
The process is characterized by the joint probability mass functions
p(x1 , x2 , . . . , xn ) = Pr {(X1 , X2 , . . . , Xn ) = (x1 , x2 , . . . , xn )}
(x1 , x2 , . . . , xn ) ∈ X n
A stochastic process is said to be stationary if the joint distribution of
any subset of the sequence of random variables is invariant with respect
to shifts in the time index, ie
Pr {X1 = x1 , . . . , Xn = xn } = Pr {X1+l = x1 , . . . , Xn+l = xn }
for every n and shift l and for all x1 , . . . , xn ∈ X .
Markov chains
A discrete stochastic process X1 , X2 , . . . is said to be a Markov chain or a
Markov process if for n = 1, 2, . . .
Pr {Xn+1 = xn+1 |Xn = xn , . . . , X1 = x1 } = Pr {Xn+1 = xn+1 |Xn = xn }
for all x1 , . . . , xn , xn+1 ∈ X .
For a Markov chain we can write the joint probability mass function as
p(x1 , x2 , . . . , xn ) = p(x1 )p(x2 |x1 ) · · · p(xn |xn−1 )
A Markov chain is said to be time invariant if the conditional probability
p(xn+1 |xn ) does not depend on n, ie
Pr {Xn+1 = b|Xn = a} = Pr {X2 = b|X1 = a}
for all a, b ∈ X and n = 1, 2, . . ..
Markov chains, cont.
If {Xi } is a Markov chain, Xn is called the state at time n. A
time-invariant Markov chain is characterized by its initial state and a
probability transition matrix P = [Pij ], where Pij = Pr {Xn+1 = j|Xn = i}
(assuming that X = {1, 2, . . . , m}).
If it is possible to go with positive probability from any state of the
Markov chain to any other state in a finite number of steps, the Markov
chain is said to be irreducible. If the largest common factor of the
lengths of different paths from a state to itself is 1, the Markov chain is
said to be aperiodic.
If the probability mass function at time n is p(xn ), the probability mass
function at time n + 1 is
X
p(xn+1 ) =
p(xn )Pxn xn+1
xn
Markov chains, cont.
A distribution on the states such that the distribution at time n + 1 is the
same as the distribution at time n is called a stationary distribution. The
stationary distribution is so called because if the initial state of the
distribution is drawn according to a stationary distribution, the Markov
chain forms a stationary process.
If a finite-state markov chain is irreducible and aperiodic, the stationary
distribution is unique, and from any starting distribution, the distribution
of Xn tends to the stationary distribution as n → ∞.
Entropy rate
The entropy rate (or just entropy) of a stochastic process {Xi } is defined
by
1
H(X ) = lim H(X1 , X2 , . . . , Xn )
n→∞ n
when the limit exists. It can also be defined as
H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 , . . . , X1 )
n→∞
when the limit exists.
For stationary processes, both limits exist and are equal
H(X ) = H 0 (X )
For a stationary Markov chain, we have
H(X ) = H 0 (X ) = lim H(Xn |Xn−1 , Xn−2 . . . , X1 )
n→∞
= lim H(Xn |Xn−1 ) = H(X2 |X1 )
n→∞
Entropy rate, cont.
For a Markov chain, the stationary distribution µ is the solution to the
equation system
X
µj =
µi Pij , j = 1, . . . , m
i
Given a stationary Markov chain {Xi } with stationary distribution µ and
transition matrix P. Let X1 ∼ µ. The entropy rate of the Markov chain is
then given by
X
X
X
H(X ) = −
µi Pij log Pij =
µi (−
Pij log Pij )
ij
i
ie a weighted sum of the entropies for each state.
j
Functions of Markov chains
Let X1 , X2 , . . . be a stationary markov chain and let Yi = φ(Xi ) be a
process where each term is a function of the corresponding state in the
Markov chain. What is the entropy rate H(Y)? If Yi also forms a Markov
chain this is easy to calculate, but this is not generally the case. We can
however find upper and lower bounds.
Since the Markov chain is stationary, so is Y1 , Y2 , . . .. Thus, the entropy
rate H(Y) is well defined. We might try to compute H(Yn |Yn−1 . . . Y1 )
for each n to find the entropy rate, however the convergence might be
aribtrarily slow.
We already know that H(Yn |Yn−1 . . . Y1 ) is an upper bound on H(Y),
since it converges monotonically to the entropy rate. For a lower bound,
we can use H(Yn |Yn−1 , . . . , Y1 , X1 ).
Functions of Markov chains, cont.
We have
H(Yn |Yn−1 , . . . , Y2 , X1 ) ≤ H(Y)
Proof: For k = 1, 2, . . . we have
H(Yn |Yn−1 , . . . , Y2 , X1 )
= H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 )
= H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 , X0 , . . . , X−k )
= H(Yn |Yn−1 , . . . , Y2 , Y1 , X1 , X0 , . . . , X−k ,
Y0 , . . . , Y−k )
≤ H(Yn |Yn−1 , . . . , Y2 , Y1 , Y0 , . . . , Y−k )
= H(Yn+k+1 |Yn+k , . . . , Y2 , Y1 )
Since this is true for all k it will be true in the limit too. Thus
H(Yn |Yn−1 , . . . , Y1 , X1 ) ≤ lim H(Yn+k+1 |Yn+k , . . . , Y2 , Y1 ) = H(Y)
k→∞
Functions of Markov chains, cont.
The interval between the upper and lower bounds tends to 0, ie
H(Yn |Yn−1 , . . . , Y1 ) − H(Yn |Yn−1 , . . . , Y1 , X1 ) → 0
Proof: The difference can be rewritten as
H(Yn |Yn−1 , . . . , Y1 ) − H(Yn |Yn−1 , . . . , Y1 , X1 ) = I (X1 ; Yn |Yn−1 , . . . , Y1 )
By the properties of mutual information
I (X1 ; Y1 , Y2 , . . . , Yn ) ≤ H(X1 )
Since I (X1 ; Y1 , Y2 , . . . , Yn ) increases with n, the limit exists and
lim I (X1 ; Y1 , Y2 , . . . , Yn ) ≤ H(X1 )
n→∞
Functions of Markov chains, cont.
The chain rule for mutual information gives us
H(X1 ) ≥
=
=
lim I (X1 ; Y1 , Y2 , . . . , Yn )
n→∞
lim
n→∞
∞
X
n
X
I (X1 ; Yi |Yi−1 , . . . , Y1 )
i=1
I (X1 ; Yi |Yi−1 , . . . , Y1 )
i=1
Since this infinite sum is finite and all terms are non-negative, the terms
must tend to 0, which means that
lim I (X1 ; Yn |Yn−1 , . . . , Y1 ) = 0
n→∞
Functions of Markov chains, cont.
We have thus shown that
H(Yn |Yn−1 , . . . , Y1 , X1 ) ≤ H(Y) ≤ H(Yn |Yn−1 , . . . , Y1 )
and that
lim H(Yn |Yn−1 , . . . , Y1 , X1 ) = H(Y) = lim H(Yn |Yn−1 , . . . , Y1 )
n→∞
n→∞
This result can also be generalized to the case when Yi is a random
function of Xi , called a hidden Markov model.
Higher order Markov sources
A Markov chain is a process where the dependency of an outcome at
time n only reaches back one step in time
p(xn |xn−1 , xn−2 , xn−3 , . . .) = p(xn |xn−1 )
In some applications, particularly in data compression, we might want to
have source models where the dependency reaches back longer in time. A
Markov source of order k is a random process where the following holds
p(xn |xn−1 , xn−2 , xn−3 , . . .) = p(xn |xn−1 , . . . , xn−k )
A Markov source of order 0 is a memoryless process. A Markov source of
order 1 is a Markov chain. Given Xn , a stationary Markov source of order
k. Form a new source Sn = (Xn−k+1 , Xn−k+2 , . . . , Xn ). This new process
is a Markov chain. The entropy rate is
H(Sn |Sn−1 ) = H(Xn |Xn−1 , . . . , Xn−k )
Differential entropy
A continuous random variable X has the probability density function
f (x). Let the support set S be the set of x where f (x) > 0. The
differential entropy h(X ) of the variable is defined as
Z
h(X ) = − f (x) log f (x) dx
S
Unlike the entropy for a discrete variable, the differential entropy can be
both positive and negative.
Some common distributions
Normal distribution (gaussian distribution)
f (x) = √
(x−m)2
1
e − 2σ2
2πσ
,
h(X ) =
1
log 2πeσ 2
2
,
h(X ) =
1
log 2e 2 σ 2
2
Laplace distribution
1
f (x) = √ e −
2σ
√
2|x−m|
σ
Uniform distribution
1
a≤x ≤b
b−a
f (x) =
0
otherwise
,
h(X ) = log(b − a) =
1
log 12σ 2
2
AEP for continuous variables
Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random variables drawn
according to the density f (x). Then
1
− log f (X1 , X2 , . . . , Xn ) → E [− log f (X )] = h(X ) in probability
n
(n)
For > 0 and any n we define the typical set A
follows
n
A(n)
= {(x1 , . . . , xn ) ∈ S : | −
with respect to f (x) as
1
log f (x1 , . . . , xn ) − h(X )| ≤ }
n
The volume Vol(A) of any set A is defined as
Z
Vol(A) =
dx1 dx2 . . . dxn
A
AEP, cont.
(n)
The typical set A
has the following properties
1.
(n)
Pr {A }
> 1 − for sufficiently large n
2.
(n)
Vol(A )
≤ 2n(h(X )+) for all n
(n)
3. Vol(A ) ≥ (1 − )2n(h(X )−) for sufficiently large n
This indicates that the typical set contains almost all probability and has
a volume of approximately 2nh(X ) .
Quantization
Suppose we do uniform quantization of a continuous random variable X ,
ie we divide the range of X into bins of length ∆. Assuming that f (x) is
continuous in each bin, there exists a value xi within each bin such that
Z
(i+1)∆
f (xi )∆ =
f (x)dx
i∆
Consider the quantized variable X ∆ defined by
X ∆ = xi
if i∆ ≤ X < (i + 1)∆
The probability p(xi ) = pi that X ∆ = xi is
Z
(i+1)∆
pi =
f (x)dx = f (xi )∆
i∆
Quantization, cont.
The entropy of the quantized variable is
X
H(X ∆ ) = −
pi log pi
i
= −
X
∆f (xi ) log(∆f (xi ))
i
= −
X
∆f (xi ) log f (xi ) −
i
Z
≈
X
∆f (xi ) log ∆
i
∞
f (x) log f (x) dx − log ∆
−
−∞
= h(X ) − log ∆
Differential entropy, cont.
Two random variables X and Y with joint density function f (x, y ) and
conditional density functions f (x|y ) and f (y |x). The joint differential
entropy is defined as
Z
h(X , Y ) = − f (x, y ) log f (x, y ) dxdy
The conditional differential entropy is defined as
Z
h(X |Y ) = − f (x, y ) log f (x|y ) dxdy
We have
h(X , Y ) = h(X ) + h(Y |X ) = h(Y ) + h(X |Y )
which can be generalized to (chain rule)
h(X1 , X2 , . . . , Xn ) = h(X1 ) + h(X2 |X1 ) + . . . + h(Xn |X1 , X2 , . . . , Xn−1 )
Relative entropy
The relative entropy (Kullback-Leibler distance) between two densities f
and g is defined by
Z
f
D(f ||g ) = f log
g
The relative entropy is finite only if the support set of f is contained in
the support set for g .
Mutual information
The mutual information between X and Y is defined as
Z
f (x, y )
I (X ; Y ) = f (x, y ) log
dxdy
f (x)f (y )
which gives
I (X ; Y ) = h(X ) − h(X |Y ) = h(Y ) − h(Y |X ) = h(X ) + h(Y ) − h(X , Y )
and
I (X ; Y ) = D(f (x, y )||f (x)f (y ))
Given two uniformely quantized versions of X and Y
I (X ∆ ; Y ∆ )
=
H(X ∆ ) − H(X ∆ |Y ∆ )
≈
h(X ) − log ∆ − (h(X |Y ) − log ∆)
=
I (X ; Y )
Properties
D(f ||g ) ≥ 0
with equality iff f and g are equal almost everywhere.
I (X ; Y ) ≥ 0
with equality iff X and Y are independent.
h(X |Y ) ≤ h(X )
with equality iff X and Y are independent.
h(X + c) = h(X )
h(aX ) = h(X ) + log |a|
h(AX) = h(X) + log | det(A)|
Differential entropy, cont.
The gaussian distribution is the distribution that maximizes the
differential entropy, for a given covariance matrix.
Ie, for the one-dimensional case, the differential entropy for a variable X
with variance σ 2 satisfies the inequality
h(X ) ≤
1
log 2πeσ 2
2
with equality iff X is gaussian.
For the general case, the differential entropy for a random n-dimensional
vector X with covariance matrix K satisfies the inequality
h(X) ≤
1
log(2πe)n |K |
2
with equality iff X is a multivariate gaussian vector.