Download The Capacity-Cost Function of a Hard-Constrained

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inductive probability wikipedia , lookup

Birthday problem wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
1
The Capacity-Cost Function of a Hard-Constrained
Channel
Abstract:
In this paper we consider hard-constrained costly channels. By using the
thermodynamic formalism, we prove thats its capacity-cost function strictly
convex and some other related properties. We also obtain estimates for the
variations of the cost of the sequences generated by optimum codes.
0.1
Introduction
A hard-constrained channel transmits without errors certain sequences and
reject others. If for each sequence is assigned a cost, we have the hardconstrained costly channel. This channel has been studied by Khayrallah
and Neuhoff in [KN], where they show how to calculate its capacity-cost
function, assuming that the channel is finite-state. They also show methods
for constructing optimum codes.
In the present paper we shall prove that the capacity-cost function of a
hardly-constrained channel is strictly convex and some other related properties. The general case is considered, not imposing that the channel is
finite-state and also admiting very general cost functions. At the end of the
paper, we also prove some estimates concerning the variations of the cost
of sequences generated by optimum codes. The techniques we use to prove
these facts are from the thermodynamic formalism.
The hard constrained channel is used to model the magnetic recording.
In this case, the constraints are due to physical limitation of the storage
system. It can also be used in Shannon’s telegraph channel, but in this case
the constraints are dictated by the scheme of transmission used. Sometimes,
it is interesting to add a cost for the usage of an undesirable sequence. The
idea is to accept some of these sequences, but not too much of them. This
is done by limiting the mean cost of the sequences. In [KN], some examples
of practical applications of these channels are presented.
The capacity-cost function C(ρ) is the maximum code rate which is possible with average cost at most ρ. This function is only meaningful in a
certain interval [ρmin , ρmax ] which are characterized by
ρmin = sup{ρ|C(ρ) = 0}
ρmax = inf{ρ|C(ρ) = Cmax }
2
The function C(ρ) has some well-known properties. It is continuous, strictly
crescent and convex ∩ in this interval [ρmin , ρmax ] (see appendix). In this
paper, we shall prove that C(ρ) is in fact a strictly convex ∩ , analytic,
function of ρ in the interval (ρmin , ρmax ]. Moreover , we shall also show that
dC
dC
dρ (ρmin ) = +∞ and dρ (ρmax ) = 0 .
These techniques used in this paper allow us to show some facts concerning the cost of the sequences generated by optimum codes. One of them is
that the cost satisfies assymptotically a normal law. Also, we will show how
the capacity-cost function can be used to estimate the assymptotic probability that the average cost differs from its limit ρ.
In section 2, we define more precisely the capacity-cost function. In
sections 3 and 4 we prove the results about C(ρ) mentioned above. These
sections are the core of the article. In section 5, the convergence to normal
is studied and in section 6 we show the relation between the capacity-cost
function and large deviations of the cost.
0.2
The Capacity-Cost Function
In this section, we shall define the capacity-cost function of a hard-constrained
channel and relate some of its properties. Let Σ denote the space of sequences
in d symbols,
Σ = {x = (x0 , x1 , x2 , ...)|xi ∈ {1, ..., d}}
and S the shift,
S(x0 , x1 , x2 , ...) = (x1 , x2 , ...)
A noiseless constrained channel is a subset Π of Σ invariant by S. We shall
assume that Π is irredutible and aperiodic ([A],[KN]). This is a natural
hypothesis, since if it does not hold we can divide Π in several channels with
these properties.
The cost is a Hölder-continuous function k : Π → R + ([B]). The average
cost kn is given by
X
1 n−1
kn (x) =
k(S j (x))
n j=0
We say that k is homologous to a constant if, for each x ∈ Σ,
lim kn (x) = K
n→∞
where the limit K is independent of x ∈ Σ .
0.2. THE CAPACITY-COST FUNCTION
3
Given ρ ≥ 0 , the capacity C(ρ) is defined by
Z
C(ρ) = max {H(p)|
p∈M(¦)
kdp ≤ ρ}
where M(¦) denote the set of shift invariant probability measures with
support in Π.
Let
ρmin = sup{ρ|C(ρ) = 0}
and
ρmax = min{ρ|C(ρ) = Cmax }
where Cmax = maxp∈M(¦) {H(p)}. Evidently C(ρ) = 0, if ρ < ρmin and
C(ρ) = Cmax , if ρ ≥ ρmax . We observe also that if k is homologous to a
constant, then ρmin = ρmax .
In [E], it is proved that the capacity-cost function C(ρ) is continuous,
strictly increasing and convex ∩ in the interval [ρmin , ρmax ].
Example 1:
Take Π to be the full shift Σ in 2 symbols and let k : Σ → R+ be given
0if x0 = 1
by k(x0 , x1 , ...) =
. This channel is memoryless and so we can
1if x0 = 2
use i.i.d. source tests to find the capacity-cost function [E]. We have then
Z
C(ρ) = max {H(p)|
p∈M(¦)
kdp ≤ ρ, pi.i.d.}
Every i.i.d. source is characterized by a number in the interval [0, 1]
which represents P {x0 = 2}. We shall denote this number also byR p, hoping
that this will cause no confusion to the reader. It is clear that kdp = p.
Denote by h(p) the entropy function given by
h(p) = −p log p − (1 − p) log(1 − p)
Then
C(ρ) = max {h(p)| p ≤ ρ}
p∈[0,1]
So we have ρmin = 0, ρmax =
1
2
and for ρ ∈ [0, 12 ], C(ρ) = h(ρ).
4
0.3
The Topological Pressure
Given µ the topological pressure P (µ) is given by
Z
P (µ) = sup {H(p) + µ
Π
p∈M(¦)
kdp}
This function is very important in thermodynamic formalism and we shall
use it here to derive our results. The theorem of Ruelle states some important
facts about it:
(1) P is an analytic function of µ.
(2) There exists a unique probability measure p∗ = p∗ (µ) in Π such that
Z
P (µ) = H(p∗ ) + µ
Π
kdp∗
(3) The derivative of P is given by
dP
(µ) =
dµ
Z
Π
kdp∗
(4) The second derivative of P is given by
d2 P
(µ) = σ 2 (µ)
d2 µ
where
1
σ (µ) = lim
n→∞ n
2
Z
n−1
X
(
Π j=0
Z
j
k(S (x)) − n
Π
kdp∗ )2 dp∗
Moreover, if k is not homologous to a constant, then σ 2 (µ) > 0, for any
µ ∈ R. In this case the topological pressure P is a strictly convex ∪ function
of µ.
Example 1 (cont.):
Returning to example 1, we have that
P (µ) = sup {h(p) + µp}
p∈[0,1]
In order to find the maximum of the above expression we look for p satisfying
dh
(p) = −µ
dp
0.3. THE TOPOLOGICAL PRESSURE
Since
5
dh
1−p
(p) = log(
)
dp
p
we obtain
p∗ =
1
1 + e−µ
Therefore
P (µ) = log(1 + eµ ) − µ
We obtain also expressions for
dP
1
=−
dµ
1 + eµ
and
d2 P
eµ
=
dµ2
(1 + eµ )2
In next lemma, we show that for µ ≤ 0, the probability measure that leads
to the pressure also leads to the capacity.
Lemma 1:
R
Given µ ≤ 0, take ρ = Π kdp∗ . Then
C(ρ) = H(p∗ )
Proof:
R
If p ∈ M (Π), p 6= p∗ , with Π kdp ≤ ρ then
Z
H(p) + µ
Z
Π
kdp < H(p∗ ) + µ
Π
kdp∗
and hence
H(p) < H(p∗ )
Therefore
Z
H(p∗ ) = sup{H(p)|
p
Π
kdp ≤ ρ} = C(ρ)
Example 1 (cont.):
In example 1, given µ ≤ 0, take p∗ = 1+e1−µ . Then ρ = p∗ and by the
above lemma,
C(ρ) = h(ρ)
This is in accordance with our previous calculus of C(ρ), since for µ ≤ 0,
ρ ∈ (0, 21 ].
6
Remark:
For each µ, we can obtain the value of P (µ) as the logarithm of the
dominant eigenvalue λ = λ(µ) of a linear positive operator L in the space
of continuous functions on Π . The operator L is called the Ruelle-PerronFrobenius operator. The adjoint operator La defined on the space of measures on Π has also λ(µ) as its dominant eigenvalue. If we denote by ν
the eigenvector of La associated to λ normalized to be aR probability and by
g the eigenvector of L associated to λ normalized by gdv = 1, then the
probability p∗ is equal to gν . For more details, see [PP].
In case of a finite-state channel, the operator L reduces to a positive
matrix B (see [KN]).
Example 2:
Consider the costly finite-state channel defined by the graph of figure1.
It is a (1, 3) hard channel that admits also sequences of 1’s, but with a unit
cost for each pair together (see [KN]). In this example,




B=
eµ
1
1
1
1
0
0
0
0
1
0
0
0
0
1
0





We can compute P (µ) as the logarithm of the eigenvalue λ of this matrix.
The probability vector p∗ is obtained as a product of the normalized left and
right
eigenvectors associated with λ. The correspondent value of ρ is given by
R
∗ , which in this case reduces to the first component of p∗ . For example,
kdp
Π
if µ = 1.0885, one obtain ρ = 0.09 and C(ρ) = 0.8181 (see [KN]).
0.4
The Legendre Transform
Assume that k is not homologous to a constant. Let
dP
(µ)
µ→−∞ dµ
dP
=
(0)
dµ
dP
=
lim
(µ)
µ→+∞ dµ
ρ− =
ρ0
ρ+
lim
Denote by L(P)(ρ) the Legendre transform of P (µ), which is defined by
L(P)(ρ) = inf {P(µ) − µρ}
µ
0.4. THE LEGENDRE TRANSFORM
7
for ρ ∈ (ρ− , ρ+ ). Since P is strictly convex, L(P) is well defined and there
exists a unique µ∗ ∈ R such that
L(P)(ρ) = P(µ∗ ) − µ∗ ρ
and
dP ∗
(µ ) = ρ
dµ
Lemma:
Take ρ ∈ (ρ− , ρ0 ) . Then L(P)(ρ) = C(ρ) .
Proof:
We have
L(P)(ρ) = P(µ∗ ) − µ∗ ρ
By Ruelle’s theorem, there exists a unique p∗ such that
Z
P (µ∗ ) = H(p∗ ) + µ∗
and
dP ∗
(µ ) =
dµ
kdp∗
Z
kdp∗
Therefore
L(P)(ρ) = H(√∗ )
which implies, by lemma 1, that
L(P)(ρ) = C(ρ)
Lemma:
We have that ρ− = ρmin and ρ0 = ρmax .
Proof:
Since C(ρ) > 0 for ρ > ρ− , we have ρmin ≤ ρ− . And by the convexity of
C and limρ→ρ− dC
dρ (ρ) = +∞, we conclude that ρ− ≤ ρmin . Hence ρ− = ρmin .
At µ = 0, C(ρ0 ) = P (0) = Cmax and therefore ρ0 ≥ ρmax . If µ < 0,
ρ < ρ0 and dC
dρ (ρ) > 0. This implies that ρ0 ≤ ρmax . Therefore ρ0 = ρmax .
Example 1 (cont.):
In example 1 we have ρmin = ρ− = 0 and ρmax = ρ0 = 12 . Also, ρ+ = 1.
Example 2 (cont.):
By making µ = −∞, we obtain ρmin = 0 and C(0) = log(1.4656). And
by making µ = 0, we obtain ρmax = 0.2938 and Cmax = log(1.9276) .
8
Theorem:
Given ρ ∈ (ρmin , ρmax ], take µ = µ(ρ) such that dP
dµ (µ) = ρ. Then µ is an
analytic function of ρ and the following formulas are valid:
C(ρ) = P (µ) − ρµ
dC
(ρ) = −µ
dρ
Ã
!−1
d2 P
d2 C
(ρ)
=
−
(µ)
dρ2
dµ2
Proof:
Given ρ ∈ ( ρmin , ρmax ], let µ = µ(ρ) be defined implicitely by the formula
dP
(µ) = ρ
dµ
By the theorem of implicit function, µ(ρ) is analytic (?). And
dµ
(ρ) =
dρ
Ã
!−1
d2 P
(µ)
dµ2
And since
C(ρ) = P (µ(ρ)) − ρµ(ρ)
we have
dP
dµ
dµ
dC
(ρ) =
(µ) (ρ) − µ(ρ) − ρ (ρ) = −µ(ρ)
dρ
dµ
dρ
dρ
Therefore
Ã
!−1
d2 C
d2 P
dµ
(ρ)
=
−
(ρ)
=
−
(µ)
dρ2
dρ
dµ2
Corollary:
The capacity-cost function C(ρ) is analytic in the interval ( ρmin , ρmax ]
and strictly convex ∩ in the interval [ρmin , ρmax ]. Moreover,
dC
(ρmax ) = 0
dρ
and
lim
ρ&ρmin
dC
(ρ) = +∞
dρ
0.5. CENTRAL LIMIT THEOREM
9
Proof:
Since C(ρ) = P (µ(ρ)) − ρµ(ρ), with µ(ρ) analytic, C(ρ) is also analytic.
And by the formula
Ã
!−1
d2 C
d2 P
(ρ) = −
(µ)
dρ2
dµ2
2
we conclude that ddρC2 (ρ) < 0. Therefore C(ρ) is convex ∩ in the interval (
ρmin , ρmax ] . By the continuity of C(ρ) in [ρmin , ρmax ], we conclude that in
fact C(ρ) is convex ∩ in the interval [ ρmin , ρmax ].
dC
Finally, the equation dC
dρ (ρ) = −µ(ρ) implies that dρ (ρmax ) = 0 and
limρ&ρmin dC
dρ (ρ) = +∞ .
0.5
Central Limit Theorem
If we assume that the source is generating symbols independently and with
equal probability, and we use an optimal code, then the corresponding sequences in Π will be distributed according to p∗ . The cost k have a mean ρ
and an assymptotic variation given by ([PP])
1
σ (µ) = lim
n→∞ n
Z
2
n−1
X
(
Π j=0
k(S j (x)) − nρ)2 dp∗
This formula can be simplified to ([L])
Z
σ 2 (µ) =
Π
(k(x) − ρ)2 dp∗ +
+2
∞ Z
X
i=1 Π
(k(x) − ρ)(k(S i (x)) − ρ)dp∗
In this case, the central limit theorem is valid [PP]. This means that
X
1 n−1
√
k(σ j (x))
n j=0
converges in distribution to the normal N (ρ, σ).
Example1(cont.):
In example 1, it is clear that, for any i ≥ 1, k(S i (x) is independent of
k(x) . Therefore
Z
Π
(k(x) − ρ)(k(S i (x)) − ρ)dp∗ = 0
10
And
Z
Π
(k(x) − ρ)2 dp∗ = p(1 − p)2 + (1 − p)(0 − p)2
Hence
σ 2 (µ) = p(1 − p)
Example 2(cont.):
Calcular o valor de σ 2 (µ) para o valor de µ = 1.0885.
0.6
Large Deviations
As in last section, assume that the source is generating symbols independently and with equal probability. If we consider an optimum code, the
corresponding sequences in Π will be distributed according to p∗ .We know
that
Z
X
1 n−1
k(T j (x)) =
kdp∗ = ρ∗
lim
n→∞ n
Π
j=0
P
n−1
In this section, we will study the probability of deviations of n1 j=0
k(T j (x))
from its limit ρ. In thermodinamic formalism, this is called large deviations.
Let
X
1 n−1
p∗n (A) = p∗ (
k(T j (x)) ∈ A)
n j=0
Then, by the above formula,
lim p∗ (A)
n→∞ n
lim p∗ (A)
n→∞ n
= 1if ρ∗ ∈ A
= 0if ρ∗ ∈
/A
In the theory of large deviations, the following fact is true [L]:
1
log p∗n (A) = sup(I(ρ))
n→∞ n
ρ∈A
lim
where
I(ρ) = L{P(µ + µ∗ ) − P(µ∗ )}
Lemma:
For µ∗ ≤ 0, let ρ∗ =
dP
∗
dµ (µ ).
Then
I(ρ) = C(ρ) − C(ρ∗ ) − (ρ − ρ∗ )
dC ∗
(ρ )
dρ
0.6. LARGE DEVIATIONS
11
Proof:
I(ρ) = inf {P (µ + µ∗ ) − P (µ∗ ) − µρ}
µ
= inf {P (µ + µ∗ ) − (µ + µ∗ )ρ} − P (µ∗ ) + µ∗ ρ
µ
= L{P}(ρ) − {P(µ∗ ) − µ∗ ρ∗ } + (ρ − ρ∗ )µ∗
= L{P}(ρ) − L{P}(ρ∗ ) + (ρ − ρ∗ )µ∗
∗
By lemma 2, L{P}(ρ) = C(ρ) and µ∗ = − dC
dρ (ρ ). Therefore
I(ρ) = C(ρ) − C(ρ∗ ) − (ρ − ρ∗ )
dC ∗
(ρ )
dρ
Remarks:
(1) Observe that if ρ∗ ∈ A, then
lim
n→∞
1
log p∗n (A) = 0
n
(2) Observe that the greter the value of σ 2 (µ), less convex ∩ is the function C(ρ). This implies that the function I(ρ) is less negative and hence the
probability of large deviations of k increases.