Download On Generalized Measures of Information with

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Inductive probability wikipedia , lookup

History of randomness wikipedia , lookup

Claude Shannon wikipedia , lookup

Probability box wikipedia , lookup

Entropy (information theory) wikipedia , lookup

Transcript
On Generalized Measures of Information with
Maximum and Minimum Entropy Prescriptions
A Thesis
Submitted For the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Ambedkar Dukkipati
Computer Science and Automation
Indian Institute of Science
Bangalore – 560 012
March 2006
Abstract
Z
dP
dP ,
dR
X
where P and R are probability measures on a measurable space (X, M), plays a basic role in the
Kullback-Leibler relative-entropy or KL-entropy of P with respect to R defined as
ln
definitions of classical information measures. It overcomes a shortcoming of Shannon entropy
– discrete case definition of which cannot be extended to nondiscrete case naturally. Further,
entropy and other classical information measures can be expressed in terms of KL-entropy and
hence properties of their measure-theoretic analogs will follow from those of measure-theoretic
KL-entropy. An important theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem
which equips KL-entropy with a fundamental definition and can be stated as: measure-theoretic
KL-entropy equals the supremum of KL-entropies over all measurable partitions of X. In this
thesis we provide the measure-theoretic formulations for ‘generalized’ information measures, and
state and prove the corresponding GYP-theorem – the ‘generalizations’ being in the sense of R ényi
and nonextensive, both of which are explained below.
Kolmogorov-Nagumo average or quasilinear mean of a vector x = (x 1 , . . . , xn ) with respect
P
n
to a pmf p = (p1 , . . . , pn ) is defined as hxiψ = ψ −1
p
ψ(x
)
, where ψ is an arbitrary
k
k
k=1
continuous and strictly monotone function. Replacing linear averaging in Shannon entropy with
Kolmogorov-Nagumo averages (KN-averages) and further imposing the additivity constraint – a
characteristic property of underlying information associated with single event, which is logarithmic – leads to the definition of α-entropy or Rényi entropy. This is the first formal well-known
generalization of Shannon entropy. Using this recipe of Rényi’s generalization, one can prepare
only two information measures: Shannon and Rényi entropy. Indeed, using this formalism Rényi
characterized these additive entropies in terms of axioms of KN-averages. On the other hand, if
one generalizes the information of a single event in the definition of Shannon entropy, by replacing the logarithm with the so called q-logarithm, which is defined as ln q x =
x1−q −1
1−q ,
one gets
what is known as Tsallis entropy. Tsallis entropy is also a generalization of Shannon entropy
but it does not satisfy the additivity property. Instead, it satisfies pseudo-additivity of the form
x ⊕q y = x + y + (1 − q)xy, and hence it is also known as nonextensive entropy. One can apply
Rényi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with
KN-averages and thereby imposing the constraint of pseudo-additivity. A natural question that
arises is what are the various pseudo-additive information measures that can be prepared with this
recipe? We prove that Tsallis entropy is the only one. Here, we mention that one of the important characteristics of this generalized entropy is that while canonical distributions resulting from
‘maximization’ of Shannon entropy are exponential in nature, in the Tsallis case they result in
power-law distributions.
i
The concept of maximum entropy (ME), originally from physics, has been promoted to a general principle of inference primarily by the works of Jaynes and (later on) Kullback. This connects
information theory and statistical mechanics via the principle: the states of thermodynamic equilibrium are states of maximum entropy, and further connects to statistical inference via select the
probability distribution that maximizes the entropy. The two fundamental principles related to
the concept of maximum entropy are Jaynes maximum entropy principle, which involves maximizing Shannon entropy and the Kullback minimum entropy principle that involves minimizing
relative-entropy, with respect to appropriate moment constraints.
Though relative-entropy is not a metric, in cases involving distributions resulting from relativeentropy minimization, one can bring forth certain geometrical formulations. These are reminiscent
of squared Euclidean distance and satisfy an analogue of the Pythagoras’ theorem. This property
is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality and
plays a fundamental role in geometrical approaches to statistical estimation theory like information geometry. In this thesis we state and prove the equivalent of Pythagoras’ theorem in the
nonextensive formalism. For this purpose we study relative-entropy minimization in detail and
present some results.
Finally, we demonstrate the use of power-law distributions, resulting from ME-prescriptions
of Tsallis entropy, in evolutionary algorithms. This work is motivated by the recently proposed
generalized simulated annealing algorithm based on Tsallis statistics.
To sum up, in light of their well-known axiomatic and operational justifications, this thesis
establishes some results pertaining to the mathematical significance of generalized measures of
information. We believe that these results represent an important contribution towards the ongoing
research on understanding the phenomina of information.
ii
To
Bhirava Swamy and Bharati who infected me with a disease called Life
and to
all my Mathematics teachers who taught me how to extract sweetness from it.
------------. . . lie down in a garden and extract from the disease,
especially if it’s not a real one, as much sweetness as
possible.
There’s a lot of sweetness in it.
F RANZ K AFKA
iii
IN A LETTER TO
M ILENA
Acknowledgements
No one deserves more thanks for the success of this work than my advisers Prof. M. Narasimha
Murty and Dr. Shalabh Bhatnagar. I wholeheartedly thank them for their guidance.
I thank Prof. Narasimha Murty for his continued support throughout my graduate student
years. I always looked upon him for advice – academic or non-academic. He has always been
a very patient critique of my research approach and results; without his trust and guidance this
thesis would not have been possible. I feel that I am more disciplined, simple and punctual after
working under his guidance.
The opportunity to watch Dr. Shalabh Bhatnagar in action (particularly during discussions)
has fashioned my way of thought in problem solving. He has been a valuable adviser, and I hope
my three and half years of working with him have left me with at least few of his qualities.
I am thankful to the Chairman, Department of CSA for all the support.
I am privileged to learn mathematics from the great teachers: Prof. Vittal Rao, Prof. Adi Murty
and Prof. A. V. Gopala Krishna. I thank them for imbibing in me the rigour of mathematics.
Special thanks are due to Prof. M. A. L. Thathachar for having taught me.
I thank Dr. Christophe Vignat for his criticisms and encouraging advice on my papers.
I wish to thank CSA staff Ms. Lalitha, Ms. Meenakshi and Mr. George for being of very great
help in administrative works. I am thankful to all my labmates: Dr. Vishwanath, Asharaf, Shahid,
Rahul, Dr. Vijaya, for their help. I also thank my institute friends Arjun, Raghav, Ranjna.
I will never forget the time I spent with Asit, Aneesh, Gunti, Ravi. Special thanks to my music
companions, Raghav, Hari, Kripa, Manas, Niki. Thanks to all IISc Hockey club members and my
running mates, Sai, Aneesh, Sunder. I thank Dr. Sai Jagan Mohan for correcting my drafts.
Special thanks are due to Vinita who corrected many of my drafts of papers, this thesis, all the
way from DC and WI. Thanks to Vinita, Moski and Madhulatha for their care.
I am forever indebted to my sister Kalyani for her prayers. My special thanks are due to my
sister Sasi and her husband and to my brother Karunakar and his wife. Thanks to my cousin
Chinni for her special care. The three great new women in my life: my nieces Sanjana (3 years),
Naomika (2 years), Bhavana (3 months) who will always be dear to me. I reserve my special love
for my nephew (new born).
I am indebted to my father for keeping his promise that he will continue to guide me even
though he had to go to unreachable places. I owe everything to my mother for taking care of every
need of mine. I dedicate this thesis to my parents and to my teachers.
iv
Contents
Abstract
i
Acknowledgements
iv
Notations
1
Prolegomenon
Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
1.2
Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2.1
What is Entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
1.2.2
Why to maximize entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
A reader’s guide to the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
KN-averages and Entropies:Rényi’s Recipe
19
2.1
Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2.1.1
Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.1.2
Kullback-Leibler Relative-Entropy . . . . . . . . . . . . . . . . . . . . . . . . .
23
Rényi’s Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
2.2.1
Hartley Function and Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . .
25
2.2.2
Kolmogorov-Nagumo Averages or Quasilinear Means . . . . . . . . . . . . . . .
27
2.2.3
Rényi Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Nonextensive Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.3.1
Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
2.3.2
q-Deformed Algebra
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2.4
Uniqueness of Tsallis Entropy under Rényi’s Recipe . . . . . . . . . . . . . . . . . . . .
38
2.5
A Characterization Theorem for Nonextensive Entropies . . . . . . . . . . . . . . . . . .
43
2.2
2.3
3
1
1.1
1.3
2
viii
Measures and Entropies:Gelfand-Yaglom-Perez Theorem
46
3.1
Measure Theoretic Definitions of Classical Information Measures . . . . . . . . . . . . .
48
3.1.1
Discrete to Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.1.2
Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.1.3
Interpretation of Discrete and Continuous Entropies in terms of KL-entropy . . .
54
v
3.2
Measure-Theoretic Definitions of Generalized Information Measures . . . . . . . . . . .
56
3.3
Maximum Entropy and Canonical Distributions . . . . . . . . . . . . . . . . . . . . . . .
58
3.4
ME-prescription for Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
3.4.1
Tsallis Maximum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . . .
60
3.4.2
The Case of Normalized q-expectation values . . . . . . . . . . . . . . . . . . .
62
Measure-Theoretic Definitions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.5.1
On Measure-Theoretic Definitions of Generalized Relative-Entropies . . . . . . .
64
3.5.2
On ME of Measure-Theoretic Definition of Tsallis Entropy . . . . . . . . . . . .
69
Gelfand-Yaglom-Perez Theorem in the General Case . . . . . . . . . . . . . . . . . . . .
70
3.5
3.6
4
Geometry and Entropies:Pythagoras’ Theorem
75
4.1
Relative-Entropy Minimization in the Classical Case . . . . . . . . . . . . . . . . . . . .
77
4.1.1
Canonical Minimum Entropy Distribution . . . . . . . . . . . . . . . . . . . . .
78
4.1.2
Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
4.2
4.3
5
6
Tsallis Relative-Entropy Minimization
. . . . . . . . . . . . . . . . . . . . . . . . . . .
81
4.2.1
Generalized Minimum Relative-Entropy Distribution . . . . . . . . . . . . . . .
81
4.2.2
q-Product Representation for Tsallis Minimum Entropy Distribution . . . . . . .
82
4.2.3
Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
4.2.4
The Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . . .
86
Nonextensive Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.3.1
Pythagoras’ Theorem Restated . . . . . . . . . . . . . . . . . . . . . . . . . . .
87
4.3.2
The Case of q-Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
4.3.3
In the Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . .
92
Power-laws and Entropies: Generalization of Boltzmann Selection
95
5.1
EAs based on Boltzmann Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
5.2
EA based on Power-law Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3
Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Conclusions
106
6.1
Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3
Concluding Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
vi
Bibliography
111
vii
Notations
R
The set (field) of real numbers
R+
[0, ∞)
Z+
The set of +ve integers
2X
Power set of the set X
#E
Cardinality of a set E
χE : X → {0, 1}
Characteristic function of a set E ⊆ X
(X, M)
Measurable space, where X is a nonempty set and M is a σ-algebra
a.e
Almost everywhere
hXi
Expectation of random variable X
EX
Expectation of random varible X
hXiψ
KN-average: expectation of random variable X with respect to a function ψ
hXiq
q-expectation of random varibale X
hhXiiq
Normalized q-expectation of random variable X
νµ
Measure ν is absolutely continuous w.r.t. measure µ
S
Shannon entropy functional
Sq
Tsallis entropy functional
Sα
Rényi entropy functional
Z
Partition function of maximum entropy distributions
Zb
Partition function of minimum relative-entropy distribution
viii
1
Prolegomenon
Abstract
This chapter serves as an introduction to the thesis. The purpose is to motivate the
discussion on generalized information measures and their maximum entropy prescriptions by introducing in broad brush-strokes a picture of the information theory
and its relation with statistical mechanics and statistics. It also has road-map of the
thesis, which should serve as a reader’s guide.
Having an obsession to quantify – put it formally, finding a way of assigning a real
number to (measure) any phenomena that we come across, it is natural to ask the following question. How one would measure ‘information’? The question was asked at
the beginning of this age of information sciences and technology itself and a satisfactory answer was given. The theory of information was born . . . a ‘bandwagon’ . . .
as Shannon (1956) himself called it.
“A key feature of Shannon’s information theory is the discovery that the colloquial
term information can often be given a mathematical meaning as a numerically measurable quantity, on the basis of a probabilistic model, in such a way that the solution of
many important problems of information storage and transmission can be formulated
in terms of this measure of the amount of information. This information measure has
a very concrete operational interpretation: roughly, it equals the minimum number of
binary digits needed, on the average, to encode the message in question. The coding
theorems of information theory provide such overwhelming evidence for the adequateness of Shannon’s information measure that to look for essentially different measures
of information might appear to make no sense at all. Moreover, it has been shown
by several authors, starting with Shannon (1948), that the measure of the amount of
information is uniquely determined by some rather natural postulates. Still, all the
evidence that Shannon’s information measure is the only possible one, is valid only
within the restricted scope of coding problems considered by Shannon. As Rényi
pointed out in his fundamental paper (Rényi, 1961) on generalized information measure, in other sorts of problems other quantities may serve just as well or even better
as measures of information. This should be indicated either by their operational significance (pragmatic approach) or by a set of natural postulates characterizing them
1
(axiomatic approach) or, preferably, by both.”
The above passage is quoted from a critical survey on information measures by
Csiszár (1974), which summarizes the significance of information measures and scope
of generalizing them. Now we shall see the details.
Information Measures and Generalizations
The central tenet of Shannon’s information theory is the construction of a measure of
“amount of information” inherent in a probability distribution. This construction is in
the form of a functional that returns a real number which is supposed to be considered
as the amount of information of a probability distribution, and hence the functional is
known as information measure. The underlying concept in this construction is that it
complements the amount of information with amount of uncertainty and it happens to
be logarithmic.
The logarithmic form of information measure dates back to Hartley (1928), who
introduced the practical measure of information as the logarithm of the number of possible symbol sequences, where the distribution of events are considered to be equally
probable. It was Shannon (1948), and independently Wiener (1948), who introduced
a measure of information of general finite probability distribution p with point masses
p1 , . . . , pn as
S(p) = −
n
X
pk ln pk .
k=1
Owing to its similarity as a mathematical expression to Boltzmann entropy in thermodynamics, the term ‘entropy’ is adopted in the information sciences and used with
information measure synonymously. Shannon demonstrated many nice properties of
his entropy measure to be called itself a measure of information. One important property of Shannon entropy is the additivity, i.e., for two independent distributions, the
entropy of the joint distribution is the sum of the entropies of the two distributions.
Today, information theory is considered to be a very fundamental field which intersects with physics (statistical mechanics), mathematics (probability theory), electrical
engineering (communication theory) and computer science (Kolmogorov complexity)
etc. (cf. Fig. 1.1, pp. 2, Cover & Thomas, 1991).
Now, let us examine an alternate interpretation of the Shannon entropy functional
that is important to study its mathematical properties and its generalizations. Let X
be the underlying random variable, which takes values x 1 , . . . xn ; we use the notation
2
p(xk ) = pk , k = 1, . . . n. Then, Shannon entropy can be written as expectation of a
function of X as follows. Define a function H which assigns each value x k that X
takes, the value − ln p(xk ) = − ln pk , for k = 1, . . . n. The quantity − ln pk is known
as the information associated with the single event x k with probability pk , also known
as Hartley information (Aczél & Daróczy, 1975). From this what one can infer is
that Shannon entropy expression is an average of Hartley information. Interpretation
of Shannon entropy, as an average of information associated with a single event, is
central to Rényi generalization.
Rényi entropies were introduced into mathematics by Alfred Rényi (1960). The
original motivation was strictly formal. The basic idea behind Rényi’s generalization
is that any putative candidate for an entropy should be a mean, and thereby he uses a
well known idea in mathematics that the linear mean, though most widely used, is not
the only possible way of averaging, however, one can define the mean with respect to
an arbitrary function. Here one should be aware that, to define a ‘meaningful’ generalized mean, one has to restrict the choice of functions to continuous and monotone
functions (Hardy, Littlewood, & Pólya, 1934).
Following the above idea, once we replace the linear mean with generalized means,
we have a set of information measures each corresponding to a continuous and monotone function. Can we call every such entity an information measure? Rényi (1960)
postulated that an information measure should satisfy additivity property which Shannon entropy itself does. The important consequence of this constraint is that it restricts
the choice of function in a generalized mean to linear and exponential functions: if we
choose a linear function, we get back the Shannon entropy, if we choose an exponential
function, we have well known and much studied generalization of Shannon entropy
n
Sα (p) =
X
1
pαk ,
ln
1−α
k=1
where α is a parameter corresponding to an exponential function, which specifies the
generalized mean and is known as entropic index. Rényi has called them entropies of
order α (α 6= 1, α > 0); they include Shannon’s entropy in a limiting sense, namely,
in the limit α → 1, α-entropy retrieves Shannon entropy. For this reason, Shannon’s
entropy may be called entropy of order 1.
Rényi studied extensively these generalized entropy functionals in his various papers; one can refer to his book on probability theory (Rényi, 1970, Chapter 9) for a
summary of results.
While Rényi entropy is considered to be the first formal generalization of Shannon
3
entropy, Havrda and Charvát (1967) observed that for operational purposes, it seems
P
more natural to consider the simpler expression nk=1 pαk as an information measure
instead of Rényi entropy (up to a constant factor). Characteristics of this information
measure are studied by Daróczy (1970), Forte and Ng (1973), and it is shown that
this quantity permits simpler postulational characterizations (for the summary of the
discussion see (Csiszár, 1974)).
While generalized information measures, after Rényi’s work, continued to be of
interest to many mathematicians, it was in 1988 that they came to attention in Physics
when Tsallis reinvented the above mentioned Havrda and Charvát entropy (up to a
constant factor), and specified it in the form (Tsallis, 1988)
P
1 − k pqk
Sq (p) =
.
q−1
Though this expression looks somewhat similar to the Rényi entropy and retrieves
Shannon entropy in the limit q → 1, Tsallis entropy has the remarkable, albeit not
yet understood, property that in the case of independent experiments, it is not additive. Hence, statistical formalism based on Tsallis entropy is also termed nonextensive
statistics.
Next, we discuss what information measures to do with statistics.
Information Theory and Statistics
Probabilities are unobservable quantities in the sense that one cannot determine the values of these corresponding to a random experiment by simply an inspection of whether
the events do, in fact, occur or not. Assessing the probability of the occurrence of some
event or of the truth of some hypothesis is the important question one runs up against
in any application of probability theory to the problems of science or practical life.
Although the mathematical formalism of probability theory serves as a powerful tool
when analyzing such problems, it cannot, by itself, answer this question. Indeed, the
formalism is silent on this issue, since its goal is just to provide theorems valid for
all probability assignments allowed by its axioms. Hence, recourse is necessary to
an additional rule which tells us in which case one ought to assign which values to
probabilities.
In 1957, Jaynes proposed a rule to assign numerical values to probabilities in circumstances where certain partial information is available. Jaynes showed, in particular, how this rule, when applied to statistical mechanics, leads to the usual canonical
4
distributions in an extremely simple fashion. The concept he used was ‘maximum
entropy’.
With his maximum entropy principle, Jaynes re-derived Gibbs-Boltzmann statistical mechanics á la information theory in his two papers (Jaynes, 1957a, 1957b). This
principle states that the states of thermodynamic equilibrium are states of maximum
entropy. Formally, let p1 , . . . , pn be the probabilities that a particle in a system has
energies E1 , . . . , En respectively, then well known Gibbs-Boltzmann distribution
e−βEk
Z
pk =
k = 1, . . . , n,
P
can be deduced from maximizing the Shannon entropy functional − nk=1 pk ln pk
P
with respect to the constraint of known expected energy nk=1 pk Ek = U along with
P
the normalizing constraint nk=1 pk = 1. Z is called the partition function and can be
specified as
Z=
n
X
e−βEk .
k=1
Though use of maximum entropy has its historical roots in physics (e.g., Elsasser,
1937) and economics (e.g., Davis, 1941), later on, Jaynes showed that a general method
of statistical inference could be built upon this rule, which subsumes the techniques of
statistical mechanics as a mere special case. The principle of maximum entropy states
that, of all the distributions p that satisfy the constraints, one should choose the distribution with largest entropy. In the above formulation of Gibbs-Boltzmann distribution
one can view the mean energy constraint and normalizing constraints as the only available information. Also, this principle is a natural extension of Laplace’s famous principle of insufficient reason, which postulates that the uniform distribution is the most
satisfactory representation of our knowledge when we know nothing about the random
variate except that each probability is nonnegative and the sum of the probabilities is
unity; it is easy to show that Shannon entropy is maximum for uniform distribution.
The maximum entropy principle is used in many fields, ranging from physics (for
example, Bose-Einstein and Fermi-Dirac statistics can be made as though they are
derived from the maximum entropy principle) and chemistry to image reconstruction
and stock market analysis, recently in machine learning.
While Jayens was developing his maximum entropy principle for statistical inference problems, a more general principle was proposed by Kullback (1959, pp. 37)
which is known as the minimum entropy principle. This principle comes into picture
in problems where inductive inference is to update from a prior probability distributions to a posterior distribution when ever new information becomes available. This
5
principle states that, given a prior distribution r, of all the distributions p that satisfy the constraints, one should choose the distribution with the least Kullback-Leibler
relative-entropy
I(pkr) =
n
X
k=1
pk ln
pk
.
rk
Minimizing relative-entropy is equivalent to maximizing entropy when the prior is a
uniform distribution. This principle laid the foundations for an information theoretic
approach of statistics (Kullback, 1959) and plays important role in certain geometrical
approaches of statistical inference (Amari, 1985).
Maximum entropy principle together with minimum entropy principle is referred
as ME-principle and the inference based on these principles are collectively known
as ME-methods. Papers by Shore and Johnson (1980) and by Tikochinsky, Tishby,
and Levine (1984) paved the way for strong theoretical justification for using MEmethods in inference problems. A more general view of ME fundamentals are reported
by Harremoës and Topsøe (2001).
Before we move on we briefly explain the relation between ME and inference
methods using the well-known Bayes’ theorem. The choice between these two updating methods is dictated by the nature of the information being processed. When we
want to update our beliefs about the value of certain quantities θ on the basis of information about the observed values of other quantities x - the data - we must use Bayes’
theorem. If the prior beliefs are given by p(θ), the updated or posterior distribution is
p(θ|x) ∝ p(θ)p(x|θ). Being a consequence of the product rule for probabilities, the
Bayesian method of updating is limited to situations where it makes sense to define
the joint probability of x and θ. The ME-method, on the other hand, is designed for
updating from a prior probability distribution to a posterior distribution when the information to be processed is testable information, i.e., it takes the form of constraints on
the family of acceptable posterior distributions. In general, it makes no sense to process testable information using Bayes’ theorem, and conversely, neither does it make
sense to process data using ME. However, in those special cases when the same piece
of information can be both interpreted as data and as constraint then both methods can
be used and they agree. For more details on ME and Bayes’ approach one can refer to
(Caticha & Preuss, 2004; Grendár jr & Grendár, 2001).
An excellent review of ME-principle and consistency arguments can be found in
the papers by Uffink (1995, 1996) and by Skilling (1984). This subject is dealt with in
applications in the book of Kapur and Kesavan (1997).
6
Power-law Distributions
Despite the great success of the standard ME-principle, it is a well known fact that
there are many relevant probability distributions in nature which are not easily derivable from Jaynes-Shannon prescription: Power-law distributions constitute an interesting example. If one sticks to the standard logarithmic entropy, ‘awkward constraints’
are needed in order to obtain power-law type distributions (Tsallis et al., 1995). Does
Jaynes ME-principle suggest in a natural way the possibility of incorporating alternative entropy functionals to the variational principle? It seems that if one replaces
Shannon entropy with its generalization, ME-prescriptions ‘naturally’ result in powerlaw distributions.
Power-law distributions can be obtained by optimizing Tsallis entropy under appropriate constraints. The distribution thus obtained is termed the q-exponential distri1
bution. The associated q-exponential function of x is e q (x) = [1 + (1 − q)x]+1−q , with
the notation [a]+ = max{0, a}, and converges to the ordinary exponential function
in the limit q → 1. Hence formalism of Tsallis offers continuity between Boltzmann-
Gibbs distribution and power-law distribution, which is given by the nonextensive parameter q. Boltzmann-Gibbs distribution is a special case of the power-law distribution
of Tsallis prescription; as we set q → 0, we get exponential.
Here, we take up an important real-world example, where significance of powerlaw distribution can be demonstrated.
The importance of power-law distributions in the domain of computer science was
first precipitated in 1999 in the study of connectedness of World Wide Web (WWW).
Using a Web crawler, Barabási and Albert (1999) mapped the connectedness of the
Web. To their surprise, the web did not have an even distribution of connectivity (socalled “random connectivity”). Instead, a very few network nodes (called “hubs”)
were far more connected than other nodes. In general, they found that the probability
p(k) that a node in the network connects with k other nodes was, in a given network,
proportional to k −γ , where the degree exponent γ is not universal and depends on
the detail of network structure. Pictorial depiction of random networks and scale-free
networks is given in Figure 1.1.
Here we wish to point out that, using the q-exponential function, p(k) is rewritten
as p(k) = eq ( κk ), where q = 1 + γ1 and κ = (q − 1)k0 . This implies that the BarabásiAlbert solution optimizes the Tsallis entropy (Abe & Suzuki, 2004).
One more interesting example is the distribution of scientific articles in journals
(Naranan, 1970). If the journals are divided into groups, each containing the same
7
Figure 1.1: Structure of Random and Scale-Free Networks
number of articles on a given subject, then the number of journals in the succeeding
groups from a geometrical progression.
Tsallis nonextensive formalism had been applied to analyze the various phenomena
which exhibit power-laws, for example stock markets (Queirós et al., 2005), citations
of scientific papers (Tsallis & de Albuquerque, 2000), scale-free network of earthquakes (Abe & Suzuki, 2004), models of network packet traffic (Karmeshu & Sharma,
2006) etc. To a great extent, the success of Tsallis proposal is attributed to the ubiquity
of power law distributions in nature.
Information Measures on Continuum
Until now we have considered information measures in the discrete case, where the
number of configurations is finite. Is it possible to extend the definitions of information
measures to non-discrete cases, or to even more general cases? For example can we
write Shannon entropy in the continuous case, naively, as
Z
S(p) = − p(x) ln p(x) dx
for a probability density p(x)? It turns out that in the above continuous case, entropy
functional poses a formidable problem if one interprets it as an information measure.
Information measures extended to abstract spaces are important not only for mathematical reasons, the resultant generality and rigor could also prove important for eventual applications. Even in communication problems discrete memoryless sources and
channels are not always adequate models for real-world signal sources or communication and storage media. Metric spaces of functions, vectors and sequences as well as
random fields naturally arise as models of source and channel outcomes (Cover, Gacs,
& Gray, 1989). The by-products of general rigorous definitions have the potential for
8
proving useful new properties, for providing insight into their behavior and for finding
formulas for computing such measures for specific processes.
Immediately after Shannon published his ideas, the problem of extending the definitions of information measures to abstract spaces was addressed by well-known mathematicians of the time, Kolmogorov (1956, 1957) (for an excellent review on Kolmogorov’s contributions to information theory see (Cover et al., 1989)), Dobrushin
(1959), Gelfand (1956, 1959), Kullback (Kullback, 1959), Pinsker (1960a, 1960b),
Yaglom (1956, 1959), Perez (1959), Rényi (1960), Kallianpur (1960), etc.
We now examine why extending the Shannon entropy to the non-discrete case is
a nontrivial problem. Firstly, probability densities mostly carry a physical dimension
(say probability per length) which give the entropy functional the unit of ‘ln cm’, which
seems somewhat odd. Also in contrast to its discrete case counterpart this expression
is not invariant under a reparametrization of the domain, e.g. by a change of unit.
Further, S may now become negative, and is not bounded both from above or below
so that new problems of definition appear cf. (Hardy et al., 1934, pp. 126).
These problems are clarified if one considers how to construct an entropy for a
continuous probability distribution starting from the discrete case. A natural approach
is to consider the limit of the finite discrete entropies corresponding to a sequence of
finite partitions of an interval (on which entropy is defined) whose norms tend to zero.
Unfortunately, this approach does not work, because this limit is infinite for all continuous probability distributions. Such divergence is also obtained–and explained–if
one adopts the well-known interpretation of the Shannon entropy as the least expected
number of yes/no questions needed to identify the value of x, since in general it takes
an infinite number of such questions to identify a point in the continuum (of course,
this interpretation supposes that the logarithm in entropy functional has base 2).
To overcome the problems posed by the definition of entropy functional in continuum, the solution suggested was to consider the expression in discrete case (cf.
Gelfand et al., 1956; Kolmogorov, 1957; Kullback, 1959)
S(p|µ) = −
n
X
k=1
p(xk ) ln
p(xk )
,
µ(xk )
where µ(xk ) are positive weights determined by some ‘background measure’ µ. Note
that the above entropy functional S(p|µ) is the negative of Kullback-Leibler relativeentropy or KL-entropy when we consider that µ(x k ) are positive and sum to one.
Now, one can show that the present entropy functional, which is defined in terms of
KL-entropy, however has a natural extension to the continuous case (Topsøe, 2001,
9
Theorem 5.2). This is because, if one now partitions the real line in increasingly finer
subsets, the probabilities corresponding to p and the background weights corresponding to µ are both split simultaneously and the logarithm of their ratio will generally
not diverge.
This is how KL-entropy plays an important role in definitions of information measures extended to continuum. Based on these above ideas one can extend the information measures on measure space (X, M, µ); µ is exactly the same as that appeared
in the above definition of the entropy functional S(p|µ) in discrete case. The entropy
functionals in both the discrete and continuous cases can be retrieved by appropriately choosing the reference measure µ. Such a definition of information measures
on measure spaces can be used in ME-prescriptions, which are consistent with the
prescriptions when their discrete counterparts, are used.
One can find the continuum and measure-theoretic aspects of entropy functionals
in the information theory text of Guiaşu (1977). A concise and very good discussion
on ME-prescriptions of continuous entropy functionals can be found in (Uffink, 1995).
What is this thesis about?
One can see from the above discussions that the two generalizations of Shannon entropy, Rényi and Tsallis, originated or developed from different fields. Though Rényi’s
generalization originated in information theory, it has been studied in statistical mechanics (e.g., Bashkirov, 2004) and statistics (e.g., Morales et al., 2004). Similarly,
Tsallis generalization was mainly studied in statistical mechanics when it was proposed, but, now, Shannon-Khinchin axioms have been extended to Tsallis entropy (Suyari, 2004a) and applied to statistical inference problems (e.g., Tsallis, 1998). This
elicits no surprise because from the above discussion one can see that information
theory is naturally connected to statistical mechanics and statistics.
The study of the mathematical properties and applications of generalized information measures and, further, new formulations of the maximum entropy principle
based on these generalized information measures constitute a currently growing field
of research. It is in this line of inquiry that this thesis presents some results pertaining to mathematical properties of generalized information measures and their MEprescriptions, including the results related to measure-theoretic formulations of the
same.
Finally, note that Rényi and Tsallis generalizations can be ‘naturally’ applied to
Kullback-Leibler relative entropy to define generalized relative-entropy measures, which
10
are extensively studied in the literature. Indeed, the major results that we present in
this thesis are related to these generalized relative-entropies.
1.1 Summary of Results
Here we give a brief summary of the main results presented in this thesis. Broadly, results presented in this thesis can be divided into those related to information measures
and their ME-prescriptions.
Generalized Means, Rényi’s Recipe and Information Measures
One can view Rényi’s formalism as a tool, which can be used to generalize information measures and thereby characterize them using axioms of Kolmogorov-Nagumo
averages (KN-averages). For example, one can apply Rényi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with KN-averages and
thereby impose the constraint of pseudo-additivity. A natural question arises is what
are the various pseudo-additive information measures that one can prepare with this
recipe? In this thesis we prove that only Tsallis entropy is possible in this case, using
which we characterize Tsallis entropy based on axioms of KN-averages.
Generalized Information Measures in Abstract Spaces
Owing to the probabilistic settings for information theory, it is natural that more general definitions of information measures can be given on measure spaces. In this thesis
we develop measure-theoretic formulations for generalized information measures and
present some related results.
One can give measure-theoretic definitions for Rényi and Tsallis entropies along
similar lines as Shannon entropy. One can also show that, as is the case with Shannon
entropy, these measure-theoretic definitions are not natural extensions of their discrete
analogs. In this context we present two results: (i) we prove that, as in the case of
classical ‘relative-entropy’, generalized relative-entropies, whether Rényi or Tsallis,
can be extended naturally to the measure-theoretic case, and (ii) we show that, MEprescriptions of measure-theoretic Tsallis entropy are consistent with the discrete case.
Another important result that we present in this thesis is the Gelfand-Yaglom-Perez
(GYP) theorem for Rényi relative-entropy, which can be easily extended to Tsallis
relative-entropy. GYP-theorem for Kullback-Leibler relative-entropy is a fundamental
11
theorem which plays an important role in extending discrete case definitions of various
classical information measures to the measure-theoretic case. It also provides a means
to compute relative-entropy and study its behavior.
Tsallis Relative-Entropy Minimization
Unlike the generalized entropy measures, ME of generalized relative-entropies is not
much addressed in the literature. In this thesis we study Tsallis relative-entropy minimization in detail.
We study the properties of Tsallis relative-entropy minimization and present some
differences with the classical case. In the representation of such a minimum relativeentropy distribution, we highlight the use of the q-product, an operator that has been
recently introduced to derive the mathematical structure behind Tsallis statistics.
Nonextensive Pythagoras’ Theorem
It is a common practice in mathematics to employ geometric ideas in order to obtain
additional insights or new methods even in problems which do not involve geometry
intrinsically. Maximum and minimum entropy methods are no exception.
Kullback-Leibler relative-entropy, in cases involving distributions resulting from
relative-entropy minimization, has a celebrated property reminiscent of squared Euclidean distance: it satisfies an analog of Pythagoras’ theorem. And hence, this property is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle
equality, and plays a fundamental role in geometrical approaches to statistical estimation theory like information geometry. We state and prove the equivalent of Pythagoras’ theorem in the nonextensive case.
Power-law Distributions in EAs
Recently, power-law distributions have been used in simulated annealing, which claims
to perform better than classical simulated annealing. In this thesis we demonstrate the
use of power-law distributions in evolutionary algorithms (EAs). The proposed algorithm use Tsallis generalized canonical distribution, which is a one-parameter generalization of the Boltzmann distribution, to weigh the configurations in the selection
mechanism. We provide some simulation results in this regard.
12
1.2 Essentials
This section details some heuristic explanations for the logarithmic nature of Hartley and Shannon entropies. We also discuss some notations and why the concept of
“maximum entropy” is important.
1.2.1 What is Entropy?
The logarithmic nature of Hartley and Shannon information measures, and their additivity properties can be explained by heuristic arguments. Here we give one such
explanation (Rényi, 1960).
To characterize an element of a set of size n we need log 2 n units of information,
where a unit is a bit. The important feature of the logarithmic information measure is
its additivity: If a set E is a disjoint union of m n-tuples: E 1 , . . . , Em , then we can
specify an element of this mn-element set E in two steps: first we need ln 2 m bits of
information to describe which of the sets E 1 , . . . , Em , say Ek , contains the element,
and we need log 2 n further bits of information to tell which element of this set E k is
the considered one. The information needed to characterize an element of E is the
‘sum’ of the two partial informations. Indeed, log 2 nm = log 2 n + log2 m.
The next step is due to Shannon (1948). He has pointed out that Hartley’s formula
is valid only if the elements of E are equiprobable; if their probabilities are not equal,
the situation changes and we arrive at the formula (2.15). If all the probabilities are
equal to n1 , Shannon’s formula (2.15) reduces to Hartley’s formula: S(p) = log 2 n.
Shannon’s formula has the following heuristic motivation. Let E be the disjoint
P
union of the sets E1 , . . . , En having N1 , . . . , Nn elements respectively ( nk=1 Nk =
N ). Let us suppose that we are interested only in knowing the subset E k to which a
given element of E belongs. Suppose that the elements of E are equiprobable. The
information characterizing an element of E consists of two parts: the first specifies the
subset Ek containing this particular element and the second locates it within E k . The
amount of the second piece of information is log 2 Nk (by Hartley’s formula), thus it
depends on the index k. To specify an element of E we need log 2 N bits of information
and as we have seen it is composed of the information specifying E k – its amount will
be denoted by Hk – and of the information within Ek . According to the principle
of additivity, we have log 2 N = Hk + log2 Nk or Hk = log 2
N
Nk .
It is plausible to
define the information needed to identify the subset E k which the considered element
13
belongs to as the weighted average of the informations H k , where the weights are the
probabilities that the element belongs to the E k ’s. Thus,
S=
n
X
Nk
k=1
N
Hk ,
from which we obtain the Shannon entropy expression using the above interpretations
of Hk = log2
N
Nk
and using the notation pk =
Nk
N .
Now we note one more important idea behind the Shannon entropy. We frequently
come across Shannon entropy being treated as both a measure of uncertainty and of
information. How is this rendered possible?
If X is the underlying random variable, then S(p) is also written as S(X) though it
does not depend on the actual values of X. With this, one can say that S(X) quantifies
how much information we gain, on an average, when we learn the value of X. An
alternative view is that the entropy of X measures the amount of uncertainty about
X before we learn its value. These two views are complementary; we can either view
entropy as a measure of our uncertainty before we learn the value of X, or as a measure
of how much information we have gained after we learn the value of X.
Following this one can see that Shannon entropy for the most ‘certain distribution’ (0, . . . , 1, . . . 0) returns the value 0, and for the most ‘uncertain distribution’
( n1 , . . . , n1 ) returns the value ln n. Further one can show the inequality
0 ≤ S(p) ≤ ln n ,
for any probability distribution p. The inequality S(p) ≥ 0 is easy to verify. Let us
prove that for any probability distribution p = (p 1 , . . . , pn ) we have
1
1
,...,
S(p) = S(p1 , . . . , pn ) ≤ S
= ln n .
n
n
(1.1)
Here, we shall see the proof. I One way of showing this property is by using the
Jensen inequality for real-valued continuous functions. Let f (x) be a real-valued continuous concave function defined on the interval [a, b]. Then for any x 1 , . . . , xn ∈ [a, b]
P
and any set of non-negative real numbers λ 1 , . . . , λn such that nk=1 λk = 1, we have
!
n
n
X
X
λk f (xk ) ≤ f
λk xk
.
(1.2)
k=1
k=1
For convex functions the reverse inequality is true. Setting a = 0, b = 1, x k = pk ,
λk =
1
n
and f (x) = −x ln x we obtain
!
!
n
n
n
X
X
X
1
1
1
pk ln pk ≤ −
pk ln
pk
,
−
n
n
n
k=1
k=1
k=1
14
and hence the result.
Alternatively, one can use Lagrange’s method to maximize entropy subject to the
Pn
normalization condition of probability distribution
k=1 pk = 1. In this case the
Lagrangian is
L≡−
n
X
k=1
n
X
pk ln pk − λ
k=1
pk − 1
!
,
Differentiating with respect to p1 , . . . , pn , we get
−(1 + ln pk ) − λ = 0 , k = 1, . . . n
which gives
p1 = p 2 = . . . = p n =
The Hessian matrix is
 1
−n
0 ...
 0 −1 ...

n
 ..
..
..
 .
.
.
0
0
1
.
n
0
0
..
.
. . . − n1
(1.3)





which is always negative definite, so that the values from (1.3) determine a maximum
value, which, because of the concavity property, is also the global maximum value.
Hence the result. J
1.2.2 Why to maximize entropy?
Consider a random variable X. Let the possible values X takes be x 1 , . . . , xn that
possibly represent the outcomes of an experiment, states of a physical system, or just
labels of various propositions. The probability with which the event x k is selected is
denoted by pk , for k = 1, . . . , n. Our problem is to assign probabilities p 1 , . . . , pn .
Laplace’s principle of insufficient reason is the simplest rule that can be used when
we do not have any information about a random experiment. It states that whenever we
have no reason to believe that one case rather than any other is realized, or, as is also
put, in case all values of X are judged to be ‘equally possible’, then their probabilities
are equal, i.e
pk =
1
,
n
k = 1, . . . n.
15
We can restate the principle as, the uniform distribution is the most satisfactory representation of our knowledge when we know nothing about the random variate except
that each probability is nonnegative and the sum of the probabilities is unity. This rule,
of course, refers to the meaning of the concept of probability , and is therefore subject to debate and controversy. We will not discuss this here, one can refer to (Uffink,
1995) for a list of objections to this principle reported in the literature.
Now having the Shannon entropy as a measure of uncertainty (information), can
we generalize the principle of insufficient reason and say that with the available information, we can always choose the distribution which maximizes the Shannon entropy?
This is what is known as the Jaynes’ maximum entropy principle which states that of
all the probability distributions that satisfy given constraints, choose the distribution
which maximizes Shannon entropy. That is if our state of knowledge is appropriately
represented by a set of expectation values, then the “best”, least unbiased probability
distribution is the one that (i) reflects just what we know, without “inventing” unavailable pieces of knowledge, and, additionally, (ii) maximize ignorance: the truth, all
the truth, nothing but the truth. This is the rationale behind the maximum entropy
principle.
Now we shall examine this principle in detail. Let us assume that some information
about the random variable X is given which can be modeled as a constraint on the set
of all possible probability distributions. It is assumed that this constraint exhaustively
specifies all relevant information about X. The principle of maximum entropy is then
the prescription to choose that probability distribution p for which the Shannon entropy
is maximal under the given constraint.
Here we take simple and often studied type of constraints, i.e. the case where
expectation of X is given. Say we have the constraint
n
X
xk pk = U ,
k=1
where U is the expectation of X. Now to maximize Shannon entropy with respect
P
to the above constraint, together with the normalizing constraint nk=1 pk = 1, the
Lagrangian can be written as
L≡−
n
X
k=1
pk ln pk − λ
n
X
k=1
pk − 1
!
−β
n
X
k=1
xk pk − U
!
.
Setting the derivatives of the Lagrangian with respect to p 1 , . . . , pn equal to zero, we
get
ln pk = −λ − βxk
16
The Lagrange parameter λ can be specified by the normalizing constraint. Finally,
maximum entropy distribution can be written as
e−βxk
,
pk = Pn
−βxk
k=1 e
where the parameter β is determined by the expectation constraint.
Note that, one can extend this method to more than one constraint specified with
respect to some arbitrary functions; for details see (Kapur & Kesavan, 1997).
The maximum entropy principle subsumes the principle of insufficient reason. Indeed, in the absence of reasons, i.e., in the case where none or only trivial constraints
are imposed on the probability distribution, its entropy S(p) is maximal when all probabilities are equal. Although, as a generalization of the principle of insufficient reason,
maximum entropy principle inherits all objections associated with its infamous predecessor. Interestingly it does cope with some of the objections; for details see (Uffink,
1995).
Note that calculating the Lagrange parameters in maximum entropy methods is
a non-trivial task and the same holds for calculating maximum entropy distributions.
Various techniques to calculate maximum entropy distributions can be found in (Agmon et al., 1979; Mead & Papanicolaou, 1984; Ormoneit & White, 1999; Wu, 2003).
Maximum entropy principle can be used for a wide variety of problems. The book
by Kapur and Kesavan (1997) gives an excellent account of maximum entropy methods
with emphasis on various applications.
1.3 A reader’s guide to the thesis
Notation and Delimiters
The commonly used notation in the thesis is given in the beginning of the chapters.
When we write down the proofs of some results which are not specified in the
Theorem/Lemma environment, we denote the beginning and ending of proofs by I
and J respectively. Otherwise the end of proofs that are part of the above are identified
by . Some additional explanations with in the results are included in the footnotes.
To avoid proliferation of symbols we use the same notation for different concepts
if this does not cause ambiguity; the correspondence should be clear from the context. For example whether it is a maximum entropy distribution or minimum relativeentropy distribution we use the same symbols for Lagrange multipliers.
17
Roadmap
Apart from this chapter this thesis contains five other chapters. We now briefly outline
a summary of each chapter.
In Chapter 2, we present a brief introduction of generalized information measures
and their properties. We discuss how generalized means play a role in the information
measures and present a result related to generalized means and Tsallis generalization.
In Chapter 3, we discuss various aspects of information measures defined on
measure spaces. We present measure-theoretic definitions for generalized information
measures and present important results.
In Chapter 4, we discuss the geometrical aspects of relative-entropy minimization
and present an important result for Tsallis relative-entropy minimization.
In Chapter 5, we apply power-law distributions to selection mechanism in evolutionary algorithms and test their novelty by simulations.
Finally, in Chapter 6, we summarize the contributions of this thesis, and discuss
possible future directions.
18
2
KN-averages and Entropies:
Rényi’s Recipe
Abstract
This chapter builds the background for this thesis and introduces Rényi and Tsallis
(nonextensive) generalizations of classical information measures. It also presents
a significant result on relation between Kolmogorov-Nagumo averages and nonextensive generalization, which can also be found in (Dukkipati, Murty, & Bhatnagar,
2006b).
In recent years, interest in generalized information measures has increased dramatically, after the introduction of nonextensive entropy in Physics by Tsallis (1988) (first
defined by Havrda and Charvát (1967)), and has been studied extensively in information theory and statistics. One can get this nonextensive entropy or Tsallis entropy
by generalizing the information of a single event in the definition of Shannon entropy,
where logarithm is replaced with q-logarithm (defined as ln q x =
x1−q −1
1−q ).
The term
‘nonextensive’ is used because it does not satisfy the additivity property – a characteristic property of Shannon entropy – instead, it satisfies pseudo-additivity of the form
x ⊕q y = x + y + (1 − q)xy.
Indeed, the starting point of the theory of generalized measures of information
is due to Rényi (1960, 1961), who introduced α-entropy or Rényi entropy, the first
formal generalization of Shannon entropy. Replacing linear averaging in Shannon
entropy, which can be interpreted as an average of information of a single event, with
P
Kolmogorov-Nagumo averages (KN-average) of the form hxi ψ = ψ −1 ( k pk ψ(xk )),
where ψ is an arbitrary continuous and strictly monotone function), and further impos-
ing the additivity constraint – a characteristic property of underlying information of a
single event – leads to Rényi entropy. Using this recipe of Rényi, one can prepare only
two information measures: Shannon and Rényi entropy. By means of this formalism,
Rényi characterized these additive entropies in terms of axioms of KN-averages.
One can view Rényi’s formalism as a tool, which can be used to generalize information measures and thereby characterize them using axioms of KN-averages. For
example, one can apply Rényi’s recipe in the nonextensive case by replacing the linear
19
averages in Tsallis entropy with KN-averages and thereby imposing the constraint of
pseudo-additivity. A natural question that arises is what are the pseudo-additive information measures that one can prepare with this recipe? We prove that Tsallis entropy is
the only possible measure in this case, which allows us to characterize Tsallis entropy
using axioms of KN-averages.
As one can see from the above discussion, Hartley information measure (Hartley,
1928) of a single stochastic event plays a fundamental role in the Rényi and Tsallis
generalizations. Generalization of Rényi involves the generalization of linear average
in Shannon entropy, where as, in the case of Tsallis, it is the generalization of the
Hartley function; while Rényi’s is considered to be the additive generalization, Tsallis is non-additive. These generalizations can be extended to Kullback-Leibler (KL)
relative-entropy too; indeed, many results presented in this thesis are related to generalized relative entropies.
First, we discuss the important properties of classical information measures, Shannon and KL, in § 2.1. We discuss Rényi’s generalization in § 2.2, where we discuss
the Hartley function and properties of quasilinear means. Nonextensive generalization
of Shannon entropy and relative-entropy is presented in detail in § 2.3. Results on the
uniqueness of Tsallis entropy under Rényi’s recipe and characterization of nonextensive information measures are presented in § 2.4 and § 2.5 respectively.
2.1 Classical Information Measures
In this section, we discuss the properties of two important classical information measures, Shannon entropy and Kullback-Leibler relative-entropy. We present the definitions in the discrete case; the same for the measure-theoretic case are presented in
the Chapter 3, where we discuss the maximum entropy prescriptions of information
measures.
We start with a brief note on the notation used in this chapter. Let X be a discrete
random variable (r.v) defined on some probability space, which takes only n values
and n < ∞. We denote the set of all such random variables by X. We use the symbol
Y to denote a different set of random variables, say, those that take only m values
and m 6= n, m < ∞. Corresponding to the n-tuple (x 1 , . . . , xn ) of values which X
takes, the probability mass function (pmf) of X is denoted by p = (p 1 , . . . pn ), where
P
pk ≥ 0, k = 1, . . . n and nk=1 pk = 1. Expectation of the r.v X is denoted by EX or
hXi; we use both the notations, interchangeably.
20
2.1.1 Shannon Entropy
Shannon entropy, a logarithmic measure of information of an r.v X ∈ X denoted by
S(X), reads as (Shannon, 1948)
S(X) = −
n
X
pk ln pk .
(2.1)
k=1
The convention that 0 ln 0 = 0 is followed, which can be justified by the fact that
limx→0 x ln x = 0. This formula was discovered independently by (Wiener, 1948),
hence, it is also known as Shannon-Wiener entropy.
Note that the entropy functional (2.1) is determined completely by the pmf p of r.v
X, and does not depend on the actual values that X takes. Hence, entropy functional
is often denoted as a function of pmf alone as S(p) or S(p 1 , . . . , pn ); we use all these
notations, interchangeably, depending on the context. The logarithmic function in (2.1)
can be taken with respect to an arbitrary base greater than unity. In this thesis, we
always use the base e unless otherwise mentioned.
Shannon entropy of the Bernoulli variate is known as Shannon entropy function
which is defined as follows. Let X be a Bernoulli variate with pmf (p, 1 − p) where
0 < p < 1. Shannon entropy of X or Shannon entropy function is defined as
s(p) = S(p, 1 − p) = −p ln p − (1 − p) ln(1 − p) ,
p ∈ [0, 1] .
(2.2)
s(p) attains its maximum value for p = 12 . Later, in this chapter we use this function
to compare Shannon entropy functional with generalized information measures, Rényi
and Tsallis, graphically.
Also, Shannon entropy function is of basic importance as Shannon entropy can be
expressed through it as follows:
p3
+ (p1 + p2 + p3 )s
p1 + p 2 + p 3
pn
+ . . . + (p1 + . . . + pn )s
p1 + . . . + p n
n
X
pk
=
.
(2.3)
(p1 + . . . + pk )s
p1 + . . . + p k
p2
S(p1 , . . . , pn ) = (p1 + p2 )s
p1 + p 2
k=2
We have already discussed some of the basic properties of Shannon entropy in
Chapter 1; here we state some properties formally. For a detailed list of properties
see (Aczél & Daróczy, 1975; Guiaşu, 1977; Cover & Thomas, 1991; Topsøe, 2001).
21
S(p) ≥ 0, for any pmf p = (p1 , . . . , pn ) and assumes minimum value, S(p) = 0,
for a degenerate distribution, i.e., p(x 0 ) = 1 for some x0 ∈ X, and p(x) = 0, ∀x ∈ X,
x 6= x0 . If p is not degenerate then S(p) is strictly positive. For any probability
distribution p = (p1 , . . . , pn ) we have
1
1
,...,
S(p) = S(p1 , . . . , pn ) ≤ S
= ln n .
n
n
(2.4)
An important property of entropy functional S(p) is that it is a concave function of
p. This is a very useful property since a local maximum is also the global maximum
for a concave function that is subject to linear constraints.
Finally, the characteristic property of Shannon entropy can be stated as follows.
Let X ∈ X and Y ∈ Y be two random variables which are independent. Then we
have,
S(X × Y ) = S(X) + S(Y ) ,
(2.5)
where X × Y denotes joint r.v of X and Y . When X and Y are not necessarily
independent, then1
S(X × Y ) ≤ S(X) + S(Y ) ,
(2.6)
i.e., the entropy of the joint experiment is less than or equal to the sum of the uncertainties of the two experiments. This is called the subadditivity property.
Many sets of axioms for Shannon entropy have been proposed. Shannon (1948)
has originally given a characterization theorem of the entropy introduced by him. A
more general and exact one is due to Hinčin (1953), generalized by Faddeev (1986).
The most intuitive and compact axioms are given by Khinchin (1956), which are
known as the Shannon-Khinchin axioms. Faddeev’s axioms can be obtained as a special case of Shannon-Khinchin axioms cf. (Guiaşu, 1977, pp. 9, 63).
Here we list the Shannon-Khinchin axioms. Consider the sequence of functions
S(1), S(p1 , p2 ), . . . , S(p1 , . . . pn ), . . ., where, for every n, the function S(p 1 , . . . , pn )
is defined on the set
(
P=
(p1 , . . . , pn ) | pi ≥ 0,
1
n
X
pi = 1
i=1
)
.
This follows from the fact that S(X × Y ) = S(X) + S(Y |X), and conditional entropy S(Y |X) ≤
S(Y ), where
n
m
p(xi , yj ) ln p(yj |xi ) .
S(Y |X) = −
i=1 j=1
22
Consider the following axioms:
[SK1] continuity: For any n, the function S(p 1 , . . . , pn ) is continuous and symmetric
with respect to all its arguments,
[SK2] expandability: For every n, we have
S(p1 , . . . , pn , 0) = S(p1 , . . . , pn ) ,
[SK3] maximality: For every n, we have the inequality
1
1
,...,
,
S(p1 , . . . , pn ) ≤ S
n
n
[SK4] Shannon additivity: If
pij ≥ 0, pi =
mi
X
j=1
pij ∀i = 1, . . . , n, ∀j = 1, . . . , mi ,
(2.7)
then the following equality holds:
S(p11 , . . . , pnmn ) = S(p1 , . . . , pn ) +
n
X
i=1
pimi
pi1
,...,
pi S
pi
pi
.
(2.8)
Khinchin uniqueness theorem states that if the functional S : P → R satisfies the
axioms [SK1]-[SK4] then S is uniquely determined by
S(p1 , . . . , pn ) = −c
n
X
pk ln pk ,
k=1
where c is any positive constant. Proof of this uniqueness theorem for Shannon entropy
can be found in (Khinchin, 1956) or in (Guiaşu, 1977, Theorem 1.1, pp. 9).
2.1.2 Kullback-Leibler Relative-Entropy
Kullback and Leibler (1951) introduced relative-entropy or information divergence,
which measures the distance between two distributions of a random variable. This information measure is also known as KL-entropy, cross-entropy, I-divergence, directed
divergence, etc. (We use KL-entropy and relative-entropy interchangeably in this thesis.) KL-entropy of X ∈ X with pmf p with respect to Y ∈ X with pmf r is denoted
by I(XkY ) and is defined as
I(pkr) = I(XkY ) =
n
X
k=1
pk ln
pk
,
rk
23
(2.9)
where one would assume that whenever r k = 0, the corresponding pk = 0 and 0 ln 00 =
0. Following Rényi (1961), if p and r are pmfs of the same r.v X, the relative-entropy
is sometimes synonymously referred to as the information gain about X achieved if p
can be used instead of r. KL-entropy as a distance measure on the space of all pmfs
of X is not a metric, since it is not symmetric, i.e., I(pkr) 6= I(rkp), and it does not
satisfy the triangle inequality.
KL-entropy is an important concept in information theory, since other informationtheoretic quantities including entropy and mutual information may be formulated as
special cases. For continuous distributions in particular, it overcomes the difficulties
with continuous version of entropy (known as differential entropy); its definition in
nondiscrete cases is a natural extension of the discrete case. These aspects constitute
the major discussion of Chapter 3 of this thesis.
Among the properties of KL-entropy, the property that I(pkr) ≥ 0 and I(pkr) = 0
if and only if p = r is fundamental in the theory of information measures, and is known
as the Gibbs inequality or divergence inequality (Cover & Thomas, 1991, pp. 26). This
property follows from Jensen’s inequality.
I(pkr) is a convex function of both p and r. Further, it is a convex in the pair
(p, r), i.e., if (p1 , r1 ) and (p2 , q2 ) are two pairs of pmfs, then (Cover & Thomas, 1991,
pp. 30)
I(λp1 + (1 − λ)p2 kλr1 + (1 − λ)r2 ) ≤ λI(p1 kr1 ) + (1 − λ)I(p2 kr2 ) .
(2.10)
Similar to Shannon entropy, KL-entropy is additive too in the following sense. Let
X1 , X2 ∈ X and Y1 , Y2 ∈ Y be such that X1 and Y1 are independent, and X2 and Y2
are independent, respectively, then
I(X1 × Y1 kX2 × Y2 ) = I(X1 kX2 ) + I(Y1 kY2 ) ,
(2.11)
which is the additivity property2 of KL-entropy.
Finally, KL-entropy (2.9) and Shannon entropy (2.1) are related by
I(pkr) = −S(p) −
n
X
pk ln rk .
(2.12)
k=1
2
Additivity property of KL-entropy can alternatively be stated as follows. Let X and Y be two
independent random variables. Let p(x, y) and r(x, y) be two possible joint pmfs of X and Y . Then we
have
I(p(x, y)kr(x, y)) = I(p(x)kr(x)) + I(p(y)kr(y)) .
24
One has to note that the above relation between KL and Shannon entropies differs in
the nondiscrete cases, which we discuss in detail in Chapter 3.
2.2 Rényi’s Generalizations
Two important concepts that are essential for the derivation of Rényi entropy are Hartley information measure and generalized averages known as Kolmogorov-Nagumo
averages. Hartley information measure quantifies the information associated with a
single event and brings forth the operational significance of the Shannon entropy – the
average of Hartley information is viewed as the Shannon entropy. Rényi used generalized averages KN, in the averaging of Hartley information to derive his generalized
entropy. Before we summarize the information theory procedure leading to Rényi entropy, we discuss these concepts in detail.
A conceptual discussion on significance of Hartley information in the definition
of Shannon entropy can be found in (Rényi, 1960) and more formal discussion can
be found in (Aczél & Daróczy, 1975, Chapter 0). Concepts related to generalized
averages can be found in the book on inequalities (Hardy et al., 1934, Chapter 3).
2.2.1 Hartley Function and Shannon Entropy
The motivation to quantify information in terms of logarithmic functions goes back
to Hartley (1928), who first used a logarithmic function to define uncertainty associated with a finite set. This is known as Hartley information measure. The Hartley
information measure of a finite set A with n elements is defined as H(A) = log b n. If
the base of the logarithm is 2, then the uncertainty is measured in bits, and in the case
of natural logarithm, the unit is nats. As we mentioned earlier, in this thesis, we use
only natural logarithm as a convention.
Hartley information measure resembles the measure of disorder in thermodynamics, first provided by Boltzmann principle (known as Boltzmann entropy), and is given
by
S = K ln W ,
(2.13)
where K is the thermodynamic unit of measurement of entropy and is known as the
Boltzmann constant and W , called the degree of disorder or statistical weight, is the
total number of microscopic states compatible with the macroscopic state of the system.
25
One can give a more general definition of Hartley information measure described
above as follows. Define a function H : {x 1 , . . . , xn } → R of the values taken by r.v
X ∈ X with corresponding p.m.f p = (p1 , . . . pn ) as (Aczél & Daróczy, 1975)
H(xk ) = ln
1
, ∀k = 1, . . . n.
pk
(2.14)
H is also known as information content or entropy of a single event (Aczél & Daróczy,
1975) and plays an important role in all classical measures of information. It can be
interpreted either as a measure of how unexpected the given event is, or as measure
of the information yielded by the event; and it has been called surprise by Watanabe
(1969), and unexpectedness by Barlow (1990).
Hartley function satisfies: (i) H is nonnegative: H(x k ) ≥ 0 (ii) H is additive:
H(xi , xj ) = H(xi ) + H(xj ), where H(xi , xj ) = ln pi1pj (iii) H is normalized:
H(xk ) = 1, whenever pk =
satisfied for pk =
1
2 ).
1
e
(in the case of logarithm with base 2, the same is
These properties are both necessary and sufficient (Aczél &
Daróczy, 1975, Theorem 0.2.5).
Now, Shannon entropy (2.1) can be written as expectation of Hartley function as
S(X) = hHi =
n
X
pk Hk ,
(2.15)
k=1
where Hk = H(xk ), ∀k = 1, . . . n, with the understanding that hHi = hH(X)i.
The characteristic additive property of Shannon entropy (2.5) now follows as a consequence of the additivity property of Hartley function.
There are two postulates involved in defining Shannon entropy as expectation of
Hartley function. One is the additivity of information which is the characteristic property of Hartley function, and the other is that if different amounts of information occur
with different probabilities, the total information will be the average of the individual
informations weighted by the probabilities of their occurrences. One can justify these
postulates by heuristic arguments based on probabilistic considerations, which can be
advanced to establish the logarithmic nature of Hartley and Shannon information measures (see § 1.2.1).
Expressing or defining Shannon entropy as an expectation of Hartley function, not
only provides an intuitive idea of Shannon entropy as a measure of information but it is
also useful in derivation of its properties. Further, as we are going to see in detail, this
provides a unified way to discuss the Rényi’s and Tsallis generalizations of Shannon
entropy.
Now we move on to a discussion on generalized averages.
26
2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means
In the general theory of means, the quasilinear mean of a random variable X ∈ X is
defined as3
Eψ X = hXiψ = ψ
−1
n
X
pk ψ (xk )
k=1
!
,
(2.16)
where ψ is continuous and strictly monotonic (increasing or decreasing) and hence
has an inverse ψ −1 , which satisfies the same conditions. In the context of generalized means, ψ is referred to as Kolmogorov-Nagumo function (KN-function). In
particular, if ψ is linear, then (2.16) reduces to the expression of linear averaging,
P
EX = hXi = nk=1 pk xk . Also, the mean hXiψ takes the form of weighted arith1
P
Q
metic mean ( nk=1 pk xak ) a when ψ(x) = xa , a > 0 and geometric mean nk=1 xpkk if
ψ(x) = ln x.
In order to justify (2.16) as a so called mean we need the following theorem.
T HEOREM 2.1
If ψ is continuous and strictly monotone in a ≤ x ≤ b, a ≤ x k ≤ b, k = 1, . . . n,
P
pk > 0 and nk=1 pk = 1, then ∃ unique x0 ∈ (a, b) such that
ψ(x0 ) =
n
X
pk ψ(xk ) ,
(2.17)
k=1
and x0 is greater than some and less than others of the x k unless all xk are zero.
The implication of Theorem 2.1 is that the mean h . i ψ is determined when the
function ψ is given. One may ask whether the converse is true: if hXi ψ1 = hXiψ2 for
all X ∈ X, is ψ1 necessarily the same function as ψ2 ? Before answering this question,
we shall give the following definition.
D EFINITION 2.1
Continuous and strictly monotone functions ψ 1 and ψ2 are said to be KN-equivalent
if hXiψ1 = hXiψ2 for all X ∈ X.
3
Kolmogorov (1930) and Nagumo (1930) first characterized the quasilinear mean for a vector
1
(x1 , . . . , xn ) as hxiψ = ψ −1 n
k=1 n ψ(xk ) where ψ is a continuous and strictly monotone function.
de Finetti (1931) extended their result to the case of simple (finite) probability distributions. The version
of the quasilinear mean representation theorem referred to in § 2.5 is due to Hardy et al. (1934), which
followed closely the approach of de Finetti. Aczél (1948) proved a characterization of the quasilinear
mean using functional equations. Ben-Tal (1977) showed that quasilinear means are ordinary arithmetic
means under suitably defined addition and scalar multiplication operations. Norries (1976) did a survey
of quasilinear means and its more restrictive forms in Statistics, and a more recent survey of generalized means can be found in (Ostasiewicz & Ostasiewicz, 2000). Applications of quasilinear means can
be found in economics (e.g., Epstein & Zin, 1989) and decision theory (e.g., Kreps & Porteus, 1978)).
Recently Czachor and Naudts (2002) studied generalized thermostatistics based on quasilinear means.
27
Note that when we compare two means, it is to be understood that the underlying probabilities are same. Now, the following theorem characterizes KN-equivalent functions.
T HEOREM 2.2
In order that two continuous and strictly monotone functions ψ 1 and ψ2 are KNequivalent, it is necessary and sufficient that
ψ1 = αψ2 + β ,
(2.18)
where α and β are constants and α 6= 0.
A simple consequence of the above theorem is that if ψ is a KN-function then we
have hXiψ = hXi−ψ . Hence, without loss of generality, one can assume that ψ is
an increasing function. The following theorem states the important property of KN-
averages, which characterizes additivity of quasilinear means cf. (Hardy et al., 1934,
Theorem 84).
T HEOREM 2.3
Let ψ be a KN-function and c be a real constant then hX + ci ψ = hXiψ + c i.e.,
ψ −1
n
X
pk ψ (xk + c)
k=1
!
= ψ −1
n
X
k=1
pk ψ (xk )
!
+c
if and only if ψ is either linear or exponential.
Proofs of Theorems 2.1, 2.2 and 2.3 can be found in the book on inequalities
by Hardy et al. (1934).
Rényi (1960) employed these generalized averages in the definition of Shannon
entropy to generalize the same.
2.2.3 Rényi Entropy
In the definition of Shannon entropy (2.15), if the standard mean of Hartley function
H is replaced with the quasilinear mean (2.16), one can obtain a generalized measure
of information of r.v X with respect to a KN-function ψ as
!
!
n
n
X
X
1
pk ψ ln
pk ψ (Hk ) ,
Sψ (X) = ψ −1
= ψ −1
pk
k=1
(2.19)
k=1
where ψ is a KN-function. We refer to (2.19) as quasilinear entropy with respect to
the KN-function ψ. A natural question that arises is what is the possible mathematical
form of KN-function ψ, or in other words, what is the most general class of functions
ψ which will still provide a measure of information compatible with the additivity
28
property (postulate)? The answer is that insisting on additivity allows by Theorem 2.3
only for two classes of ψ’s – linear and exponential functions. We formulate these
arguments formally as follows.
If we impose the constraint of additivity on S ψ , i.e., for any X, Y ∈ X
Sψ (X × Y ) = Sψ (X) + Sψ (Y ) ,
(2.20)
then ψ should satisfy (Rényi, 1960)
hX + ciψ = hXiψ + c ,
(2.21)
for any random variable X ∈ X and a constant c.
Rényi employed this formalism to define a one-parameter family of measures of
information as follows:
n
X
1
pαk
ln
Sα (X) =
1−α
k=1
!
,
(2.22)
where the KN-function ψ is chosen in (2.19) as ψ(x) = e (1−α)x whose choice is motivated by Theorem 2.3. If we choose ψ as a linear function in quasilinear entropy (2.19),
what we get is Shannon entropy. The right side of (2.22) makes sense 4 as a measure
of information whenever α 6= 1 and α > 0 cf. (Rényi, 1960).
Rényi entropy is a one-parameter generalization of Shannon entropy in the sense
that Sα (p) → S(p) as α → 1. Hence, Rényi entropy is referred to as entropy of order
α, whereas Shannon entropy is referred to as entropy of order 1. The Rényi entropy
can also be seen as an interpolation formula connecting the Shannon (α = 1) and
Hartley (α = 0) entropies.
Among the basic properties of Rényi entropy, S α is positive. This follows from
P
Jensen’s inequality which gives nk=1 pαk ≤ 1 in the case α > 1, and while in the case
P
0 < α < 1 it gives nk=1 pαk ≥ 1; in both cases we have Sα (p) ≥ 0.
Sα is strictly concave with respect to p for 0 < α ≤ 1. For α > 1, Rényi
entropy is neither pure convex nor pure concave. This is a simple consequence of
the fact that both ln x and xα (α < 1) are concave functions, while x α is convex for
α > 1 (see (Ben-Bassat & Raviv, 1978) for proofs and a detailed discussion).
4
For negative α, however, Sα (p) has disadvantageous properties; namely, it will tend to infinity if
any pk tends to 0. This means that it is too sensitive to small probabilities. (This property could also
formulated in the following way: if we add a new event of probability 0 to a probability distribution,
what does not change the probability distribution, Sα (p) becomes infinity.) The case α = 0 must also be
excluded because it yields an expression not depending on the probability distribution p = (p1 , . . . , pn ).
29
A notable property of Sα (p) is that it is a monotonically decreasing function of α
for any pmf p. This can be verified as follows. I We can calculate the derivative of
Sα (p) with respect to α as
n
X
dSα (p)
1
=
dα
(1 − α)
pαk
Pn
α
j=1 pj
k=1
1
=
(1 − α)2
(
n
X
k=1
pαk
Pn
α
j=1 pj
!
!
ln pk +
ln p1−α
k
− ln
1
ln
(1 − α)2
n
X
k=1
n
X
k=1
pαk
Pn
One should note here that the vector of positive real numbers
pαk
α
j=1 pj
!
p1−α
k
)
.
(2.23)
pα
1
n
j=1
pα
j
,...,
pα
n
n
j=1
pα
j
represents a pmf. (Indeed, distributions of this form are known as escort distribu-
tions (Abe, 2003) and plays an important role in ME-prescriptions of Tsallis entropy. We discuss these aspects in Chapter 3.) Denoting the mean of a vector x =
(x1 , . . . , xn ) with respect to this pmf, i.e. escort distribution of p, by hhxii α we can
write (2.23) in an elegant form, which further gives the results as
1
dSα (p)
1−α
1−α
=
hhln p
iiα − ln hhp
iiα ≤ 0 .
dα
(1 − α)2
(2.24)
The inequality in (2.24) is due to Jensen’s inequality. J Important consequences of the
fact that Sα is a monotone decreasing function of α are the following two inequalities
S1 (p) < Sα (p) < ln n ,
0 < α < 1,
(2.25a)
Sα (p) < S1 (p) < ln n ,
α > 1,
(2.25b)
where S1 (p) = limα→1 Sα (p) is the Shannon entropy.
From the derivation of Rényi entropy it is obvious that it is additive, i.e.,
Sα (X × Y ) = Sα (X) + Sα (Y ) ,
(2.26)
where X ∈ X and Y ∈ Y are two independent r.v.
Most of the other known properties of Rényi entropy and its characterizations are
summarized by Aczél and Daróczy (1975, Chapter 5) and Jizba and Arimitsu (2004b).
Properties related to convexity and bounds of Rényi entropy can be found in (BenBassat & Raviv, 1978).
30
Similar to the Shannon entropy function (2.2) one can define the entropy function
in the case of Rényi as
sα (p) =
1
ln pα + (1 − p)α ,
1−α
p ∈ [0, 1],
(2.27)
which is the Rényi entropy of a Bernoulli random variable.
Figure 2.1 shows the plot of Shannon entropy function (2.2) compared to Rényi
entropy function (2.27) for various values of entropic index α.
0.7
0.6
α=0.8
α=1.2
α=1.5
α
s(p) & s ( p)
0.5
0.4
0.3
0.2
Shannon
Renyi
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
Figure 2.1: Shannon and Renyi Entropy Functions
Rényi entropy does have a reasonable operational significance even if not one comparable with that of Shannon entropy cf. (Csiszár, 1974). As regards the axiomatic approach, Rényi (1961) did suggest a set of postulates characterizing his entropies but it
P
involved the rather artificial procedure of considering incomplete pdfs ( nk=1 pk ≤ 1 )
as well. This shortcoming has been eliminated by Daróczy (1970). Recently, a slightly
different set of axioms is given by (Jizba & Arimitsu, 2004b).
Despite its formal origin, Rényi entropy proved important in a variety of practical
applications in coding theory (Campbell, 1965; Aczél & Daróczy, 1975; Lavenda,
1998), statistical inference (Arimitsu & Arimitsu, 2000, 2001), quantum mechanics (Maassen & Uffink, 1988), chaotic dynamics systems (Halsey, Jensen, Kadanoff,
Procaccia, & Shraiman, 1986) etc. Rényi entropy is also used in neural networks
(Kamimura, 1998). Thermodynamic properties of systems with multi-fractal structures have been studied by extending the notion of Gibbs-Shannon entropy into a more
31
general framework - Rényi entropy (Jizba & Arimitsu, 2004a).
Entropy of order 2 i.e., Rényi entropy for α = 2,
S2 (p) = − ln
n
X
p2k
(2.28)
k=1
is known as Rényi quadratic entropy. R‘{enyi quadratic entropy is mostly used in
a contex of kernel based estimators, since it allows an explicit computation of the
estimated density. This measure has also been applied to clustering problems under
the name of information theoretic clustering (Gokcay & Principe, 2002). Maximum
entropy formulations of Rényi quadratic entropy are studied to compute conditional
probabilities, with applications to image retrieval and language modeling in the PhD
thesis of Zitnick (2003).
Along similar lines of generalization of entropy, Rényi (1960) defined a one parameter generalization of Kullback-Leibler relative-entropy as
n
Iα (pkr) =
X pα
1
k
ln
α−1
α−1
rk
(2.29)
k=1
for pmfs p and r. Properties of this generalized relative-entropy can be found in (Rényi,
1970, Chapter 9).
We conclude this section with the note that though it is considered that the first
formal generalized measure of information is due to Rényi, the idea of considering
some generalized measure did not start with Rényi. Bhattacharyya (1943, 1946) and
Jeffreys (1948) dealt with the quantity
n
X
√
I1/2 (pkr) = −2
pk rk = I1/2 (rkp)
(2.30)
k=1
as a measure of difference between the distributions p and r, which is nothing but
Rényi relative-entropy (2.29) with α = 21 . Before Rényi, Schützenberger (1954) mentioned the expression Sα and Kullback (1959) too dealt with the quantities I α . (One
can refer (Rényi, 1960) for a discussion on the context in which Kullback considered
these generalized entropies.)
Apart from Rényi and Tsallis generalizations, there are various generalizations
of Shannon entropy reported in literature. Reviews of these generalizations can be
found in Kapur (1994) and Arndt (2001). The characterizations of various information
measures are studied in (Ebanks, Sahoo, & Sander, 1998). Since poorly motivated
generalizations have also been published during Rényi’s time, Rényi emphasized the
need of operational as well as postulational justification in order to call an algebraic
32
expression an information quantity. In this respect, Rényi’s review paper (Rényi, 1965)
is particularly instructive.
Now we discuss the important, non-additive generalization of Shannon entropy.
2.3 Nonextensive Generalizations
Although, first introduced by Havrda and Charvát (1967) in the context of cybernetics
theory and later studied by Daróczy (1970), it was Tsallis (1988) who exploited its
nonextensive features and placed it in a physical setting. Hence it is also known as
Harvda-Charvat-Daróczy-Tsallis entropy. (Throughout this paper we refer to this as
Tsallis or nonextensive entropy.)
2.3.1 Tsallis Entropy
Tsallis entropy of an r.v X ∈ X with p.m.f p = (p 1 , . . . pn ) is defined as
P
1 − nk=1 pqk
,
Sq (X) =
q−1
(2.31)
where q > 0 is called the nonextensive index.
Tsallis entropy too, like Rényi entropy, is a one-parameter generalization of Shannon entropy in the sense that
lim Sq (p) = −
q→1
n
X
pk ln pk = S1 (p) ,
(2.32)
k=1
since in the limit q → 1, we have pkq−1 = e(q−1) ln pk ∼ 1 + (q − 1) ln pk or by the
L’Hospital rule.
Tsallis entropy retains many important properties of Shannon entropy except for
the additivity property. Here we briefly discuss some of these properties. The arguments which provide the positivity of Rényi entropy are also applicable for Tsallis
entropy and hence Sq (p) ≥ 0 for any pmf p. Sq equals zero in the case of certainty
and attains its extremum for a uniform distribution.
The fact that Tsallis entropy attains maximum for uniform distribution can be
shown as follows. I We extremize the Tsallis entropy under the normalizing conP
straint nk=1 pk = 1. By introducing the Lagrange multiplier λ, we set
!!
P
n
X
1 − nk=1 pqk
∂
q
0=
−λ
pk − 1
=−
pq−1 − λ .
∂pk
q−1
q−1 k
k=1
33
It follows that
λ(1 − q)
pk =
q
1
q−1
.
Since this is independent of k, imposition of the normalizing constraint immediately
yields pk = n1 . J
Tsallis entropy is concave for all q > 0 (convex for q < 0). I This follows
immediately from the Hessian matrix
∂2
∂pi ∂pj
Sq (p) − λ
n
X
k=1
pk − 1
!!
= −qpiq−2 δij ,
which is clearly negative definite for q > 0 (positive definite for q < 0). J One can
recall that Rényi entropy (2.22) is concave only for 0 < α < 1.
Also, one can prove that for two pmfs p and r, and for real number 0 ≤ λ ≤ 1 we
have
Sq (λp + (1 − λ)r) ≥ λSq (p) + (1 − λ)Sq (r) ,
which results from Jensen’s inequality and concavity of
(2.33)
xq
1−q .
What separates out Tsallis entropy from Shannon and Rényi entropies is that it is
not additive. The entropy index q in (2.31) characterizes the degree of nonextensivity
reflected in the pseudo-additivity property
Sq (X ×Y ) = Sq (X)⊕q Sq (Y ) = Sq (X)+Sq (Y )+(1−q)Sq (X)Sq (Y ) ,(2.34)
where X, Y ∈ X are two independent random variables.
In the nonextensive case, Tsallis entropy function can written as
sq (p) =
1 1 − xq − (1 − x)q
q−1
(2.35)
Figure 2.2 shows the plots of Shannon entropy function (2.2) and Tsallis entropy function (2.35) for various values of entropic index a.
It is worth mentioning here that the derivation of Tsallis entropy using the Lorentz
addition by Amblard and Vignat (2005) gives insights into the boundedness of Tsallis
entropy. In this thesis we will not go into these details.
The first set of axioms for Tsallis entropy is given by dos Santos (1997), which
were later improved by Abe (2000). The most concise set of axioms are given by Suyari (2004a), which are known as Generalized Shannon-Khinchin axioms. A simpli34
0.9
0.8
s(p) & sq(p)
0.7
q=0.8
q=1.2
0.6
q=1.5
0.5
0.4
0.3
0.2
Shannon
Tsallis
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
Figure 2.2: Shannon and Tsallis Entropy Functions
fied proof of this uniqueness theorem for Tsallis entropy is given by (Furuichi, 2005).
In these axioms, Shannon additivity (2.8) is generalized to
Sq (p11 , . . . , pnmn ) = Sq (p1 , . . . , pn ) +
n
X
i=1
pqi Sq
pimi
pi1
,...,
pi
pi
,
(2.36)
under the same conditions (2.7); remaining axioms are the same as in Shannon-Khinchin
axioms.
Now we turn our attention to the nonextensive generalization of relative-entropy.
The definition of Kullback-Leibler relative-entropy (2.9) and the nonextensive entropic
functional (2.31) naturally lead to the generalization (Tsallis, 1998)
Iq (pkr) =
n
X
pk
pk
rk
k=1
q−1
q−1
−1
,
(2.37)
which is called as Tsallis relative-entropy. The limit q → 1 recovers the relativeentropy in the classical case. One can also generalize Gibbs inequality as (Tsallis,
1998) )
Iq (pkr) ≥ 0 if q > 0
= 0 if q = 0
≤ 0 if q < 0 .
(2.38)
35
For q 6= 0, the equalities hold if and only if p = r. (2.38) can be verified as follows.
I Consider the function f (x) =
1−x1−q
1−q .
We have f 00 (x) > 0 for q > 0 and hence it
is convex. By Jensen’s inequality we obtain

1−q 

r
n
X  1 − pkk
1 

pk 
Iq (pkr) =
1−
≥
1−q
1−q
k=1
=0 .
n
X
k=1
!1−q 
rk

pk
pk
(2.39)
For q < 0 we have f 00 (x) < 0 and hence we have the reverse inequality by Jensen’s
inequality for concave functions. J
Further, for q > 0, Iq (pkr) is a convex function of p and r, and for q < 0 it
is concave, which can be proved using Jensen’s inequality cf. (Borland, Plastino, &
Tsallis, 1998).
Tsallis relative-entropy satisfies the pseudo-additivity property of the form (Furuichi et al., 2004)
Iq (X1 × Y1 kX2 × Y2 ) = Iq (X1 kX2 ) + Iq (Y1 kY2 )
+(q − 1)Iq (X1 kX2 )Iq (Y1 kY2 ) ,
(2.40)
where X1 , X2 ∈ X and Y1 , Y2 ∈ Y are such that X1 and Y1 are independent, and X2
and Y2 are independent respectively. The limit q → 1 in (2.40) retrieves (2.11), the ad-
ditivity property of Kullback-Leibler relative-entropy. One should note the difference
between the pseudo-additivities of Tsallis entropy (2.34) and Tsallis relative-entropy
(2.40).
Further properties of Tsallis relative-entropy have been discussed in (Tsallis, 1998;
Borland et al., 1998; Furuichi et al., 2004). Characterization of Tsallis relative-entropy,
by generalizing Hobson’s uniqueness theorem (Hobson, 1969) of relative-entropy, is
presented in (Furuichi, 2005).
2.3.2 q-Deformed Algebra
The mathematical basis for Tsallis statistics comes from the q-deformed expressions
for the logarithm (q-logarithm) and the exponential function (q-exponential) which
were first defined in (Tsallis, 1994), in the context of nonextensive thermostatistics.
The q-logarithm is defined as
lnq x =
x1−q − 1
(x > 0, q ∈ R) ,
1−q
36
(2.41)
and the q-exponential is defined as
(
1
[1 + (1 − q)x] 1−q if 1 + (1 − q)x ≥ 0
x
eq =
0
otherwise.
(2.42)
We have limq→1 lnq x = ln x and limq→1 exq = ex . These two functions are related by
ln x
eq q = x .
(2.43)
The q-logarithm satisfies pseudo-additivity of the form
lnq (xy) = lnq x + lnq y + (1 − q) lnq x lnq y ,
(2.44)
while, the q-exponential satisfies
exq eyq = e(x+y+(1−q)xy)
.
q
(2.45)
One important property of the q-logarithm is (Furuichi, 2006)
x
lnq
= y q−1 (lnq x − lnq y) .
y
(2.46)
These properties of q-logarithm and q-exponential functions, (2.44) and (2.45),
motivate the definition of q-addition as
x ⊕q y = x + y + (1 − q)xy ,
(2.47)
which we have already mentioned in the context of pseudo-additivity of Tsallis entropy
(2.34). The q-addition is commutative i.e., x ⊕ q y = y ⊕q x, and associative i.e.,
x ⊕q (y ⊕q z) = (x ⊕q y) ⊕q z. But it is not distributive with respect to the usual
multiplication, i.e., a(x ⊕q y) 6= (ax ⊕q ay). Similar to the definition of q-addition,
the q-difference is defined as
x q y =
1
x−y
, y 6=
.
1 + (1 − q)y
q−1
(2.48)
Further properties of these q-deformed functions can be found in (Yamano, 2002).
In this framework a new multiplication operation called q-product has been defined, which plays an important role in the compact representation of distributions
resulting from Tsallis relative-entropy minimization (Dukkipati, Murty, & Bhatnagar,
2005b). These aspects are discussed in Chapter 4.
Now, using these q-deformed functions, Tsallis entropy (2.31) can be represented
as
Sq (p) = −
n
X
pqk lnq pk ,
(2.49)
k=1
37
and Tsallis relative-entropy (2.37) as
Iq (pkr) = −
n
X
k=1
pk lnq
rk
.
pk
(2.50)
These representations are very important for deriving many results related to nonextensive generalizations as we are going to consider in the later chapters.
2.4 Uniqueness of Tsallis Entropy under Rényi’s Recipe
Though the derivation of Tsallis entropy proposed in 1988 is slightly different, one
can understand this generalization using the q-logarithm function, where one would
first generalize logarithm in the Hartley information with the q-logarithm and define
e : {x1 , . . . , xn } → R of r.v X as (Tsallis, 1999)
the q-Hartley function H
e k = H(x
e k ) = lnq 1 ,
H
pk
k = 1, . . . n .
(2.51)
Now, Tsallis entropy (2.31) can be defined as the expectation of the q-Hartley function
e as5
H
D E
e .
Sq (X) = H
(2.52)
Note that the characteristic pseudo-additivity property of Tsallis entropy (2.34) is a
consequence of the pseudo-additivity of the q-logarithm (2.44).
Before we present the main results, we briefly discuss the context of quasilinear
means, where there is a relation between Tsallis and Rényi entropy. By using the
definition of the q-logarithm (2.41), the q-Hartley function can be written as
where
e k = lnq 1 = φq (Hk ) ,
H
pk
φq (x) =
e(1−q)x − 1
= lnq (ex ) .
1−q
(2.53)
Note that the function φq is KN-equivalent to e(1−q)x (by Theorem 2.2), the KNfunction used in Rényi entropy. Hence Tsallis entropy is related to Rényi entropies
as
SqT = φq (SqR ) ,
(2.54)
5
There are alternative definitions of nonextensive information content in the Tsallis formalism. One
of them is the expression − lnq pk used by Yamano (2001) and characterized by Suyari (2002) (note
that − lnq pk 6= lnq p1k ). Using this definition one has to use alternate expectation, called q-expectation,
to define Tsallis entropy. We discuss q-expectation values in Chapter 3. Regarding the definition of
nonextensive information content, we use Tsallis (1999) definition (2.51) in this thesis.
38
0.9
Renyi
Tsallis
q<1
0.8
0.6
0.5
0.4
R
T
Sq (p) & Sq (p)
0.7
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
(a) Entropic Index q = 0.8
0.7
Renyi
Tsallis
q>1
0.6
0.4
0.3
R
T
Sq (p) & Sq (p)
0.5
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
p
(b) Entropic Index q = 1.2
Figure 2.3: Comparison of Rényi and Tsallis Entropy Functions
where SqT and SqR denote the Tsallis and Rényi entropies respectively with a real number q as a parameter. (2.54) implies that Tsallis and Rényi entropies are monotonic
functions of each other and, as a result, both must be maximized by the same probability distribution. In this thesis, we consider only ME-prescriptions related to nonextensive entropies. Discussion on ME of Rényi entropy can be found in (Bashkirov, 2004;
Johnson & Vignat, 2005; Costa, Hero, & Vignat, 2002).
Comparisons of Rényi entropy function (2.27) with Tsallis entropy function (2.35)
are shown graphically in Figure 2.3 for two cases of entropic index, corresponding to
0 < q < 1 and q > 1 respectively. Now a natural question that arises is whether
one could generalize Tsallis entropy using Rényi’s recipe, i.e. by replacing the linear
average in (2.52) by KN-averages and imposing the condition of pseudo-additivity. It
is equivalent to determining the KN-function ψ for which the so called q-quasilinear
39
entropy defined as
" n
#
D E
X
e
ek
Seψ (X) = H
pk ψ H
= ψ −1
,
ψ
(2.55)
k=1
e k = H(x
e k ), ∀k = 1, . . . n, satisfies the pseudo-additivity property.
where H
First, we present the following result which characterizes the pseudo-additivity of
quasilinear means.
T HEOREM 2.4
Let X, Y ∈ X be two independent random variables. Let ψ be any KN-function.
Then
hX ⊕q Y iψ = hXiψ ⊕q hY iψ
(2.56)
if and only if ψ is linear.
Proof
Let p and r be the p.m.fs of random variables X, Y ∈ X respectively. The proof of
sufficiency is simple and follows from
hX ⊕q Y iψ = hX ⊕q Y i =
=
=
n
n X
X
i=1 j=1
n
n X
X
i=1 j=1
n
X
pi rj (xi ⊕q yj )
pi rj (xi + yj + (1 − q)xi yj )
pi xi +
i=1
n
X
j=1
rj yj + (1 − q)
n
X
i=1
pi xi
n
X
rj yj .
j=1
To prove the converse, we need to determine all forms of ψ which satisfy


n X
n
X
ψ −1 
pi rj ψ (xi ⊕q yj )
i=1 j=1
= ψ −1
n
X
pi ψ (xi )
i=1
!


n
X
⊕q ψ −1 
rj ψ (yj ) .
(2.57)
j=1
Since (2.57) must hold for arbitrary p.m.fs p, r and for arbitrary numbers x 1 , . . . , xn
and y1 , . . . , yn , one can choose yj = c for all j. Then (2.57) yields
ψ −1
n
X
i=1
pk ψ (xi ⊕q c)
!
= ψ −1
n
X
i=1
40
pk ψ (xi )
!
⊕q c .
(2.58)
That is, ψ should satisfy
hX ⊕q ciψ = hXiψ ⊕q c ,
(2.59)
for any X ∈ X and any constant c. This can be rearranged as
h(1 + (1 − q)c)X + ciψ = (1 + (1 − q)c)hXi ψ + c
by using the definition of ⊕q . Since q is independent of other quantities, ψ should
satisfy an equation of the form
hdX + ciψ = dhXiψ + c ,
(2.60)
where d 6= 0 (by writing d = (1 + (1 − q)c)). Finally ψ must satisfy
hX + ciψ = hXiψ + c
(2.61)
hdXiψ = dhXiψ ,
(2.62)
and
for any X ∈ X and any constants d, c. From Theorem 2.3, condition (2.61) is satisfied
only when ψ is linear or exponential.
To complete the theorem, we have to show that KN-averages do not satisfy condition (2.62) when ψ is exponential. For a particular choice of ψ(x) = e (1−α)x , assume
that
hdXiψ = dhXiψ ,
(2.63)
where
hdXiψ1
n
X
1
=
pk e(1−α)dxk
ln
1−α
dhXiψ1
n
X
d
=
ln
pk e(1−α)xk
1−α
k=1
and
k=1
!
!
,
.
Now define a KN-function ψ 0 as ψ 0 (x) = e(1−α)dx , for which
!
n
X
1
pk e(1−α)dxk
.
hXiψ0 =
ln
d(1 − α)
k=1
Condition (2.63) implies
hXiψ = hXiψ0 ,
and by Theorem 2.2, ψ and ψ 0 are KN-equivalent, which gives a contradiction.
41
One can observe that the above proof avoids solving functional equations as in the
case of the proof of Theorem 2.3 (e.g., Aczél & Daróczy, 1975). Instead, it makes
use of Theorem 2.3 itself and other basic properties of KN-averages. The following
corollary is an immediate consequence of Theorem 2.4.
C OROLLARY 2.1
The q-quasilinear entropy Seψ (defined as in (2.55)) with respect to a KN-function
ψ satisfies pseudo-additivity if and only if Seψ is Tsallis entropy.
Proof
Let X, Y ∈ X be two independent random variables and let p, r be their corresponding
pmfs. By the pseudo-additivity constraint, ψ should satisfy
Seψ (X × Y ) = Seψ (X) ⊕q Seψ (Y ) .
(2.64)
From the property of q-logarithm that lnq xy = lnq x ⊕q lnq y, we need

ψ −1 
n
n X
X
pi rj ψ lnq
i=1 j=1
= ψ −1
n
X
i=1

1 
pi rj
pi ψ lnq
1
pi
!


n
X
1 
.
rj ψ lnq
⊕q ψ −1 
rj
Equivalently, we need


n X
n
X
e p ⊕q H
er 
ψ −1 
pi rj ψ H
j
i
i=1 j=1
= ψ −1
n
X
i=1
ep
pi ψ H
i
!
(2.65)
j=1

⊕q ψ −1 
n
X
j=1

e jr  ,
rj ψ H
e p and H
e r represent the q-Hartley functions corresponding to probability diswhere H
tributions p and r respectively. That is, ψ should satisfy
e p ⊕q H
e r i = hH
e p i ⊕q hH
e ri .
hH
ψ
ψ
ψ
Also from Theorem 2.4, ψ is linear and hence Seψ is Tsallis.
Corollary 2.1 shows that using Rényi’s recipe in the nonextensive case one can
prepare only Tsallis entropy, while in the classical case there are two possibilities.
Figure 2.4 summarizes the Rényi’s recipe for Shannon and Tsallis information measures.
42
Hartley Information
q−Hartley Information
KN−average
KN−average
Quasilinear Entropy
q−Quasilinear Entropy
additivity
Shannon Entropy
pseudo−additivity
’
Renyi
Entropy
Tsallis Entropy
Figure 2.4: Rényi’s Recipe for Additive and Pseudo-additive Information Measures
2.5 A Characterization Theorem for Nonextensive Entropies
The significance of Rényi’s formalism to generalize Shannon entropy is a characterization of the set of all additive information measures in terms of axioms of quasilinear
means (Rényi, 1960). By the result, Theorem 2.4, that we presented in this chapter, one can extend this characterization to pseudo-additive (nonextensive) information
measures. We emphasize here that, for such a characterization one would assume
that entropy is the expectation of a function of underlying r.v. In the classical case, the
function is Hartley function, while in the nonextensive case it is the q-Hartley function.
Since characterization of quasilinear means is given in terms of cumulative distribution of a random variable as in (Hardy et al., 1934), we use the following definitions
and notation.
Let F : R → R denote the cumulative distribution function of the random variable
X ∈ X. Corresponding to a KN-function ψ : R → R, the generalized mean of F
(equivalently, generalized mean of X) can be written as
Z
−1
Eψ (F ) = Eψ (X) = hXiψ = ψ
ψ dF
,
(2.66)
which is the continuous analogue to (2.16), and is axiomized by Kolmogorov, Nagumo,
de Finetti, c.f (Hardy et al., 1934, Theorem 215) as follows.
43
T HEOREM 2.5
Let FI be the set of all cumulative distribution functions defined on some interval
I of the real line R. A functional κ : F I → R satisfies the following axioms:
[KN1] κ(δx ) = x, where δx ∈ FI denotes the step function at x (Consistency with
certainty) ,
[KN2] F, G ∈ FI , if F ≤ G then κ(F ) ≤ κ(G); the equality holds if and only if
F = G (Monotonicity) and,
[KN3] F, G ∈ FI , if κ(F ) = κ(G) then κ(βF + (1 − β)H) = κ(βG + (1 − β)H),
for any H ∈ FI (Quasilinearity)
if and only if there is a continuous strictly monotone function ψ such that
Z
κ(F ) = ψ −1
ψ
dF
.
Proof of the above characterization can be found in (Hardy et al., loc. cit.). Modified axioms for the quasilinear mean can be found in (Chew, 1983; Fishburn, 1986;
Ostasiewicz & Ostasiewicz, 2000). Using this characterization of the quasilinear mean,
Rényi gave the following characterization for additive information measures.
T HEOREM 2.6
Let X ∈ X be a random variable. An information measure defined as a (gener-
alized) mean κ of Hartley function of X is either Shannon or Rényi if and only
if
1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5
and,
2. If X1 , X2 ∈ X are two random variables which are independent, then
κ(X1 + X2 ) = κ(X1 ) + κ(X2 ) .
Further, if κ satisfies κ(Y ) + κ(−Y ) = 0 for any Y ∈ X then κ is necessarily
Shannon entropy.
The proof of above theorem is straight forward by using Theorem (2.3); for details
see (Rényi, 1960).
Now we give the following characterization theorem for nonextensive entropies.
44
T HEOREM 2.7
Let X ∈ X be a random variable. An information measure defined as a (general-
ized) mean κ of q-Hartley function of X is Tsallis entropy if and only if
1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5
and,
2. If X1 , X2 ∈ X are two random variables which are independent, then
κ(X1 ⊕q X2 ) = κ(X1 ) ⊕q κ(X2 ) .
The above theorem is a direct consequence of Theorems 2.4 and 2.5. This characterization of Tsallis entropy only replaces the additivity constraint in the characterization of Shannon entropy given by Rényi (1960) with pseudo-additivity, which further
does not make use of the postulate κ(X) + κ(−X) = 0. (This postulate is needed
to distinguish Shannon entropy from Rényi entropy). This is possible because Tsallis
entropy is unique by means of KN-averages and under pseudo-additivity.
From the relation between Rényi and Tsallis information measures (2.54), possibly, generalized averages play a role – though not very well understood till now –
in describing the operational significance of Tsallis entropy. Here, one should mention the work of Czachor and Naudts (2002), who studied the KN-average based MEprescriptions of generalized information measures (constraints with respect to which
one would maximize entropy are defined in terms of quasilinear means). In this regard,
results presented in this chapter have mathematical significance in the sense that they
further the relation between nonextensive entropic measures and generalized averages.
45
3
Measures and Entropies:
Gelfand-Yaglom-Perez Theorem
Abstract
R
The measure-theoretic KL-entropy defined as X ln dP
dR dP , where P and R are
probability measures on a measurable space (X, M), plays a basic role in the definitions of classical information measures. A fundamental theorem in this respect is
the Gelfand-Yaglom-Perez Theorem (Pinsker, 1960b, Theorem 2.4.2) which equips
measure-theoretic KL-entropy with a fundamental definition and can be stated as,
Z
m
ln
X
X
dP
P (Ek )
dP = sup
P (Ek ) ln
,
dR
R(Ek )
k=1
where supremum is taken over all the measurable partitions {Ek }m
k=1 . In this chapter, we state and prove the GYP-theorem for Rényi relative-entropy of order greater
than one. Consequently, the result can be easily extended to Tsallis relative-entropy.
Prior to this, we develop measure-theoretic definitions of generalized information
measures and discuss the maximum entropy prescriptions. Some of the results presented in this chapter can also be found in (Dukkipati, Bhatnagar, & Murty, 2006b,
2006a).
Shannon’s measure of information was developed essentially for the case when the
random variable takes a finite number of values. However in the literature, one often
encounters an extension of Shannon entropy in the discrete case (2.1) to the case of a
one-dimensional random variable with density function p in the form (e.g., Shannon
& Weaver, 1949; Ash, 1965)
S(p) = −
Z
+∞
p(x) ln p(x) dx .
−∞
This entropy in the continuous case as a pure-mathematical formula (assuming convergence of the integral and absolute continuity of the density p with respect to Lebesgue
measure) resembles Shannon entropy in the discrete case, but cannot be used as a
measure of information for the following reasons. First, it is not a natural extension of
Shannon entropy in the discrete case, since it is not the limit of the sequence of finite
discrete entropies corresponding to pmfs which approximate the pdf p. Second, it is
not strictly positive.
46
Inspite of these short comings, one can still use the continuous entropy functional
in conjunction with the principle of maximum entropy where one wants to find a probability density function that has greater uncertainty than any other distribution satisfying
a set of given constraints. Thus, one is interested in the use of continuous measure as
a measure of relative and not absolute uncertainty. This is where one can relate maximization of Shannon entropy to the minimization of Kullback-Leibler relative-entropy
cf. (Kapur & Kesavan, 1997, pp. 55). On the other hand, it is well known that the
continuous version of KL-entropy defined for two probability density functions p and
r,
I(pkr) =
Z
+∞
p(x) ln
−∞
p(x)
dx ,
r(x)
is indeed a natural generalization of the same in the discrete case.
Indeed, during the early stages of development of information theory, the important
paper by Gelfand, Kolmogorov, and Yaglom (1956) called attention to the case of
defining entropy functional on an arbitrary measure space (X, M, µ). In this case,
Shannon entropy of a probability density function p : X → R + can be written as,
Z
S(p) = −
p(x) ln p(x) dµ(x) .
X
One can see from the above definition that the concept of “the entropy of a pdf” is a
misnomer as there is always another measure µ in the background. In the discrete case
considered by Shannon, µ is the cardinality measure 1 (Shannon & Weaver, 1949, pp.
19); in the continuous case considered by both Shannon and Wiener, µ is the Lebesgue
measure cf. (Shannon & Weaver, 1949, pp. 54) and (Wiener, 1948, pp. 61, 62). All
entropies are defined with respect to some measure µ, as Shannon and Wiener both
emphasized in (Shannon & Weaver, 1949, pp.57, 58) and (Wiener, 1948, pp.61, 62)
respectively.
This case was studied independently by Kallianpur (1960) and Pinsker (1960b),
and perhaps others were guided by the earlier work of Kullback and Leibler (1951),
where one would define entropy in terms of Kullback-Leibler relative-entropy. In
this respect, the Gelfand-Yaglom-Perez theorem (GYP-theorem) (Gelfand & Yaglom,
1959; Perez, 1959; Dobrushin, 1959) plays an important role as it equips measuretheoretic KL-entropy with a fundamental definition. The main contribution of this
chapter is to prove GYP-theorem for Rényi relative-entropy of order α > 1, which can
be extended to Tsallis relative-entropy.
1
Counting or cardinality measure µ on a measurable space (X, = 2X , is defined as µ(E) = #E, ∀E ∈ .
47
), where X is a finite set and
Before proving GYP-theorem for Rényi relative-entropy, we study the measuretheoretic definitions of generalized information measures in detail, and discuss the
corresponding ME-prescriptions. We show that as in the case of relative-entropy,
the measure-theoretic definitions of generalized relative-entropies, Rényi and Tsallis,
are natural extensions of their respective discrete definition. We also show that MEprescriptions of measure-theoretic Tsallis entropy are consistent with that of discrete
case, which is true for measure-theoretic Shannon-entropy.
We review the measure-theoretic formalisms for classical information measures in
§ 3.1 and extend these definitions to generalized information measures in § 3.2. In
§ 3.3 we present the ME-prescription for Shannon entropy followed by prescriptions
for Tsallis entropy in § 3.4. We revisit measure-theoretic definitions of generalized
entropy functionals in § 3.5 and present some results. Finally, Gelfand-Yaglom-Perez
theorem in the general case is presented in § 3.6.
3.1 Measure Theoretic Definitions of Classical Information Measures
In this section, we study the non-discrete definitions of entropy and KL-entropy and
present the formal definitions on the measure spaces. Rigorous studies of the Shannon
and KL entropy functionals in measure spaces can be found in the papers by Ochs
(1976) and Masani (1992a, 1992b). Basic measure-theoretic aspects of classical information measures can be found in books on information theory by Pinsker (1960b),
Guiaşu (1977) and Gray (1990). For more details on development of mathematical
information theory one can refer to excellent survey by Kotz (1966). This survey is
perhaps the best available English-language guide to the Eastern European information
theory literature for the period 1956-1966. One can also refer to (Cover et al., 1989)
for a review on Kolmogorov’s contributions to mathematical information theory.
A note on the notation. To avoid proliferation of symbols we use the same notation for the information measures in the discrete and non-discrete cases; the correspondence should be clear from the context. For example, we use S(p) to denote the
entropy of a pdf p in the measure-theoretic setting too. Whenever we have to compare these quantities in different cases we use the symbols appropriately, which will
be specified in the sequel.
3.1.1 Discrete to Continuous
Let p : [a, b] → R+ be a probability density function, where [a, b] ⊂ R. That is, p
48
satisfies
p(x) ≥ 0, ∀x ∈ [a, b] and
Z
b
p(x) dx = 1 .
a
In trying to define entropy in the continuous case, the expression of Shannon entropy
in the discrete case (2.1) was automatically extended to continuous case by replacing
the sum in the discrete case with the corresponding integral. We obtain, in this way,
Boltzmann’s H-function (also known as differential entropy in information theory),
S(p) = −
Z
b
p(x) ln p(x) dx .
(3.1)
a
The “continuous entropy” given by (3.1) is not a natural extension of definition in
discrete case in the sense that, it is not the limit of the finite discrete entropies corresponding to a sequence of finer partitions of the interval [a, b] whose norms tend
to zero. We can show this by a counter example. I Consider a uniform probability
distribution on the interval [a, b], having the probability density function
p(x) =
1
,
b−a
x ∈ [a, b] .
The continuous entropy (3.1), in this case will be
S(p) = ln(b − a) .
On the other hand, let us consider a finite partition of the interval [a, b] which is composed of n equal subintervals, and let us attach to this partition the finite discrete
uniform probability distribution whose corresponding entropy will be, of course,
Sn (p) = ln n .
Obviously, if n tends to infinity, the discrete entropy S n (p) will tend to infinity too,
and not to ln(b − a); therefore S(p) is not the limit of S n (p), when n tends to infinity.
J Further, one can observe that ln(b − a) is negative when b − a < 1.
Thus, strictly speaking, continuous entropy (3.1) cannot represent a measure of
uncertainty since uncertainty should in general be positive. We are able to prove the
“nice” properties only for the discrete entropy, therefore, it qualifies as a “good” measure of information (or uncertainty) supplied by a random experiment 2 . We cannot
2
One importent property that Shannon entropy exhibits in the continuous case is the entropy power
inequality, which can be stated as follows. Let X and Y are continuous independent random variables
with entropies S(X) and S(Y ) then we have e2S(X+Y ) ≥ e2S(X) + e2S(Y ) with equality if and only if
X and Y are Gaussian variables or one of them is determenistic. The entropy power inequality is derived
by Shannon (1948). Only few and partial versions of it have been proved in the discrete case.
49
extend the so called nice properties to the “continuous entropy” because it is not the
limit of a suitably defined sequence of discrete entropies.
Also, in physical applications, the coordinate x in (3.1) represents an abscissa, a
distance from a fixed reference point. This distance x has the dimensions of length.
Since the density function p(x) specifies the probabilities of an event of type [c, d) ⊂
Rd
[a, b] as c p(x) dx and probabilities are dimensionless, one has to assign the dimen-
sions (length)−1 to p(x). Now for 0 ≤ z < 1, one has the series expansion
1
1
− ln(1 − z) = z + z 2 + z 3 + . . . .
2
3
(3.2)
It is thus necessary that the argument of the logarithmic function in (3.1) be dimensionless. Hence the formula (3.1) is then seen to be dimensionally incorrect, since the
argument of the logarithm on its right hand side has the dimensions of a probability
density (Smith, 2001). Although, Shannon (1948) used the formula (3.1), he did note
its lack of invariance with respect to changes in the coordinate system.
In the context of maximum entropy principle, Jaynes (1968) addressed this problem and suggested the formula,
Z b
p(x)
0
p(x) ln
S (p) = −
dx ,
m(x)
a
(3.3)
in the place of (3.1), where m(x) is a prior function. Note that when m(x) is also a
probability density function, (3.3) is nothing but the relative-entropy. However, if we
choose m(x) = c, a constant (e.g., Zellner & Highfield, 1988), we get
S 0 (p) = S(p) + ln c ,
where S(p) refers to the continuous entropy (3.1). Thus, maximization of S 0 (p) is
equivalent to maximization of S(p). Further discussion on estimation of probability
density functions by maximum entropy method can be found in (Lazo & Rathie, 1978;
Zellner & Highfield, 1988; Ryu, 1993).
Prior to that, Kullback and Leibler (1951) too suggested that in the measuretheoretic definition of entropy, instead of examining the entropy corresponding only
to the given measure, we have to compare the entropy inside a whole class of measures.
3.1.2 Classical Information Measures
Let (X, M, µ) be a measure space, where µ need not be a probability measure unless
otherwise specified. Symbols P , R will denote probability measures on measurable
50
space (X, M) and p, r denote M-measurable functions on X. An M-measurable funcR
tion p : X → R+ is said to be a probability density function (pdf) if X p(x) dµ(x) =
R
1 or X p dµ = 1 (henceforth, the argument x will be omitted in the integrals if this
does not cause ambiguity).
In this general setting, Shannon entropy S(p) of pdf p is defined as follows (Athreya,
1994).
D EFINITION 3.1
Let (X, M, µ) be a measure space and the M-measurable function p : X → R + be
a pdf. Then, Shannon entropy of p is defined as
Z
p ln p dµ ,
S(p) = −
(3.4)
X
provided the integral on right exists.
Entropy functional S(p) defined in (3.4) can be referred to as entropy of the probability measure P that is induced by p, that is defined according to
Z
p(x) dµ(x) , ∀E ∈ M .
P (E) =
(3.5)
E
This reference is consistent3 because the probability measure P can be identified a.e
by the pdf p.
Further, the definition of the probability measure P in (3.5), allows us to write
entropy functional (3.4) as,
Z
dP
dP
S(p) = −
ln
dµ ,
dµ
X dµ
(3.6)
since (3.5) implies4 P µ, and pdf p is the Radon-Nikodym derivative of P w.r.t µ.
Now we proceed to the definition of Kullback-Leibler relative-entropy or KLentropy for probability measures.
3
Say p and r be two pdfs and P and R be the corresponding induced measures on measurable space
(X, ) such that P and R are identical, i.e., E p dµ = E r dµ, ∀E ∈ . Then we have p = r, µ a.e,
and hence − X p ln p dµ = − X r ln r dµ.
4
If a nonnegative measurable function f induces a measure ν on measurable space (X, ) with
respect to a measure µ, defined as ν(E) = E f dµ, ∀E ∈ then ν µ. The converse of this result
is given by Radon-Nikodym theorem (Kantorovitz, 2003, pp.36, Theorem 1.40(b)).
51
D EFINITION 3.2
Let (X, M) be a measurable space. Let P and R be two probability measures on
(X, M). Kullback-Leibler relative-entropy KL-entropy of P relative to R is defined
as
Z
dP


dP
 ln
dR
X
I(P kR) =



+∞
if
P R ,
(3.7)
otherwise.
The divergence inequality I(P kR) ≥ 0 and I(P kR) = 0 if and only if P = R
can be shown in this case too. KL-entropy (3.7) also can be written as
Z
dP
dP
ln
dR .
I(P kR) =
dR
dR
X
(3.8)
Let the σ-finite measure µ on (X, M) be such that P R µ. Since µ is
σ-finite, from Radon-Nikodym theorem, there exist non-negative M-measurable functions p : X → R+ and r : X → R+ unique µ-a.e, such that
Z
p dµ , ∀E ∈ M ,
P (E) =
(3.9a)
E
and
R(E) =
Z
E
r dµ , ∀E ∈ M .
(3.9b)
The pdfs p and r in (3.9a) and (3.9b) (they are indeed pdfs) are Radon-Nikodym derivatives of probability measures P and R with respect to µ, respectively, i.e., p =
r=
D EFINITION 3.3
dR
dµ .
dP
dµ
and
Now one can define relative-entropy of pdf p w.r.t r as follows 5 .
Let (X, M, µ) be a measure space. Let M-measurable functions p, r : X → R +
be two pdfs. The KL-entropy of p relative to r is defined as
Z
p(x)
dµ(x) ,
p(x) ln
I(pkr) =
r(x)
X
(3.10)
provided the integral on right exists.
As we have mentioned earlier, KL-entropy (3.10) exists if the two densities are
absolutely continuous with respect to one another. On the real line, the same definition
can be written with respect to the Lebesgue measure
Z
p(x)
I(pkr) = p(x) ln
dx ,
r(x)
5
This follows from the chain rule for Radon-Nikodym derivative:
dP a.e dP
=
dR
dµ
dR
dµ −1
.
52
which exists if the densities p(x) and r(x) share the same support. Here, in the sequel
we use the convention
ln 0 = −∞,
ln
a
= +∞ forany a ∈ R,
0
0.(±∞) = 0.
(3.11)
Now, we turn to the definition of entropy functional on a measure space. Entropy
functional in (3.6) is defined for a probability measure that is induced by a pdf. By
the Radon-Nikodym theorem, one can define Shannon entropy for any arbitrary µcontinuous probability measure as follows.
D EFINITION 3.4
Let (X, M, µ) be a σ-finite measure space. Entropy of any µ-continuous probability
measure P (P µ) is defined as
Z
dP
dP .
S(P ) = −
ln
dµ
X
(3.12)
The entropy functional (3.12) is known as Baron-Jauch entropy or generalized
Boltzmann-Gibbs-Shannon entropy (Wehrl, 1991). Properties of entropy of a probability measure in the Definition 3.4 are studied in detail by Ochs (1976). In the literature, one can find notation of the form S(P |µ) to represent the entropy functional
in (3.12) viz., the entropy of a probability measure, to stress the role of the measure
µ (e.g., Ochs, 1976; Athreya, 1994). Since all the information measures we define are
with respect to the measure µ on (X, M), we omit µ in the entropy functional notation.
By assuming µ as a probability measure in the Definition 3.4, one can relate Shannon entropy with Kullback-Leibler entropy as,
S(P ) = −I(P kµ) .
(3.13)
Note that when µ is not a probability measure, the divergence inequality I(P kµ) ≥ 0
need not be satisfied.
A note on the σ-finiteness of measure µ in the definition of entropy functional. In
the definition of entropy functional we assumed that µ is a σ-finite measure. This condition was used by Ochs (1976), Csiszár (1969) and Rosenblatt-Roth (1964) to tailor
the measure-theoretic definitions. For all practical purposes and for most applications,
this assumption is satisfied (see (Ochs, 1976) for a discussion on the physical interpretation of measurable space (X, M) with σ-finite measure µ for an entropy measure
of the form (3.12), and of the relaxation of the σ-finiteness condition). The more universal definitions of entropy functionals, by relaxing the σ-finiteness condition, are
studied by Masani (1992a, 1992b).
53
3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy
First, let us consider the discrete case of (X, M, µ), where X = {x 1 , . . . , xn }, M =
2X is the power set of X. Let P and µ be any probability measures on (X, M). Then
µ and P can be specified as follows.
n
X
µ: µk = µ({xk }) ≥ 0, k = 1, . . . , n,
µk = 1 ,
(3.14a)
k=1
and
n
X
P : Pk = P ({xk }) ≥ 0, k = 1, . . . , n,
Pk = 1 .
(3.14b)
k=1
The probability measure P is absolutely continuous with respect to the probability
measure µ if µk = 0 for some k ∈ {1, . . . , n} then Pk = 0 as well. The corresponding
Radon-Nikodym derivative of P with respect to µ is given by
dP
Pk
(xk ) =
, k = 1, . . . n .
dµ
µk
The measure-theoretic entropy S(P ) (3.12), in this case, can be written as
S(P ) = −
n
X
k=1
Pk ln
n
n
k=1
k=1
X
X
Pk
=
Pk ln µk −
Pk ln Pk .
µk
If we take referential probability measure µ as a uniform probability distribution on
the set X, i.e. µk = n1 , k = 1, . . . , n, we obtain
S(P ) = Sn (P ) − ln n ,
(3.15)
where Sn (P ) denotes the Shannon entropy (2.1) of pmf P = (P 1 , . . . , Pn ) and S(P )
denotes the measure-theoretic entropy (3.12) reduced to the discrete case, with the
probability measures µ and P specified as in (3.14a) and (3.14b) respectively.
Now, let us consider the continuous case of (X, M, µ), where X = [a, b] ⊂ R, M
is the σ-algebra of Lebesgue measurable subsets of [a, b]. In this case µ and P can be
specified as follows.
µ: µ(x) ≥ 0, x ∈ [a, b], 3 µ(E) =
Z
E
µ(x) dx, ∀E ∈ M,
Z
b
µ(x) dx = 1 ,
a
(3.16a)
and
P : P (x) ≥ 0, x ∈ [a, b], 3 P (E) =
54
Z
E
P (x) dx, ∀E ∈ M,
Z
b
P (x) dx = 1 .
a
(3.16b)
Note that the abuse of notation in the above specification of probability measures µ
and P , where we have used the same symbols for both measures and pdfs, is in order
to have the notation consistent with the discrete case analysis given above. The probability measure P is absolutely continuous with respect to the probability measure µ, if
µ(x) = 0 on a set of a positive Lebesgue measure implies that P (x) = 0 on the same
set. The Radon-Nikodym derivative of the probability measure P with respect to the
probability measure µ will be
dP
P (x)
(x) =
.
dµ
µ(x)
We emphasize here that this relation can only be understood with the above (abuse
of) notation explained. Then the measure-theoretic entropy S(P ) in this case can be
written as
S(P ) = −
Z
b
P (x) ln
a
P (x)
dx .
µ(x)
If we take referential probability measure µ as a uniform distribution, i.e. µ(x) =
1
b−a ,
x ∈ [a, b], then we obtain
S(P ) = S[a,b] (P ) − ln(b − a) ,
(3.17)
where S[a,b] (P ) denotes the Shannon entropy (3.1) of pdf P (x) and S(P ) denotes the
measure-theoretic entropy (3.12) reduced to the continuous case, with the probability
measures µ and P specified as in (3.16a) and (3.16b) respectively.
Hence, one can conclude that measure theoretic entropy S(P ) defined for a probability measure P on the measure space (X, M, µ), is equal to both Shannon entropy in
the discrete and continuous case up to an additive constant, when the reference measure µ is chosen as a uniform probability distribution. On the other hand, one can see
that measure-theoretic KL-entropy, in the discrete and continuous cases corresponds
to its discrete and continuous definitions.
Further, from (3.13) and (3.15), we can write Shannon entropy in terms of KullbackLeibler relative-entropy as
Sn (P ) = ln n − I(P kµ) .
(3.18)
Thus, Shannon entropy appears as being (up to an additive constant) the variation
of information when we pass from the initial uniform probability distribution to new
P
probability distribution given by P k ≥ 0, nk=1 Pk = 1, as any such probability
distribution is obviously absolutely continuous with respect to the uniform discrete
55
probability distribution. Similarly, from (3.13) and (3.17) the relation between Shannon entropy and relative-entropy in continuous case can be obtained, and we can write
Boltzmann H-function in terms of relative-entropy as
S[a,b] (p) = ln(b − a) − I(P kµ) .
(3.19)
Therefore, the continuous entropy or Boltzmann H-function S(p) may be interpreted
as being (up to an additive constant) the variation of information when we pass from
the initial uniform probability distribution on the interval [a, b] to the new probability
measure defined by the probability distribution function p(x) (any such probability
measure is absolutely continuous with respect to the uniform probability distribution
on the interval [a, b]).
From the above discussion one can see that KL-entropy equips one with unitary
interpretation of both discrete entropy and continuous entropy. One can utilize Shannon entropy in the continuous case, as well as Shannon entropy in the discrete case,
both being interpreted as the variation of information when we pass from the initial
uniform distribution to the corresponding probability measure.
Also, since measure theoretic entropy is equal to the discrete and continuous entropy up to an additive constant, ME-prescriptions of measure-theoretic Shannon entropy are consistent with both the discrete and continuous cases.
3.2 Measure-Theoretic Definitions of Generalized Information Measures
In this section we extend the measure-theoretic definitions to generalized information
measures discussed in Chapter 2. We begin with a brief note on the notation and
assumptions used.
We define all the information measures on the measurable space (X, M). The
default reference measure is µ unless otherwise stated. For simplicity in exposition,
we will not distinguish between functions differing on a µ-null set only; nevertheless,
we can work with equations between M-measurable functions on X if they are stated
as being valid only µ-almost everywhere (µ-a.e or a.e). Further we assume that all
the quantities of interest exist and also assume, implicitly, the σ-finiteness of µ and
µ-continuity of probability measures whenever required. Since these assumptions repeatedly occur in various definitions and formulations, these will not be mentioned
in the sequel. With these assumptions we do not distinguish between an information
measure of pdf p and that of the corresponding probability measure P – hence when
56
we give definitions of information measures for pdfs, we also use the corresponding
definitions of probability measures as well, wherever convenient or required – with
R
the understanding that P (E) = E p dµ, and the converse holding as a result of the
Radon-Nikodym theorem, with p =
dP
dµ .
In both the cases we have P µ.
With these notations we move on to the measure-theoretic definitions of generalized information measures. First we consider the Rényi generalizations. The measuretheoretic definition of Rényi entropy is as follows.
D EFINITION 3.5
Rényi entropy of a pdf p : X → R+ on a measure space (X, M, µ) is defined as
Z
1
p(x)α dµ(x) ,
(3.20)
Sα (p) =
ln
1−α
X
provided the integral on the right exists and α ∈ R, α > 0.
The same can also be defined for any µ-continuous probability measure P as
1
ln
Sα (P ) =
1−α
Z X
dP
dµ
α−1
dP .
(3.21)
On the other hand, Rényi relative-entropy can be defined as follows.
D EFINITION 3.6
Let p, r : X → R+ be two pdfs on a measure space (X, M, µ). The Rényi relative-
entropy of p relative to r is defined as
Z
1
p(x)α
dµ(x) ,
Iα (pkr) =
ln
α−1
α−1
X r(x)
(3.22)
provided the integral on the right exists and α ∈ R, α > 0.
The same can be written in terms of probability measures as,
Z dP α−1
1
ln
dP
Iα (P kR)=
α−1
dR
X
Z 1
dP α
=
ln
dR ,
α−1
dR
X
(3.23)
whenever P R; Iα (P kR) = +∞, otherwise. Further if we assume µ in (3.21) is a
probability measure then
Sα (P ) = Iα (P kµ) .
(3.24)
The Tsallis entropy in the measure-theoretic setting can be defined as follows.
57
D EFINITION 3.7
Tsallis entropy of a pdf p on (X, M, µ) is defined as
R
Z
1 − X p(x)q dµ(x)
1
Sq (p) =
p(x) lnq
dµ(x) =
,
p(x)
q−1
X
(3.25)
provided the integral on the right exists and q ∈ R, q > 0.
The q-logarithm lnq is defined as in (2.41). The same can be defined for µcontinuous probability measure P , and can be written as
Sq (P ) =
Z
lnq
X
dP
dµ
−1
dP .
(3.26)
The definition of Tsallis relative-entropy is given below.
D EFINITION 3.8
Let (X, M, µ) be a measure space. Let p, r : X → R + be two probability density
functions. The Tsallis relative-entropy of p relative to r is defined as
Z
r(x)
dµ(x) =
p(x) lnq
Iq (pkr) = −
p(x)
X
R
p(x)q
X r(x)q−1
dµ(x) − 1
q−1
(3.27)
provided the integral on the right exists and q ∈ R, q > 0.
The same can be written for two probability measures P and R, as
Iq (P kR) = −
Z
lnq
X
dP
dR
−1
dP ,
(3.28)
whenever P R; Iq (P kR) = +∞, otherwise. If µ in (3.26) is a probability measure
then
Sq (P ) = Iq (P kµ) .
(3.29)
We shall revisit these measure-theoretic definitions in § 3.5.
3.3 Maximum Entropy and Canonical Distributions
For all the ME-prescriptions of classical information measures we consider the set of
constraints of the form
Z
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . , M ,
um dP =
X
(3.30)
X
with respect to M-measurable functions u m : X → R, m = 1, . . . , M , whose
expectation values hum i, m = 1, . . . , M are (assumed to be) a priori known, along
58
with the normalizing constraint
R
X
dP = 1. (From now on we assume that any set of
constraints on probability distributions implicitly includes this constraint, which will
therefore not be mentioned in the sequel.)
To maximize the entropy (3.4) with respect to the constraints (3.30), the solution
is calculated via the Lagrangian:
Z
Z
dP
ln
dP (x) − 1
L(x, λ, β) = −
(x) dP (x) − λ
dµ
X
X
Z
M
X
um (x) dP (x) − hum i , (3.31)
βm
−
X
m=1
where λ and βm , m = 1, . . . , M are Lagrange parameters (we use the notation β =
(β1 , . . . , βM )). The solution is given by
ln
M
X
dP
βm um (x) = 0 .
(x) + λ +
dµ
m=1
The solution can be calculated as
dP (x) = exp − ln Z(β) −
or
e−
dP
(x) =
p(x) =
dµ
M
m=1
M
X
!
βm um (x) dµ(x)
m=1
βm um (x)
,
Z(β)
where the partition function Z(β) is written as
!
Z
M
X
Z(β) =
exp −
βm um (x) dµ(x) .
X
(3.32)
(3.33)
(3.34)
m=1
The Lagrange parameters βm , m = 1, . . . M are specified by the set of constraints
(3.30).
The maximum entropy, denoted by S, can be calculated as
S = ln Z +
M
X
m=1
βm hum i .
(3.35)
The Lagrange parameters βm , m = 1, . . . M , are calculated by searching the
unique solution (if it exists) of the following system of nonlinear equations:
∂
ln Z(β) = −hum i , m = 1, . . . M.
∂βm
(3.36)
We also have
∂S
= βm , m = 1, . . . M.
∂hum i
Equations (3.36) and (3.37) are referred to as the thermodynamic equations.
59
(3.37)
3.4 ME-prescription for Tsallis Entropy
As we mentioned earlier, the great success of Tsallis entropy is attributed to the powerlaw distributions that result from the ME-prescriptions of Tsallis entropy. But there are
subtleties involved in the choice of constraints one would choose for ME prescriptions
of these entropy functionals. The issue of what kind of constraints one should use in
the ME-prescriptions is still a part of the major discussion in the nonextensive formalism (Ferri et al., 2005; Abe & Bagci, 2005; Wada & Scarfone, 2005).
In the nonextensive formalism, maximum entropy distributions are derived with
respect to the constraints that are different from (3.30), and are inadequate for handling the serious mathematical difficulties that result for instance, those of unwanted
divergences etc. cf. (Tsallis, 1988). To handle these difficulties constraints of the form
Z
um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M
(3.38)
X
are proposed by Curado and Tsallis (1991). The averages of the form hu m iq are referred to as q-expectations.
3.4.1 Tsallis Maximum Entropy Distribution
To calculate the maximum Tsallis entropy distribution with respect to the constraints
(3.38), the Lagrangian can be written as
Z
Z
1
L(x, λ, β) =
dP (x) − λ
lnq
dP (x) − 1
p(x)
X
X
Z
M
X
−
p(x)q−1 um (x) dP (x) − hum iq . (3.39)
βm
X
m=1
The solution is given by
M
X
1
−λ−
lnq
βm um (x)p(x)q−1 = 0 .
p(x)
m=1
(3.40)
By the definition of q-logarithm (2.41), (3.40) can be rearranged as
p(x) =
h
1 − (1 − q)
PM
m=1 βm um (x)
1
(λ(1 − q) + 1) 1−q
60
i
1
1−q
.
(3.41)
The denominator in (3.41) can be calculated using the normalizing constraint
1. Finally, Tsallis maximum entropy distribution can be written as
p(x) =
h
1 − (1 − q)
PM
m=1 βm um (x)
Zq
i
R
X
dP =
1
1−q
,
(3.42)
where the partition function is
Zq =
Z "
X
1 − (1 − q)
M
X
βm um (x)
m=1
1
# 1−q
dµ(x) .
(3.43)
Tsallis maximum entropy distribution (3.42) can be expressed in terms of the q-expectation
function (2.42) as
p(x) =
−
eq
M
m=1
βm um (x)
Zq
.
(3.44)
Note that in order to guarantee that pdf p in (3.42) is non-negative real for any
x ∈ X, it is necessary to supplement it with an appropriate prescription for treating
h
i
P
negative values of the quantity 1 − (1 − q) M
m=1 βm um (x) . That is, we need a
prescription for the value of p(x) when
"
#
M
X
1 − (1 − q)
βm um (x) < 0 .
(3.45)
m=1
The simplest possible prescription, and the one usually adopted, is to set p(x) = 0
whenever the inequality (3.45) holds (Tsallis, 1988; Curado & Tsallis, 1991). This
rule is known as the Tsallis cut-off condition. Simple extensions of Tsallis cut-off
conditions are proposed in (Teweldeberhan et al., 2005) by defining an alternate qexponential function. In this thesis, we consider only the usual Tsallis cut-off condition mentioned above. Note that by expressing Tsallis maximum entropy distribution (3.42) in terms of the q-exponential function, as in (3.44), we have assumed Tsallis
cut-off condition implicitly. In summary, when we refer to Tsallis maximum entropy
distribution we mean the following
 M
− m=1 βm um (x)
h
i

PM

 eq
if
1
−
(1
−
q)
β
u
(x)
>0
m=1 m m
Zq
p(x) =



0
otherwise.
(3.46)
Maximum Tsallis entropy can be calculated as (Curado & Tsallis, 1991),
Sq = ln Zq +
M
X
m=1
βm hum iq .
(3.47)
61
The corresponding thermodynamic equations are as follows (Curado & Tsallis, 1991):
∂
lnq Zq = −hum iq , m = 1, . . . M,
∂βm
(3.48)
∂Sq
= βm , m = 1, . . . M.
∂hum iq
(3.49)
It may be interesting to compare these equations with their classical counterparts,
(3.36) and (3.37), to see the consistency in generalizations.
Here we mention that some important mathematical properties of nonextensive
maximum entropy distribution (3.42) for q =
1
2
has been studied and reported by Rebollo-
Neira (2001) with applications to data subset selection. One can refer to (Vignat, Hero,
& Costa, 2004) for a study of Tsallis maximum entropy distributions in the multivariate case.
3.4.2 The Case of Normalized q-expectation values
Constraints of the form (3.38) had been used for some time in the nonextensive MEprescriptions, but because of problems in justifying it on physical grounds (for example
q-expectation of a constant need not be a constant and hence they are not expectations in the true sense) the constraints of the following form were proposed in (Tsallis,
Mendes, & Plastino, 1998)
R
um (x)p(x)q dµ(x)
XR
= hhum iiq , m = 1, . . . , M .
q
X p(x) dµ(x)
(3.50)
Here hhum iiq can be considered as the expectation of u m with respect to the modified
probability measure P(q) (it is indeed a probability measure) defined as
P(q) (E) =
Z
q
p(x) dµ(x)
X
−1 Z
E
p(x)q dµ(x) ,
∀E ∈ M .
(3.51)
The modified probability measure P(q) is known as the escort probability measure (Tsallis et al., 1998).
Now, the variational principle for Tsallis entropy maximization with respect to
constraints (3.50) can be written as
Z
Z
1
L(x, λ, β) = lnq
dP (x) − λ
dP (x) − 1
p(x)
X
X
Z
M
X
q−1
(q)
−
p(x)
βm
um (x) − hhum iiq dP (x) , (3.52)
m=1
X
62
(q)
where the parameters βm can be defined in terms of true Lagrange parameters β m as
βm
(q)
βm
=Z
, m = 1, . . . , M.
(3.53)
p(x)q dµ(x)
X
The maximum entropy distribution in this case turns out to be

1 − (1 − q)
p(x) =
 1
1−q
u
(x)
−
hhu
ii
β
m
m q
m=1 m

R
q
X p(x) dµ(x)
PM
Zq
.
(3.54)
This can be written using q-exponential functions as

 P
M
u
(x)
−
hhu
ii
β
m
m
m
q
m=1
1
 ,
R
p(x) =
expq −
q
p(x)
dµ(x)
Zq
X
(3.55)
where
Zq =
Z
X

β
u
(x)
−
hhu
ii
m
m
m
m=1
q
 dµ(x) .
R
q
p(x)
dµ(x)
X
 P
M
expq −
(3.56)
Maximum Tsallis entropy Sq in this case satisfies
Sq = lnq Zq ,
(3.57)
while the corresponding thermodynamic equations are
∂
lnq Zq = −hhum iiq , m = 1, . . . M ,
∂βm
(3.58)
∂Sq
= βm , m = 1, . . . M ,
∂hhum iiq
(3.59)
where
lnq Zq = lnq Zq −
M
X
m=1
βm hhum iiq .
(3.60)
3.5 Measure-Theoretic Definitions Revisited
It is well known that unlike Shannon entropy, Kullback-Leibler relative-entropy in
the discrete case can be extended naturally to the measure-theoretic case by a simple
63
limiting process cf. (Topsøe, 2001, Theorem 5.2). In this section, we show that this
fact is true for generalized relative-entropies too. Rényi relative-entropy on continuous
valued space R and its equivalence with the discrete case is studied by Rényi (1960),
Jizba and Arimitsu (2004b). Here, we present the result in the measure-theoretic case
and conclude that measure-theoretic definitions of both Tsallis and Rényi relativeentropies are equivalent to their respective entities.
We also present a result pertaining to ME of measure-theoretic Tsallis entropy. We
prove that ME of Tsallis entropy in the measure-theoretic case is consistent with the
discrete case.
3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies
Here we show that generalized relative-entropies in the discrete case can be naturally
extended to measure-theoretic case, in the sense that measure-theoretic definitions can
be defined as limits of sequences of finite discrete entropies of pmfs which approximate
the pdfs involved. We refer to any such sequence of pmfs as “approximating sequence
of pmfs of a pdf”. To formalize these aspects we need the following lemma.
L EMMA 3.1
Let p be a pdf defined on measure space (X, M, µ). Then there exists a sequence
of simple functions {fn } (approximating sequence of simple functions of p) such
that limn→∞ fn = p and each fn can be written as
Z
1
fn (x) =
p dµ ,
∀x ∈ En,k , k = 1, . . . m(n) ,
µ(En,k ) En,k
(3.61)
m(n)
where {En,k }k=1 , is the measurable partition of X corresponding to f n (the notation m(n) indicates that m varies with n). Further each f n satisfies
Z
fn dµ = 1 .
(3.62)
X
Proof
Define a sequence of simple functions {f n } as
Z

1


p dµ , if 2kn ≤ p(x) < k+1

2n ,
µ(p−1 ([ 2kn , k+1
)) p−1 ([ kn , k+1

2n )

2
2n ))


k = 0, 1, . . . n2n − 1 ,
fn (x) =


Z



1

p dµ ,
if n ≤ p(x) .
 −1
µ(p ([n,∞))) p−1 ([n,∞))
(3.63)
64
Each fn is indeed a simple function and can be written as
!
n −1
Z
Z
n2
X
1
1
p dµ χEn,k +
fn =
p dµ χFn , (3.64)
µ(En,k ) En,k
µ(Fn ) Fn
k=0
where En,k = p−1
k k+1
2n , 2n
, k = 0, . . . , n2n −1 and Fn = p−1 ([n, ∞)). Also, for
any measurable set E ∈ M, χE : X → {0, 1} denotes its indicator or characteristic
function. Note that {En,0 , . . . , En,n2n −1 , Fn } is indeed a measurable partition of X,
R
R
for any n. Since E p dµ < ∞ for any E ∈ M, we have En,k p dµ = 0 whenever
R
µ(En,k ) = 0, for k = 0, . . . n2n − 1. Similarly Fn p dµ = 0 whenever µ(Fn ) = 0.
Now we show that limn→∞ fn = p, point-wise.
Since p is a pdf, we have p(x) < ∞. Then ∃ n ∈ Z + 3 p(x) ≤ n. Also ∃ k ∈ Z+ ,
0 ≤ k ≤ n2n − 1 3
0 ≤ |p − fn | <
1
2n
k
2n
≤ p(x) <
k+1
2n
and
k
2n
≤ fn (x) <
k+1
2n .
This implies
as required.
(Note that this lemmma holds true even if p is not a pdf. This follows from, if
p(x) = ∞, for some x ∈ X, then x ∈ Fn for all n, and therefore fn (x) ≥ n for all n;
hence limn→∞ fn (x) = ∞ = p(x).)
Finally, we have
Z
fn dµ =
X
n(m) "
X
1
µ(En,k )
k=1
En,k
k=1
=
=
n(m) Z
X
Z
Z
#
1
p dµ µ(En,k ) +
µ(Fn )
En,k
p dµ +
Z
Z
p dµ µ(Fn )
Fn
p dµ
Fn
p dµ = 1 .
X
The above construction of a sequence of simple functions which approximate a
measurable function is similar to the approximation theorem (e.g., Kantorovitz, 2003,
pp.6, Theorem 1.8(b)) in the theory of integration. But, approximation in Lemma 3.1
can be seen as a mean-value approximation whereas in the above case it is the lower
approximation. Further, unlike in the case of lower approximation, the sequence of
simple functions which approximate p in Lemma 3.1 are neither monotone nor satisfy
fn ≤ p.
Now one can define a sequence of pmfs {p̃ n } corresponding to the sequence of
simple functions constructed in Lemma 3.1, denoted by p̃ n = (p̃n,1 , . . . , p̃n,m(n) ), as
Z
p dµ , k = 1, . . . m(n),
(3.65)
p̃n,k = µ(En,k ) fn χEn,k (x) =
En,k
65
for any n. Note that in (3.65) the function f n χEn,k is a constant function by the construction (Lemma 3.1) of fn . We have
m(n)
X
p̃n,k =
k=1
m(n) Z
X
k=1
p dµ =
En,k
Z
p dµ = 1 ,
(3.66)
X
and hence p̃n is indeed a pmf. We call {p̃n } as the approximating sequence of pmfs of
pdf p.
Now we present our main theorem, where we assume that p and r are bounded. The
assumption of boundedness of p and r simplifies the proof. However, the result can be
extended to an unbounded case. (See (Rényi, 1959) analysis of Shannon entropy and
relative-entropy on R in the unbounded case.)
T HEOREM 3.1
Let p and r be pdfs, which are bounded and defined on a measure space (X, M, µ).
Let p̃n and r̃n be approximating sequences of pmfs of p and r respectively. Let
Iα denote the Rényi relative-entropy as in (3.22) and I q denote the Tsallis relativeentropy as in (3.27). Then
lim Iα (p̃n kr̃n ) = Iα (pkr)
(3.67)
lim Iq (p̃n kr̃n ) = Iq (pkr) ,
(3.68)
n→∞
and
n→∞
respectively.
Proof
It is enough to prove the result for either Tsallis or Rényi since each one of them is a
monotone and continuous functions of the other. Hence we write down the proof for
the case of Rényi and we use the entropic index α in the proof.
Corresponding to pdf p, let {fn } be the approximating sequence of simple func-
tions such that limn→∞ fn = p as in Lemma 3.1. Let {gn } be the approximating se-
quence of simple functions for r such that lim n→∞ gn = r. Corresponding to simple
functions fn and gn there exists a common measurable partition 6 {En,1 , . . . En,m(n) }
such that fn and gn can be written as
m(n)
fn (x) =
X
k=1
6
(an,k )χEn,k (x) , an,k ∈ R+ , ∀k = 1, . . . m(n) ,
(3.69a)
Let ϕ and φ be two simple functions defined on (X, ). Let {E1 , . . . En } and {F1 , . . . , Fm } be the
measurable partitions corresponding to ϕ and φ respectively. Then the collection defined as {Ei ∩ Fj |i =
1, . . . n, j = 1, . . . m} is a common measurable partition for ϕ and φ.
66
m(n)
gn (x) =
X
k=1
(bn,k )χEn,k (x) , bn,k ∈ R+ , ∀k = 1, . . . m(n) ,
(3.69b)
where χEn,k is the characteristic function of E n,k , for k = 1, . . . m(n). By (3.69a)
and (3.69b), the approximating sequences of pmfs p̃n = (p̃n,1 , . . . , p̃n,m(n) ) (n) and
r̃n = (r̃n,1 , . . . , r̃n,m(n) ) (n) can be written as (see (3.65))
p̃n,k = an,k µ(En,k ) ,
k = 1, . . . , m(n) ,
(3.70a)
r̃n,k = bn,k µ(En,k ) ,
k = 1, . . . , m(n) .
(3.70b)
Now Rényi relative-entropy for p̃ n and r̃n can be written as
m(n) α
X an,k
1
ln
µ(En,k ) .
Sα (p̃n kr̃n ) =
α−1
bα−1
k=1 n,k
(3.71)
To prove limn→∞ Sα (p̃n kr̃n ) = Sα (pkr) it is enough to show that
1
ln
n→∞ α − 1
lim
Z
X
fn (x)α
1
α−1 dµ(x) = α − 1 ln
gn (x)
Z
X
p(x)α
dµ(x) ,
r(x)α−1
(3.72)
since we have7
Z
X
m(n) α
X an,k
fn (x)α
dµ(x)
=
α−1 µ(En,k ) .
α−1
b
gn (x)
n,k
k=1
Further, it is enough to prove that
Z
Z
α
lim
hn (x) gn (x) dµ(x) =
n→∞ X
where hn is defined as hn (x) =
X
p(x)α
dµ(x) ,
r(x)α−1
fn (x)
gn (x) .
Case 1: 0 < α < 1
7
Note that simple functions (fn )α and (gn )α−1 can be written as
m(n)
(fn )α (x) =
aα
n,k χEn,k (x) ,
and
k=1
m(n)
(gn )α−1 (x) =
bα−1
n,k χEn,k (x) .
k=1
Further,
fnα
(x) =
gnα−1
m(n)
k=1 aα
n,k
χEn,k (x) .
bα−1
n,k 67
(3.73)
(3.74)
In this case, the Lebesgue dominated convergence theorem (Rudin, 1964, pp.26,
Theorem 1.34) gives that,
Z
Z
fnα
pα
dµ ,
lim
dµ
=
α−1
α−1
n→∞ X gn
X r
(3.75)
and hence (3.68) follows.
Case 2: α > 1
We have hαn gn →
pα
r α−1
a.e. By Fatou’s Lemma (Rudin, 1964, pp.23, Theorem
1.28), we obtain that,
Z
Z
α
hn (x) gn (x) dµ(x) ≥
lim inf
n→∞
X
X
p(x)α
dµ(x) .
r(x)α−1
From the construction of fn and gn (Lemma 3.1) we have
Z
1
p(x)
hn (x)gn (x) =
r(x) dµ(x) , ∀x ∈ En,i .
µ(En,i ) En,i r(x)
(3.76)
(3.77)
By Jensen’s inequality we get
1
hn (x) gn (x) ≤
µ(En,i )
α
Z
p(x)α
dµ(x) , ∀x ∈ En,i .
r(x)α−1
En,i
By (3.69a) and (3.69b) we can write (3.78) as
Z
aαn,i
p(x)α
µ(E
)
≤
dµ(x) , ∀i = 1, . . . m(n) .
n,i
α−1
α−1
bn,i
En,i r(x)
(3.78)
(3.79)
By summing over on both sides of (3.79) we get
m(n)
X aαn,i
α−1 µ(En,i ) ≤
bn,i
i=1
m(n) Z
X
En,i
i=1
Now (3.80) is nothing but
Z
Z
α
hn (x)gn (x) dµ(x) ≤
X
p(x)α
dµ(x) .
r(x)α−1
X
(3.80)
p(x)α
dµ(x) , ∀n ,
r(x)α−1
and hence
sup
i≥n
Z
X
hαi (x)gi (x) dµ(x)
≤
Z
X
p(x)α
dµ(x) , ∀n .
r(x)α−1
Finally we have
lim sup
n→∞
Z
X
hαn (x)gn (x) dµ(x)
≤
Z
X
p(x)α
dµ(x) .
r(x)α−1
From (3.76) and (3.81) we get
Z
Z
p(x)α
fn (x)α
dµ(x)
=
dµ(x) ,
lim
α−1
n→∞ X gn (x)α−1
X r(x)
and hence (3.68) follows.
68
(3.81)
(3.82)
3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy
With the shortcomings of Shannon entropy that it cannot be naturally extended to
a non-discrete case, we have observed that Shannon entropy in a measure-theoretic
framework can be used in ME-prescriptions consistently with the discrete case. One
can easily see that generalized information measures of Rényi and Tsallis too cannot
be extended naturally to measure-theoretic cases, i.e., measure-theoretic definitions are
not equivalent to their corresponding discrete cases in the sense that they cannot be defined as limits of sequences of finite discrete entropies corresponding to pmfs defined
on measurable partitions which approximate the pdf. One can use the same counter
example we discussed in § 3.1.1. In this section, we show that the ME-prescriptions in
the measure-theoretic case are consistent with the discrete case.
Proceeding as in the case of measure-theoretic entropy in § 3.1.3, by specifying
probability measures µ and P in discrete case as in (3.14a) and (3.14b) respectively,
the measure-theoretic Tsallis entropy S q (P ) (3.26) can be reduced to
Sq (P ) =
n
X
k=1
Pk lnq
µk
.
Pk
(3.83)
By (2.46) we get
Sq (P ) =
n
X
k=1
Pkq [lnq µk − lnq Pk ] .
(3.84)
When µ is a uniform distribution i.e., µ k = n1 , ∀k = 1, . . . n we get
Sq (P ) = Sqn (P ) − nq−1 lnq n
n
X
Pkq ,
(3.85)
k=1
where Sqn (P ) denotes the Tsallis entropy of pmf P = (P 1 , . . . , Pn ) (2.31), and Sq (P )
denotes the measure-theoretic Tsallis entropy (3.26) reduced to the discrete case, with
the probability measures µ and P specified as in (3.14a) and (3.14b) respectively.
P
Now we show that the quantity nk=1 Pkq is constant in maximization of Sq (P )
with respect to the set of constraints (3.50).
The claim is that
Z
1−q
p(x)q dµ(x) = (Zq )
,
(3.86)
X
which holds for Tsallis maximum entropy distribution (3.54) in general. This can be
shown here as follows. I From the maximum entropy distribution (3.54), we have
−1 X
Z
M
q
βm um (x) − hhum iiq
p(x) dµ(x)
1 − (1 − q)
p(x)1−q =
X
m=1
(Zq )1−q
69
,
which can be rearranged as

(Zq )1−q p(x) = 1 − (1 − q)

β
u
(x)
−
hhu
ii
m
m
m
m=1
q
 p(x)q .
R
q
p(x)
dµ(x)
X
PM
By integrating both sides in the above equation, and by using (3.50) we get (3.86). J
Now, (3.86) can be written in its discrete form as
n
X
Pkq
µq−1
k=1 k
= (Zq )
1−q
.
(3.87)
When µ is the uniform distribution we get
n
X
Pkq = n1−q (Zq )
1−q
,
(3.88)
k=1
which is a constant.
Hence, by (3.85) and (3.88), one can conclude that with respect to a particular instance of ME, measure-theoretic Tsallis entropy S q (P ) defined for a probability measure P on the measure space (X, M, µ) is equal to discrete Tsallis entropy up to an
additive constant, when the reference measure µ is chosen as a uniform probability distribution. There by, one can further conclude that with respect to a particular instance
of ME, measure-theoretic Tsallis entropy is consistent with its discrete definition.
The same result can be shown in the case of q-expectation values too.
3.6 Gelfand-Yaglom-Perez Theorem in the General Case
The measure-theoretic definition of KL-entropy plays a basic role in the definitions
of classical information measures. Entropy, mutual information and conditional forms
of entropy can be expressed in terms of KL-entropy and hence properties of their
measure-theoretic analogs will follow from those of measure-theoretic KL-entropy
(Gray, 1990). These measure-theoretic definitions are key to extending the ergodic
theorems of information theory to non-discrete cases. A fundamental theorem in this
respect is the Gelfand-Yaglom-Perez (GYP) Theorem (Pinsker, 1960b, Theorem 2.4.2)
which states that measure-theoretic relative-entropy equals the supremum of relativeentropies over all measurable partitions. In this section we prove the GYP-theorem for
Rényi relative-entropy of order greater than one.
Before we proceed to the definitions and present the notion of relative-entropy
on a measurable partition, we recall our notation and introduce new symbols. Let
70
(X, M) be a measurable space and Π denote the set of all measurable partitions of X.
m
We denote a measurable partition π ∈ Π as π = {E k }m
k=1 , i.e, ∪k=1 Ek = X and
Ei ∩ Ej = ∅, i 6= j, i, j = 1, . . . m. We denote the set of all simple functions on
+
(X, M) by L+
0 , and the set of all nonnegative M-measurable functions by L . The set
of all µ-integrable functions, where µ is a measure defined on (X, M), is denoted by
L1 (µ). Rényi relative-entropy Iα (P kR) refers to (3.23), which can be written as
Z
1
Iα (P kR) =
ln
ϕα dR ,
(3.89)
α−1
X
where ϕ ∈ L1 (R) is defined as ϕ =
dP
dR .
Let P and R be two probability measures on (X, M) such that P R. Relative-
entropy of partition π ∈ Π for P with respect to R is defined as
m
X
IP kR (π) =
P (Ek ) ln
k=1
P (Ek )
.
R(Ek )
(3.90)
The GYP-theorem states that
I(P kR) = sup IP kR (π) ,
(3.91)
π∈Π
where I(P kR) measure-theoretic KL-entropy is defined as in Definition 3.2. When
P is not absolutely continuous with respect to R, GYP-theorem assigns I(P kR) =
+∞. The proof of GYP-theorem given by Dobrushin (Dobrushin, 1959) can be found
in (Pinsker, 1960b, pp. 23, Theorem 2.4.2) or in (Gray, 1990, pp. 92, Lemma 5.2.3).
Before we state and prove the GYP-theorem for Rényi relative-entropy of order
α > 1, we state the following lemma.
L EMMA 3.2
Let P and R be probability measures on the measurable space (X, M) such that
P R. Let ϕ =
dP
dR .
P (E)α
≤
R(E)α−1
Proof
Since P (E) =
Z
E
R
E
ϕ dR ≤
Z
Then for any E ∈ M and α > 1 we have
ϕα dR .
(3.92)
E
ϕ dR, ∀E ∈ M, by Hölder’s inequality we have
Z
α
α
ϕ dR
E
1 Z
α
dR
E
That is
α
P (E) ≤ R(E)
1
)
α(1− α
Z
1− 1
ϕα dR ,
E
and hence (3.92) follows.
71
.
We now present our main result in a special case as follows.
L EMMA 3.3
Let P and R be two probability measures such that P R. Let ϕ =
dP
dR
Then for any 0 < α < ∞, we have
∈ L+
0.
m
Iα (P kR) =
X P (Ek )α
1
,
ln
α−1
R(Ek )α−1
(3.93)
k=1
where {Ek }m
k=1 ∈ Π is the measurable partition corresponding to ϕ.
Proof
The simple function ϕ ∈ L+
0 can be written as ϕ(x) =
where ak ∈ R, k = 1, . . . m. Now we have P (Ek ) =
hence
ak =
P (Ek )
,
R(Ek )
We also have ϕα (x) =
Z
α
ϕ dR =
X
Pm
R
k=1 ak χEk (x),
Ek
ϕ dR = ak R(Ek ), and
∀k = 1, . . . m.
Pm
α
k=1 ak χEk (x),
m
X
∀x ∈ X,
(3.94)
∀x ∈ X and hence
aαk R(Ek ) .
(3.95)
k=1
Now, from (3.89), (3.94) and (3.95) one obtains (3.93).
Note that right hand side of (3.93) represents the Rényi relative-entropy of the
partition {Ek }m
k=1 ∈ Π. Now we state and prove GYP-theorem for Rényi relativeentropy.
T HEOREM 3.2
Let (X, M) be a measurable space and Π denote the set of all measurable partitions
of X. Let P and R be two probability measures. Then for any α > 1, we have

m
X

P (Ek )α
1


if P R ,
ln
sup

α−1
R(E
)
{Ek }m
∈Π α − 1
k
k=1
k=1
Iα (P kR) =
(3.96)




+∞
otherwise.
Proof
If P is not absolutely continuous with respect R, there exists E ∈ M such that P (E) >
0 and R(E) = 0. Since {E, X − E} ∈ Π, Iα (P kR) = +∞.
Now, we assume that P R. It is clear that it is enough to prove that
Z
ϕα dR =
X
sup
{Ek }m
k=1 ∈Π
m
X
P (Ek )α
α−1 ,
R(E
)
k
k=1
72
(3.97)
where ϕ =
dP
dR .
have
From Lemma 3.2, for any measurable partition {E k }m
k=1 ∈ Π, we
Z
m
m Z
X
X
P (Ek )α
α
ϕ dR =
ϕα dR ,
α−1 ≤
R(Ek )
X
k=1
k=1 Ek
and hence
sup
{Ek }m
k=1 ∈Π
Z
m
X
P (Ek )α
ϕα dR .
α−1 ≤
R(E
)
X
k
k=1
(3.98)
Now we shall obtain the reverse inequality to prove (3.97) . Thus, we now show
sup
{Ek }m
k=1 ∈Π
m
X
P (Ek )α
k=1
R(Ek )
α−1
≥
Z
ϕα dR .
(3.99)
X
Note that corresponding to any ϕ ∈ L+ , there exists a sequence of simple functions
{ϕn }, ϕn ∈ L+
0 , that satisfy
0 ≤ ϕ 1 ≤ ϕ2 ≤ . . . ≤ ϕ
(3.100)
such that limn→∞ ϕn = ϕ (Kantorovitz, 2003, Theorem 1.8(2)). {ϕ n } induces a
sequence of measures {Pn } on (X, M) defined by
Z
ϕn (x) dR(x) , ∀E ∈ M.
Pn (E) =
(3.101)
E
We have
R
E
ϕn dR ≤
R
E
ϕ dR < ∞, ∀E ∈ M and hence Pn R, ∀n. From the
Lebesgue bounded convergence theorem, we have
lim Pn (E) = P (E) ,
n→∞
∀E ∈ M .
(3.102)
α
α
α
α
α
Now, ϕn ∈ L+
0 , ϕn ≤ ϕn+1 ≤ ϕ , 1 ≤ n < ∞ and lim n→∞ ϕn = ϕ for any α > 0.
Hence from Lebesgue monotone convergence theorem we have
Z
Z
α
lim
ϕn dR =
ϕα dR .
n→∞ X
(3.103)
X
We now claim that (3.103) implies
Z
Z
α
α
+
ϕ dR = sup
φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0
.
(3.104)
X
This can be verified as follows. Denote φ n = ϕαn . We have 0 ≤ φ ≤ ϕα , ∀n, φn ↑ ϕα ,
and (as shown above)
Z
Z
lim
φn dR =
ϕα dR .
n→∞ X
(3.105)
X
73
α
Now for any φ ∈ L+
0 such that 0 ≤ φ ≤ ϕ we have
Z
Z
φ dR ≤
ϕα dR
X
X
and hence
sup
Z
α
X
φ dR | 0 ≤ φ ≤ ϕ , φ ∈
L+
0
≤
Z
Now we show the reverse inequality of (3.106). If
given any > 0 one can find 0 ≤ n0 < ∞ such that
Z
Z
φn0 dR + ϕα dR <
ϕα dR .
R
X
(3.106)
ϕα dR < +∞, from (3.105)
X
X
and hence
Z
Z
+
α
α
φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 + .
ϕ dR < sup
(3.107)
X
X
Since (3.107) is true for any > 0, we can write
Z
Z
+
α
α
φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0
.
ϕ dR ≤ sup
(3.108)
X
X
R
Now let us verify (3.108) in the case of X ϕα dR = +∞. In this case, ∀N > 0, one
R
can choose n0 such that X φn0 dR > N and hence
Z
(3.109)
ϕα dR > N
(∵ 0 ≤ φn0 ≤ ϕα )
X
and
sup
Z
α
X
φ dR | 0 ≤ φ ≤ ϕ , φ ∈
L+
0
>N .
Since (3.109) and (3.110) are true for any N > 0, we have
Z
Z
+
α
α
φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 = +∞
ϕ dR = sup
X
(3.110)
(3.111)
X
and hence (3.108) is verified in the case of
R
X
ϕα dR = +∞. Now (3.106) and (3.108)
verify the claim that (3.103) implies (3.104). Finally (3.104) together with Lemma 3.3
proves (3.97) and hence the theorem.
Now from the fact that Rényi and Tsallis relative-entropies ((3.23) and (3.28) respectively) are monotone and continuous functions of each other, the GYP-theorem
presented in the case of Rényi is valid for the Tsallis case too, whenever q > 1.
However, the GYP-theorem is yet to be stated for the case when entropic index
0 < α < 1 ( 0 < q < 1 in the case of Tsallis). Work on this problem is ongoing.
74
4
Geometry and Entropies:
Pythagoras’ Theorem
Abstract
Kullback-Leibler relative-entropy, in cases involving distributions resulting from
relative-entropy minimization, has a celebrated property reminiscent of squared Euclidean distance: it satisfies an analogue of the Pythagoras’ theorem. And hence,
this property is referred to as Pythagoras’ theorem of relative-entropy minimization
or triangle equality and plays a fundamental role in geometrical approaches of statistical estimation theory like information geometry. We state and prove the equivalent
of Pythagoras’ theorem in the generalized nonextensive formalism as the main result
of this chapter. Before presenting this result we study the Tsallis relative-entropy
minimization and present some differences with the classical case. This work can
also be found in (Dukkipati et al., 2005b; Dukkipati, Murty, & Bhatnagar, 2006a).
Apart from being a fundamental measure of information, Kullback-Leibler relativeentropy or KL-entropy plays a role of ‘measure of the distance’ between two probability distributions in statistics. Since it is not a metric, at first glance, it might seem that
the geometrical interpretations that metric distance measures provide usually might
not be possible at all with the KL-entropy playing a role as a distance measure on a
space of probability distributions. But it is a pleasant surprise that it is possible to formulate certain geometric propositions for probability distributions, with the relativeentropy playing the role of squared Euclidean distance. Some of these geometrical
interpretations cannot be derived from the properties of KL-entropy alone, but from
the properties of “KL-entropy minimization”; restating the previous statement, these
geometrical formulations are possible only when probability distributions resulting
from ME-prescriptions of KL-entropy are involved.
As demonstrated by Kullback (1959), minimization problems of relative-entropy
with respect to a set of moment constraints find their importance in the well known
Kullback’s minimum entropy principle and thereby play a basic role in the informationtheoretic approach to statistics (Good, 1963; Ireland & Kullback, 1968). They frequently occur elsewhere also, e.g., in the theory of large deviations (Sanov, 1957), and
in statistical physics, as maximization of entropy (Jaynes, 1957a, 1957b).
Kullback’s minimum entropy principle can be considered as a general method of
75
inference about an unknown probability distribution when there exists a prior estimate
of the distribution and new information in the form of constraints on expected values (Shore, 1981b). Formally, one can state this principle as: given a prior distribution
r, of all the probability distributions that satisfy the given moment constraints, one
should choose the posterior p with the least relative-entropy. The prior distribution
r can be a reference distribution (uniform, Gaussian, Lorentzian or Boltzmann etc.)
or a prior estimate of p. The principle of Jaynes maximum entropy is a special case
of minimization of relative-entropy under appropriate conditions (Shore & Johnson,
1980).
Many properties of relative-entropy minimization just reflect well-known properties of relative-entropy but there are surprising differences as well. For example,
relative-entropy does not generally satisfy a triangle relation involving three arbitrary
probability distributions. But in certain important cases involving distributions that
result from relative-entropy minimization, relative-entropy results in a theorem comparable to the Pythagoras’ theorem cf. (Csiszár, 1975) and ( Čencov, 1982, § 11). In
this geometrical interpretation, relative-entropy plays the role of squared distance and
minimization of relative-entropy appears as the analogue of projection on a sub-space
in a Euclidean geometry. This property is also known as triangle equality (Shore,
1981b).
The main aim of this chapter is to study the possible generalization of Pythagoras’
theorem to the nonextensive case. Before we take up this problem, we present the
properties of Tsallis relative-entropy minimization and present some differences with
the classical case. In the representation of such a minimum entropy distribution, we
highlight the use of the q-product (q-deformed version of multiplication), an operator
that has been introduced recently to derive the mathematical structure behind the Tsallis statistics. Especially, q-product representation of Tsallis minimum relative-entropy
distribution will be useful for the derivation of the equivalent of triangle equality for
Tsallis relative-entropy.
Before we conclude this introduction on geometrical ideas of relative-entropy minimization, we make a note on the other geometric approaches that will not be considered in this thesis. One approach is that of Rao (1945), where one looks at the set of
probability distributions on a sample space as a differential manifold and introduce a
Riemannian geometry on this manifold. This approach is pioneered by Čencov (1982)
and Amari (1985) who have shown the existence of a particular Riemannian geometry
which is useful in understanding some questions of statistical inference. This Riemannian geometry turns out to have some interesting connections with information theory
76
and as shown by Campbell (1985), with the minimum relative-entropy. In this approach too, the above mentioned Pythagoras’ Theorem plays an important role (Amari
& Nagaoka, 2000, pp.72).
The other idea involves the use of Hausdorff dimension (Billingsley, 1960, 1965)
to understand why minimizing relative-entropy should provide useful results. This
approach was begun by Eggleston (1952) for a special case of maximum entropy and
was developed by Campbell (1992). For an excellent review on various geometrical
aspects associated with minimum relative-entropy one can refer to (Campbell, 2003).
The structure of the chapter is organized as follows. We present the necessary
background in § 4.1, where we discuss properties of relative-entropy minimization in
the classical case. In § 4.2, we present the ME prescriptions of Tsallis relative-entropy
and discuss its differences with the classical case. Finally, the derivation of Pythagoras’
theorem in the nonextensive case is presented in § 4.3.
Regarding the notation, we use the same notation as in Chapter 3, and we write all
our mathematical formulations on the measure space (X, M, µ). All the assumptions
we made in Chapter 3 (see § 3.2) are valid here too. Also, though results presented in
this chapter do not involve major measure theoretic concepts, we write all the integrals
with respect to the measure µ, as a convention; these integrals can be replaced by
summations in the discrete case or Lebesgue integrals in the continuous case.
4.1 Relative-Entropy Minimization in the Classical Case
Kullback’s minimum entropy principle can stated formally as follows. Given a prior
distribution r with a finite set of moment constraints of the form
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . , M ,
(4.1)
X
one should choose the posterior p which minimizes the relative-entropy
Z
p(x)
p(x) ln
I(pkr) =
dµ(x) .
r(x)
X
(4.2)
In (4.1), hum i, m = 1, . . . , M are the known expectation values of M-measurable
functions um : X → R, m = 1, . . . , M respectively.
With reference to (4.2) we clarify here that, though we mainly use expressions of
relative-entropy defined for pdfs in this chapter, we use expressions in terms of corresponding probability measures as well. For example, when we write the Lagrangian
77
for relative-entropy minimization below, we use the definition of relative-entropy (3.7)
for probability measures P and R, corresponding to pdfs p and r respectively (refer to
Definitions 3.2 and 3.3). This correspondence between probability measures P and R
with pdfs p and r, respectively, will not be described again in the sequel.
4.1.1 Canonical Minimum Entropy Distribution
To minimize the relative-entropy (4.2) with respect to the constraints (4.1), the Lagrangian turns out to be
Z
Z
dP
dP (x) − 1
ln
(x) dP (x) + λ
L(x, λ, β) =
dR
X
X
Z
M
X
βm
um (x) dP (x) − hum i ,
+
(4.3)
X
m=1
where λ and βm , m = 1, . . . M are Lagrange multipliers. The solution is given by
M
X
dP
ln
βm um (x) = 0 ,
(x) + λ +
dR
m=1
and the solution can be written in the form of
dP
e−
(x) = Z
dR
e−
M
m=1
βm um (x)
M
m=1
βm um (x)
.
(4.4)
dR
X
Finally, from (4.4) the posterior distribution p(x) =
dP
dµ
given by Kullback’s minimum
entropy principle can be written in terms of the prior r(x) =
p(x) =
r(x)e−
M
m=1
where
Zb =
Z
r(x)e−
dR
dµ
as
βm um (x)
Zb
M
m=1
βm um (x)
,
(4.5)
dµ(x)
(4.6)
X
is the partition function.
Relative-entropy minimization has been applied to many problems in statistics (Kullback, 1959) and statistical mechanics (Hobson, 1971). The other applications include
pattern recognition (Shore & Gray, 1982), spectral analysis (Shore, 1981a), speech
coding (Markel & Gray, 1976), estimation of prior distribution for Bayesian inference (Caticha & Preuss, 2004) etc. For a list of references on applications of relativeentropy minimization see (Shore & Johnson, 1980) and a recent paper (Cherney &
Maslov, 2004).
78
Properties of relative-entropy minimization have been studied extensively and presented by Shore (1981b). Here we briefly mention a few.
The principle of maximum entropy is equivalent to relative-entropy minimization
in the special case of discrete spaces and uniform priors, in the sense that, when the
prior is a uniform distribution with finite support W (over E ⊂ X), the minimum
entropy distribution turns out to be
p(x) = Z
e−
e−
M
m=1
M
m=1
βm um (x)
,
βm um (x)
(4.7)
dµ(x)
E
which is in fact, a maximum entropy distribution (3.33) of Shannon entropy with respect to the constraints (4.1).
The important relations to relative-entropy minimization are as follows. Minimum
relative-entropy, I, can be calculated as
I = − ln Zb −
M
X
m=1
βm hum i ,
(4.8)
while the thermodynamic equations are
and
∂
b = −hum i , m = 1, . . . M,
ln Z
∂βm
∂I
= −βm , m = 1, . . . M.
∂hum i
(4.9)
(4.10)
4.1.2 Pythagoras’ Theorem
The statement of Pythagoras’ theorem of relative-entropy minimization can be formulated as follows (Csiszár, 1975).
T HEOREM 4.1
Let r be the prior, p be the probability distribution that minimizes the relativeentropy subject to a set of constraints
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . , M ,
(4.11)
X
with respect to M-measurable functions u m : X → R, m = 1, . . . M whose expectation values hum i, m = 1, . . . M are (assumed to be) a priori known. Let l be any
other distribution satisfying the same constraints (4.11), then we have the triangle
equality
I(lkr) = I(lkp) + I(pkr) .
(4.12)
79
Proof
We have
Z
l(x)
dµ(x)
r(x)
X
Z
Z
l(x)
p(x)
=
dµ(x) +
dµ(x)
l(x) ln
l(x) ln
p(x)
r(x)
X
X
Z
p(x)
= I(lkp) +
dµ(x)
l(x) ln
r(x)
X
I(lkr) =
l(x) ln
(4.13)
From the minimum entropy distribution (4.5) we have
M
X
p(x)
b .
βm um (x) − ln Z
ln
=−
r(x)
(4.14)
m=1
By substituting (4.14) in (4.13) we get
(
Z
I(lkr) = I(lkp) +
X
= I(lkp) −
= I(lkp) −
l(x) −
Z
M
X
m=1
βm um (x) − ln Zb
βm
m=1
βm hum i − ln Zb
X
dµ(x)
M
X
m=1
M
X
)
l(x)um (x) dµ(x) − ln Zb
= I(lkp) + I(pkr) .
(By hypothesis)
(By (4.8))
A simple consequence of the above theorem is that
I(lkr) ≥ I(pkr)
(4.15)
since I(lkp) ≥ 0 for every pair of pdfs, with equality if and only if l = p. A pictorial
depiction of the triangle equality (4.12) is shown in Figure 4.1.
l
r
p
Figure 4.1: Triangle Equality of Relative-Entropy Minimization
Detailed discussions on the importance of Pythagoras’ theorem of relative-entropy
minimization can be found in (Shore, 1981b) and (Amari & Nagaoka, 2000, pp. 72).
80
For a study of relative-entropy minimization without the use of Lagrange multiplier
technique and corresponding geometrical aspects, one can refer to (Csiszár, 1975).
Triangle equality of relative-entropy minimization not only plays a fundamental
role in geometrical approaches of statistical estimation theory ( Čencov, 1982) and
information geometry (Amari, 1985, 2001) but is also important for applications in
which relative-entropy minimization is used for purposes of pattern classification and
cluster analysis (Shore & Gray, 1982).
4.2 Tsallis Relative-Entropy Minimization
Unlike the generalized entropy measures, ME of generalized relative-entropies is not
much addressed in the literature. Here, one has to mention the work of Borland et al.
(1998), where they give the minimum relative-entropy distribution of Tsallis relativeentropy with respect to the constraints in terms of q-expectation values.
In this section, we study several aspects of Tsallis relative-entropy minimization.
First we derive the minimum entropy distribution in the case of q-expectation values
(3.38) and then in the case of normalized q-expectation values (3.50). We propose
an elegant representation of these distributions by using q-deformed binary operator
called q-product ⊗q . This operator is defined by Borges (2004) along similar lines
as q-addition ⊕q and q-subtraction q that we discussed in § 2.3.2. Since q-product
plays an important role in nonextensive formalism, we include a detailed discussion
on the q-product in this section. Finally, we study properties of Tsallis relative-entropy
minimization and its differences with the classical case.
4.2.1 Generalized Minimum Relative-Entropy Distribution
To minimize Tsallis relative-entropy
Z
r(x)
p(x) lnq
dµ(x)
Iq (pkr) = −
p(x)
X
(4.16)
with respect to the set of constraints specified in terms of q-expectation values
Z
um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M,
(4.17)
X
the concomitant variational principle is given as follows: Define
Z
Z
r(x)
lnq
L(x, λ, β) =
dP (x) − 1
dP (x) − λ
p(x)
X
X
Z
M
X
q−1
−
βm
p(x) um (x) dP (x) − hum iq
X
m=1
81
(4.18)
where λ and βm , m = 1, . . . M are Lagrange multipliers. Now set
dL
=0 .
dP
(4.19)
The solution is given by
lnq
M
X
r(x)
− λ − p(x)q−1
βm um (x) = 0 ,
p(x)
m=1
which can be rearranged by using the definition of q-logarithm ln q x =
p(x) =
h
r(x)1−q − (1 − q)
PM
m=1 βm um (x)
1
(λ(1 − q) + 1) 1−q
i
x1−q −1
1−q
as
1
1−q
.
Specifying the Lagrange parameter λ via the normalization
R
X
p(x) dµ(x) = 1, one
can write Tsallis minimum relative-entropy distribution as (Borland et al., 1998)
p(x) =
"
r(x)1−q − (1 − q)
M
X
βm um (x)
m=1
1
# 1−q
cq
Z
,
(4.20)
dµ(x) .
(4.21)
where the partition function is given by
cq =
Z
Z "
X
r(x)1−q − (1 − q)
M
X
βm um (x)
m=1
1
# 1−q
The values of the Lagrange parameters β m , m = 1, . . . , M are determined using the
constraints (4.17).
4.2.2 q-Product Representation for Tsallis Minimum Entropy Distribution
Note that the generalized relative-entropy distribution (4.20) is not of the form of its
classical counterpart (4.5) even if we replace the exponential with the q-exponential.
But one can express (4.20) in a form similar to the classical case by invoking qdeformed binary operation called q-product.
In the framework of q-deformed functions and operators discussed in Chapter 2, a
new multiplication, called q-product defined as

1

 x1−q + y 1−q − 1 1−q
if x, y > 0,
1−q + y 1−q − 1 > 0
x ⊗q y ≡
x

0
otherwise.
82
(4.22)
is first introduced in (Nivanen et al., 2003) and explicitly defined in (Borges, 2004) for
satisfying the following equations:
lnq (x ⊗q y)=lnq x + lnq y ,
(4.23)
exq ⊗q eyq =ex+y
.
q
(4.24)
The q-product recovers the usual product in the limit q → 1 i.e., lim q→1 (x ⊗q y) =
xy. The fundamental properties of the q-product ⊗ q are almost the same as the usual
product, and the distributive law does not hold in general, i.e.,
a(x ⊗q y) 6= ax ⊗q y (a, x, y ∈ R) .
Further properties of the q-product can be found in (Nivanen et al., 2003; Borges,
2004).
One can check the mathematical validity of the q-product by recalling the expression of the exponential function ex
ex = lim
n→∞
1+
x n
.
n
(4.25)
Replacing the power on the right side of (4.25) by n times the q-product ⊗ q :
n
x ⊗q = x ⊗ q . . . ⊗ q x ,
|
{z
}
(4.26)
n times
one can verify that (Suyari, 2004b)
exq = lim
n→∞
1+
x ⊗q n
.
n
(4.27)
Further mathematical significance of q-product is demonstrated in (Suyari & Tsukada,
2005) by discovering the mathematical structure of statistics based on the Tsallis formalism: law of error, q-Stirling’s formula, q-multinomial coefficient and experimental
evidence of q-central limit theorem.
Now, one can verify the non-trivial fact that Tsallis minimum entropy distribution
(4.20) can be expressed as (Dukkipati, Murty, & Bhatnagar, 2005b),
−
r(x) ⊗q eq
p(x) =
where
cq =
Z
Z
X
M
m=1
βm um (x)
cq
Z
−
r(x) ⊗q eq
M
m=1
βm um (x)
83
,
(4.28)
dµ(x).
(4.29)
Later in this chapter we see that this representation is useful in establishing properties
of Tsallis relative-entropy minimization and corresponding thermodynamic equations.
It is important to note that the distribution in (4.20) could be a (local/global) minimum only if q > 0 and the Tsallis cut-off condition (3.46) specified by Tsallis maximum entropy distribution is extended to the relative-entropy case i.e., p(x) = 0 whenh
i
P
ever r(x)1−q − (1 − q) M
β
u
(x)
< 0. The latter condition is also required
m=1 m m
for the q-product representation of the generalized minimum entropy distribution.
4.2.3 Properties
As we mentioned earlier, in the classical case, that is when q = 1, relative-entropy
minimization with uniform distribution as a prior is equivalent to entropy maximization. But, in the case of nonextensive framework, this is not true. Let r be the uniform
distribution with finite support W over E ⊂ X. Then, by (4.20) one can verify that
the probability distribution which minimizes Tsallis relative-entropy is
p(x) =
"
Z "
E
1
− (1 − q)
W 1−q
1
− (1 − q)
W 1−q
M
X
βm um (x)
m=1
M
X
βm um (x)
m=1
1
# 1−q
1
# 1−q
,
dµ(x)
which can be written as
p(x) = Z
−W q−1 lnq W −
eq
−W q−1 lnq W −
eq
M
m=1
M
m=1
βm um (x)
βm um (x)
(4.30)
dµ(x)
E
or
p(x) = Z
−W 1−q
eq
−W 1−q
eq
M
m=1
M
m=1
βm um (x)
βm um (x)
.
(4.31)
dµ(x)
E
By comparing (4.30) or (4.31) with Tsallis maximum entropy distribution (3.44), one
can conclude (formally one can verify this by the thermodynamic equations of Tsallis entropy (3.37)) that minimizing relative-entropy is not equivalent 1 to maximizing
1
For fixed q-expected values hum iq , the two distributions, (4.31) and (3.44) are equal, but the values
of corresponding Lagrange multipliers are different when q 6= 1 (while in the classical case they remain
same). Further, (4.31) offers the relation between the Lagrange parameters in these two cases. Let
(S)
βm , m = 1, . . . M be the Lagrange parameters corresponding to the generalized maximum entropy
(I)
distribution while βm , m = 1, . . . M correspond to generalized minimum entropy distribution with
(S)
(I)
uniform prior. Then, we have the relation βm = W 1−q βm , m = 1, . . . M .
84
entropy when the prior is a uniform distribution. The key observation here is that W
appears in (4.31) unlike in (3.44).
In this case, one can calculate minimum relative-entropy I q as
cq −
Iq = − lnq Z
M
X
m=1
βm hum iq .
(4.32)
To demonstrate the usefulness of q-product representation of generalized minimum
entropy distribution we present the verification (4.32). I By using the property of
q-multiplication (4.24), Tsallis minimum relative-entropy distribution (4.28) can be
written as
cq = e−
p(x)Z
q
M
m=1
βm um (x)+lnq r(x)
.
By taking q-logarithm on both sides, we get
cq + (1 − q) lnq p(x) lnq Z
cq = −
lnq p(x) + lnq Z
M
X
βm um (x) + lnq r(x)
m=1
q−1
By the property of q-logarithm lnq x
y = y (lnq x − lnq y), we have
r(x)
= p(x)q−1
lnq
p(x)
(
cq + (1 − q) lnq p(x) lnq Z
cq +
lnq Z
M
X
βm um (x)
m=1
)
.
(4.33)
By substituting (4.33) in Tsallis relative-entropy (4.16) we get
Iq = −
Z
p(x)
X
q
(
cq + (1 − q) lnq p(x) lnq Z
cq +
lnq Z
M
X
m=1
βm um (x)
)
dµ(x) .
By (4.17) and expanding lnq p(x) one can write Iq in its final form as in (4.32). J
It is easy to verify the following thermodynamic equations for the minimum Tsallis
relative-entropy:
∂
cq = −hum i , m = 1, . . . M,
lnq Z
q
∂βm
∂Iq
= −βm , m = 1, . . . M,
∂hum iq
which generalize thermodynamic equations in the classical case.
85
(4.34)
(4.35)
4.2.4 The Case of Normalized q-Expectations
In this section we discuss Tsallis relative-entropy minimization with respect to the
constraints in the form of normalized q-expectations
R
um (x)p(x)q dµ(x)
XR
= hhum iiq , m = 1, . . . , M.
q
X p(x) dµ(x)
(4.36)
The variational principle for Tsallis relative-entropy minimization in this case is as
below. Let
Z
r(x)
dP (x) − 1
L(x, λ, β) = lnq
dP (x) − λ
p(x)
X
X
Z
M
X
(q)
p(x)q−1 um (x) − hhum iiq dP (x) , (4.37)
βm
−
Z
X
m=1
(q)
where the parameters βm can be defined in terms of the true Lagrange parameters
βm as
βm
(q)
βm
=Z
,
m = 1, . . . , M.
(4.38)
p(x)q dµ(x)
X
This gives minimum entropy distribution as
1 
r(x)1−q − (1 − q)
c
Z
p(x) =
q
where
c=
Z
q

Z
X

r(x)1−q − (1 − q)
 1
1−q
β
u
(x)
−
hhu
ii
m
m
m
m=1
q

R
q
p(x)
dµ(x)
X
PM
PM
m=1

βm um (x) − hhum iiq

R
q
X p(x) dµ(x)
(4.39)
1
1−q
dµ(x) .
Now, the minimum entropy distribution (4.39) can be expressed using the q-product
(4.22) as


P
M


u
(x)
−
hhu
ii
β
m
m
m
q
m=1
1

R
p(x) =
r(x) ⊗q expq 
.
q
c

X p(x) dµ(x)
Z
(4.40)
q
Minimum Tsallis relative-entropy I q in this case satisfies
c ,
Iq = − lnq Z
q
(4.41)
while one can derive the following thermodynamic equations:
∂
cq = −hhum ii , m = 1, . . . M,
lnq Z
q
∂βm
86
(4.42)
∂Iq
= −βm , m = 1, . . . M,
∂hhum iiq
(4.43)
where
M
c − X β hhu ii .
cq = lnq Z
lnq Z
m
m q
q
(4.44)
m=1
4.3 Nonextensive Pythagoras’ Theorem
With the above study of Tsallis relative-entropy minimization, in this section, we
present our main result, Pythagoras’ theorem or triangle equality (Theorem 4.1) generalized to the nonextensive case. To present this result, we shall discuss the significance
of triangle equality in the classical case. We restate Theorem 4.1 which is essential for
the derivation of the triangle equality in the nonextensive framework.
4.3.1 Pythagoras’ Theorem Restated
Significance of the triangle equality is evident in the following scenario. Let r be the
prior estimate of the unknown probability distribution l, about which, the information
in the form of constraints
Z
um (x)l(x) dµ(x) = hum i , m = 1, . . . M
(4.45)
X
is available with respect to the fixed functions u m , m = 1, . . . , M . The problem
is to choose a posterior estimate p that is in some sense the best estimate of l given
by the available information i.e., prior r and the information in the form of expected
values (4.45). Kullback’s minimum entropy principle provides a general solution to
this inference problem and provides us the estimate (4.5) when we minimize relativeentropy I(pkr) with respect to the constraints
Z
um (x)p(x) dµ(x) = hum i , m = 1, . . . M .
(4.46)
X
This estimate of posterior p by Kullback’s minimum entropy principle also offers
the relation (Theorem 4.1)
I(lkr) = I(lkp) + I(pkr) ,
(4.47)
from which one can draw the following conclusions. By (4.15), the minimum relativeentropy posterior estimate of l is not only logically consistent, but also closer to l, in the
87
relative-entropy sense, that is the prior r. Moreover, the difference I(lkr) − I(lkp) is
exactly the relative-entropy I(pkr) between the posterior and the prior. Hence, I(pkr)
can be interpreted as the amount of information provided by the constraints that is not
inherent in r.
Additional justification to use minimum relative-entropy estimate of p with respect
to the constraints (4.46) is provided by the following expected value matching prop-
erty (Shore, 1981b). To explain this concept we restate our above estimation problem
as follows.
For fixed functions um , m = 1, . . . M , let the actual unknown distribution l satisfy
Z
um (x)l(x) dµ(x) = hwm i , m = 1, . . . M,
(4.48)
X
where hwm i, m = 1, . . . M are expected values of l, the only information available
about l apart from the prior r. To apply minimum entropy principle to estimate poste-
rior estimation p of l, one has to determine the constraints for p with respect to which
we minimize I(pkr). Equivalently, by assuming that p satisfies the constraints of the
form (4.46), one has to determine the expected values hu m i, m = 1, . . . , M .
Now, as hum i, m = 1, . . . , M vary, one can show that I q (lkp) has the minimum
value when
hum i = hwm i , m = 1, . . . M.
(4.49)
The proof is as follows (Shore, 1981b). I Proceeding as in the proof of Theorem 4.1,
we have
I(lkp) = I(lkr) +
= I(lkr) +
Z
M
X
βm
m=1
βm hwm i + ln Zb
m=1
M
X
X
l(x)um (x) dµ(x) + ln Zb
(By (4.48))
(4.50)
Since the variation of I(lkp) with respect to hu m i results in the variation of I(lkp)
with respect to βm for any m = 1, . . . , M , to find the minimum of I(lkp) one can
solve
∂
Iq (lkp) = 0 , m = 1, . . . M ,
∂βm
which gives the solution as in (4.49). J
This property of expectation matching states that, for a distribution p of the form
(4.5), I(lkp) is the smallest when the expected values of p match those of l. In particular, p is not only the distribution that minimizes I(pkr) but also minimizes I(lkp).
88
We now restate the Theorem 4.1 which summarizes the above discussion.
T HEOREM 4.2
Let r be the prior distribution, and p be the probability distribution that minimizes
the relative-entropy subject to a set of constraints
Z
um (x)p(x) dµ(x) = hum i ,
m = 1, . . . , M.
(4.51)
X
Let l be any other distribution satisfying the constraints
Z
um (x)l(x) dµ(x) = hwm i ,
m = 1, . . . , M.
(4.52)
X
Then
1. I1 (lkp) is minimum only if (expectation matching property)
hum i = hwm i ,
m = 1, . . . M.
(4.53)
2. When (4.53) holds, we have
I(lkr) = I(lkp) + I(pkr)
(4.54)
By the above interpretation of triangle equality and analogy with the comparable situation in Euclidean geometry, it is natural to call p, as defined by (4.5) as the
projection of r on the plane described by (4.52). Csiszár (1975) has introduced a generalization of this notion to define the projection of r on any convex set E of probability
distributions. If p ∈ E satisfies the equation
I(pkr) = min I(skr) ,
(4.55)
s∈ then p is called the projection of r on E. Csiszár (1975) develops a number of results
about these projections for both finite and infinite dimensional spaces. In this thesis,
we will not consider this general approach.
4.3.2 The Case of q-Expectations
From the above discussion, it is clear that to derive the triangle equality of Tsallis
relative-entropy minimization, one should first deduce the equivalent of expectation
matching property in the nonextensive case.
We state below and prove the Pythagoras theorem in nonextensive framework
(Dukkipati, Murty, & Bhatnagar, 2006a).
89
T HEOREM 4.3
Let r be the prior distribution, and p be the probability distribution that minimizes
the Tsallis relative-entropy subject to a set of constraints
Z
um (x)p(x)q dµ(x) = hum iq ,
m = 1, . . . , M.
(4.56)
X
Let l be any other distribution satisfying the constraints
Z
um (x)l(x)q dµ(x) = hwm iq ,
m = 1, . . . , M.
(4.57)
X
Then
1. Iq (lkp) is minimum only if
hum iq =
hwm iq
1 − (1 − q)Iq (lkp)
, m = 1, . . . M.
(4.58)
2. Under (4.58), we have
Iq (lkr) = Iq (lkp) + Iq (pkr) + (q − 1)Iq (lkp)Iq (pkr) .
Proof
(4.59)
First we deduce the equivalent of expectation matching property in the nonextensive
case. That is, we would like to find the values of hu m iq for which Iq (lkp) is minimum.
We write the following useful relations before we proceed to the derivation.
We can write the generalized minimum entropy distribution (4.28) as
ln r(x)
p(x) =
eq q
⊗q eq −
cq
Z
M
m=1
βm um (x)
=
eq −
M
m=1
βm um (x)+lnq r(x)
cq
Z
ln x
,
(4.60)
by using the relations eq q = x and exq ⊗q eyq = ex+y
. Further by using
q
lnq (xy) = lnq x + lnq y + (1 − q) lnq x lnq y
we can write (4.60) as
cq +(1−q) lnq p(x) lnq Z
cq = −
lnq p(x)+lnq Z
M
X
m=1
By the property of q-logarithm
x
= y q−1 (lnq x − lnq y) ,
lnq
y
and by q-logarithmic representations of Tsallis entropy,
Z
Sq = −
p(x)q lnq p(x) dµ(x) ,
X
90
βm um (x)+lnq r(x) .(4.61)
(4.62)
one can verify that
Iq (pkr) = −
Z
X
p(x)q lnq r(x) dµ(x) − Sq (p) .
(4.63)
With these relations in hand we proceed with the derivation. Consider
Z
p(x)
l(x) lnq
Iq (lkp) = −
dµ(x) .
l(x)
X
By (4.62) we have
Z
h
i
l(x)q lnq p(x) − lnq l(x) dµ(x)
X
Z
h
i
l(x)q lnq p(x) − lnq r(x) dµ(x) .
= Iq (lkr) −
Iq (lkp) = −
(4.64)
X
From (4.61), we get
Iq (lkp) = Iq (lkr)+
Z
l(x)q
X
cq
+ lnq Z
Z
"
M
X
#
βm um (x) dµ(x)
m=1
l(x)q dµ(x)
X
Z
cq
l(x)q lnq p(x) dµ(x) .
+(1 − q) lnq Z
(4.65)
X
By using (4.57) and (4.63),
Iq (lkp) = Iq (lkr) +
M
X
m=1
cq
βm hwm iq + lnq Z
M
X
m=1
l(x)q dµ(x)
X
h
i
cq − Iq (lkp) − Sq (l) ,
+(1 − q) lnq Z
and by the expression of Tsallis entropy S q (l) =
Iq (lkp) = Iq (lkr) +
Z
1
q−1
1−
R
X
(4.66)
l(x)q dµ(x) , we have
cq − (1 − q) lnq Z
cq Iq (lkp) . (4.67)
βm hwm iq + lnq Z
Since the multipliers βm , m = 1, . . . M are functions of the expected values hu m iq ,
variations in the expected values are equivalent to variations in the multipliers. Hence,
to find the minimum of Iq (lkp), we solve
∂
Iq (lkp) = 0 .
∂βm
(4.68)
By using thermodynamic equation (4.34), solution of (4.68) provides us with the
expectation matching property in the nonextensive case as
hum iq =
hwm iq
1 − (1 − q)Iq (lkp)
, m = 1, . . . M .
91
(4.69)
In the limit q → 1 the above equation gives hu m i1 = hwm i1 which is the expectation
matching property in the classical case.
Now, to derive the triangle equality for Tsallis relative-entropy minimization, we
substitute the expression for hwm iq , which is given by (4.69), in (4.67). And after
some algebra one can arrive at (4.59).
Note that the limit q → 1 in (4.59) gives the triangle equality in the classical
case (4.54). The two important cases which arise out of (4.59) are,
Iq (lkr) ≤ Iq (lkp) + Iq (pkr) when 0 < q ≤ 1 ,
(4.70)
Iq (lkr) ≥ Iq (lkp) + Iq (pkr) when 1 < q .
(4.71)
We refer to Theorem 4.3 as nonextensive Pythagoras’ theorem and (4.59) as nonextensive triangle equality, whose pseudo-additivity property is consistent with the pseudo
additivity of Tsallis relative-entropy (compare (2.40) and (2.11)), and hence is a natural
generalization of triangle equality in the classical case.
4.3.3 In the Case of Normalized q-Expectations
In the case of normalized q-expectation too, the Tsallis relative-entropy satisfies nonextensive triangle equality with modified conditions from the case of q-expectation values.
T HEOREM 4.4
Let r be the prior distribution, and p be the probability distribution that minimizes
the Tsallis relative-entropy subject to the set of constraints
R
um (x)p(x)q dµ(x)
XR
= hhum iiq , m = 1, . . . , M.
q
X p(x) dµ(x)
Let l be any other distribution satisfying the constraints
R
um (x)l(x)q dµ(x)
XR
= hhwm iiq , m = 1, . . . , M.
q
X l(x) dµ(x)
(4.72)
(4.73)
Then we have
Iq (lkr) = Iq (lkp) + Iq (pkr) + (q − 1)Iq (lkp)Iq (pkr),
(4.74)
provided
hhum iiq = hhwm iiq m = 1, . . . M.
92
(4.75)
Proof
From Tsallis minimum entropy distribution p in the case of normalized q-expected
values (4.40), we have
c + (1 − q) ln p(x) ln Z
c
lnq r(x) − lnq p(x) = lnq Z
q
q
q q
PM
m=1 βm um (x) − hhum iiq
R
.
+
q
X p(x) dµ(x)
Proceeding as in the proof of Theorem 4.3, we have
Z
h
i
l(x)q lnq p(x) − lnq r(x) dµ(x) .
Iq (lkp) = Iq (lkr) −
(4.76)
(4.77)
X
From (4.76), we obtain
Z
c
Iq (lkp) = Iq (lkr) + lnq Zq
l(x)q dµ(x)
X
Z
c
+(1 − q) lnq Zq
l(x)q lnq p(x) dµ(x)
X
+R
Z
M
X
1
q
βm
l(x) um (x) − hhum iiq dµ(x) .
q
X
X p(x) dµ(x) m=1
(4.78)
By (4.73) the same can be written as
Z
c
Iq (lkp) = Iq (lkr) + lnq Z
l(x)q dµ(x)
q
X
Z
c
+(1 − q) lnq Zq
l(x)q lnq p(x) dµ(x)
X
R
M
l(x)q dµ(x) X
+RX
β
hhw
ii
−
hhu
ii
.
m
m
m
q
q
q
X p(x) dµ(x) m=1
(4.79)
By using the relations
Z
l(x)q lnq p(x) dµ(x) = −Iq (lkp) − Sq (l) ,
X
and
Z
X
l(x)q dµ(x) = (1 − q)Sq (l) + 1 ,
(4.79) can be written as
c
c − (1 − q) ln Z
Iq (lkp) = Iq (lkr) + lnq Z
q
q q Iq (lkp)
R
M
q
X
X l(x) dµ(x)
R
. (4.80)
hhw
ii
−
hhu
ii
β
+
m
m
m
q
q
q
X p(x) dµ(x) m=1
Finally using (4.41) and (4.75) we have the nonextensive triangle equality (4.74).
93
Note that in this case the minimum of I q (lkp) is not guaranteed. Also the condition
(4.75) for nonextensive triangle equality here is the same as the expectation value
matching property in the classical case.
Finally, nonextensive Pythagoras’ theorem is yet another remarkable and consistent generalization shown by Tsallis formalism.
94
5
Power-laws and Entropies:
Generalization of Boltzmann Selection
Abstract
The great success of Tsallis formalism is due to the resulting power-law distributions from ME-prescriptions of its entropy functional. In this chapter we provide
experimental demonstration of use of the power-law distributions in evolutionary
algorithms by generalizing Boltzmann selection to the Tsallis case. The proposed algorithm uses Tsallis canonical distribution to weigh the configurations for ’selection’
instead of Gibbs-Boltzmann distribution. This work is motivated by the recently
proposed generalized simulated annealing algorithm based on Tsallis statistics. The
results in this chapter can also be found in (Dukkipati, Murty, & Bhatnagar, 2005a).
The central step of an enormous variety of problems (in Physics, Chemistry, Statistics,
Engineering, Economics) is the minimization of an appropriate energy or cost function. (For example, energy function in the traveling salesman problem is the length
of the path.) If the cost function is convex, any gradient descent method easily solves
the problem. But if the cost function is nonconvex the solution requires more sophisticated methods, since a gradient decent procedure could easily trap the system in a
local minimum. Consequently, various algorithmic strategies have been developed
along the years for making this important problem increasingly tractable. Among
the various methods developed to solve hard optimization problems, the most popular
ones are simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) and evolutionary
algorithms (Bounds, 1987).
Evolutionary computation comprises of techniques for obtaining near-optimal solutions of hard optimization problems in physics (e.g., Sutton, Hunter, & Jan, 1994)
and engineering (Holland, 1975). These methods are based largely on ideas from biological evolution and are similar to simulated annealing, except that, instead of exploring the search space with a single point at each instant, these deal with a population – a
multi-subset of search space – in order to avoid getting trapped in local optima during
the process of optimization. Though evolutionary algorithms are not analyzed traditionally in the Monte Carlo framework, few researchers (e.g., Cercueil & Francois,
2001; Cerf, 1996a, 1996b) analyzed these algorithms in this framework.
95
A typical evolutionary algorithm is a two step process: selection and variation.
Selection comprises replicating an individual in the population based on probabilities
(selection probabilities) that are assigned to individuals in the population on the basis
of a “fitness” measure defined by the objective function. A stochastic perturbation of
individuals while replicating is called variation.
Selection is a central concept in evolutionary algorithms. There are several selection mechanisms in evolutionary algorithms, among which Boltzmann selection has an
important place because of the deep connection between the behavior of complex systems in thermal equilibrium at finite temperature and multivariate optimization (Nulton
& Salamon, 1988). In these systems, each configuration is weighted by its GibbsBoltzmann probability factor e−E/T , where E is the energy of the configuration and
T is the temperature. Finding the low-temperature state of a system when the energy can be computed amounts to solving an optimization problem. This connection
has been used to devise the simulated annealing algorithm (Kirkpatrick et al., 1983).
Similarly for evolutionary algorithms, in the selection process where one would select
“better” configurations, one can use the same technique to weigh the individuals i.e.,
using Gibbs-Boltzmann factor. This is called Boltzmann selection, which is nothing
but defining selection probabilities in the form of Boltzmann canonical distribution.
Classical simulated annealing, as proposed by Kirkpatrick et al. (1983), extended
the well-known procedure of Metropolis et al. (1953) for equilibrium Gibbs-Boltzmann
statistics: a new configuration is accepted with the probability
p = min 1, e−β∆E ,
where β =
1
T
(5.1)
is the inverse temperature parameter and ∆E is the change in the energy.
The annealing consists in decreasing the temperature gradually. Geman and Geman
(1984) showed that if the temperature decreases as the inverse logarithm of time, the
system will end in a global minimum.
On the other hand, in the generalized simulated annealing procedure proposed
by Tsallis and Stariolo (1996) the acceptance probability is generalized to
n
o
1
,
p = min 1, [1 − (1 − q)β∆E] 1−q
(5.2)
1
for some q. The term [1 − (1 − q)β∆E] 1−q is due to Tsallis distribution in Tsallis
statistics (see § 3.4) and q → 1 in (5.2) retrieves the acceptance probability in the
classical case. This method is shown to be faster than both classical simulated annealing and the fast simulated annealing methods (Stariolo & Tsallis, 1995; Tsallis,
96
1988). This algorithm has been used successfully in many applications (Yu & Mo,
2003; Moret et al., 1998; Penna, 1995; Andricioaei & Straub, 1996, 1997).
The above described use of power-law distributions in simulated annealing is the
motivation for us to incorporate Tsallis canonical probability distribution for selection
in evolutionary algorithms and test their novelty.
Before we present the proposed algorithm and simulation results, we also present
an information theoretic justification of Boltzmann distribution in selection mechanism (Dukkipati et al., 2005a). In fact, in evolutionary algorithms Boltzmann selection is viewed just as an exponential scaling for proportionate selection (de la Maza &
Tidor, 1993) (where selection probabilities of configurations are inversely proportional
to their energies (Holland, 1975)). We show that by using Boltzmann distribution in the
selection mechanism one would implicitly satisfy Kullback minimum relative-entropy
principle.
5.1 EAs based on Boltzmann Distribution
Let Ω be the search space i.e., space of all configurations of an optimization problem. Let E : Ω → R+ be the objective function – following statistical mechanics
terminology (Nulton & Salamon, 1988; Prügel-Bennett & Shapiro, 1994) we refer
to this function as energy (in evolutionary computation terminology this is called as
fitness function) – where the objective is to find a configuration with lowest energy.
t
Pt = {ωk }N
k=1 denotes a population which is a multi-subset of Ω. Here we assume
that the size of population at any time is finite and need not be a constant.
In the first step, initial population P 0 is chosen with random configurations. At
each time step t, the population undergoes the following procedure.
selection
Pt −→ Pt0
variation
−→ Pt+1 .
Variation is nothing but stochastically perturbing the individuals in the population.
Various methods in evolutionary algorithms follow different approaches. For example
in genetic algorithms, where configurations are represented as binary strings, operators
such as mutation and crossover are used; for details see (Holland, 1975).
Selection is the mechanism, where “good” configurations are replicated based on
t
their selection probabilities (Back, 1994). For a population P t = {ωk }N
k=1 with the
corresponding energy values {Ek }nk=1 , selection probabilities are defined as
pt (ωk ) = Prob(ωk ∈ Pt0 |ωk ∈ Pt ) ,
97
∀k = 1 . . . Nt ,
start
Initialize Population
Evaluate "Fitness"
Apply Selection
Randomly Vary Individuals
no
Stop Criterion
yes
end
Figure 5.1: Structure of evolutionary algorithms
t
and {pt (ωk )}N
k=1 satisfies the condition:
P Nt
k=1 pt (ωk )
= 1.
The general structure of evolutionary algorithms is shown in Figure 5.1; for further
details refer to (Fogel, 1994; Back, Hammel, & Schwefel, 1997).
According to Boltzmann selection, selection probabilities are defined as
e−βEk
,
pt (ωk ) = PNt
−βEj
j=1 e
(5.3)
where β is the inverse temperature at time t. The strength of selection is controlled by
the parameter β. A higher value of β (low temperature) gives a stronger selection, and
a lower value of β gives a weaker selection (Back, 1994).
Boltzmann selection gives faster convergence, but without a good annealing schedule for β, it might lead to premature convergence. This problem is well known from
simulated annealing (Aarts & Korst, 1989), but not very well studied in evolutionary
algorithms. This problem is addressed in (Mahnig & Mühlenbein, 2001; Dukkipati,
Murty, & Bhatnagar, 2004) where annealing schedules for evolutionary algorithms
based on Boltzmann selection have been proposed.
Now, we derive the selection equation, similar to the one derived in (Dukkipati
98
et al., 2004), which characterizes Boltzmann selection from first principles. Given a
t
population Pt = {ωk }N
k=1 , the simplest probability distribution on Ω which represents
Pt is
νt (ω)
, ∀ω ∈ Ω ,
Nt
ξt (ω) =
(5.4)
where the function νt : Ω → Z+ ∪ {0} measures the number of occurrences of each
configuration ω ∈ Ω in population Pt . Formally νt can be defined as
νt (ω) =
Nt
X
k=1
δ(ω, ωk ) , ∀ω ∈ Ω ,
(5.5)
where δ : Ω × Ω → {0, 1} is defined as δ(ω1 , ω2 ) = 1 if ω1 = ω2 , δ(ω1 , ω2 ) = 0
otherwise.
The mechanism of selection involves assigning selection probabilities to the configurations in Pt as in (5.3) and sample configurations based on selection probabilities
to generate the population Pt+1 . That is, selection probability distribution assigns zero
probability to the configurations which are not present in the population. Now from
the fact that population is a multi-subset of Ω,we can write selection probability distribution with respect to population P t as,
p(ω) =





νt (ω)e−βE(ω)
−βE(ω)
ω∈Pt νt (ω)e
, if ω ∈ Pt ,
0
(5.6)
otherwise.
One can estimate the frequencies of configurations after the selection ν t+1 as
νt+1 (ω) = νt (ω) P
e−βE(ω)
N
,
−βE(ω) t+1
ω∈Pt νt (ω)e
(5.7)
where Nt+1 is the population size after the selection. Further, the probability distribution which represents the population P t+1 can be estimated as
ξt+1 (ω) =
νt+1 (ω)
e−βE(ω)
= νt (ω) P
−βE(ω)
Nt+1
ω∈Pt νt (ω)e
e−βE(ω)
.
−βE(ω)
ω∈Pt ξt (ω)Nt e
= ξt (ω)Nt P
Finally, we can write the selection equation as
ξt+1 (ω) = P
ξt (ω)e−βE(ω)
.
−βE(ω)
ω∈Pt ξt (ω)e
99
(5.8)
One can observe that (5.8) resembles the minimum relative-entropy distribution
that we derived in § 4.1.1 (see 4.5). This motivates one to investigate the possible
connection of Boltzmann selection with the Kullback’s relative-entropy principle.
Given the distribution ξt , which represents the population P t , we would like to
estimate the distribution ξt+1 that represents the population Pt+1 . In this context one
can view ξt as a prior estimate of ξt+1 . The available constraints for ξt+1 are
X
ξt+1 (ω) = 1 ,
(5.9a)
X
ξt+1 (ω)E(ω) = hEit+1 ,
(5.9b)
w∈Ω
w∈Ω
where hEit+1 is the expected value of the function E with respect to ξ t+1 . At this
stage let us assume that hEi t+1 is a given quantity; this will be explained later.
In this set up, Kullback minimum relative-entropy principle gives the estimate for
ξt+1 . That is, one should choose ξt+1 in such a way that it minimizes the relativeentropy
I(ξt+1 kξt ) =
X
ξt+1 (ω) ln
ω∈Ω
ξt+1 (ω)
ξt (ω)
(5.10)
with respect to the constraints (5.9a) and (5.9b). The corresponding Lagrangian can
be written as
L ≡ −I(ξt+1 kξt )−(λ − 1)
−β
X
ω∈Ω
X
ω∈Ω
ξt+1 (ω) − 1
!
E(ω)ξt+1 (ω) − hEit+1
!
,
where λ and β are Lagrange parameters and
∂L
= 0 =⇒ ξt+1 (ω) = eln ξt (ω)−λ−βE(ω) .
∂ξt+1 (ω)
By (5.9a) we get
ξt (ω)e−βE(ω)
eln ξt (ω)−βE(ω)
,
ξt+1 (ω) = P ln ξ (ω)−βE(ω) = P
−βE(ω)
t
ωe
ω∈Pt ξt (ω)e
(5.11)
which is the selection equation (5.8) that we have derived from the Boltzmann selection mechanism. The Lagrange multiplier β is the inverse temperature parameter in
Boltzmann selection.
100
The above justification is incomplete without explaining the relevance of the constraint (5.9b) in this context. Note that the inverse temperature parameter β in (5.11)
is determined using constraint (5.9b). Thus we have
P
−βE(ω)
ω∈Ω E(ω)ξt (ω)e
P
= hEit+1 .
−βE(ω)
ω∈Ω ξt (ω)e
(5.12)
Now it is evident that by specifying β in the annealing schedule of Boltzmann selec-
tion, we predetermine hEi t+1 , which is the mean of the function E with respect to the
population Pt+1 , according to which the configurations for P t+1 are sampled.
Now with this information theoretic justification of Boltzmann selection we proceed to its generalization to the Tsallis case.
5.2 EA based on Power-law Distributions
We propose a new selection scheme for evolutionary algorithms based on Tsallis generalized canonical distribution, that results from maximum entropy prescriptions of
t
Tsallis entropy discussed in § 3.4 as follows. For a population P (t) = {ω k }N
k=1 with
t
corresponding energies {Ek }N
k=1 we define selection probabilities as
1
[1 − (1 − q)βt Ek ] 1−q
pt (ωk ) = P
Nt
1
1−q
j=1 [1 − (1 − q)βt Ej ]
,
∀k = 1, . . . Nt ,
(5.13)
where {βt : t = 1, 2, . . .} is an annealing schedule. We refer to the selection scheme
based on Tsallis distribution as Tsallis selection and the evolutionary algorithm with
Tsallis selection as generalized evolutionary algorithm.
In this algorithm, we use the Cauchy annealing schedule that is proposed in (Dukkipati
et al., 2004). This annealing schedule chooses β t as a non-decreasing Cauchy sequence
for faster convergence. One such sequence is
βt = β 0
t
X
1
,
iα
t = 1, 2, . . . ,
(5.14)
i=1
where β0 is any constant and α > 1. The novelty of this annealing schedule has been
demonstrated using simulations in (Dukkipati et al., 2004). Similar to the practice
in generalized simulated annealing (Andricioaei & Straub, 1997), in our algorithm, q
tends towards 1 as temperature decreases during annealing.
The generalized evolutionary algorithm based on Tsallis statistics is given in Figure 5.2.
101
Algorithm 1 Generalized Evolutionary algorithm
P (0) ← Initialize with configurations from search space randomly
Initialize β and q
for t = 1 to T do
for all ω ∈ P (t) do
(Selection)
Calculate
1
[1 − (1 − q)βE(ω)] 1−q
p(ω) =
Zq
Copy ω into P 0 (t) with probability p(ω) with replacement
end for
for all ω ∈ P 0 (t) do
(Variation)
Perform variation with specific probability
end for
Update β according to annealing schedule
Update q according to its schedule
P (t + 1) ← P 0 (t)
end for
Figure 5.2: Generalized Evolutionary Algorithm based on Tsallis statistics to optimize
the energy function E(ω).
5.3 Simulation Results
We discuss the simulations conducted to study the generalized evolutionary algorithm based on Tsallis statistics proposed in this paper. We compare performance
of evolutionary algorithms with three selection mechanisms viz., proportionate selection (where selection probabilities of configurations are inversely proportional to
their energies (Holland, 1975)), Boltzmann selection and Tsallis selection respectively.
For comparison purposes we study multi-variate function optimization in the framework of genetic algorithms. Specifically, we use the following bench mark test functions (Mühlenbein & Schlierkamp-Voosen, 1993), where the aim is to find the configuration with the lowest functional value:
• Ackley’s function:
q P
P
l
1
2
E1 (~x) = −20 exp −0.2 l i=1 xi −exp 1l li=1 cos(2πxi ) +20+e ,
where −30 ≤ xi ≤ 30,
• Rastrigin’s function:
P
E2 (~x) = lA + li=1 x2i − A cos(2πxi ),
where A = 10 ; −5.12 ≤ xi ≤ 5.12,
102
• Griewangk’s function:
P
Q
xi 2
xi
E3 (~x) = li=1 4000
+ 1,
− li=1 cos √
i
where −600 ≤ xi ≤ 600.
Parameters for the algorithms were set to compare performance of these algorithms
in identical conditions. Each xi is encoded with 5 bits and l = 15 i.e., search space is
of size 275 . Population size is n = 350. For all the experiments, probability of uniform
crossover is 0.8 and probability of mutation is below 0.1. We limited each algorithm
to 100 iterations and have given plots for the behavior of the process when averaged
over 20 runs.
As we mentioned earlier, for Boltzmann selection we have used the Cauchy annealing schedule (see (5.14)), in which we set β 0 = 200 and α = 1.01. For Tsallis
selection too, we have used the same annealing schedule as Boltzmann selection with
identical parameters. In our preliminary simulations, q was kept constant and tested
with various values. Then we adopted a strategy from generalized simulated annealing
where one would choose an initial value of q 0 and decrease linearly to the value 1. This
schedule of q gave better performance than keeping it constant. We reported results
with various values of q0 .
20
19
bestfitness
18
17
16
15
q0 = 3
q0 = 2
q0 = 1.5
q0 = 1.01
14
0
10
20
30
40
50
60
70
80
90
100
generations
Figure 5.3: Performance of evolutionary algorithm with Tsallis selection for various
values of q0 for the test function Ackley
From various simulations, we observed that when the problem size is small (for
example smaller values of l) all the selection mechanisms perform equally well. Boltzmann selection is effective when we increase the problem size. For Tsallis selection,
we performed simulations with various values of q 0 . Figure 5.3 shows the performance
for Ackley function for q0 = 3, 2, 1.5 and 1.01, respectively, from which one can see
103
20
proportionate
Boltzmann
Tsallis
19
best_fitness
18
17
16
15
14
0
10
20
30
40
50
60
70
80
90
100
generations
Figure 5.4: Ackley: q0 = 1.5
180
proportionate
Boltzmann
Tsallis
170
160
best_fitness
150
140
130
120
110
100
0
10
20
30
40
50
60
70
80
90
100
generations
Figure 5.5: Rastrigin: q0 = 2
200
proportionate
Boltzmann
Tsallis
180
best_fitness
160
140
120
100
80
60
0
10
20
30
40
50
generations
60
70
80
Figure 5.6: Griewangk: q0 = 1.01
104
90
100
that the choice of q0 is very important for the evolutionary algorithm with Tsallis selection which varies with the problem at hand.
Figures 5.4, 5.5 and 5.6 show the comparisons of evolutionary algorithms based
on Tsallis selection, Boltzmann selection and proportionate selection, respectively, for
different functions. We have reported only the best behavior for various values of q 0 .
From these simulation results, we conclude that the evolutionary algorithm based on
Tsallis canonical distribution with appropriate value of q 0 outperforms those based on
Boltzmann and proportionate selection respectively.
105
6
Conclusions
Abstract
In this concluding chapter we summarize the results of the Dissertation, with an
emphasis on novelties, and new problems suggested by this research.
Information theory based on Shannon entropy functional found applications that cut
across a myriad of fields, because of its established mathematical significance i.e., its
beautiful mathematical properties. Shannon (1956) too emphasized that “the hard core
of information theory is, essentially, a branch of mathematics” and “a thorough understanding of the mathematical foundation . . . is surely a prerequisite to other applications.” Given that “the hard core of information theory is a branch of mathematics,”
one could expect formal generalizations of information measures taking place, just as
would be the case for any other mathematical concept.
At the outset of this Dissertation we noted from (Rényi, 1960; Csiszár, 1974) that
generalization of information measures should be indicated by their operational significance (pragmatic approach) and by a set of natural postulates characterizing them
(axiomatic approach). In the literature ranging from mathematics to physics, information theory to machine learning one can find various operational and axiomatic
justifications of the generalized information measures. In this thesis, we investigated
some properties of generalized information measures and their maximum and minimum entropy prescriptions pertaining to their mathematical significance.
6.1 Contributions of the Dissertation
In this section we briefly summarize the contributions of this thesis including some
problems suggested by this work.
Rényi’s recipe for nonextensive information measures
Passing an information measure through Rényi’s formalism – a procedure followed by
Rényi to generalize Shannon entropy – allows one to study the possible generaliza106
tions and characterizations of information measure in terms of axioms of quasilinear
means. In Chapter 2, we studied this technique for nonextensive entropy and showed
that Tsallis entropy is unique under Rényi’s recipe. Assuming that any putative candidate for an entropy should be a mean (Rényi, 1961), and in light of attempts to study
ME-prescriptions of information measures, where constraints are specified using KNaverages (e.g., Czachor & Naudts, 2002), the results presented in this thesis further the
relation between entropy functionals and generalized means.
Measure-theoretic formulations
In Chapter 3, we extended the discrete case definitions of generalized information
measures to the measure-theoretic case. We showed that as in the case of KullbackLeibler relative-entropy, generalized relative-entropies, whether Rényi or Tsallis, in
the discrete case can be naturally extended to measure-theoretic case, in the sense
that measure-theoretic definitions can be derived from limits of sequences of finite
discrete entropies of pmfs which approximate the pdfs involved. We also showed that
ME prescriptions of measure-theoretic Tsallis entropy are consistent with the discrete
case, which is also true for measure-theoretic Shannon-entropy.
GYP-theorem
Gelfand-Yaglom-Perez theorem for KL-entropy not only equips it with a fundamental
definition but also provides a means to compute KL-entropy and study its behavior. We
stated and proved the GYP-theorem for generalized relative entropies of order α > 1
(q > 1 for the Tsallis case) in Chapter 3. However, results for the case 0 < α < 1, are
yet to be obtained.
q-product representation of Tsallis minimum entropy distribution
Tsallis relative-entropy minimization in both the cases, q-expectations and normalized
q-expectations, has been studied and some significant differences with the classical
case are presented in Chapter 4. We showed that unlike in the classical case, minimizing Tsallis relative-entropy is not equivalent to maximizing entropy when the prior is a
uniform distribution. Our usage of q-product in the representation of Tsallis minimum
entropy distributions, not only provides it with an elegant representation but also simplifies the calculations in the study of its properties and in deriving the expressions for
minimum relative-entropy and corresponding thermodynamic equations.
107
The detailed study of Tsallis relative-entropy minimization in the case of normalized q-expected values and the computation of corresponding minimum relativeentropy distribution (where one has to address the self-referential nature of the probabilities) based on Tsallis et al. (1998), Martı́nez et al. (2000) formalisms for Tsallis
entropy maximization is currently under investigation. Considering the various fields
to which Tsallis generalized statistics has been applied, studies of applications of Tsallis relative minimization of various inference problems are of particular relevance.
Nonextensive Pythagoras’ theorem
Phythagoras’ theorem of relative-entropy plays an important role in geometrical approaches of statistical estimation theory like information geometry. In Chapter 4 we
proved Pythagoras’ theorem in the nonextensive case i.e., for Tsallis relative-entropy
minimization. In our opinion, this result is yet another remarkable and consistent generalization shown by the Tsallis formalism.
Use of power-law distributions in EAs
Inspired by the generalization of simulated annealing reported by (Tsallis & Stariolo,
1996), in Chapter 5 we proposed a generalized evolutionary algorithm based on Tsallis statistics. The algorithm uses Tsallis canonical probability distribution instead of
Boltzmann distribution. Since these distributions are maximum entropy distributions,
we presented the information theoretical justification to use Boltzmann selection in
evolutionary algorithms – prior to this, Boltzmann selection was viewed only as a special case of proportionate selection with exponential scaling. This should encourage
the use of information theoretic methods in evolutionary computation.
We tested our algorithm on some bench-mark test functions. We found that with
an appropriate choice of nonextensive index (q), evolutionary algorithms based on
Tsallis statistics outperform those based on Gibbs-Boltzmann distribution. We believe
the Tsallis canonical distribution is a powerful technique for selection in evolutionary
algorithms.
6.2 Future Directions
There are two fundamental spaces in machine learning. The first space X consists of
data points and the second space Θ consists of possible learning models. In statistical
108
learning, Θ is usually a space of statistical models, {p(x; θ) : θ ∈ Θ} in the generative
case or {p(y|x; θ) : θ ∈ Θ} in the discriminative case. Learning algorithms select a
model θ ∈ Θ based on the training example {x k }nk=1 ⊂ X or {(xk , yk )}nk=1 ⊂ X × Y
depending on whether the generative case or the discriminative case are considered.
Applying differential geometry, a mathematical theory of geometries, in smooth,
locally Euclidean spaces to space of probability distributions and so to statistical models is a fundamental technique in information geometry. Information does however
play two roles in it: Kullback-Leibler relative entropy features as a measure of divergence, and Fisher information takes the role of curvature.
ME-principle is involved in information geometry due to the following reasons.
One is Pythagoras’ theorem of relative-entropy minimization. And the other is due to
the work of Amari (2001). Amari showed that ME distributions are exactly the ones
with minimal interaction between their variables — these are close to independence.
This result plays an important role in geometric approaches to machine learning.
Now, equipped with the nonextensive Pythagoras’ theorem in the generalized case
of Tsallis, it is interesting to know the resultant geometry when we use generalized
information measures and role of entropic index in the geometry.
Another open problem in generalized information measures is the kind of constraints one should use for the ME-prescriptions. At present ME-prescriptions for
Tsallis come in three flavors. These three flavors correspond to the kind of constraints
one would use to derive the canonical distribution. The first is conventional expectation
(Tsallis (1988)), second is q-expectation values (Curado-Tsallis (1991)), and the third
is normalized q-expectation values (Tsallis-Mendes-Plastino (1998)). The problem of
which constraints to use remains an open problem that has so far been addressed only
in the context of thermodynamics.
Boghosian (1996) suggested that the entropy functional and the constraints one
would use should be considered as axioms. By this he suggested that their validity
is to be decided solely by the conclusions to which they lead and ultimately by comparison with experiment. A practical study of it in the problems related to estimating
probability distributions by using ME of Tsallis entropy might throw some light.
Moving on to another problem, we have noted that Tsallis entropy can be written
as a Kolmogorov-Nagumo function of Rényi entropy. We have also seen that the same
function is KN-equivalent to the function which is used in the generalized averaging of
Hartley information to derive Rényi entropy. This suggests the possibility that generalized averages play a role in describing the operational significance of Tsallis entropy,
109
an explanation for which still eludes us.
Finally, though Rényi information measure offers very natural – and perhaps conceptually the cleanest – setting for generalization of entropy, and while generalization
of Tsallis entropy too can be put in some what formal setting with q-generalizations of
functions – we still are not in the know about the complete relevance, in the sense of
operational, axiomatic, mathematical, of entropic indexes α in Rényi and q in Tsallis.
This is easily the most challenging problem before us.
6.3 Concluding Thought
Mathematical formalism plays an important role not only in physical theories but also
in theories of information phenomena; some undisputed examples being the Shannon
theory of information and Kolmogorov theory of complexity. One can make advances
further in these theories by, as Dirac (1939, 1963) suggested for the advancement of
theoretical physics, employing all the resources of pure mathematics in an attempt to
perfect and generalize the existing mathematical formalism.
While operational and axiomatic justifications lay the foundations, the study of
“mathematical significance” of these generalized concepts forms the pillars on which
one can develop the generalized theory. The ultimate fruits of this labour include, a
better understanding of phenomena in the context, better solutions for related practical
problems – perhaps, as Wigner (1960) called unreasonable effectiveness of mathematics – and finally, its own beauty.
110
Bibliography
Aarts, E., & Korst, J. (1989).
Simulated Annealing and Boltzmann Machines–A
Stochastic Approach to Combinatorial Optimization and Neural Computing.
Wiley, New York.
Abe, S., & Suzuki, N. (2004). Scale-free network of earthquakes. Europhysics Letters,
65(4), 581–586.
Abe, S. (2000). Axioms and uniqueness theorem for Tsallis entropy. Physics Letters
A, 271, 74–79.
Abe, S. (2003). Geometry of escort distributions. Physical Review E, 68, 031101.
Abe, S., & Bagci, G. B. (2005). Necessity of q-expectation value in nonextensive
statistical mechanics. Physical Review E, 71, 016139.
Aczél, J. (1948). On mean values. Bull. Amer. Math. Soc., 54, 392–400.
Aczél, J., & Daróczy, Z. (1975). On Measures of Information and Their Characterization. Academic Press, New York.
Agmon, N., Alhassid, Y., & Levine, R. D. (1979). An algorithm for finding the distribution of maximal entropy. Journal of Computational Physics, 30, 250–258.
Amari, S. (2001). Information geometry on hierarchy of probability distributions.
IEEE Transactions on Information Theory, 47, 1701–1711.
Amari, S. (1985). Differential-Geometric Methods in Statistics, Vol. 28 of Lecture
Notes in Statistics. Springer-Verlag, Heidelberg.
Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry, Vol. 191 of
Translations of Mathematical Monographs. Oxford University Press, Oxford.
Amblard, P.-O., & Vignat, C. (2005). A note on bounded entropies. arXiv:condmat/0509733.
Andricioaei, I., & Straub, J. E. (1996). Generalized simulated annealing algorithms using Tsallis statistics: Application to conformational optimization of a tetrapeptide. Physical Review E, 53(4), 3055–3058.
111
Andricioaei, I., & Straub, J. E. (1997). On Monte Carlo and molecular dynamics methods inspired by Tsallis statistics: Methodology, optimization, and application to
atomic clusters. J. Chem. Phys., 107(21), 9117–9124.
Arimitsu, T., & Arimitsu, N. (2000). Tsallis statistics and fully developed turbulence.
J. Phys. A: Math. Gen., 33(27), L235.
Arimitsu, T., & Arimitsu, N. (2001). Analysis of turbulence by statistics based on
generalized entropies. Physica A, 295, 177–194.
Arndt, C. (2001). Information Measures: Information and its Description in Science
and Engineering. Springer, Berlin.
Ash, R. B. (1965). Information Theory. Interscience, New York.
Athreya, K. B. (1994). Entropy maximization. IMA preprint series 1231, Institute for
Mathematics and its Applications, University of Minnesota, Minneapolis.
Back, T. (1994). Selective pressure in evolutionary algorithms: A characterization of
selection mechanisms. In Proceedings of the First IEEE Conference on Evolutionary Computation, pp. 57–62 Piscataway, NJ. IEEE Press.
Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary computation: Comments on the history and current state. IEEE Transactions on Evolutionary Computation, 1(1), 3–17.
Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks.
Science, 286, 509–512.
Barlow, H. (1990). Conditions for versatile learning, Helmholtz’s unconscious inference and the test of perception. Vision Research, 30, 1561–1572.
Bashkirov, A. G. (2004). Maximum Rényi entropy principle for systems with powerlaw hamiltonians. Physical Review Letters, 93, 130601.
Ben-Bassat, M., & Raviv, J. (1978). Rényi’s entropy and the probability of error. IEEE
Transactions on Information Theory, IT-24(3), 324–331.
Ben-Tal, A. (1977). On generalized means and generalized convex functions. Journal
of Optimization: Theory and Application, 21, 1–13.
Bhattacharyya, A. (1943). On a measure on divergence between two statistical populations defined by their probability distributions. Bull. Calcutta. Math. Soc., 35,
99–109.
112
Bhattacharyya, A. (1946). On some analogues of the amount of information and their
use in statistical estimation. Sankhya, 8, 1–14.
Billingsley, P. (1960). Hausdorff dimension in probability theory. Illinois Journal of
Mathematics, 4, 187–209.
Billingsley, P. (1965). Ergodic Theory and Information. John Wiley & Songs, Toronto.
Boghosian, B. M. (1996).
Thermodynamic description of the relaxation of two-
dimensional turbulence using Tsallis statistics. Physical Review E, 53, 4754.
Borges, E. P. (2004). A possible deformed algebra and calculus inspired in nonextensive thermostatistics. Physica A, 340, 95–101.
Borland, L., Plastino, A. R., & Tsallis, C. (1998). Information gain within nonextensive thermostatistics. Journal of Mathematical Physics, 39(12), 6490–6501.
Bounds, D. G. (1987). New optimization methods from physics and biology. Nature,
329, 215.
Campbell, L. L. (1965). A coding theorem and Rényi’s entropy. Information and
Control, 8, 423–429.
Campbell, L. L. (1985). The relation between information theory and the differential
geometry approach to statistics. Information Sciences, 35(3), 195–210.
Campbell, L. L. (1992). Minimum relative entropy and Hausdorff dimension. Internat.
J. Math. & Stat. Sci., 1, 35–46.
Campbell, L. L. (2003). Geometric ideas in minimum cross-entropy. In Karmeshu
(Ed.), Entropy Measures, Maximum Entropy Principle and Emerging Applications, pp. 103–114. Springer-Verlag, Berlin Heidelberg.
Caticha, A., & Preuss, R. (2004). Maximum entropy and Bayesian data analysis:
Entropic prior distributions. Physical Review E, 70, 046127.
Čencov, N. N. (1982). Statistical Decision Rules and Optimal Inference, Vol. 53 of
Translations of Mathematical Monographs. Amer. Math. Soc., Providence RI.
Cercueil, A., & Francois, O. (2001). Monte Carlo simulation and population-based optimization. In Proceedings of the 2001 Congress on Evolutionary Computation
(CEC2001), pp. 191–198. IEEE Press.
113
Cerf, R. (1996a). The dynamics of mutation-selection algorithms with large population
sizes. Ann. Inst. H. Poincaré, 32, 455–508.
Cerf, R. (1996b). A new genetic algorithm. Ann. Appl. Probab., 6, 778–817.
Cherney, A. S., & Maslov, V. P. (2004). On minimization and maximization of entropy
in various disciplines. SIAM journal of Theory of Probability and Its Applications, 48(3), 447–464.
Chew, S. H. (1983). A generalization of the quasilinear mean with applications to
the measurement of income inequality and decision theory resolving the allais
paradox. Econometrica, 51(4), 1065–1092.
Costa, J. A., Hero, A. O., & Vignat, C. (2002). A characterization of the multivariate
distributions maximizing Rényi entropy. In Proceedings of IEEE International
Symposium on Information Theory(ISIT), pp. 263–263. IEEE Press.
Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley, New
York.
Cover, T. M., Gacs, P., & Gray, R. M. (1989). Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability, 17(3),
840–865.
Csiszár, I. (1969). On generalized entropy. Studia Sci. Math. Hungar., 4, 401–419.
Csiszár, I. (1974). Information measures: A critical survey. In Information Theory, Statistical Decision Functions and Random Processes, Vol. B, pp. 73–86.
Academia Praha, Prague.
Csiszár, I. (1975). I-divergence of probability distributions and minimization problems. Ann. Prob., 3(1), 146–158.
Curado, E. M. F., & Tsallis, C. (1991). Generalized statistical mechanics: Connections
with thermodynamics. J. Phys. A: Math. Gen., 24, 69–72.
Czachor, M., & Naudts, J. (2002). Thermostatistics based on Kolmogorov-Nagumo
averages: Unifying framework for extensive and nonextensive generalizations.
Physics Letters A, 298, 369–374.
Daróczy, Z. (1970). Generalized information functions. Information and Control, 16,
36–51.
114
Davis, H. (1941). The Theory of Econometrics. Principia Press, Bloomington, IN.
de Finetti, B. (1931). Sul concetto di media. Giornale di Istituto Italiano dei Attuarii,
2, 369–396.
de la Maza, M., & Tidor, B. (1993). An analysis of selection procedures with particular
attention paid to proportional and Boltzmann selection. In Forrest, S. (Ed.),
Proceedings of the Fifth International Conference on Genetic Algorithms, pp.
124–131 San Mateo, CA. Morgan Kaufmann Publishers.
Dirac, P. A. M. (1939). The relation between mathematics and physics. Proceedings
of the Royal Society of Edinburgh, 59, 122–129.
Dirac, P. A. M. (1963). The evolution of the physicist’s picture of nature. Scientific
American, 208, 45–53.
Dobrushin, R. L. (1959). General formulations of Shannon’s basic theorems of the
theory of information. Usp. Mat. Nauk., 14(6), 3–104.
dos Santos, R. J. V. (1997). Generalization of Shannon’s theorem for Tsallis entropy.
Journal of Mathematical Physics, 38, 4104–4107.
Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006a). Gelfand-Yaglom-Perez theorem for generalized relative entropies. arXiv:math-ph/0601035.
Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006b). On measure theoretic definitions of generalized information measures and maximum entropy prescriptions.
arXiv:cs.IT/0601080. (Submitted to Physica A).
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2004). Cauchy annealing schedule: An annealing schedule for Boltzmann selection scheme in evolutionary
algorithms. In Proceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 1, pp. 55–62. IEEE Press.
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005a). Information theoretic justification of Boltzmann selection and its generalization to Tsallis case. In Proceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 2, pp.
1667–1674. IEEE Press.
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005b). Properties of Kullback-Leibler
cross-entropy minimization in nonextensive framework. In Proceedings of IEEE
International Symposium on Information Theory(ISIT), pp. 2374–2378. IEEE
Press.
115
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006a). Nonextensive triangle equality
and other properties of Tsallis relative-entropy minimization. Physica A, 361,
124–138.
Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006b). Uniqueness of nonextensive
entropy under rényi’s recipe. arXiv:cs.IT/05511078.
Ebanks, B., Sahoo, P., & Sander, W. (1998). Characterizations of Information Measures. World Scientific, Singapore.
Eggleston, H. G. (1952). Sets of fractional dimension which occur in some problems
of number theory. Proc. London Math. Soc., 54(2), 42–93.
Elsasser, W. M. (1937). On quantum measurements and the role of the uncertainty
relations in statistical mechanics. Physical Review, 52, 987–999.
Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion and the temporal behavior of consumption and asset returns: A theoretical framework. Econometrica,
57, 937–970.
Faddeev, D. K. (1986). On the concept of entropy of a finite probabilistic scheme
(Russian). Uspehi Mat. Nauk (N.S), 11, 227–231.
Ferri, G. L., Martı́nez, S., & Plastino, A. (2005). The role of constraints in Tsallis’
nonextensive treatment revisited. Physica A, 347, 205–220.
Fishburn, P. C. (1986). Implicit mean value and certainty equivalence. Econometrica,
54(5), 1197–1206.
Fogel, D. B. (1994). An introduction to simulated evolutionary optimization. IEEE
Transactions on Neural Networks, 5(1), 3–14.
Forte, B., & Ng, C. T. (1973). On a characterization of the entropies of type β. Utilitas
Math., 4, 193–205.
Furuichi, S. (2005). On uniqueness theorem for Tsallis entropy and Tsallis relative
entropy. IEEE Transactions on Information Theory, 51(10), 3638–3645.
Furuichi, S. (2006). Information theoretical properties of Tsallis entropies. Journal of
Mathematical Physics, 47, 023302.
Furuichi, S., Yanagi, K., & Kuriyama, K. (2004). Fundamental properties of Tsallis
relative entropy. Journal of Mathematical Physics, 45, 4868–4877.
116
Gelfand, I. M., Kolmogorov, N. A., & Yaglom, A. M. (1956). On the general definition
of the amount of information. Dokl. Akad. Nauk USSR, 111(4), 745–748. (In
Russian).
Gelfand, I. M., & Yaglom, A. M. (1959). Calculation of the amount of information
about a random function contained in another such function. Usp. Mat. Nauk,
12(1), 3–52. (English translation in American Mathematical Society Translations, Providence, R.I. Series 2, vol. 12).
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the
Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6(6),
721–741.
Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 158–171.
Good, I. J. (1963). Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann. Math. Statist., 34, 911–934.
Gray, R. M. (1990). Entropy and Information Theory. Springer-Verlag, New York.
Grendár jr, M., & Grendár, M. (2001). Maximum entropy: Clearing up mysteries.
Entropy, 3(2), 58–63.
Guiaşu, S. (1977). Information Theory with Applications. McGraw-Hill, Great Britain.
Halsey, T. C., Jensen, M. H., Kadanoff, L. P., Procaccia, I., & Shraiman, B. I. (1986).
Fractal measures and their singularities: The characterization of strange sets.
Physical Review A, 33, 1141–1151.
Hardy, G. H., Littlewood, J. E., & Pólya, G. (1934). Inequalities. Cambridge.
Harremoës, P., & Topsøe, F. (2001). Maximum entropy fundamentals. Entropy, 3,
191–226.
Hartley, R. V. L. (1928). Transmission of information. Bell System Technical Journal,
7, 535.
Havrda, J., & Charvát, F. (1967). Quantification method of classification process:
Concept of structural α-entropy. Kybernetika, 3, 30–35.
Hinčin, A. (1953). The concept of entropy in the theory of probability (Russian).
Uspehi Mat. Nauk, 8(3), 3–28. (English transl.: In Mathematical Foundations
of Information Theory, pp. 1-28. Dover, New York, 1957).
117
Hobson, A. (1969). A new theorem of information theory. J. Stat. Phys., 1, 383–391.
Hobson, A. (1971). Concepts in Statistical Mechanics. Gordon and Breach, New York.
Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. The University of
Michigan Press, Ann Arbor, MI.
Ireland, C., & Kullback, S. (1968).
Contingency tables with given marginals.
Biometrika, 55, 179–188.
Jaynes, E. T. (1957a). Information theory and statistical mechanics i. Physical Review,
106(4), 620–630.
Jaynes, E. T. (1957b). Information theory and statistical mechanics ii. Physical Review,
108(4), 171–190.
Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions on Systems Science and
Cybernetics, sec-4(3), 227–241.
Jeffreys, H. (1948). Theory of Probability (2nd Edition). Oxford Clarendon Press.
Jizba, P., & Arimitsu, T. (2004a). Observability of Rényi’s entropy. Physical Review
E, 69, 026128.
Jizba, P., & Arimitsu, T. (2004b). The world according to Rényi: thermodynamics of
fractal systems. Annals of Physics, 312, 17–59.
Johnson, O., & Vignat, C. (2005). Some results concerning maximum Rényi entropy
distributions. math.PR/0507400.
Johnson, R., & Shore, J. (1983). Comments on and correction to ’axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy’ (jan 80 26-37) (corresp.). IEEE Transactions on Information Theory,
29(6), 942–943.
Kallianpur, G. (1960). On the amount of information contained in a σ-field. In Olkin,
I., & Ghurye, S. G. (Eds.), Essays in Honor of Harold Hotelling, pp. 265–273.
Stanford Univ. Press, Stanford.
Kamimura, R. (1998). Minimizing α-information for generalization and interpretation.
Algorithmica, 22(1/2), 173–197.
Kantorovitz, S. (2003). Introduction to Modern Analysis. Oxford, New York.
118
Kapur, J. N. (1994). Measures of Information and their Applications. Wiley, New
York.
Kapur, J. N., & Kesavan, H. K. (1997). Entropy Optimization Principles with Applications. Academic Press.
Karmeshu, & Sharma, S. (2006). Queue lengh distribution of network packet traffic:
Tsallis entropy maximization with fractional moments. IEEE Communications
Letters, 10(1), 34–36.
Khinchin, A. I. (1956). Mathematical Foundations of Information Theory. Dover, New
York.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated
annealing. Science, 220(4598), 671–680.
Kolmogorov, A. N. (1930). Sur la notion de la moyenne. Atti della R. Accademia
Nazionale dei Lincei, 12, 388–391.
Kolmogorov, A. N. (1957). Theorie der nachrichtenübermittlung. In Grell, H. (Ed.),
Arbeiten zur Informationstheorie, Vol. 1. Deutscher Verlag der Wissenschaften,
Berlin.
Kotz, S. (1966). Recent results in information theory. Journal of Applied Probability,
3(1), 1–93.
Kreps, D. M., & Porteus, E. L. (1978). Temporal resolution of uncertainty and dynamic
choice theory. Econometrica, 46, 185–200.
Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Ann. Math.
Stat., 22, 79–86.
Lavenda, B. H. (1998). The analogy between coding theory and multifractals. Journal
of Physics A: Math. Gen., 31, 5651–5660.
Lazo, A. C. G. V., & Rathie, P. N. (1978). On the entropy of continuous probability
distributions. IEEE Transactions on Information Theory, IT-24(1), 120–122.
Maassen, H., & Uffink, J. B. M. (1988). Generalized entropic uncertainty relations.
Physical Review Letters, 60, 1103–1106.
119
Mahnig, T., & Mühlenbein, H. (2001). A new adaptive Boltzmann selection schedule
sds. In Proceedings of the Congress on Evolutionary Computation (CEC’2001),
pp. 183–190. IEEE Press.
Markel, J. D., & Gray, A. H. (1976). Linear Prediction of Speech. Springer-Verlag,
New York.
Martı́nez, S., Nicolás, F., Pennini, F., & Plastino, A. (2000). Tsallis’ entropy maximization procedure revisited. Physica A, 286, 489–502.
Masani, P. R. (1992a). The measure-theoretic aspects of entropy, Part 1. Journal of
Computational and Applied Mathematics, 40, 215–232.
Masani, P. R. (1992b). The measure-theoretic aspects of entropy, Part 2. Journal of
Computational and Applied Mathematics, 44, 245–260.
Mead, L. R., & Papanicolaou, N. (1984). Maximum entropy in the problem of moments. Journal of Mathematical Physics, 25(8), 2404–2417.
Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculation by fast computing machines. Journal of Chemical
Physics, 21, 1087–1092.
Morales, D., Pardo, L., Pardo, M. C., & Vajda, I. (2004). Rényi statistics for testing
composite hypotheses in general exponential models. Journal of Theoretical
and Applied Statistics, 38(2), 133–147.
Moret, M. A., Pascutti, P. G., Bisch, P. M., & Mundim, K. C. (1998). Stochastic molecular optimization using generalized simulated annealing. J. Comp. Chemistry,
19, 647.
Mühlenbein, H., & Schlierkamp-Voosen, D. (1993). Predictive models for the breeder
genetic algorithm. Evolutionary Computation, 1(1), 25–49.
Nagumo, M. (1930). Über eine klasse von mittelwerte. Japanese Journal of Mathematics, 7, 71–79.
Naranan, S. (1970). Bradford’s law of bibliography of science: an interpretation.
Nature, 227, 631.
Nivanen, L., Méhauté, A. L., & Wang, Q. A. (2003). Generalized algebra within a
nonextensive statistics. Rep. Math. Phys., 52, 437–434.
120
Norries, N. (1976). General means and statistical theory. The American Statistician,
30, 1–12.
Nulton, J. D., & Salamon, P. (1988). Statistical mechanics of combinatorial optimization. Physical Review A, 37(4), 1351–1356.
Ochs, W. (1976). Basic properties of the generalized Boltzmann-Gibbs-Shannon entropy. Reports on Mathematical Physics, 9, 135–155.
Ormoneit, O., & White, H. (1999). An efficient algorithm to compute maximum entropy densities. Econometric Reviews, 18(2), 127–140.
Ostasiewicz, S., & Ostasiewicz, W. (2000). Means and their applications. Annals of
Operations Research, 97, 337–355.
Penna, T. J. P. (1995). Traveling salesman problem and Tsallis statistics. Physical
Review E, 51, R1.
Perez, A. (1959). Information theory with abstract alphabets. Theory of Probability
and its Applications, 4(1).
Pinsker, M. S. (1960a). Dynamical systems with completely positive or zero entropy.
Soviet Math. Dokl., 1, 937.
Pinsker, M. S. (1960b). Information and Information Stability of Random Variables
and Process. Holden-Day, San Francisco, CA. (English ed., 1964, translated
and edited by Amiel Feinstein).
Prügel-Bennett, A., & Shapiro, J. (1994). Analysis of genetic algorithms using statistical mechanics. Physical Review Letters, 72(9), 1305–1309.
Queirós, S. M. D., Anteneodo, C., & Tsallis, C. (2005). Power-law distributions in
economics: a nonextensive statistical approach. In Abbott, D., Bouchaud, J.-P.,
Gabaix, X., & McCauley, J. L. (Eds.), Noise and Fluctuations in Econophysics
and Finance, pp. 151–164. SPIE, Bellingham, WA.
Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical
parameters. Bull. Calcutta Math. Soc., 37, 81–91.
Rebollo-Neira, L. (2001). Nonextensive maximum-entropy-based formalism for data
subset selection. Physical Review E, 65, 011113.
121
Rényi, A. (1959). On the dimension and entropy of probability distributions. Acta
Math. Acad. Sci. Hung., 10, 193–215. (reprinted in (Turán, 1976), pp. 320-342).
Rényi, A. (1960). Some fundamental questions of information theory. MTA III. Oszt.
Közl., 10, 251–282. (reprinted in (Turán, 1976), pp. 526-552).
Rényi, A. (1961).
On measures of entropy and information.
In Proceedings of
the Fourth Berkeley Symposium on Mathematical Statistics and Probability,
pp. 547–561 Berkeley-Los Angeles. University of California Press. (reprinted
in (Turán, 1976), pp. 565-580).
Rényi, A. (1965). On the foundations of information theory. Rev. Inst. Internat. Stat.,
33, 1–14. (reprinted in (Turán, 1976), pp. 304-317).
Rényi, A. (1970). Probability Theory. North-Holland, Amsterdam.
Rosenblatt-Roth, M. (1964). The concept of entropy in probability theory and its
applications in the theory of information transmission through communication
channels. Theory Probab. Appl., 9(2), 212–235.
Rudin, W. (1964). Real and Complex Analysis. McGraw-Hill. (International edition,
1987).
Ryu, H. K. (1993). Maximum entropy estimation of density and regression functions.
Journal of Econometrics, 56, 397–440.
Sanov, I. N. (1957). On the probability of large deviations of random variables. Mat.
Sbornik, 42, 11–44. (in Russian).
Schützenberger, M. B. (1954). Contribution aux applications statistiques de la théorie
de l’information. Publ. l’Institut Statist. de l’Universit é de Paris, 3, 3–117.
Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379.
Shannon, C. E. (1956). The bandwagon (edtl.). IEEE Transactions on Information
Theory, 2, 3–3.
Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication.
University of Illinois Press, Urbana, Illinois.
Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions
122
on Information Theory, IT-26(1), 26–37. (See (Johnson & Shore, 1983) for
comments and corrections.).
Shore, J. E. (1981a). Minimum cross-entropy spectral analysis. IEEE Transactions on
Acoustics Speech and Signal processing, ASSP-29, 230–237.
Shore, J. E. (1981b). Properties of cross-entropy minimization. IEEE Transactions on
Information Theory, IT-27(4), 472–482.
Shore, J. E., & Gray, R. M. (1982). Minimum cross-entropy pattern classification and
cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(1), 11–18.
Skilling, J. (1984). The maximum entropy method. Nature, 309, 748.
Smith, J. D. H. (2001). Some observations on the concepts of information theoretic
entropy and randomness. Entropy, 3, 1–11.
Stariolo, D. A., & Tsallis, C. (1995). Optimization by simulated annealing: Recent
progress. In Staufer, D. (Ed.), Annual Reviews of Computational Physics, Vol. 2,
p. 343. World Scientific, Singapore.
Sutton, P., Hunter, D. L., & Jan, N. (1994). The ground state energy of the ±j spin
glass from the genetic algorithm. Journal de Physique I France, 4, 1281–1285.
Suyari, H. (2002). Nonextensive entropies derived from from invariance of pseudoadditivity. Physica Review E, 65, 066118.
Suyari, H. (2004a). Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy. IEEE Transactions on Information Theory, 50(8), 1783–1787.
Suyari, H. (2004b). q-Stirling’s formula in Tsallis statistics. cond-mat/0401541.
Suyari, H., & Tsukada, M. (2005). Law of error in Tsallis statistics. IEEE Transactions
on Information Theory, 51(2), 753–757.
Teweldeberhan, A. M., Plastino, A. R., & Miller, H. G. (2005). On the cut-off prescriptions associated with power-law generalized thermostatistics. Physics Letters A,
343, 71–78.
Tikochinsky, Y., Tishby, N. Z., & Levine, R. D. (1984). Consistent inference of probabilities for reproducible experiments. Physical Review Letters, 52, 1357–1360.
123
Topsøe, F. (2001). Basic concepts, identities and inequalities - the toolkit of information theory. Entropy, 3, 162–190.
Tsallis, C. (1988). Possible generalization of Boltzmann Gibbs statistics. J. Stat. Phys.,
52, 479.
Tsallis, C. (1994). What are the numbers that experiments provide?. Quimica Nova,
17, 468.
Tsallis, C., & de Albuquerque, M. P. (2000). Are citations of scientific papers a case
of nonextensivity?. Eur. Phys. J. B, 13, 777–780.
Tsallis, C. (1998). Generalized entropy-based criterion for consistent testing. Physical
Review E, 58, 1442–1445.
Tsallis, C. (1999). Nonextensive statistics: Theoretical, experimental and computational evidences and connections. Brazilian Journal of Physics, 29, 1.
Tsallis, C., Levy, S. V. F., Souza, A. M. C., & Maynard, R. (1995). Statisticalmechanical foundation of the ubiquity of lévy distributions in nature. Physical
Review Letters, 75, 3589–3593.
Tsallis, C., Mendes, R. S., & Plastino, A. R. (1998). The role of constraints within
generalized nonextensive statistics. Physica A, 261, 534–554.
Tsallis, C., & Stariolo, D. A. (1996). Generalized simulated annealing. Physica A,
233, 345–406.
Turán, P. (Ed.). (1976). Selected Papers of Alfréd Rényi. Akademia Kiado, Budapest.
Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency
requirement?. Studies in History and Philosophy of Modern Physics, 26, 223–
261.
Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in
History and Philosophy of Modern Physics, 27, 47–79.
Vignat, C., Hero, A. O., & Costa, J. A. (2004). About closedness by convolution of
the Tsallis maximizers. Physica A, 340, 147–152.
Wada, T., & Scarfone, A. M. (2005). Connections between Tsallis’ formalism employing the standard linear average energy and ones employing the normalized
q-average enery. Physics Letters A, 335, 351–362.
124
Watanabe, S. (1969). Knowing and Guessing. Wiley.
Wehrl, A. (1991). The many facets of entropy. Reports on Mathematical Physics, 30,
119–129.
Wiener, N. (1948). Cybernetics. Wiley, New York.
Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural
sciences. Communications in Pure and Applied Mathematics, 13, 1–14.
Wu, X. (2003). Calculation of maximum entropy densities with application to income
distribution. Journal of Econometrics, 115, 347–354.
Yamano, T. (2001). Information theory based on nonadditive information content.
Physical Review E, 63, 046105.
Yamano, T. (2002). Some properties of q-logarithm and q-exponential functions in
Tsallis statistics. Physica A, 305, 486–496.
Yu, Z. X., & Mo, D. (2003). Generalized simulated annealing algorithm applied in the
ellipsometric inversion problem. Thin Solid Films, 425, 108.
Zellner, A., & Highfield, R. A. (1988). Calculation of maximum entropy distributions
and approximation of marginalposterior distributions. Journal of Econometrics,
37, 195–209.
Zitnick, C. (2003). Computing Conditional Probabilities in Large Domains by Maximizing Renyi’s Quadratic Entropy. Ph.D. thesis, Robotics Institute, Carnegie
Mellon University.
125