* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download On Generalized Measures of Information with
Survey
Document related concepts
Transcript
On Generalized Measures of Information with Maximum and Minimum Entropy Prescriptions A Thesis Submitted For the Degree of Doctor of Philosophy in the Faculty of Engineering by Ambedkar Dukkipati Computer Science and Automation Indian Institute of Science Bangalore – 560 012 March 2006 Abstract Z dP dP , dR X where P and R are probability measures on a measurable space (X, M), plays a basic role in the Kullback-Leibler relative-entropy or KL-entropy of P with respect to R defined as ln definitions of classical information measures. It overcomes a shortcoming of Shannon entropy – discrete case definition of which cannot be extended to nondiscrete case naturally. Further, entropy and other classical information measures can be expressed in terms of KL-entropy and hence properties of their measure-theoretic analogs will follow from those of measure-theoretic KL-entropy. An important theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem which equips KL-entropy with a fundamental definition and can be stated as: measure-theoretic KL-entropy equals the supremum of KL-entropies over all measurable partitions of X. In this thesis we provide the measure-theoretic formulations for ‘generalized’ information measures, and state and prove the corresponding GYP-theorem – the ‘generalizations’ being in the sense of R ényi and nonextensive, both of which are explained below. Kolmogorov-Nagumo average or quasilinear mean of a vector x = (x 1 , . . . , xn ) with respect P n to a pmf p = (p1 , . . . , pn ) is defined as hxiψ = ψ −1 p ψ(x ) , where ψ is an arbitrary k k k=1 continuous and strictly monotone function. Replacing linear averaging in Shannon entropy with Kolmogorov-Nagumo averages (KN-averages) and further imposing the additivity constraint – a characteristic property of underlying information associated with single event, which is logarithmic – leads to the definition of α-entropy or Rényi entropy. This is the first formal well-known generalization of Shannon entropy. Using this recipe of Rényi’s generalization, one can prepare only two information measures: Shannon and Rényi entropy. Indeed, using this formalism Rényi characterized these additive entropies in terms of axioms of KN-averages. On the other hand, if one generalizes the information of a single event in the definition of Shannon entropy, by replacing the logarithm with the so called q-logarithm, which is defined as ln q x = x1−q −1 1−q , one gets what is known as Tsallis entropy. Tsallis entropy is also a generalization of Shannon entropy but it does not satisfy the additivity property. Instead, it satisfies pseudo-additivity of the form x ⊕q y = x + y + (1 − q)xy, and hence it is also known as nonextensive entropy. One can apply Rényi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with KN-averages and thereby imposing the constraint of pseudo-additivity. A natural question that arises is what are the various pseudo-additive information measures that can be prepared with this recipe? We prove that Tsallis entropy is the only one. Here, we mention that one of the important characteristics of this generalized entropy is that while canonical distributions resulting from ‘maximization’ of Shannon entropy are exponential in nature, in the Tsallis case they result in power-law distributions. i The concept of maximum entropy (ME), originally from physics, has been promoted to a general principle of inference primarily by the works of Jaynes and (later on) Kullback. This connects information theory and statistical mechanics via the principle: the states of thermodynamic equilibrium are states of maximum entropy, and further connects to statistical inference via select the probability distribution that maximizes the entropy. The two fundamental principles related to the concept of maximum entropy are Jaynes maximum entropy principle, which involves maximizing Shannon entropy and the Kullback minimum entropy principle that involves minimizing relative-entropy, with respect to appropriate moment constraints. Though relative-entropy is not a metric, in cases involving distributions resulting from relativeentropy minimization, one can bring forth certain geometrical formulations. These are reminiscent of squared Euclidean distance and satisfy an analogue of the Pythagoras’ theorem. This property is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality and plays a fundamental role in geometrical approaches to statistical estimation theory like information geometry. In this thesis we state and prove the equivalent of Pythagoras’ theorem in the nonextensive formalism. For this purpose we study relative-entropy minimization in detail and present some results. Finally, we demonstrate the use of power-law distributions, resulting from ME-prescriptions of Tsallis entropy, in evolutionary algorithms. This work is motivated by the recently proposed generalized simulated annealing algorithm based on Tsallis statistics. To sum up, in light of their well-known axiomatic and operational justifications, this thesis establishes some results pertaining to the mathematical significance of generalized measures of information. We believe that these results represent an important contribution towards the ongoing research on understanding the phenomina of information. ii To Bhirava Swamy and Bharati who infected me with a disease called Life and to all my Mathematics teachers who taught me how to extract sweetness from it. ------------. . . lie down in a garden and extract from the disease, especially if it’s not a real one, as much sweetness as possible. There’s a lot of sweetness in it. F RANZ K AFKA iii IN A LETTER TO M ILENA Acknowledgements No one deserves more thanks for the success of this work than my advisers Prof. M. Narasimha Murty and Dr. Shalabh Bhatnagar. I wholeheartedly thank them for their guidance. I thank Prof. Narasimha Murty for his continued support throughout my graduate student years. I always looked upon him for advice – academic or non-academic. He has always been a very patient critique of my research approach and results; without his trust and guidance this thesis would not have been possible. I feel that I am more disciplined, simple and punctual after working under his guidance. The opportunity to watch Dr. Shalabh Bhatnagar in action (particularly during discussions) has fashioned my way of thought in problem solving. He has been a valuable adviser, and I hope my three and half years of working with him have left me with at least few of his qualities. I am thankful to the Chairman, Department of CSA for all the support. I am privileged to learn mathematics from the great teachers: Prof. Vittal Rao, Prof. Adi Murty and Prof. A. V. Gopala Krishna. I thank them for imbibing in me the rigour of mathematics. Special thanks are due to Prof. M. A. L. Thathachar for having taught me. I thank Dr. Christophe Vignat for his criticisms and encouraging advice on my papers. I wish to thank CSA staff Ms. Lalitha, Ms. Meenakshi and Mr. George for being of very great help in administrative works. I am thankful to all my labmates: Dr. Vishwanath, Asharaf, Shahid, Rahul, Dr. Vijaya, for their help. I also thank my institute friends Arjun, Raghav, Ranjna. I will never forget the time I spent with Asit, Aneesh, Gunti, Ravi. Special thanks to my music companions, Raghav, Hari, Kripa, Manas, Niki. Thanks to all IISc Hockey club members and my running mates, Sai, Aneesh, Sunder. I thank Dr. Sai Jagan Mohan for correcting my drafts. Special thanks are due to Vinita who corrected many of my drafts of papers, this thesis, all the way from DC and WI. Thanks to Vinita, Moski and Madhulatha for their care. I am forever indebted to my sister Kalyani for her prayers. My special thanks are due to my sister Sasi and her husband and to my brother Karunakar and his wife. Thanks to my cousin Chinni for her special care. The three great new women in my life: my nieces Sanjana (3 years), Naomika (2 years), Bhavana (3 months) who will always be dear to me. I reserve my special love for my nephew (new born). I am indebted to my father for keeping his promise that he will continue to guide me even though he had to go to unreachable places. I owe everything to my mother for taking care of every need of mine. I dedicate this thesis to my parents and to my teachers. iv Contents Abstract i Acknowledgements iv Notations 1 Prolegomenon Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2 Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.1 What is Entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2.2 Why to maximize entropy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A reader’s guide to the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 KN-averages and Entropies:Rényi’s Recipe 19 2.1 Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Kullback-Leibler Relative-Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 23 Rényi’s Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.1 Hartley Function and Shannon Entropy . . . . . . . . . . . . . . . . . . . . . . . 25 2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means . . . . . . . . . . . . . . . 27 2.2.3 Rényi Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Nonextensive Generalizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.1 Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3.2 q-Deformed Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.4 Uniqueness of Tsallis Entropy under Rényi’s Recipe . . . . . . . . . . . . . . . . . . . . 38 2.5 A Characterization Theorem for Nonextensive Entropies . . . . . . . . . . . . . . . . . . 43 2.2 2.3 3 1 1.1 1.3 2 viii Measures and Entropies:Gelfand-Yaglom-Perez Theorem 46 3.1 Measure Theoretic Definitions of Classical Information Measures . . . . . . . . . . . . . 48 3.1.1 Discrete to Continuous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.1.2 Classical Information Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy . . . 54 v 3.2 Measure-Theoretic Definitions of Generalized Information Measures . . . . . . . . . . . 56 3.3 Maximum Entropy and Canonical Distributions . . . . . . . . . . . . . . . . . . . . . . . 58 3.4 ME-prescription for Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.1 Tsallis Maximum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . . . 60 3.4.2 The Case of Normalized q-expectation values . . . . . . . . . . . . . . . . . . . 62 Measure-Theoretic Definitions Revisited . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies . . . . . . . 64 3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy . . . . . . . . . . . . 69 Gelfand-Yaglom-Perez Theorem in the General Case . . . . . . . . . . . . . . . . . . . . 70 3.5 3.6 4 Geometry and Entropies:Pythagoras’ Theorem 75 4.1 Relative-Entropy Minimization in the Classical Case . . . . . . . . . . . . . . . . . . . . 77 4.1.1 Canonical Minimum Entropy Distribution . . . . . . . . . . . . . . . . . . . . . 78 4.1.2 Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 4.3 5 6 Tsallis Relative-Entropy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.1 Generalized Minimum Relative-Entropy Distribution . . . . . . . . . . . . . . . 81 4.2.2 q-Product Representation for Tsallis Minimum Entropy Distribution . . . . . . . 82 4.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.4 The Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . . . 86 Nonextensive Pythagoras’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.1 Pythagoras’ Theorem Restated . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.2 The Case of q-Expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 In the Case of Normalized q-Expectations . . . . . . . . . . . . . . . . . . . . . 92 Power-laws and Entropies: Generalization of Boltzmann Selection 95 5.1 EAs based on Boltzmann Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 EA based on Power-law Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Conclusions 106 6.1 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3 Concluding Thought . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 vi Bibliography 111 vii Notations R The set (field) of real numbers R+ [0, ∞) Z+ The set of +ve integers 2X Power set of the set X #E Cardinality of a set E χE : X → {0, 1} Characteristic function of a set E ⊆ X (X, M) Measurable space, where X is a nonempty set and M is a σ-algebra a.e Almost everywhere hXi Expectation of random variable X EX Expectation of random varible X hXiψ KN-average: expectation of random variable X with respect to a function ψ hXiq q-expectation of random varibale X hhXiiq Normalized q-expectation of random variable X νµ Measure ν is absolutely continuous w.r.t. measure µ S Shannon entropy functional Sq Tsallis entropy functional Sα Rényi entropy functional Z Partition function of maximum entropy distributions Zb Partition function of minimum relative-entropy distribution viii 1 Prolegomenon Abstract This chapter serves as an introduction to the thesis. The purpose is to motivate the discussion on generalized information measures and their maximum entropy prescriptions by introducing in broad brush-strokes a picture of the information theory and its relation with statistical mechanics and statistics. It also has road-map of the thesis, which should serve as a reader’s guide. Having an obsession to quantify – put it formally, finding a way of assigning a real number to (measure) any phenomena that we come across, it is natural to ask the following question. How one would measure ‘information’? The question was asked at the beginning of this age of information sciences and technology itself and a satisfactory answer was given. The theory of information was born . . . a ‘bandwagon’ . . . as Shannon (1956) himself called it. “A key feature of Shannon’s information theory is the discovery that the colloquial term information can often be given a mathematical meaning as a numerically measurable quantity, on the basis of a probabilistic model, in such a way that the solution of many important problems of information storage and transmission can be formulated in terms of this measure of the amount of information. This information measure has a very concrete operational interpretation: roughly, it equals the minimum number of binary digits needed, on the average, to encode the message in question. The coding theorems of information theory provide such overwhelming evidence for the adequateness of Shannon’s information measure that to look for essentially different measures of information might appear to make no sense at all. Moreover, it has been shown by several authors, starting with Shannon (1948), that the measure of the amount of information is uniquely determined by some rather natural postulates. Still, all the evidence that Shannon’s information measure is the only possible one, is valid only within the restricted scope of coding problems considered by Shannon. As Rényi pointed out in his fundamental paper (Rényi, 1961) on generalized information measure, in other sorts of problems other quantities may serve just as well or even better as measures of information. This should be indicated either by their operational significance (pragmatic approach) or by a set of natural postulates characterizing them 1 (axiomatic approach) or, preferably, by both.” The above passage is quoted from a critical survey on information measures by Csiszár (1974), which summarizes the significance of information measures and scope of generalizing them. Now we shall see the details. Information Measures and Generalizations The central tenet of Shannon’s information theory is the construction of a measure of “amount of information” inherent in a probability distribution. This construction is in the form of a functional that returns a real number which is supposed to be considered as the amount of information of a probability distribution, and hence the functional is known as information measure. The underlying concept in this construction is that it complements the amount of information with amount of uncertainty and it happens to be logarithmic. The logarithmic form of information measure dates back to Hartley (1928), who introduced the practical measure of information as the logarithm of the number of possible symbol sequences, where the distribution of events are considered to be equally probable. It was Shannon (1948), and independently Wiener (1948), who introduced a measure of information of general finite probability distribution p with point masses p1 , . . . , pn as S(p) = − n X pk ln pk . k=1 Owing to its similarity as a mathematical expression to Boltzmann entropy in thermodynamics, the term ‘entropy’ is adopted in the information sciences and used with information measure synonymously. Shannon demonstrated many nice properties of his entropy measure to be called itself a measure of information. One important property of Shannon entropy is the additivity, i.e., for two independent distributions, the entropy of the joint distribution is the sum of the entropies of the two distributions. Today, information theory is considered to be a very fundamental field which intersects with physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (Kolmogorov complexity) etc. (cf. Fig. 1.1, pp. 2, Cover & Thomas, 1991). Now, let us examine an alternate interpretation of the Shannon entropy functional that is important to study its mathematical properties and its generalizations. Let X be the underlying random variable, which takes values x 1 , . . . xn ; we use the notation 2 p(xk ) = pk , k = 1, . . . n. Then, Shannon entropy can be written as expectation of a function of X as follows. Define a function H which assigns each value x k that X takes, the value − ln p(xk ) = − ln pk , for k = 1, . . . n. The quantity − ln pk is known as the information associated with the single event x k with probability pk , also known as Hartley information (Aczél & Daróczy, 1975). From this what one can infer is that Shannon entropy expression is an average of Hartley information. Interpretation of Shannon entropy, as an average of information associated with a single event, is central to Rényi generalization. Rényi entropies were introduced into mathematics by Alfred Rényi (1960). The original motivation was strictly formal. The basic idea behind Rényi’s generalization is that any putative candidate for an entropy should be a mean, and thereby he uses a well known idea in mathematics that the linear mean, though most widely used, is not the only possible way of averaging, however, one can define the mean with respect to an arbitrary function. Here one should be aware that, to define a ‘meaningful’ generalized mean, one has to restrict the choice of functions to continuous and monotone functions (Hardy, Littlewood, & Pólya, 1934). Following the above idea, once we replace the linear mean with generalized means, we have a set of information measures each corresponding to a continuous and monotone function. Can we call every such entity an information measure? Rényi (1960) postulated that an information measure should satisfy additivity property which Shannon entropy itself does. The important consequence of this constraint is that it restricts the choice of function in a generalized mean to linear and exponential functions: if we choose a linear function, we get back the Shannon entropy, if we choose an exponential function, we have well known and much studied generalization of Shannon entropy n Sα (p) = X 1 pαk , ln 1−α k=1 where α is a parameter corresponding to an exponential function, which specifies the generalized mean and is known as entropic index. Rényi has called them entropies of order α (α 6= 1, α > 0); they include Shannon’s entropy in a limiting sense, namely, in the limit α → 1, α-entropy retrieves Shannon entropy. For this reason, Shannon’s entropy may be called entropy of order 1. Rényi studied extensively these generalized entropy functionals in his various papers; one can refer to his book on probability theory (Rényi, 1970, Chapter 9) for a summary of results. While Rényi entropy is considered to be the first formal generalization of Shannon 3 entropy, Havrda and Charvát (1967) observed that for operational purposes, it seems P more natural to consider the simpler expression nk=1 pαk as an information measure instead of Rényi entropy (up to a constant factor). Characteristics of this information measure are studied by Daróczy (1970), Forte and Ng (1973), and it is shown that this quantity permits simpler postulational characterizations (for the summary of the discussion see (Csiszár, 1974)). While generalized information measures, after Rényi’s work, continued to be of interest to many mathematicians, it was in 1988 that they came to attention in Physics when Tsallis reinvented the above mentioned Havrda and Charvát entropy (up to a constant factor), and specified it in the form (Tsallis, 1988) P 1 − k pqk Sq (p) = . q−1 Though this expression looks somewhat similar to the Rényi entropy and retrieves Shannon entropy in the limit q → 1, Tsallis entropy has the remarkable, albeit not yet understood, property that in the case of independent experiments, it is not additive. Hence, statistical formalism based on Tsallis entropy is also termed nonextensive statistics. Next, we discuss what information measures to do with statistics. Information Theory and Statistics Probabilities are unobservable quantities in the sense that one cannot determine the values of these corresponding to a random experiment by simply an inspection of whether the events do, in fact, occur or not. Assessing the probability of the occurrence of some event or of the truth of some hypothesis is the important question one runs up against in any application of probability theory to the problems of science or practical life. Although the mathematical formalism of probability theory serves as a powerful tool when analyzing such problems, it cannot, by itself, answer this question. Indeed, the formalism is silent on this issue, since its goal is just to provide theorems valid for all probability assignments allowed by its axioms. Hence, recourse is necessary to an additional rule which tells us in which case one ought to assign which values to probabilities. In 1957, Jaynes proposed a rule to assign numerical values to probabilities in circumstances where certain partial information is available. Jaynes showed, in particular, how this rule, when applied to statistical mechanics, leads to the usual canonical 4 distributions in an extremely simple fashion. The concept he used was ‘maximum entropy’. With his maximum entropy principle, Jaynes re-derived Gibbs-Boltzmann statistical mechanics á la information theory in his two papers (Jaynes, 1957a, 1957b). This principle states that the states of thermodynamic equilibrium are states of maximum entropy. Formally, let p1 , . . . , pn be the probabilities that a particle in a system has energies E1 , . . . , En respectively, then well known Gibbs-Boltzmann distribution e−βEk Z pk = k = 1, . . . , n, P can be deduced from maximizing the Shannon entropy functional − nk=1 pk ln pk P with respect to the constraint of known expected energy nk=1 pk Ek = U along with P the normalizing constraint nk=1 pk = 1. Z is called the partition function and can be specified as Z= n X e−βEk . k=1 Though use of maximum entropy has its historical roots in physics (e.g., Elsasser, 1937) and economics (e.g., Davis, 1941), later on, Jaynes showed that a general method of statistical inference could be built upon this rule, which subsumes the techniques of statistical mechanics as a mere special case. The principle of maximum entropy states that, of all the distributions p that satisfy the constraints, one should choose the distribution with largest entropy. In the above formulation of Gibbs-Boltzmann distribution one can view the mean energy constraint and normalizing constraints as the only available information. Also, this principle is a natural extension of Laplace’s famous principle of insufficient reason, which postulates that the uniform distribution is the most satisfactory representation of our knowledge when we know nothing about the random variate except that each probability is nonnegative and the sum of the probabilities is unity; it is easy to show that Shannon entropy is maximum for uniform distribution. The maximum entropy principle is used in many fields, ranging from physics (for example, Bose-Einstein and Fermi-Dirac statistics can be made as though they are derived from the maximum entropy principle) and chemistry to image reconstruction and stock market analysis, recently in machine learning. While Jayens was developing his maximum entropy principle for statistical inference problems, a more general principle was proposed by Kullback (1959, pp. 37) which is known as the minimum entropy principle. This principle comes into picture in problems where inductive inference is to update from a prior probability distributions to a posterior distribution when ever new information becomes available. This 5 principle states that, given a prior distribution r, of all the distributions p that satisfy the constraints, one should choose the distribution with the least Kullback-Leibler relative-entropy I(pkr) = n X k=1 pk ln pk . rk Minimizing relative-entropy is equivalent to maximizing entropy when the prior is a uniform distribution. This principle laid the foundations for an information theoretic approach of statistics (Kullback, 1959) and plays important role in certain geometrical approaches of statistical inference (Amari, 1985). Maximum entropy principle together with minimum entropy principle is referred as ME-principle and the inference based on these principles are collectively known as ME-methods. Papers by Shore and Johnson (1980) and by Tikochinsky, Tishby, and Levine (1984) paved the way for strong theoretical justification for using MEmethods in inference problems. A more general view of ME fundamentals are reported by Harremoës and Topsøe (2001). Before we move on we briefly explain the relation between ME and inference methods using the well-known Bayes’ theorem. The choice between these two updating methods is dictated by the nature of the information being processed. When we want to update our beliefs about the value of certain quantities θ on the basis of information about the observed values of other quantities x - the data - we must use Bayes’ theorem. If the prior beliefs are given by p(θ), the updated or posterior distribution is p(θ|x) ∝ p(θ)p(x|θ). Being a consequence of the product rule for probabilities, the Bayesian method of updating is limited to situations where it makes sense to define the joint probability of x and θ. The ME-method, on the other hand, is designed for updating from a prior probability distribution to a posterior distribution when the information to be processed is testable information, i.e., it takes the form of constraints on the family of acceptable posterior distributions. In general, it makes no sense to process testable information using Bayes’ theorem, and conversely, neither does it make sense to process data using ME. However, in those special cases when the same piece of information can be both interpreted as data and as constraint then both methods can be used and they agree. For more details on ME and Bayes’ approach one can refer to (Caticha & Preuss, 2004; Grendár jr & Grendár, 2001). An excellent review of ME-principle and consistency arguments can be found in the papers by Uffink (1995, 1996) and by Skilling (1984). This subject is dealt with in applications in the book of Kapur and Kesavan (1997). 6 Power-law Distributions Despite the great success of the standard ME-principle, it is a well known fact that there are many relevant probability distributions in nature which are not easily derivable from Jaynes-Shannon prescription: Power-law distributions constitute an interesting example. If one sticks to the standard logarithmic entropy, ‘awkward constraints’ are needed in order to obtain power-law type distributions (Tsallis et al., 1995). Does Jaynes ME-principle suggest in a natural way the possibility of incorporating alternative entropy functionals to the variational principle? It seems that if one replaces Shannon entropy with its generalization, ME-prescriptions ‘naturally’ result in powerlaw distributions. Power-law distributions can be obtained by optimizing Tsallis entropy under appropriate constraints. The distribution thus obtained is termed the q-exponential distri1 bution. The associated q-exponential function of x is e q (x) = [1 + (1 − q)x]+1−q , with the notation [a]+ = max{0, a}, and converges to the ordinary exponential function in the limit q → 1. Hence formalism of Tsallis offers continuity between Boltzmann- Gibbs distribution and power-law distribution, which is given by the nonextensive parameter q. Boltzmann-Gibbs distribution is a special case of the power-law distribution of Tsallis prescription; as we set q → 0, we get exponential. Here, we take up an important real-world example, where significance of powerlaw distribution can be demonstrated. The importance of power-law distributions in the domain of computer science was first precipitated in 1999 in the study of connectedness of World Wide Web (WWW). Using a Web crawler, Barabási and Albert (1999) mapped the connectedness of the Web. To their surprise, the web did not have an even distribution of connectivity (socalled “random connectivity”). Instead, a very few network nodes (called “hubs”) were far more connected than other nodes. In general, they found that the probability p(k) that a node in the network connects with k other nodes was, in a given network, proportional to k −γ , where the degree exponent γ is not universal and depends on the detail of network structure. Pictorial depiction of random networks and scale-free networks is given in Figure 1.1. Here we wish to point out that, using the q-exponential function, p(k) is rewritten as p(k) = eq ( κk ), where q = 1 + γ1 and κ = (q − 1)k0 . This implies that the BarabásiAlbert solution optimizes the Tsallis entropy (Abe & Suzuki, 2004). One more interesting example is the distribution of scientific articles in journals (Naranan, 1970). If the journals are divided into groups, each containing the same 7 Figure 1.1: Structure of Random and Scale-Free Networks number of articles on a given subject, then the number of journals in the succeeding groups from a geometrical progression. Tsallis nonextensive formalism had been applied to analyze the various phenomena which exhibit power-laws, for example stock markets (Queirós et al., 2005), citations of scientific papers (Tsallis & de Albuquerque, 2000), scale-free network of earthquakes (Abe & Suzuki, 2004), models of network packet traffic (Karmeshu & Sharma, 2006) etc. To a great extent, the success of Tsallis proposal is attributed to the ubiquity of power law distributions in nature. Information Measures on Continuum Until now we have considered information measures in the discrete case, where the number of configurations is finite. Is it possible to extend the definitions of information measures to non-discrete cases, or to even more general cases? For example can we write Shannon entropy in the continuous case, naively, as Z S(p) = − p(x) ln p(x) dx for a probability density p(x)? It turns out that in the above continuous case, entropy functional poses a formidable problem if one interprets it as an information measure. Information measures extended to abstract spaces are important not only for mathematical reasons, the resultant generality and rigor could also prove important for eventual applications. Even in communication problems discrete memoryless sources and channels are not always adequate models for real-world signal sources or communication and storage media. Metric spaces of functions, vectors and sequences as well as random fields naturally arise as models of source and channel outcomes (Cover, Gacs, & Gray, 1989). The by-products of general rigorous definitions have the potential for 8 proving useful new properties, for providing insight into their behavior and for finding formulas for computing such measures for specific processes. Immediately after Shannon published his ideas, the problem of extending the definitions of information measures to abstract spaces was addressed by well-known mathematicians of the time, Kolmogorov (1956, 1957) (for an excellent review on Kolmogorov’s contributions to information theory see (Cover et al., 1989)), Dobrushin (1959), Gelfand (1956, 1959), Kullback (Kullback, 1959), Pinsker (1960a, 1960b), Yaglom (1956, 1959), Perez (1959), Rényi (1960), Kallianpur (1960), etc. We now examine why extending the Shannon entropy to the non-discrete case is a nontrivial problem. Firstly, probability densities mostly carry a physical dimension (say probability per length) which give the entropy functional the unit of ‘ln cm’, which seems somewhat odd. Also in contrast to its discrete case counterpart this expression is not invariant under a reparametrization of the domain, e.g. by a change of unit. Further, S may now become negative, and is not bounded both from above or below so that new problems of definition appear cf. (Hardy et al., 1934, pp. 126). These problems are clarified if one considers how to construct an entropy for a continuous probability distribution starting from the discrete case. A natural approach is to consider the limit of the finite discrete entropies corresponding to a sequence of finite partitions of an interval (on which entropy is defined) whose norms tend to zero. Unfortunately, this approach does not work, because this limit is infinite for all continuous probability distributions. Such divergence is also obtained–and explained–if one adopts the well-known interpretation of the Shannon entropy as the least expected number of yes/no questions needed to identify the value of x, since in general it takes an infinite number of such questions to identify a point in the continuum (of course, this interpretation supposes that the logarithm in entropy functional has base 2). To overcome the problems posed by the definition of entropy functional in continuum, the solution suggested was to consider the expression in discrete case (cf. Gelfand et al., 1956; Kolmogorov, 1957; Kullback, 1959) S(p|µ) = − n X k=1 p(xk ) ln p(xk ) , µ(xk ) where µ(xk ) are positive weights determined by some ‘background measure’ µ. Note that the above entropy functional S(p|µ) is the negative of Kullback-Leibler relativeentropy or KL-entropy when we consider that µ(x k ) are positive and sum to one. Now, one can show that the present entropy functional, which is defined in terms of KL-entropy, however has a natural extension to the continuous case (Topsøe, 2001, 9 Theorem 5.2). This is because, if one now partitions the real line in increasingly finer subsets, the probabilities corresponding to p and the background weights corresponding to µ are both split simultaneously and the logarithm of their ratio will generally not diverge. This is how KL-entropy plays an important role in definitions of information measures extended to continuum. Based on these above ideas one can extend the information measures on measure space (X, M, µ); µ is exactly the same as that appeared in the above definition of the entropy functional S(p|µ) in discrete case. The entropy functionals in both the discrete and continuous cases can be retrieved by appropriately choosing the reference measure µ. Such a definition of information measures on measure spaces can be used in ME-prescriptions, which are consistent with the prescriptions when their discrete counterparts, are used. One can find the continuum and measure-theoretic aspects of entropy functionals in the information theory text of Guiaşu (1977). A concise and very good discussion on ME-prescriptions of continuous entropy functionals can be found in (Uffink, 1995). What is this thesis about? One can see from the above discussions that the two generalizations of Shannon entropy, Rényi and Tsallis, originated or developed from different fields. Though Rényi’s generalization originated in information theory, it has been studied in statistical mechanics (e.g., Bashkirov, 2004) and statistics (e.g., Morales et al., 2004). Similarly, Tsallis generalization was mainly studied in statistical mechanics when it was proposed, but, now, Shannon-Khinchin axioms have been extended to Tsallis entropy (Suyari, 2004a) and applied to statistical inference problems (e.g., Tsallis, 1998). This elicits no surprise because from the above discussion one can see that information theory is naturally connected to statistical mechanics and statistics. The study of the mathematical properties and applications of generalized information measures and, further, new formulations of the maximum entropy principle based on these generalized information measures constitute a currently growing field of research. It is in this line of inquiry that this thesis presents some results pertaining to mathematical properties of generalized information measures and their MEprescriptions, including the results related to measure-theoretic formulations of the same. Finally, note that Rényi and Tsallis generalizations can be ‘naturally’ applied to Kullback-Leibler relative entropy to define generalized relative-entropy measures, which 10 are extensively studied in the literature. Indeed, the major results that we present in this thesis are related to these generalized relative-entropies. 1.1 Summary of Results Here we give a brief summary of the main results presented in this thesis. Broadly, results presented in this thesis can be divided into those related to information measures and their ME-prescriptions. Generalized Means, Rényi’s Recipe and Information Measures One can view Rényi’s formalism as a tool, which can be used to generalize information measures and thereby characterize them using axioms of Kolmogorov-Nagumo averages (KN-averages). For example, one can apply Rényi’s recipe in the nonextensive case by replacing the linear averaging in Tsallis entropy with KN-averages and thereby impose the constraint of pseudo-additivity. A natural question arises is what are the various pseudo-additive information measures that one can prepare with this recipe? In this thesis we prove that only Tsallis entropy is possible in this case, using which we characterize Tsallis entropy based on axioms of KN-averages. Generalized Information Measures in Abstract Spaces Owing to the probabilistic settings for information theory, it is natural that more general definitions of information measures can be given on measure spaces. In this thesis we develop measure-theoretic formulations for generalized information measures and present some related results. One can give measure-theoretic definitions for Rényi and Tsallis entropies along similar lines as Shannon entropy. One can also show that, as is the case with Shannon entropy, these measure-theoretic definitions are not natural extensions of their discrete analogs. In this context we present two results: (i) we prove that, as in the case of classical ‘relative-entropy’, generalized relative-entropies, whether Rényi or Tsallis, can be extended naturally to the measure-theoretic case, and (ii) we show that, MEprescriptions of measure-theoretic Tsallis entropy are consistent with the discrete case. Another important result that we present in this thesis is the Gelfand-Yaglom-Perez (GYP) theorem for Rényi relative-entropy, which can be easily extended to Tsallis relative-entropy. GYP-theorem for Kullback-Leibler relative-entropy is a fundamental 11 theorem which plays an important role in extending discrete case definitions of various classical information measures to the measure-theoretic case. It also provides a means to compute relative-entropy and study its behavior. Tsallis Relative-Entropy Minimization Unlike the generalized entropy measures, ME of generalized relative-entropies is not much addressed in the literature. In this thesis we study Tsallis relative-entropy minimization in detail. We study the properties of Tsallis relative-entropy minimization and present some differences with the classical case. In the representation of such a minimum relativeentropy distribution, we highlight the use of the q-product, an operator that has been recently introduced to derive the mathematical structure behind Tsallis statistics. Nonextensive Pythagoras’ Theorem It is a common practice in mathematics to employ geometric ideas in order to obtain additional insights or new methods even in problems which do not involve geometry intrinsically. Maximum and minimum entropy methods are no exception. Kullback-Leibler relative-entropy, in cases involving distributions resulting from relative-entropy minimization, has a celebrated property reminiscent of squared Euclidean distance: it satisfies an analog of Pythagoras’ theorem. And hence, this property is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality, and plays a fundamental role in geometrical approaches to statistical estimation theory like information geometry. We state and prove the equivalent of Pythagoras’ theorem in the nonextensive case. Power-law Distributions in EAs Recently, power-law distributions have been used in simulated annealing, which claims to perform better than classical simulated annealing. In this thesis we demonstrate the use of power-law distributions in evolutionary algorithms (EAs). The proposed algorithm use Tsallis generalized canonical distribution, which is a one-parameter generalization of the Boltzmann distribution, to weigh the configurations in the selection mechanism. We provide some simulation results in this regard. 12 1.2 Essentials This section details some heuristic explanations for the logarithmic nature of Hartley and Shannon entropies. We also discuss some notations and why the concept of “maximum entropy” is important. 1.2.1 What is Entropy? The logarithmic nature of Hartley and Shannon information measures, and their additivity properties can be explained by heuristic arguments. Here we give one such explanation (Rényi, 1960). To characterize an element of a set of size n we need log 2 n units of information, where a unit is a bit. The important feature of the logarithmic information measure is its additivity: If a set E is a disjoint union of m n-tuples: E 1 , . . . , Em , then we can specify an element of this mn-element set E in two steps: first we need ln 2 m bits of information to describe which of the sets E 1 , . . . , Em , say Ek , contains the element, and we need log 2 n further bits of information to tell which element of this set E k is the considered one. The information needed to characterize an element of E is the ‘sum’ of the two partial informations. Indeed, log 2 nm = log 2 n + log2 m. The next step is due to Shannon (1948). He has pointed out that Hartley’s formula is valid only if the elements of E are equiprobable; if their probabilities are not equal, the situation changes and we arrive at the formula (2.15). If all the probabilities are equal to n1 , Shannon’s formula (2.15) reduces to Hartley’s formula: S(p) = log 2 n. Shannon’s formula has the following heuristic motivation. Let E be the disjoint P union of the sets E1 , . . . , En having N1 , . . . , Nn elements respectively ( nk=1 Nk = N ). Let us suppose that we are interested only in knowing the subset E k to which a given element of E belongs. Suppose that the elements of E are equiprobable. The information characterizing an element of E consists of two parts: the first specifies the subset Ek containing this particular element and the second locates it within E k . The amount of the second piece of information is log 2 Nk (by Hartley’s formula), thus it depends on the index k. To specify an element of E we need log 2 N bits of information and as we have seen it is composed of the information specifying E k – its amount will be denoted by Hk – and of the information within Ek . According to the principle of additivity, we have log 2 N = Hk + log2 Nk or Hk = log 2 N Nk . It is plausible to define the information needed to identify the subset E k which the considered element 13 belongs to as the weighted average of the informations H k , where the weights are the probabilities that the element belongs to the E k ’s. Thus, S= n X Nk k=1 N Hk , from which we obtain the Shannon entropy expression using the above interpretations of Hk = log2 N Nk and using the notation pk = Nk N . Now we note one more important idea behind the Shannon entropy. We frequently come across Shannon entropy being treated as both a measure of uncertainty and of information. How is this rendered possible? If X is the underlying random variable, then S(p) is also written as S(X) though it does not depend on the actual values of X. With this, one can say that S(X) quantifies how much information we gain, on an average, when we learn the value of X. An alternative view is that the entropy of X measures the amount of uncertainty about X before we learn its value. These two views are complementary; we can either view entropy as a measure of our uncertainty before we learn the value of X, or as a measure of how much information we have gained after we learn the value of X. Following this one can see that Shannon entropy for the most ‘certain distribution’ (0, . . . , 1, . . . 0) returns the value 0, and for the most ‘uncertain distribution’ ( n1 , . . . , n1 ) returns the value ln n. Further one can show the inequality 0 ≤ S(p) ≤ ln n , for any probability distribution p. The inequality S(p) ≥ 0 is easy to verify. Let us prove that for any probability distribution p = (p 1 , . . . , pn ) we have 1 1 ,..., S(p) = S(p1 , . . . , pn ) ≤ S = ln n . n n (1.1) Here, we shall see the proof. I One way of showing this property is by using the Jensen inequality for real-valued continuous functions. Let f (x) be a real-valued continuous concave function defined on the interval [a, b]. Then for any x 1 , . . . , xn ∈ [a, b] P and any set of non-negative real numbers λ 1 , . . . , λn such that nk=1 λk = 1, we have ! n n X X λk f (xk ) ≤ f λk xk . (1.2) k=1 k=1 For convex functions the reverse inequality is true. Setting a = 0, b = 1, x k = pk , λk = 1 n and f (x) = −x ln x we obtain ! ! n n n X X X 1 1 1 pk ln pk ≤ − pk ln pk , − n n n k=1 k=1 k=1 14 and hence the result. Alternatively, one can use Lagrange’s method to maximize entropy subject to the Pn normalization condition of probability distribution k=1 pk = 1. In this case the Lagrangian is L≡− n X k=1 n X pk ln pk − λ k=1 pk − 1 ! , Differentiating with respect to p1 , . . . , pn , we get −(1 + ln pk ) − λ = 0 , k = 1, . . . n which gives p1 = p 2 = . . . = p n = The Hessian matrix is 1 −n 0 ... 0 −1 ... n .. .. .. . . . 0 0 1 . n 0 0 .. . . . . − n1 (1.3) which is always negative definite, so that the values from (1.3) determine a maximum value, which, because of the concavity property, is also the global maximum value. Hence the result. J 1.2.2 Why to maximize entropy? Consider a random variable X. Let the possible values X takes be x 1 , . . . , xn that possibly represent the outcomes of an experiment, states of a physical system, or just labels of various propositions. The probability with which the event x k is selected is denoted by pk , for k = 1, . . . , n. Our problem is to assign probabilities p 1 , . . . , pn . Laplace’s principle of insufficient reason is the simplest rule that can be used when we do not have any information about a random experiment. It states that whenever we have no reason to believe that one case rather than any other is realized, or, as is also put, in case all values of X are judged to be ‘equally possible’, then their probabilities are equal, i.e pk = 1 , n k = 1, . . . n. 15 We can restate the principle as, the uniform distribution is the most satisfactory representation of our knowledge when we know nothing about the random variate except that each probability is nonnegative and the sum of the probabilities is unity. This rule, of course, refers to the meaning of the concept of probability , and is therefore subject to debate and controversy. We will not discuss this here, one can refer to (Uffink, 1995) for a list of objections to this principle reported in the literature. Now having the Shannon entropy as a measure of uncertainty (information), can we generalize the principle of insufficient reason and say that with the available information, we can always choose the distribution which maximizes the Shannon entropy? This is what is known as the Jaynes’ maximum entropy principle which states that of all the probability distributions that satisfy given constraints, choose the distribution which maximizes Shannon entropy. That is if our state of knowledge is appropriately represented by a set of expectation values, then the “best”, least unbiased probability distribution is the one that (i) reflects just what we know, without “inventing” unavailable pieces of knowledge, and, additionally, (ii) maximize ignorance: the truth, all the truth, nothing but the truth. This is the rationale behind the maximum entropy principle. Now we shall examine this principle in detail. Let us assume that some information about the random variable X is given which can be modeled as a constraint on the set of all possible probability distributions. It is assumed that this constraint exhaustively specifies all relevant information about X. The principle of maximum entropy is then the prescription to choose that probability distribution p for which the Shannon entropy is maximal under the given constraint. Here we take simple and often studied type of constraints, i.e. the case where expectation of X is given. Say we have the constraint n X xk pk = U , k=1 where U is the expectation of X. Now to maximize Shannon entropy with respect P to the above constraint, together with the normalizing constraint nk=1 pk = 1, the Lagrangian can be written as L≡− n X k=1 pk ln pk − λ n X k=1 pk − 1 ! −β n X k=1 xk pk − U ! . Setting the derivatives of the Lagrangian with respect to p 1 , . . . , pn equal to zero, we get ln pk = −λ − βxk 16 The Lagrange parameter λ can be specified by the normalizing constraint. Finally, maximum entropy distribution can be written as e−βxk , pk = Pn −βxk k=1 e where the parameter β is determined by the expectation constraint. Note that, one can extend this method to more than one constraint specified with respect to some arbitrary functions; for details see (Kapur & Kesavan, 1997). The maximum entropy principle subsumes the principle of insufficient reason. Indeed, in the absence of reasons, i.e., in the case where none or only trivial constraints are imposed on the probability distribution, its entropy S(p) is maximal when all probabilities are equal. Although, as a generalization of the principle of insufficient reason, maximum entropy principle inherits all objections associated with its infamous predecessor. Interestingly it does cope with some of the objections; for details see (Uffink, 1995). Note that calculating the Lagrange parameters in maximum entropy methods is a non-trivial task and the same holds for calculating maximum entropy distributions. Various techniques to calculate maximum entropy distributions can be found in (Agmon et al., 1979; Mead & Papanicolaou, 1984; Ormoneit & White, 1999; Wu, 2003). Maximum entropy principle can be used for a wide variety of problems. The book by Kapur and Kesavan (1997) gives an excellent account of maximum entropy methods with emphasis on various applications. 1.3 A reader’s guide to the thesis Notation and Delimiters The commonly used notation in the thesis is given in the beginning of the chapters. When we write down the proofs of some results which are not specified in the Theorem/Lemma environment, we denote the beginning and ending of proofs by I and J respectively. Otherwise the end of proofs that are part of the above are identified by . Some additional explanations with in the results are included in the footnotes. To avoid proliferation of symbols we use the same notation for different concepts if this does not cause ambiguity; the correspondence should be clear from the context. For example whether it is a maximum entropy distribution or minimum relativeentropy distribution we use the same symbols for Lagrange multipliers. 17 Roadmap Apart from this chapter this thesis contains five other chapters. We now briefly outline a summary of each chapter. In Chapter 2, we present a brief introduction of generalized information measures and their properties. We discuss how generalized means play a role in the information measures and present a result related to generalized means and Tsallis generalization. In Chapter 3, we discuss various aspects of information measures defined on measure spaces. We present measure-theoretic definitions for generalized information measures and present important results. In Chapter 4, we discuss the geometrical aspects of relative-entropy minimization and present an important result for Tsallis relative-entropy minimization. In Chapter 5, we apply power-law distributions to selection mechanism in evolutionary algorithms and test their novelty by simulations. Finally, in Chapter 6, we summarize the contributions of this thesis, and discuss possible future directions. 18 2 KN-averages and Entropies: Rényi’s Recipe Abstract This chapter builds the background for this thesis and introduces Rényi and Tsallis (nonextensive) generalizations of classical information measures. It also presents a significant result on relation between Kolmogorov-Nagumo averages and nonextensive generalization, which can also be found in (Dukkipati, Murty, & Bhatnagar, 2006b). In recent years, interest in generalized information measures has increased dramatically, after the introduction of nonextensive entropy in Physics by Tsallis (1988) (first defined by Havrda and Charvát (1967)), and has been studied extensively in information theory and statistics. One can get this nonextensive entropy or Tsallis entropy by generalizing the information of a single event in the definition of Shannon entropy, where logarithm is replaced with q-logarithm (defined as ln q x = x1−q −1 1−q ). The term ‘nonextensive’ is used because it does not satisfy the additivity property – a characteristic property of Shannon entropy – instead, it satisfies pseudo-additivity of the form x ⊕q y = x + y + (1 − q)xy. Indeed, the starting point of the theory of generalized measures of information is due to Rényi (1960, 1961), who introduced α-entropy or Rényi entropy, the first formal generalization of Shannon entropy. Replacing linear averaging in Shannon entropy, which can be interpreted as an average of information of a single event, with P Kolmogorov-Nagumo averages (KN-average) of the form hxi ψ = ψ −1 ( k pk ψ(xk )), where ψ is an arbitrary continuous and strictly monotone function), and further impos- ing the additivity constraint – a characteristic property of underlying information of a single event – leads to Rényi entropy. Using this recipe of Rényi, one can prepare only two information measures: Shannon and Rényi entropy. By means of this formalism, Rényi characterized these additive entropies in terms of axioms of KN-averages. One can view Rényi’s formalism as a tool, which can be used to generalize information measures and thereby characterize them using axioms of KN-averages. For example, one can apply Rényi’s recipe in the nonextensive case by replacing the linear 19 averages in Tsallis entropy with KN-averages and thereby imposing the constraint of pseudo-additivity. A natural question that arises is what are the pseudo-additive information measures that one can prepare with this recipe? We prove that Tsallis entropy is the only possible measure in this case, which allows us to characterize Tsallis entropy using axioms of KN-averages. As one can see from the above discussion, Hartley information measure (Hartley, 1928) of a single stochastic event plays a fundamental role in the Rényi and Tsallis generalizations. Generalization of Rényi involves the generalization of linear average in Shannon entropy, where as, in the case of Tsallis, it is the generalization of the Hartley function; while Rényi’s is considered to be the additive generalization, Tsallis is non-additive. These generalizations can be extended to Kullback-Leibler (KL) relative-entropy too; indeed, many results presented in this thesis are related to generalized relative entropies. First, we discuss the important properties of classical information measures, Shannon and KL, in § 2.1. We discuss Rényi’s generalization in § 2.2, where we discuss the Hartley function and properties of quasilinear means. Nonextensive generalization of Shannon entropy and relative-entropy is presented in detail in § 2.3. Results on the uniqueness of Tsallis entropy under Rényi’s recipe and characterization of nonextensive information measures are presented in § 2.4 and § 2.5 respectively. 2.1 Classical Information Measures In this section, we discuss the properties of two important classical information measures, Shannon entropy and Kullback-Leibler relative-entropy. We present the definitions in the discrete case; the same for the measure-theoretic case are presented in the Chapter 3, where we discuss the maximum entropy prescriptions of information measures. We start with a brief note on the notation used in this chapter. Let X be a discrete random variable (r.v) defined on some probability space, which takes only n values and n < ∞. We denote the set of all such random variables by X. We use the symbol Y to denote a different set of random variables, say, those that take only m values and m 6= n, m < ∞. Corresponding to the n-tuple (x 1 , . . . , xn ) of values which X takes, the probability mass function (pmf) of X is denoted by p = (p 1 , . . . pn ), where P pk ≥ 0, k = 1, . . . n and nk=1 pk = 1. Expectation of the r.v X is denoted by EX or hXi; we use both the notations, interchangeably. 20 2.1.1 Shannon Entropy Shannon entropy, a logarithmic measure of information of an r.v X ∈ X denoted by S(X), reads as (Shannon, 1948) S(X) = − n X pk ln pk . (2.1) k=1 The convention that 0 ln 0 = 0 is followed, which can be justified by the fact that limx→0 x ln x = 0. This formula was discovered independently by (Wiener, 1948), hence, it is also known as Shannon-Wiener entropy. Note that the entropy functional (2.1) is determined completely by the pmf p of r.v X, and does not depend on the actual values that X takes. Hence, entropy functional is often denoted as a function of pmf alone as S(p) or S(p 1 , . . . , pn ); we use all these notations, interchangeably, depending on the context. The logarithmic function in (2.1) can be taken with respect to an arbitrary base greater than unity. In this thesis, we always use the base e unless otherwise mentioned. Shannon entropy of the Bernoulli variate is known as Shannon entropy function which is defined as follows. Let X be a Bernoulli variate with pmf (p, 1 − p) where 0 < p < 1. Shannon entropy of X or Shannon entropy function is defined as s(p) = S(p, 1 − p) = −p ln p − (1 − p) ln(1 − p) , p ∈ [0, 1] . (2.2) s(p) attains its maximum value for p = 12 . Later, in this chapter we use this function to compare Shannon entropy functional with generalized information measures, Rényi and Tsallis, graphically. Also, Shannon entropy function is of basic importance as Shannon entropy can be expressed through it as follows: p3 + (p1 + p2 + p3 )s p1 + p 2 + p 3 pn + . . . + (p1 + . . . + pn )s p1 + . . . + p n n X pk = . (2.3) (p1 + . . . + pk )s p1 + . . . + p k p2 S(p1 , . . . , pn ) = (p1 + p2 )s p1 + p 2 k=2 We have already discussed some of the basic properties of Shannon entropy in Chapter 1; here we state some properties formally. For a detailed list of properties see (Aczél & Daróczy, 1975; Guiaşu, 1977; Cover & Thomas, 1991; Topsøe, 2001). 21 S(p) ≥ 0, for any pmf p = (p1 , . . . , pn ) and assumes minimum value, S(p) = 0, for a degenerate distribution, i.e., p(x 0 ) = 1 for some x0 ∈ X, and p(x) = 0, ∀x ∈ X, x 6= x0 . If p is not degenerate then S(p) is strictly positive. For any probability distribution p = (p1 , . . . , pn ) we have 1 1 ,..., S(p) = S(p1 , . . . , pn ) ≤ S = ln n . n n (2.4) An important property of entropy functional S(p) is that it is a concave function of p. This is a very useful property since a local maximum is also the global maximum for a concave function that is subject to linear constraints. Finally, the characteristic property of Shannon entropy can be stated as follows. Let X ∈ X and Y ∈ Y be two random variables which are independent. Then we have, S(X × Y ) = S(X) + S(Y ) , (2.5) where X × Y denotes joint r.v of X and Y . When X and Y are not necessarily independent, then1 S(X × Y ) ≤ S(X) + S(Y ) , (2.6) i.e., the entropy of the joint experiment is less than or equal to the sum of the uncertainties of the two experiments. This is called the subadditivity property. Many sets of axioms for Shannon entropy have been proposed. Shannon (1948) has originally given a characterization theorem of the entropy introduced by him. A more general and exact one is due to Hinčin (1953), generalized by Faddeev (1986). The most intuitive and compact axioms are given by Khinchin (1956), which are known as the Shannon-Khinchin axioms. Faddeev’s axioms can be obtained as a special case of Shannon-Khinchin axioms cf. (Guiaşu, 1977, pp. 9, 63). Here we list the Shannon-Khinchin axioms. Consider the sequence of functions S(1), S(p1 , p2 ), . . . , S(p1 , . . . pn ), . . ., where, for every n, the function S(p 1 , . . . , pn ) is defined on the set ( P= (p1 , . . . , pn ) | pi ≥ 0, 1 n X pi = 1 i=1 ) . This follows from the fact that S(X × Y ) = S(X) + S(Y |X), and conditional entropy S(Y |X) ≤ S(Y ), where n m p(xi , yj ) ln p(yj |xi ) . S(Y |X) = − i=1 j=1 22 Consider the following axioms: [SK1] continuity: For any n, the function S(p 1 , . . . , pn ) is continuous and symmetric with respect to all its arguments, [SK2] expandability: For every n, we have S(p1 , . . . , pn , 0) = S(p1 , . . . , pn ) , [SK3] maximality: For every n, we have the inequality 1 1 ,..., , S(p1 , . . . , pn ) ≤ S n n [SK4] Shannon additivity: If pij ≥ 0, pi = mi X j=1 pij ∀i = 1, . . . , n, ∀j = 1, . . . , mi , (2.7) then the following equality holds: S(p11 , . . . , pnmn ) = S(p1 , . . . , pn ) + n X i=1 pimi pi1 ,..., pi S pi pi . (2.8) Khinchin uniqueness theorem states that if the functional S : P → R satisfies the axioms [SK1]-[SK4] then S is uniquely determined by S(p1 , . . . , pn ) = −c n X pk ln pk , k=1 where c is any positive constant. Proof of this uniqueness theorem for Shannon entropy can be found in (Khinchin, 1956) or in (Guiaşu, 1977, Theorem 1.1, pp. 9). 2.1.2 Kullback-Leibler Relative-Entropy Kullback and Leibler (1951) introduced relative-entropy or information divergence, which measures the distance between two distributions of a random variable. This information measure is also known as KL-entropy, cross-entropy, I-divergence, directed divergence, etc. (We use KL-entropy and relative-entropy interchangeably in this thesis.) KL-entropy of X ∈ X with pmf p with respect to Y ∈ X with pmf r is denoted by I(XkY ) and is defined as I(pkr) = I(XkY ) = n X k=1 pk ln pk , rk 23 (2.9) where one would assume that whenever r k = 0, the corresponding pk = 0 and 0 ln 00 = 0. Following Rényi (1961), if p and r are pmfs of the same r.v X, the relative-entropy is sometimes synonymously referred to as the information gain about X achieved if p can be used instead of r. KL-entropy as a distance measure on the space of all pmfs of X is not a metric, since it is not symmetric, i.e., I(pkr) 6= I(rkp), and it does not satisfy the triangle inequality. KL-entropy is an important concept in information theory, since other informationtheoretic quantities including entropy and mutual information may be formulated as special cases. For continuous distributions in particular, it overcomes the difficulties with continuous version of entropy (known as differential entropy); its definition in nondiscrete cases is a natural extension of the discrete case. These aspects constitute the major discussion of Chapter 3 of this thesis. Among the properties of KL-entropy, the property that I(pkr) ≥ 0 and I(pkr) = 0 if and only if p = r is fundamental in the theory of information measures, and is known as the Gibbs inequality or divergence inequality (Cover & Thomas, 1991, pp. 26). This property follows from Jensen’s inequality. I(pkr) is a convex function of both p and r. Further, it is a convex in the pair (p, r), i.e., if (p1 , r1 ) and (p2 , q2 ) are two pairs of pmfs, then (Cover & Thomas, 1991, pp. 30) I(λp1 + (1 − λ)p2 kλr1 + (1 − λ)r2 ) ≤ λI(p1 kr1 ) + (1 − λ)I(p2 kr2 ) . (2.10) Similar to Shannon entropy, KL-entropy is additive too in the following sense. Let X1 , X2 ∈ X and Y1 , Y2 ∈ Y be such that X1 and Y1 are independent, and X2 and Y2 are independent, respectively, then I(X1 × Y1 kX2 × Y2 ) = I(X1 kX2 ) + I(Y1 kY2 ) , (2.11) which is the additivity property2 of KL-entropy. Finally, KL-entropy (2.9) and Shannon entropy (2.1) are related by I(pkr) = −S(p) − n X pk ln rk . (2.12) k=1 2 Additivity property of KL-entropy can alternatively be stated as follows. Let X and Y be two independent random variables. Let p(x, y) and r(x, y) be two possible joint pmfs of X and Y . Then we have I(p(x, y)kr(x, y)) = I(p(x)kr(x)) + I(p(y)kr(y)) . 24 One has to note that the above relation between KL and Shannon entropies differs in the nondiscrete cases, which we discuss in detail in Chapter 3. 2.2 Rényi’s Generalizations Two important concepts that are essential for the derivation of Rényi entropy are Hartley information measure and generalized averages known as Kolmogorov-Nagumo averages. Hartley information measure quantifies the information associated with a single event and brings forth the operational significance of the Shannon entropy – the average of Hartley information is viewed as the Shannon entropy. Rényi used generalized averages KN, in the averaging of Hartley information to derive his generalized entropy. Before we summarize the information theory procedure leading to Rényi entropy, we discuss these concepts in detail. A conceptual discussion on significance of Hartley information in the definition of Shannon entropy can be found in (Rényi, 1960) and more formal discussion can be found in (Aczél & Daróczy, 1975, Chapter 0). Concepts related to generalized averages can be found in the book on inequalities (Hardy et al., 1934, Chapter 3). 2.2.1 Hartley Function and Shannon Entropy The motivation to quantify information in terms of logarithmic functions goes back to Hartley (1928), who first used a logarithmic function to define uncertainty associated with a finite set. This is known as Hartley information measure. The Hartley information measure of a finite set A with n elements is defined as H(A) = log b n. If the base of the logarithm is 2, then the uncertainty is measured in bits, and in the case of natural logarithm, the unit is nats. As we mentioned earlier, in this thesis, we use only natural logarithm as a convention. Hartley information measure resembles the measure of disorder in thermodynamics, first provided by Boltzmann principle (known as Boltzmann entropy), and is given by S = K ln W , (2.13) where K is the thermodynamic unit of measurement of entropy and is known as the Boltzmann constant and W , called the degree of disorder or statistical weight, is the total number of microscopic states compatible with the macroscopic state of the system. 25 One can give a more general definition of Hartley information measure described above as follows. Define a function H : {x 1 , . . . , xn } → R of the values taken by r.v X ∈ X with corresponding p.m.f p = (p1 , . . . pn ) as (Aczél & Daróczy, 1975) H(xk ) = ln 1 , ∀k = 1, . . . n. pk (2.14) H is also known as information content or entropy of a single event (Aczél & Daróczy, 1975) and plays an important role in all classical measures of information. It can be interpreted either as a measure of how unexpected the given event is, or as measure of the information yielded by the event; and it has been called surprise by Watanabe (1969), and unexpectedness by Barlow (1990). Hartley function satisfies: (i) H is nonnegative: H(x k ) ≥ 0 (ii) H is additive: H(xi , xj ) = H(xi ) + H(xj ), where H(xi , xj ) = ln pi1pj (iii) H is normalized: H(xk ) = 1, whenever pk = satisfied for pk = 1 2 ). 1 e (in the case of logarithm with base 2, the same is These properties are both necessary and sufficient (Aczél & Daróczy, 1975, Theorem 0.2.5). Now, Shannon entropy (2.1) can be written as expectation of Hartley function as S(X) = hHi = n X pk Hk , (2.15) k=1 where Hk = H(xk ), ∀k = 1, . . . n, with the understanding that hHi = hH(X)i. The characteristic additive property of Shannon entropy (2.5) now follows as a consequence of the additivity property of Hartley function. There are two postulates involved in defining Shannon entropy as expectation of Hartley function. One is the additivity of information which is the characteristic property of Hartley function, and the other is that if different amounts of information occur with different probabilities, the total information will be the average of the individual informations weighted by the probabilities of their occurrences. One can justify these postulates by heuristic arguments based on probabilistic considerations, which can be advanced to establish the logarithmic nature of Hartley and Shannon information measures (see § 1.2.1). Expressing or defining Shannon entropy as an expectation of Hartley function, not only provides an intuitive idea of Shannon entropy as a measure of information but it is also useful in derivation of its properties. Further, as we are going to see in detail, this provides a unified way to discuss the Rényi’s and Tsallis generalizations of Shannon entropy. Now we move on to a discussion on generalized averages. 26 2.2.2 Kolmogorov-Nagumo Averages or Quasilinear Means In the general theory of means, the quasilinear mean of a random variable X ∈ X is defined as3 Eψ X = hXiψ = ψ −1 n X pk ψ (xk ) k=1 ! , (2.16) where ψ is continuous and strictly monotonic (increasing or decreasing) and hence has an inverse ψ −1 , which satisfies the same conditions. In the context of generalized means, ψ is referred to as Kolmogorov-Nagumo function (KN-function). In particular, if ψ is linear, then (2.16) reduces to the expression of linear averaging, P EX = hXi = nk=1 pk xk . Also, the mean hXiψ takes the form of weighted arith1 P Q metic mean ( nk=1 pk xak ) a when ψ(x) = xa , a > 0 and geometric mean nk=1 xpkk if ψ(x) = ln x. In order to justify (2.16) as a so called mean we need the following theorem. T HEOREM 2.1 If ψ is continuous and strictly monotone in a ≤ x ≤ b, a ≤ x k ≤ b, k = 1, . . . n, P pk > 0 and nk=1 pk = 1, then ∃ unique x0 ∈ (a, b) such that ψ(x0 ) = n X pk ψ(xk ) , (2.17) k=1 and x0 is greater than some and less than others of the x k unless all xk are zero. The implication of Theorem 2.1 is that the mean h . i ψ is determined when the function ψ is given. One may ask whether the converse is true: if hXi ψ1 = hXiψ2 for all X ∈ X, is ψ1 necessarily the same function as ψ2 ? Before answering this question, we shall give the following definition. D EFINITION 2.1 Continuous and strictly monotone functions ψ 1 and ψ2 are said to be KN-equivalent if hXiψ1 = hXiψ2 for all X ∈ X. 3 Kolmogorov (1930) and Nagumo (1930) first characterized the quasilinear mean for a vector 1 (x1 , . . . , xn ) as hxiψ = ψ −1 n k=1 n ψ(xk ) where ψ is a continuous and strictly monotone function. de Finetti (1931) extended their result to the case of simple (finite) probability distributions. The version of the quasilinear mean representation theorem referred to in § 2.5 is due to Hardy et al. (1934), which followed closely the approach of de Finetti. Aczél (1948) proved a characterization of the quasilinear mean using functional equations. Ben-Tal (1977) showed that quasilinear means are ordinary arithmetic means under suitably defined addition and scalar multiplication operations. Norries (1976) did a survey of quasilinear means and its more restrictive forms in Statistics, and a more recent survey of generalized means can be found in (Ostasiewicz & Ostasiewicz, 2000). Applications of quasilinear means can be found in economics (e.g., Epstein & Zin, 1989) and decision theory (e.g., Kreps & Porteus, 1978)). Recently Czachor and Naudts (2002) studied generalized thermostatistics based on quasilinear means. 27 Note that when we compare two means, it is to be understood that the underlying probabilities are same. Now, the following theorem characterizes KN-equivalent functions. T HEOREM 2.2 In order that two continuous and strictly monotone functions ψ 1 and ψ2 are KNequivalent, it is necessary and sufficient that ψ1 = αψ2 + β , (2.18) where α and β are constants and α 6= 0. A simple consequence of the above theorem is that if ψ is a KN-function then we have hXiψ = hXi−ψ . Hence, without loss of generality, one can assume that ψ is an increasing function. The following theorem states the important property of KN- averages, which characterizes additivity of quasilinear means cf. (Hardy et al., 1934, Theorem 84). T HEOREM 2.3 Let ψ be a KN-function and c be a real constant then hX + ci ψ = hXiψ + c i.e., ψ −1 n X pk ψ (xk + c) k=1 ! = ψ −1 n X k=1 pk ψ (xk ) ! +c if and only if ψ is either linear or exponential. Proofs of Theorems 2.1, 2.2 and 2.3 can be found in the book on inequalities by Hardy et al. (1934). Rényi (1960) employed these generalized averages in the definition of Shannon entropy to generalize the same. 2.2.3 Rényi Entropy In the definition of Shannon entropy (2.15), if the standard mean of Hartley function H is replaced with the quasilinear mean (2.16), one can obtain a generalized measure of information of r.v X with respect to a KN-function ψ as ! ! n n X X 1 pk ψ ln pk ψ (Hk ) , Sψ (X) = ψ −1 = ψ −1 pk k=1 (2.19) k=1 where ψ is a KN-function. We refer to (2.19) as quasilinear entropy with respect to the KN-function ψ. A natural question that arises is what is the possible mathematical form of KN-function ψ, or in other words, what is the most general class of functions ψ which will still provide a measure of information compatible with the additivity 28 property (postulate)? The answer is that insisting on additivity allows by Theorem 2.3 only for two classes of ψ’s – linear and exponential functions. We formulate these arguments formally as follows. If we impose the constraint of additivity on S ψ , i.e., for any X, Y ∈ X Sψ (X × Y ) = Sψ (X) + Sψ (Y ) , (2.20) then ψ should satisfy (Rényi, 1960) hX + ciψ = hXiψ + c , (2.21) for any random variable X ∈ X and a constant c. Rényi employed this formalism to define a one-parameter family of measures of information as follows: n X 1 pαk ln Sα (X) = 1−α k=1 ! , (2.22) where the KN-function ψ is chosen in (2.19) as ψ(x) = e (1−α)x whose choice is motivated by Theorem 2.3. If we choose ψ as a linear function in quasilinear entropy (2.19), what we get is Shannon entropy. The right side of (2.22) makes sense 4 as a measure of information whenever α 6= 1 and α > 0 cf. (Rényi, 1960). Rényi entropy is a one-parameter generalization of Shannon entropy in the sense that Sα (p) → S(p) as α → 1. Hence, Rényi entropy is referred to as entropy of order α, whereas Shannon entropy is referred to as entropy of order 1. The Rényi entropy can also be seen as an interpolation formula connecting the Shannon (α = 1) and Hartley (α = 0) entropies. Among the basic properties of Rényi entropy, S α is positive. This follows from P Jensen’s inequality which gives nk=1 pαk ≤ 1 in the case α > 1, and while in the case P 0 < α < 1 it gives nk=1 pαk ≥ 1; in both cases we have Sα (p) ≥ 0. Sα is strictly concave with respect to p for 0 < α ≤ 1. For α > 1, Rényi entropy is neither pure convex nor pure concave. This is a simple consequence of the fact that both ln x and xα (α < 1) are concave functions, while x α is convex for α > 1 (see (Ben-Bassat & Raviv, 1978) for proofs and a detailed discussion). 4 For negative α, however, Sα (p) has disadvantageous properties; namely, it will tend to infinity if any pk tends to 0. This means that it is too sensitive to small probabilities. (This property could also formulated in the following way: if we add a new event of probability 0 to a probability distribution, what does not change the probability distribution, Sα (p) becomes infinity.) The case α = 0 must also be excluded because it yields an expression not depending on the probability distribution p = (p1 , . . . , pn ). 29 A notable property of Sα (p) is that it is a monotonically decreasing function of α for any pmf p. This can be verified as follows. I We can calculate the derivative of Sα (p) with respect to α as n X dSα (p) 1 = dα (1 − α) pαk Pn α j=1 pj k=1 1 = (1 − α)2 ( n X k=1 pαk Pn α j=1 pj ! ! ln pk + ln p1−α k − ln 1 ln (1 − α)2 n X k=1 n X k=1 pαk Pn One should note here that the vector of positive real numbers pαk α j=1 pj ! p1−α k ) . (2.23) pα 1 n j=1 pα j ,..., pα n n j=1 pα j represents a pmf. (Indeed, distributions of this form are known as escort distribu- tions (Abe, 2003) and plays an important role in ME-prescriptions of Tsallis entropy. We discuss these aspects in Chapter 3.) Denoting the mean of a vector x = (x1 , . . . , xn ) with respect to this pmf, i.e. escort distribution of p, by hhxii α we can write (2.23) in an elegant form, which further gives the results as 1 dSα (p) 1−α 1−α = hhln p iiα − ln hhp iiα ≤ 0 . dα (1 − α)2 (2.24) The inequality in (2.24) is due to Jensen’s inequality. J Important consequences of the fact that Sα is a monotone decreasing function of α are the following two inequalities S1 (p) < Sα (p) < ln n , 0 < α < 1, (2.25a) Sα (p) < S1 (p) < ln n , α > 1, (2.25b) where S1 (p) = limα→1 Sα (p) is the Shannon entropy. From the derivation of Rényi entropy it is obvious that it is additive, i.e., Sα (X × Y ) = Sα (X) + Sα (Y ) , (2.26) where X ∈ X and Y ∈ Y are two independent r.v. Most of the other known properties of Rényi entropy and its characterizations are summarized by Aczél and Daróczy (1975, Chapter 5) and Jizba and Arimitsu (2004b). Properties related to convexity and bounds of Rényi entropy can be found in (BenBassat & Raviv, 1978). 30 Similar to the Shannon entropy function (2.2) one can define the entropy function in the case of Rényi as sα (p) = 1 ln pα + (1 − p)α , 1−α p ∈ [0, 1], (2.27) which is the Rényi entropy of a Bernoulli random variable. Figure 2.1 shows the plot of Shannon entropy function (2.2) compared to Rényi entropy function (2.27) for various values of entropic index α. 0.7 0.6 α=0.8 α=1.2 α=1.5 α s(p) & s ( p) 0.5 0.4 0.3 0.2 Shannon Renyi 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Figure 2.1: Shannon and Renyi Entropy Functions Rényi entropy does have a reasonable operational significance even if not one comparable with that of Shannon entropy cf. (Csiszár, 1974). As regards the axiomatic approach, Rényi (1961) did suggest a set of postulates characterizing his entropies but it P involved the rather artificial procedure of considering incomplete pdfs ( nk=1 pk ≤ 1 ) as well. This shortcoming has been eliminated by Daróczy (1970). Recently, a slightly different set of axioms is given by (Jizba & Arimitsu, 2004b). Despite its formal origin, Rényi entropy proved important in a variety of practical applications in coding theory (Campbell, 1965; Aczél & Daróczy, 1975; Lavenda, 1998), statistical inference (Arimitsu & Arimitsu, 2000, 2001), quantum mechanics (Maassen & Uffink, 1988), chaotic dynamics systems (Halsey, Jensen, Kadanoff, Procaccia, & Shraiman, 1986) etc. Rényi entropy is also used in neural networks (Kamimura, 1998). Thermodynamic properties of systems with multi-fractal structures have been studied by extending the notion of Gibbs-Shannon entropy into a more 31 general framework - Rényi entropy (Jizba & Arimitsu, 2004a). Entropy of order 2 i.e., Rényi entropy for α = 2, S2 (p) = − ln n X p2k (2.28) k=1 is known as Rényi quadratic entropy. R‘{enyi quadratic entropy is mostly used in a contex of kernel based estimators, since it allows an explicit computation of the estimated density. This measure has also been applied to clustering problems under the name of information theoretic clustering (Gokcay & Principe, 2002). Maximum entropy formulations of Rényi quadratic entropy are studied to compute conditional probabilities, with applications to image retrieval and language modeling in the PhD thesis of Zitnick (2003). Along similar lines of generalization of entropy, Rényi (1960) defined a one parameter generalization of Kullback-Leibler relative-entropy as n Iα (pkr) = X pα 1 k ln α−1 α−1 rk (2.29) k=1 for pmfs p and r. Properties of this generalized relative-entropy can be found in (Rényi, 1970, Chapter 9). We conclude this section with the note that though it is considered that the first formal generalized measure of information is due to Rényi, the idea of considering some generalized measure did not start with Rényi. Bhattacharyya (1943, 1946) and Jeffreys (1948) dealt with the quantity n X √ I1/2 (pkr) = −2 pk rk = I1/2 (rkp) (2.30) k=1 as a measure of difference between the distributions p and r, which is nothing but Rényi relative-entropy (2.29) with α = 21 . Before Rényi, Schützenberger (1954) mentioned the expression Sα and Kullback (1959) too dealt with the quantities I α . (One can refer (Rényi, 1960) for a discussion on the context in which Kullback considered these generalized entropies.) Apart from Rényi and Tsallis generalizations, there are various generalizations of Shannon entropy reported in literature. Reviews of these generalizations can be found in Kapur (1994) and Arndt (2001). The characterizations of various information measures are studied in (Ebanks, Sahoo, & Sander, 1998). Since poorly motivated generalizations have also been published during Rényi’s time, Rényi emphasized the need of operational as well as postulational justification in order to call an algebraic 32 expression an information quantity. In this respect, Rényi’s review paper (Rényi, 1965) is particularly instructive. Now we discuss the important, non-additive generalization of Shannon entropy. 2.3 Nonextensive Generalizations Although, first introduced by Havrda and Charvát (1967) in the context of cybernetics theory and later studied by Daróczy (1970), it was Tsallis (1988) who exploited its nonextensive features and placed it in a physical setting. Hence it is also known as Harvda-Charvat-Daróczy-Tsallis entropy. (Throughout this paper we refer to this as Tsallis or nonextensive entropy.) 2.3.1 Tsallis Entropy Tsallis entropy of an r.v X ∈ X with p.m.f p = (p 1 , . . . pn ) is defined as P 1 − nk=1 pqk , Sq (X) = q−1 (2.31) where q > 0 is called the nonextensive index. Tsallis entropy too, like Rényi entropy, is a one-parameter generalization of Shannon entropy in the sense that lim Sq (p) = − q→1 n X pk ln pk = S1 (p) , (2.32) k=1 since in the limit q → 1, we have pkq−1 = e(q−1) ln pk ∼ 1 + (q − 1) ln pk or by the L’Hospital rule. Tsallis entropy retains many important properties of Shannon entropy except for the additivity property. Here we briefly discuss some of these properties. The arguments which provide the positivity of Rényi entropy are also applicable for Tsallis entropy and hence Sq (p) ≥ 0 for any pmf p. Sq equals zero in the case of certainty and attains its extremum for a uniform distribution. The fact that Tsallis entropy attains maximum for uniform distribution can be shown as follows. I We extremize the Tsallis entropy under the normalizing conP straint nk=1 pk = 1. By introducing the Lagrange multiplier λ, we set !! P n X 1 − nk=1 pqk ∂ q 0= −λ pk − 1 =− pq−1 − λ . ∂pk q−1 q−1 k k=1 33 It follows that λ(1 − q) pk = q 1 q−1 . Since this is independent of k, imposition of the normalizing constraint immediately yields pk = n1 . J Tsallis entropy is concave for all q > 0 (convex for q < 0). I This follows immediately from the Hessian matrix ∂2 ∂pi ∂pj Sq (p) − λ n X k=1 pk − 1 !! = −qpiq−2 δij , which is clearly negative definite for q > 0 (positive definite for q < 0). J One can recall that Rényi entropy (2.22) is concave only for 0 < α < 1. Also, one can prove that for two pmfs p and r, and for real number 0 ≤ λ ≤ 1 we have Sq (λp + (1 − λ)r) ≥ λSq (p) + (1 − λ)Sq (r) , which results from Jensen’s inequality and concavity of (2.33) xq 1−q . What separates out Tsallis entropy from Shannon and Rényi entropies is that it is not additive. The entropy index q in (2.31) characterizes the degree of nonextensivity reflected in the pseudo-additivity property Sq (X ×Y ) = Sq (X)⊕q Sq (Y ) = Sq (X)+Sq (Y )+(1−q)Sq (X)Sq (Y ) ,(2.34) where X, Y ∈ X are two independent random variables. In the nonextensive case, Tsallis entropy function can written as sq (p) = 1 1 − xq − (1 − x)q q−1 (2.35) Figure 2.2 shows the plots of Shannon entropy function (2.2) and Tsallis entropy function (2.35) for various values of entropic index a. It is worth mentioning here that the derivation of Tsallis entropy using the Lorentz addition by Amblard and Vignat (2005) gives insights into the boundedness of Tsallis entropy. In this thesis we will not go into these details. The first set of axioms for Tsallis entropy is given by dos Santos (1997), which were later improved by Abe (2000). The most concise set of axioms are given by Suyari (2004a), which are known as Generalized Shannon-Khinchin axioms. A simpli34 0.9 0.8 s(p) & sq(p) 0.7 q=0.8 q=1.2 0.6 q=1.5 0.5 0.4 0.3 0.2 Shannon Tsallis 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p Figure 2.2: Shannon and Tsallis Entropy Functions fied proof of this uniqueness theorem for Tsallis entropy is given by (Furuichi, 2005). In these axioms, Shannon additivity (2.8) is generalized to Sq (p11 , . . . , pnmn ) = Sq (p1 , . . . , pn ) + n X i=1 pqi Sq pimi pi1 ,..., pi pi , (2.36) under the same conditions (2.7); remaining axioms are the same as in Shannon-Khinchin axioms. Now we turn our attention to the nonextensive generalization of relative-entropy. The definition of Kullback-Leibler relative-entropy (2.9) and the nonextensive entropic functional (2.31) naturally lead to the generalization (Tsallis, 1998) Iq (pkr) = n X pk pk rk k=1 q−1 q−1 −1 , (2.37) which is called as Tsallis relative-entropy. The limit q → 1 recovers the relativeentropy in the classical case. One can also generalize Gibbs inequality as (Tsallis, 1998) ) Iq (pkr) ≥ 0 if q > 0 = 0 if q = 0 ≤ 0 if q < 0 . (2.38) 35 For q 6= 0, the equalities hold if and only if p = r. (2.38) can be verified as follows. I Consider the function f (x) = 1−x1−q 1−q . We have f 00 (x) > 0 for q > 0 and hence it is convex. By Jensen’s inequality we obtain 1−q r n X 1 − pkk 1 pk Iq (pkr) = 1− ≥ 1−q 1−q k=1 =0 . n X k=1 !1−q rk pk pk (2.39) For q < 0 we have f 00 (x) < 0 and hence we have the reverse inequality by Jensen’s inequality for concave functions. J Further, for q > 0, Iq (pkr) is a convex function of p and r, and for q < 0 it is concave, which can be proved using Jensen’s inequality cf. (Borland, Plastino, & Tsallis, 1998). Tsallis relative-entropy satisfies the pseudo-additivity property of the form (Furuichi et al., 2004) Iq (X1 × Y1 kX2 × Y2 ) = Iq (X1 kX2 ) + Iq (Y1 kY2 ) +(q − 1)Iq (X1 kX2 )Iq (Y1 kY2 ) , (2.40) where X1 , X2 ∈ X and Y1 , Y2 ∈ Y are such that X1 and Y1 are independent, and X2 and Y2 are independent respectively. The limit q → 1 in (2.40) retrieves (2.11), the ad- ditivity property of Kullback-Leibler relative-entropy. One should note the difference between the pseudo-additivities of Tsallis entropy (2.34) and Tsallis relative-entropy (2.40). Further properties of Tsallis relative-entropy have been discussed in (Tsallis, 1998; Borland et al., 1998; Furuichi et al., 2004). Characterization of Tsallis relative-entropy, by generalizing Hobson’s uniqueness theorem (Hobson, 1969) of relative-entropy, is presented in (Furuichi, 2005). 2.3.2 q-Deformed Algebra The mathematical basis for Tsallis statistics comes from the q-deformed expressions for the logarithm (q-logarithm) and the exponential function (q-exponential) which were first defined in (Tsallis, 1994), in the context of nonextensive thermostatistics. The q-logarithm is defined as lnq x = x1−q − 1 (x > 0, q ∈ R) , 1−q 36 (2.41) and the q-exponential is defined as ( 1 [1 + (1 − q)x] 1−q if 1 + (1 − q)x ≥ 0 x eq = 0 otherwise. (2.42) We have limq→1 lnq x = ln x and limq→1 exq = ex . These two functions are related by ln x eq q = x . (2.43) The q-logarithm satisfies pseudo-additivity of the form lnq (xy) = lnq x + lnq y + (1 − q) lnq x lnq y , (2.44) while, the q-exponential satisfies exq eyq = e(x+y+(1−q)xy) . q (2.45) One important property of the q-logarithm is (Furuichi, 2006) x lnq = y q−1 (lnq x − lnq y) . y (2.46) These properties of q-logarithm and q-exponential functions, (2.44) and (2.45), motivate the definition of q-addition as x ⊕q y = x + y + (1 − q)xy , (2.47) which we have already mentioned in the context of pseudo-additivity of Tsallis entropy (2.34). The q-addition is commutative i.e., x ⊕ q y = y ⊕q x, and associative i.e., x ⊕q (y ⊕q z) = (x ⊕q y) ⊕q z. But it is not distributive with respect to the usual multiplication, i.e., a(x ⊕q y) 6= (ax ⊕q ay). Similar to the definition of q-addition, the q-difference is defined as x q y = 1 x−y , y 6= . 1 + (1 − q)y q−1 (2.48) Further properties of these q-deformed functions can be found in (Yamano, 2002). In this framework a new multiplication operation called q-product has been defined, which plays an important role in the compact representation of distributions resulting from Tsallis relative-entropy minimization (Dukkipati, Murty, & Bhatnagar, 2005b). These aspects are discussed in Chapter 4. Now, using these q-deformed functions, Tsallis entropy (2.31) can be represented as Sq (p) = − n X pqk lnq pk , (2.49) k=1 37 and Tsallis relative-entropy (2.37) as Iq (pkr) = − n X k=1 pk lnq rk . pk (2.50) These representations are very important for deriving many results related to nonextensive generalizations as we are going to consider in the later chapters. 2.4 Uniqueness of Tsallis Entropy under Rényi’s Recipe Though the derivation of Tsallis entropy proposed in 1988 is slightly different, one can understand this generalization using the q-logarithm function, where one would first generalize logarithm in the Hartley information with the q-logarithm and define e : {x1 , . . . , xn } → R of r.v X as (Tsallis, 1999) the q-Hartley function H e k = H(x e k ) = lnq 1 , H pk k = 1, . . . n . (2.51) Now, Tsallis entropy (2.31) can be defined as the expectation of the q-Hartley function e as5 H D E e . Sq (X) = H (2.52) Note that the characteristic pseudo-additivity property of Tsallis entropy (2.34) is a consequence of the pseudo-additivity of the q-logarithm (2.44). Before we present the main results, we briefly discuss the context of quasilinear means, where there is a relation between Tsallis and Rényi entropy. By using the definition of the q-logarithm (2.41), the q-Hartley function can be written as where e k = lnq 1 = φq (Hk ) , H pk φq (x) = e(1−q)x − 1 = lnq (ex ) . 1−q (2.53) Note that the function φq is KN-equivalent to e(1−q)x (by Theorem 2.2), the KNfunction used in Rényi entropy. Hence Tsallis entropy is related to Rényi entropies as SqT = φq (SqR ) , (2.54) 5 There are alternative definitions of nonextensive information content in the Tsallis formalism. One of them is the expression − lnq pk used by Yamano (2001) and characterized by Suyari (2002) (note that − lnq pk 6= lnq p1k ). Using this definition one has to use alternate expectation, called q-expectation, to define Tsallis entropy. We discuss q-expectation values in Chapter 3. Regarding the definition of nonextensive information content, we use Tsallis (1999) definition (2.51) in this thesis. 38 0.9 Renyi Tsallis q<1 0.8 0.6 0.5 0.4 R T Sq (p) & Sq (p) 0.7 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p (a) Entropic Index q = 0.8 0.7 Renyi Tsallis q>1 0.6 0.4 0.3 R T Sq (p) & Sq (p) 0.5 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p (b) Entropic Index q = 1.2 Figure 2.3: Comparison of Rényi and Tsallis Entropy Functions where SqT and SqR denote the Tsallis and Rényi entropies respectively with a real number q as a parameter. (2.54) implies that Tsallis and Rényi entropies are monotonic functions of each other and, as a result, both must be maximized by the same probability distribution. In this thesis, we consider only ME-prescriptions related to nonextensive entropies. Discussion on ME of Rényi entropy can be found in (Bashkirov, 2004; Johnson & Vignat, 2005; Costa, Hero, & Vignat, 2002). Comparisons of Rényi entropy function (2.27) with Tsallis entropy function (2.35) are shown graphically in Figure 2.3 for two cases of entropic index, corresponding to 0 < q < 1 and q > 1 respectively. Now a natural question that arises is whether one could generalize Tsallis entropy using Rényi’s recipe, i.e. by replacing the linear average in (2.52) by KN-averages and imposing the condition of pseudo-additivity. It is equivalent to determining the KN-function ψ for which the so called q-quasilinear 39 entropy defined as " n # D E X e ek Seψ (X) = H pk ψ H = ψ −1 , ψ (2.55) k=1 e k = H(x e k ), ∀k = 1, . . . n, satisfies the pseudo-additivity property. where H First, we present the following result which characterizes the pseudo-additivity of quasilinear means. T HEOREM 2.4 Let X, Y ∈ X be two independent random variables. Let ψ be any KN-function. Then hX ⊕q Y iψ = hXiψ ⊕q hY iψ (2.56) if and only if ψ is linear. Proof Let p and r be the p.m.fs of random variables X, Y ∈ X respectively. The proof of sufficiency is simple and follows from hX ⊕q Y iψ = hX ⊕q Y i = = = n n X X i=1 j=1 n n X X i=1 j=1 n X pi rj (xi ⊕q yj ) pi rj (xi + yj + (1 − q)xi yj ) pi xi + i=1 n X j=1 rj yj + (1 − q) n X i=1 pi xi n X rj yj . j=1 To prove the converse, we need to determine all forms of ψ which satisfy n X n X ψ −1 pi rj ψ (xi ⊕q yj ) i=1 j=1 = ψ −1 n X pi ψ (xi ) i=1 ! n X ⊕q ψ −1 rj ψ (yj ) . (2.57) j=1 Since (2.57) must hold for arbitrary p.m.fs p, r and for arbitrary numbers x 1 , . . . , xn and y1 , . . . , yn , one can choose yj = c for all j. Then (2.57) yields ψ −1 n X i=1 pk ψ (xi ⊕q c) ! = ψ −1 n X i=1 40 pk ψ (xi ) ! ⊕q c . (2.58) That is, ψ should satisfy hX ⊕q ciψ = hXiψ ⊕q c , (2.59) for any X ∈ X and any constant c. This can be rearranged as h(1 + (1 − q)c)X + ciψ = (1 + (1 − q)c)hXi ψ + c by using the definition of ⊕q . Since q is independent of other quantities, ψ should satisfy an equation of the form hdX + ciψ = dhXiψ + c , (2.60) where d 6= 0 (by writing d = (1 + (1 − q)c)). Finally ψ must satisfy hX + ciψ = hXiψ + c (2.61) hdXiψ = dhXiψ , (2.62) and for any X ∈ X and any constants d, c. From Theorem 2.3, condition (2.61) is satisfied only when ψ is linear or exponential. To complete the theorem, we have to show that KN-averages do not satisfy condition (2.62) when ψ is exponential. For a particular choice of ψ(x) = e (1−α)x , assume that hdXiψ = dhXiψ , (2.63) where hdXiψ1 n X 1 = pk e(1−α)dxk ln 1−α dhXiψ1 n X d = ln pk e(1−α)xk 1−α k=1 and k=1 ! ! , . Now define a KN-function ψ 0 as ψ 0 (x) = e(1−α)dx , for which ! n X 1 pk e(1−α)dxk . hXiψ0 = ln d(1 − α) k=1 Condition (2.63) implies hXiψ = hXiψ0 , and by Theorem 2.2, ψ and ψ 0 are KN-equivalent, which gives a contradiction. 41 One can observe that the above proof avoids solving functional equations as in the case of the proof of Theorem 2.3 (e.g., Aczél & Daróczy, 1975). Instead, it makes use of Theorem 2.3 itself and other basic properties of KN-averages. The following corollary is an immediate consequence of Theorem 2.4. C OROLLARY 2.1 The q-quasilinear entropy Seψ (defined as in (2.55)) with respect to a KN-function ψ satisfies pseudo-additivity if and only if Seψ is Tsallis entropy. Proof Let X, Y ∈ X be two independent random variables and let p, r be their corresponding pmfs. By the pseudo-additivity constraint, ψ should satisfy Seψ (X × Y ) = Seψ (X) ⊕q Seψ (Y ) . (2.64) From the property of q-logarithm that lnq xy = lnq x ⊕q lnq y, we need ψ −1 n n X X pi rj ψ lnq i=1 j=1 = ψ −1 n X i=1 1 pi rj pi ψ lnq 1 pi ! n X 1 . rj ψ lnq ⊕q ψ −1 rj Equivalently, we need n X n X e p ⊕q H er ψ −1 pi rj ψ H j i i=1 j=1 = ψ −1 n X i=1 ep pi ψ H i ! (2.65) j=1 ⊕q ψ −1 n X j=1 e jr , rj ψ H e p and H e r represent the q-Hartley functions corresponding to probability diswhere H tributions p and r respectively. That is, ψ should satisfy e p ⊕q H e r i = hH e p i ⊕q hH e ri . hH ψ ψ ψ Also from Theorem 2.4, ψ is linear and hence Seψ is Tsallis. Corollary 2.1 shows that using Rényi’s recipe in the nonextensive case one can prepare only Tsallis entropy, while in the classical case there are two possibilities. Figure 2.4 summarizes the Rényi’s recipe for Shannon and Tsallis information measures. 42 Hartley Information q−Hartley Information KN−average KN−average Quasilinear Entropy q−Quasilinear Entropy additivity Shannon Entropy pseudo−additivity ’ Renyi Entropy Tsallis Entropy Figure 2.4: Rényi’s Recipe for Additive and Pseudo-additive Information Measures 2.5 A Characterization Theorem for Nonextensive Entropies The significance of Rényi’s formalism to generalize Shannon entropy is a characterization of the set of all additive information measures in terms of axioms of quasilinear means (Rényi, 1960). By the result, Theorem 2.4, that we presented in this chapter, one can extend this characterization to pseudo-additive (nonextensive) information measures. We emphasize here that, for such a characterization one would assume that entropy is the expectation of a function of underlying r.v. In the classical case, the function is Hartley function, while in the nonextensive case it is the q-Hartley function. Since characterization of quasilinear means is given in terms of cumulative distribution of a random variable as in (Hardy et al., 1934), we use the following definitions and notation. Let F : R → R denote the cumulative distribution function of the random variable X ∈ X. Corresponding to a KN-function ψ : R → R, the generalized mean of F (equivalently, generalized mean of X) can be written as Z −1 Eψ (F ) = Eψ (X) = hXiψ = ψ ψ dF , (2.66) which is the continuous analogue to (2.16), and is axiomized by Kolmogorov, Nagumo, de Finetti, c.f (Hardy et al., 1934, Theorem 215) as follows. 43 T HEOREM 2.5 Let FI be the set of all cumulative distribution functions defined on some interval I of the real line R. A functional κ : F I → R satisfies the following axioms: [KN1] κ(δx ) = x, where δx ∈ FI denotes the step function at x (Consistency with certainty) , [KN2] F, G ∈ FI , if F ≤ G then κ(F ) ≤ κ(G); the equality holds if and only if F = G (Monotonicity) and, [KN3] F, G ∈ FI , if κ(F ) = κ(G) then κ(βF + (1 − β)H) = κ(βG + (1 − β)H), for any H ∈ FI (Quasilinearity) if and only if there is a continuous strictly monotone function ψ such that Z κ(F ) = ψ −1 ψ dF . Proof of the above characterization can be found in (Hardy et al., loc. cit.). Modified axioms for the quasilinear mean can be found in (Chew, 1983; Fishburn, 1986; Ostasiewicz & Ostasiewicz, 2000). Using this characterization of the quasilinear mean, Rényi gave the following characterization for additive information measures. T HEOREM 2.6 Let X ∈ X be a random variable. An information measure defined as a (gener- alized) mean κ of Hartley function of X is either Shannon or Rényi if and only if 1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5 and, 2. If X1 , X2 ∈ X are two random variables which are independent, then κ(X1 + X2 ) = κ(X1 ) + κ(X2 ) . Further, if κ satisfies κ(Y ) + κ(−Y ) = 0 for any Y ∈ X then κ is necessarily Shannon entropy. The proof of above theorem is straight forward by using Theorem (2.3); for details see (Rényi, 1960). Now we give the following characterization theorem for nonextensive entropies. 44 T HEOREM 2.7 Let X ∈ X be a random variable. An information measure defined as a (general- ized) mean κ of q-Hartley function of X is Tsallis entropy if and only if 1. κ satisfies axioms of quasilinear means [KN1]-[KN3] given in Theorem 2.5 and, 2. If X1 , X2 ∈ X are two random variables which are independent, then κ(X1 ⊕q X2 ) = κ(X1 ) ⊕q κ(X2 ) . The above theorem is a direct consequence of Theorems 2.4 and 2.5. This characterization of Tsallis entropy only replaces the additivity constraint in the characterization of Shannon entropy given by Rényi (1960) with pseudo-additivity, which further does not make use of the postulate κ(X) + κ(−X) = 0. (This postulate is needed to distinguish Shannon entropy from Rényi entropy). This is possible because Tsallis entropy is unique by means of KN-averages and under pseudo-additivity. From the relation between Rényi and Tsallis information measures (2.54), possibly, generalized averages play a role – though not very well understood till now – in describing the operational significance of Tsallis entropy. Here, one should mention the work of Czachor and Naudts (2002), who studied the KN-average based MEprescriptions of generalized information measures (constraints with respect to which one would maximize entropy are defined in terms of quasilinear means). In this regard, results presented in this chapter have mathematical significance in the sense that they further the relation between nonextensive entropic measures and generalized averages. 45 3 Measures and Entropies: Gelfand-Yaglom-Perez Theorem Abstract R The measure-theoretic KL-entropy defined as X ln dP dR dP , where P and R are probability measures on a measurable space (X, M), plays a basic role in the definitions of classical information measures. A fundamental theorem in this respect is the Gelfand-Yaglom-Perez Theorem (Pinsker, 1960b, Theorem 2.4.2) which equips measure-theoretic KL-entropy with a fundamental definition and can be stated as, Z m ln X X dP P (Ek ) dP = sup P (Ek ) ln , dR R(Ek ) k=1 where supremum is taken over all the measurable partitions {Ek }m k=1 . In this chapter, we state and prove the GYP-theorem for Rényi relative-entropy of order greater than one. Consequently, the result can be easily extended to Tsallis relative-entropy. Prior to this, we develop measure-theoretic definitions of generalized information measures and discuss the maximum entropy prescriptions. Some of the results presented in this chapter can also be found in (Dukkipati, Bhatnagar, & Murty, 2006b, 2006a). Shannon’s measure of information was developed essentially for the case when the random variable takes a finite number of values. However in the literature, one often encounters an extension of Shannon entropy in the discrete case (2.1) to the case of a one-dimensional random variable with density function p in the form (e.g., Shannon & Weaver, 1949; Ash, 1965) S(p) = − Z +∞ p(x) ln p(x) dx . −∞ This entropy in the continuous case as a pure-mathematical formula (assuming convergence of the integral and absolute continuity of the density p with respect to Lebesgue measure) resembles Shannon entropy in the discrete case, but cannot be used as a measure of information for the following reasons. First, it is not a natural extension of Shannon entropy in the discrete case, since it is not the limit of the sequence of finite discrete entropies corresponding to pmfs which approximate the pdf p. Second, it is not strictly positive. 46 Inspite of these short comings, one can still use the continuous entropy functional in conjunction with the principle of maximum entropy where one wants to find a probability density function that has greater uncertainty than any other distribution satisfying a set of given constraints. Thus, one is interested in the use of continuous measure as a measure of relative and not absolute uncertainty. This is where one can relate maximization of Shannon entropy to the minimization of Kullback-Leibler relative-entropy cf. (Kapur & Kesavan, 1997, pp. 55). On the other hand, it is well known that the continuous version of KL-entropy defined for two probability density functions p and r, I(pkr) = Z +∞ p(x) ln −∞ p(x) dx , r(x) is indeed a natural generalization of the same in the discrete case. Indeed, during the early stages of development of information theory, the important paper by Gelfand, Kolmogorov, and Yaglom (1956) called attention to the case of defining entropy functional on an arbitrary measure space (X, M, µ). In this case, Shannon entropy of a probability density function p : X → R + can be written as, Z S(p) = − p(x) ln p(x) dµ(x) . X One can see from the above definition that the concept of “the entropy of a pdf” is a misnomer as there is always another measure µ in the background. In the discrete case considered by Shannon, µ is the cardinality measure 1 (Shannon & Weaver, 1949, pp. 19); in the continuous case considered by both Shannon and Wiener, µ is the Lebesgue measure cf. (Shannon & Weaver, 1949, pp. 54) and (Wiener, 1948, pp. 61, 62). All entropies are defined with respect to some measure µ, as Shannon and Wiener both emphasized in (Shannon & Weaver, 1949, pp.57, 58) and (Wiener, 1948, pp.61, 62) respectively. This case was studied independently by Kallianpur (1960) and Pinsker (1960b), and perhaps others were guided by the earlier work of Kullback and Leibler (1951), where one would define entropy in terms of Kullback-Leibler relative-entropy. In this respect, the Gelfand-Yaglom-Perez theorem (GYP-theorem) (Gelfand & Yaglom, 1959; Perez, 1959; Dobrushin, 1959) plays an important role as it equips measuretheoretic KL-entropy with a fundamental definition. The main contribution of this chapter is to prove GYP-theorem for Rényi relative-entropy of order α > 1, which can be extended to Tsallis relative-entropy. 1 Counting or cardinality measure µ on a measurable space (X, = 2X , is defined as µ(E) = #E, ∀E ∈ . 47 ), where X is a finite set and Before proving GYP-theorem for Rényi relative-entropy, we study the measuretheoretic definitions of generalized information measures in detail, and discuss the corresponding ME-prescriptions. We show that as in the case of relative-entropy, the measure-theoretic definitions of generalized relative-entropies, Rényi and Tsallis, are natural extensions of their respective discrete definition. We also show that MEprescriptions of measure-theoretic Tsallis entropy are consistent with that of discrete case, which is true for measure-theoretic Shannon-entropy. We review the measure-theoretic formalisms for classical information measures in § 3.1 and extend these definitions to generalized information measures in § 3.2. In § 3.3 we present the ME-prescription for Shannon entropy followed by prescriptions for Tsallis entropy in § 3.4. We revisit measure-theoretic definitions of generalized entropy functionals in § 3.5 and present some results. Finally, Gelfand-Yaglom-Perez theorem in the general case is presented in § 3.6. 3.1 Measure Theoretic Definitions of Classical Information Measures In this section, we study the non-discrete definitions of entropy and KL-entropy and present the formal definitions on the measure spaces. Rigorous studies of the Shannon and KL entropy functionals in measure spaces can be found in the papers by Ochs (1976) and Masani (1992a, 1992b). Basic measure-theoretic aspects of classical information measures can be found in books on information theory by Pinsker (1960b), Guiaşu (1977) and Gray (1990). For more details on development of mathematical information theory one can refer to excellent survey by Kotz (1966). This survey is perhaps the best available English-language guide to the Eastern European information theory literature for the period 1956-1966. One can also refer to (Cover et al., 1989) for a review on Kolmogorov’s contributions to mathematical information theory. A note on the notation. To avoid proliferation of symbols we use the same notation for the information measures in the discrete and non-discrete cases; the correspondence should be clear from the context. For example, we use S(p) to denote the entropy of a pdf p in the measure-theoretic setting too. Whenever we have to compare these quantities in different cases we use the symbols appropriately, which will be specified in the sequel. 3.1.1 Discrete to Continuous Let p : [a, b] → R+ be a probability density function, where [a, b] ⊂ R. That is, p 48 satisfies p(x) ≥ 0, ∀x ∈ [a, b] and Z b p(x) dx = 1 . a In trying to define entropy in the continuous case, the expression of Shannon entropy in the discrete case (2.1) was automatically extended to continuous case by replacing the sum in the discrete case with the corresponding integral. We obtain, in this way, Boltzmann’s H-function (also known as differential entropy in information theory), S(p) = − Z b p(x) ln p(x) dx . (3.1) a The “continuous entropy” given by (3.1) is not a natural extension of definition in discrete case in the sense that, it is not the limit of the finite discrete entropies corresponding to a sequence of finer partitions of the interval [a, b] whose norms tend to zero. We can show this by a counter example. I Consider a uniform probability distribution on the interval [a, b], having the probability density function p(x) = 1 , b−a x ∈ [a, b] . The continuous entropy (3.1), in this case will be S(p) = ln(b − a) . On the other hand, let us consider a finite partition of the interval [a, b] which is composed of n equal subintervals, and let us attach to this partition the finite discrete uniform probability distribution whose corresponding entropy will be, of course, Sn (p) = ln n . Obviously, if n tends to infinity, the discrete entropy S n (p) will tend to infinity too, and not to ln(b − a); therefore S(p) is not the limit of S n (p), when n tends to infinity. J Further, one can observe that ln(b − a) is negative when b − a < 1. Thus, strictly speaking, continuous entropy (3.1) cannot represent a measure of uncertainty since uncertainty should in general be positive. We are able to prove the “nice” properties only for the discrete entropy, therefore, it qualifies as a “good” measure of information (or uncertainty) supplied by a random experiment 2 . We cannot 2 One importent property that Shannon entropy exhibits in the continuous case is the entropy power inequality, which can be stated as follows. Let X and Y are continuous independent random variables with entropies S(X) and S(Y ) then we have e2S(X+Y ) ≥ e2S(X) + e2S(Y ) with equality if and only if X and Y are Gaussian variables or one of them is determenistic. The entropy power inequality is derived by Shannon (1948). Only few and partial versions of it have been proved in the discrete case. 49 extend the so called nice properties to the “continuous entropy” because it is not the limit of a suitably defined sequence of discrete entropies. Also, in physical applications, the coordinate x in (3.1) represents an abscissa, a distance from a fixed reference point. This distance x has the dimensions of length. Since the density function p(x) specifies the probabilities of an event of type [c, d) ⊂ Rd [a, b] as c p(x) dx and probabilities are dimensionless, one has to assign the dimen- sions (length)−1 to p(x). Now for 0 ≤ z < 1, one has the series expansion 1 1 − ln(1 − z) = z + z 2 + z 3 + . . . . 2 3 (3.2) It is thus necessary that the argument of the logarithmic function in (3.1) be dimensionless. Hence the formula (3.1) is then seen to be dimensionally incorrect, since the argument of the logarithm on its right hand side has the dimensions of a probability density (Smith, 2001). Although, Shannon (1948) used the formula (3.1), he did note its lack of invariance with respect to changes in the coordinate system. In the context of maximum entropy principle, Jaynes (1968) addressed this problem and suggested the formula, Z b p(x) 0 p(x) ln S (p) = − dx , m(x) a (3.3) in the place of (3.1), where m(x) is a prior function. Note that when m(x) is also a probability density function, (3.3) is nothing but the relative-entropy. However, if we choose m(x) = c, a constant (e.g., Zellner & Highfield, 1988), we get S 0 (p) = S(p) + ln c , where S(p) refers to the continuous entropy (3.1). Thus, maximization of S 0 (p) is equivalent to maximization of S(p). Further discussion on estimation of probability density functions by maximum entropy method can be found in (Lazo & Rathie, 1978; Zellner & Highfield, 1988; Ryu, 1993). Prior to that, Kullback and Leibler (1951) too suggested that in the measuretheoretic definition of entropy, instead of examining the entropy corresponding only to the given measure, we have to compare the entropy inside a whole class of measures. 3.1.2 Classical Information Measures Let (X, M, µ) be a measure space, where µ need not be a probability measure unless otherwise specified. Symbols P , R will denote probability measures on measurable 50 space (X, M) and p, r denote M-measurable functions on X. An M-measurable funcR tion p : X → R+ is said to be a probability density function (pdf) if X p(x) dµ(x) = R 1 or X p dµ = 1 (henceforth, the argument x will be omitted in the integrals if this does not cause ambiguity). In this general setting, Shannon entropy S(p) of pdf p is defined as follows (Athreya, 1994). D EFINITION 3.1 Let (X, M, µ) be a measure space and the M-measurable function p : X → R + be a pdf. Then, Shannon entropy of p is defined as Z p ln p dµ , S(p) = − (3.4) X provided the integral on right exists. Entropy functional S(p) defined in (3.4) can be referred to as entropy of the probability measure P that is induced by p, that is defined according to Z p(x) dµ(x) , ∀E ∈ M . P (E) = (3.5) E This reference is consistent3 because the probability measure P can be identified a.e by the pdf p. Further, the definition of the probability measure P in (3.5), allows us to write entropy functional (3.4) as, Z dP dP S(p) = − ln dµ , dµ X dµ (3.6) since (3.5) implies4 P µ, and pdf p is the Radon-Nikodym derivative of P w.r.t µ. Now we proceed to the definition of Kullback-Leibler relative-entropy or KLentropy for probability measures. 3 Say p and r be two pdfs and P and R be the corresponding induced measures on measurable space (X, ) such that P and R are identical, i.e., E p dµ = E r dµ, ∀E ∈ . Then we have p = r, µ a.e, and hence − X p ln p dµ = − X r ln r dµ. 4 If a nonnegative measurable function f induces a measure ν on measurable space (X, ) with respect to a measure µ, defined as ν(E) = E f dµ, ∀E ∈ then ν µ. The converse of this result is given by Radon-Nikodym theorem (Kantorovitz, 2003, pp.36, Theorem 1.40(b)). 51 D EFINITION 3.2 Let (X, M) be a measurable space. Let P and R be two probability measures on (X, M). Kullback-Leibler relative-entropy KL-entropy of P relative to R is defined as Z dP dP ln dR X I(P kR) = +∞ if P R , (3.7) otherwise. The divergence inequality I(P kR) ≥ 0 and I(P kR) = 0 if and only if P = R can be shown in this case too. KL-entropy (3.7) also can be written as Z dP dP ln dR . I(P kR) = dR dR X (3.8) Let the σ-finite measure µ on (X, M) be such that P R µ. Since µ is σ-finite, from Radon-Nikodym theorem, there exist non-negative M-measurable functions p : X → R+ and r : X → R+ unique µ-a.e, such that Z p dµ , ∀E ∈ M , P (E) = (3.9a) E and R(E) = Z E r dµ , ∀E ∈ M . (3.9b) The pdfs p and r in (3.9a) and (3.9b) (they are indeed pdfs) are Radon-Nikodym derivatives of probability measures P and R with respect to µ, respectively, i.e., p = r= D EFINITION 3.3 dR dµ . dP dµ and Now one can define relative-entropy of pdf p w.r.t r as follows 5 . Let (X, M, µ) be a measure space. Let M-measurable functions p, r : X → R + be two pdfs. The KL-entropy of p relative to r is defined as Z p(x) dµ(x) , p(x) ln I(pkr) = r(x) X (3.10) provided the integral on right exists. As we have mentioned earlier, KL-entropy (3.10) exists if the two densities are absolutely continuous with respect to one another. On the real line, the same definition can be written with respect to the Lebesgue measure Z p(x) I(pkr) = p(x) ln dx , r(x) 5 This follows from the chain rule for Radon-Nikodym derivative: dP a.e dP = dR dµ dR dµ −1 . 52 which exists if the densities p(x) and r(x) share the same support. Here, in the sequel we use the convention ln 0 = −∞, ln a = +∞ forany a ∈ R, 0 0.(±∞) = 0. (3.11) Now, we turn to the definition of entropy functional on a measure space. Entropy functional in (3.6) is defined for a probability measure that is induced by a pdf. By the Radon-Nikodym theorem, one can define Shannon entropy for any arbitrary µcontinuous probability measure as follows. D EFINITION 3.4 Let (X, M, µ) be a σ-finite measure space. Entropy of any µ-continuous probability measure P (P µ) is defined as Z dP dP . S(P ) = − ln dµ X (3.12) The entropy functional (3.12) is known as Baron-Jauch entropy or generalized Boltzmann-Gibbs-Shannon entropy (Wehrl, 1991). Properties of entropy of a probability measure in the Definition 3.4 are studied in detail by Ochs (1976). In the literature, one can find notation of the form S(P |µ) to represent the entropy functional in (3.12) viz., the entropy of a probability measure, to stress the role of the measure µ (e.g., Ochs, 1976; Athreya, 1994). Since all the information measures we define are with respect to the measure µ on (X, M), we omit µ in the entropy functional notation. By assuming µ as a probability measure in the Definition 3.4, one can relate Shannon entropy with Kullback-Leibler entropy as, S(P ) = −I(P kµ) . (3.13) Note that when µ is not a probability measure, the divergence inequality I(P kµ) ≥ 0 need not be satisfied. A note on the σ-finiteness of measure µ in the definition of entropy functional. In the definition of entropy functional we assumed that µ is a σ-finite measure. This condition was used by Ochs (1976), Csiszár (1969) and Rosenblatt-Roth (1964) to tailor the measure-theoretic definitions. For all practical purposes and for most applications, this assumption is satisfied (see (Ochs, 1976) for a discussion on the physical interpretation of measurable space (X, M) with σ-finite measure µ for an entropy measure of the form (3.12), and of the relaxation of the σ-finiteness condition). The more universal definitions of entropy functionals, by relaxing the σ-finiteness condition, are studied by Masani (1992a, 1992b). 53 3.1.3 Interpretation of Discrete and Continuous Entropies in terms of KL-entropy First, let us consider the discrete case of (X, M, µ), where X = {x 1 , . . . , xn }, M = 2X is the power set of X. Let P and µ be any probability measures on (X, M). Then µ and P can be specified as follows. n X µ: µk = µ({xk }) ≥ 0, k = 1, . . . , n, µk = 1 , (3.14a) k=1 and n X P : Pk = P ({xk }) ≥ 0, k = 1, . . . , n, Pk = 1 . (3.14b) k=1 The probability measure P is absolutely continuous with respect to the probability measure µ if µk = 0 for some k ∈ {1, . . . , n} then Pk = 0 as well. The corresponding Radon-Nikodym derivative of P with respect to µ is given by dP Pk (xk ) = , k = 1, . . . n . dµ µk The measure-theoretic entropy S(P ) (3.12), in this case, can be written as S(P ) = − n X k=1 Pk ln n n k=1 k=1 X X Pk = Pk ln µk − Pk ln Pk . µk If we take referential probability measure µ as a uniform probability distribution on the set X, i.e. µk = n1 , k = 1, . . . , n, we obtain S(P ) = Sn (P ) − ln n , (3.15) where Sn (P ) denotes the Shannon entropy (2.1) of pmf P = (P 1 , . . . , Pn ) and S(P ) denotes the measure-theoretic entropy (3.12) reduced to the discrete case, with the probability measures µ and P specified as in (3.14a) and (3.14b) respectively. Now, let us consider the continuous case of (X, M, µ), where X = [a, b] ⊂ R, M is the σ-algebra of Lebesgue measurable subsets of [a, b]. In this case µ and P can be specified as follows. µ: µ(x) ≥ 0, x ∈ [a, b], 3 µ(E) = Z E µ(x) dx, ∀E ∈ M, Z b µ(x) dx = 1 , a (3.16a) and P : P (x) ≥ 0, x ∈ [a, b], 3 P (E) = 54 Z E P (x) dx, ∀E ∈ M, Z b P (x) dx = 1 . a (3.16b) Note that the abuse of notation in the above specification of probability measures µ and P , where we have used the same symbols for both measures and pdfs, is in order to have the notation consistent with the discrete case analysis given above. The probability measure P is absolutely continuous with respect to the probability measure µ, if µ(x) = 0 on a set of a positive Lebesgue measure implies that P (x) = 0 on the same set. The Radon-Nikodym derivative of the probability measure P with respect to the probability measure µ will be dP P (x) (x) = . dµ µ(x) We emphasize here that this relation can only be understood with the above (abuse of) notation explained. Then the measure-theoretic entropy S(P ) in this case can be written as S(P ) = − Z b P (x) ln a P (x) dx . µ(x) If we take referential probability measure µ as a uniform distribution, i.e. µ(x) = 1 b−a , x ∈ [a, b], then we obtain S(P ) = S[a,b] (P ) − ln(b − a) , (3.17) where S[a,b] (P ) denotes the Shannon entropy (3.1) of pdf P (x) and S(P ) denotes the measure-theoretic entropy (3.12) reduced to the continuous case, with the probability measures µ and P specified as in (3.16a) and (3.16b) respectively. Hence, one can conclude that measure theoretic entropy S(P ) defined for a probability measure P on the measure space (X, M, µ), is equal to both Shannon entropy in the discrete and continuous case up to an additive constant, when the reference measure µ is chosen as a uniform probability distribution. On the other hand, one can see that measure-theoretic KL-entropy, in the discrete and continuous cases corresponds to its discrete and continuous definitions. Further, from (3.13) and (3.15), we can write Shannon entropy in terms of KullbackLeibler relative-entropy as Sn (P ) = ln n − I(P kµ) . (3.18) Thus, Shannon entropy appears as being (up to an additive constant) the variation of information when we pass from the initial uniform probability distribution to new P probability distribution given by P k ≥ 0, nk=1 Pk = 1, as any such probability distribution is obviously absolutely continuous with respect to the uniform discrete 55 probability distribution. Similarly, from (3.13) and (3.17) the relation between Shannon entropy and relative-entropy in continuous case can be obtained, and we can write Boltzmann H-function in terms of relative-entropy as S[a,b] (p) = ln(b − a) − I(P kµ) . (3.19) Therefore, the continuous entropy or Boltzmann H-function S(p) may be interpreted as being (up to an additive constant) the variation of information when we pass from the initial uniform probability distribution on the interval [a, b] to the new probability measure defined by the probability distribution function p(x) (any such probability measure is absolutely continuous with respect to the uniform probability distribution on the interval [a, b]). From the above discussion one can see that KL-entropy equips one with unitary interpretation of both discrete entropy and continuous entropy. One can utilize Shannon entropy in the continuous case, as well as Shannon entropy in the discrete case, both being interpreted as the variation of information when we pass from the initial uniform distribution to the corresponding probability measure. Also, since measure theoretic entropy is equal to the discrete and continuous entropy up to an additive constant, ME-prescriptions of measure-theoretic Shannon entropy are consistent with both the discrete and continuous cases. 3.2 Measure-Theoretic Definitions of Generalized Information Measures In this section we extend the measure-theoretic definitions to generalized information measures discussed in Chapter 2. We begin with a brief note on the notation and assumptions used. We define all the information measures on the measurable space (X, M). The default reference measure is µ unless otherwise stated. For simplicity in exposition, we will not distinguish between functions differing on a µ-null set only; nevertheless, we can work with equations between M-measurable functions on X if they are stated as being valid only µ-almost everywhere (µ-a.e or a.e). Further we assume that all the quantities of interest exist and also assume, implicitly, the σ-finiteness of µ and µ-continuity of probability measures whenever required. Since these assumptions repeatedly occur in various definitions and formulations, these will not be mentioned in the sequel. With these assumptions we do not distinguish between an information measure of pdf p and that of the corresponding probability measure P – hence when 56 we give definitions of information measures for pdfs, we also use the corresponding definitions of probability measures as well, wherever convenient or required – with R the understanding that P (E) = E p dµ, and the converse holding as a result of the Radon-Nikodym theorem, with p = dP dµ . In both the cases we have P µ. With these notations we move on to the measure-theoretic definitions of generalized information measures. First we consider the Rényi generalizations. The measuretheoretic definition of Rényi entropy is as follows. D EFINITION 3.5 Rényi entropy of a pdf p : X → R+ on a measure space (X, M, µ) is defined as Z 1 p(x)α dµ(x) , (3.20) Sα (p) = ln 1−α X provided the integral on the right exists and α ∈ R, α > 0. The same can also be defined for any µ-continuous probability measure P as 1 ln Sα (P ) = 1−α Z X dP dµ α−1 dP . (3.21) On the other hand, Rényi relative-entropy can be defined as follows. D EFINITION 3.6 Let p, r : X → R+ be two pdfs on a measure space (X, M, µ). The Rényi relative- entropy of p relative to r is defined as Z 1 p(x)α dµ(x) , Iα (pkr) = ln α−1 α−1 X r(x) (3.22) provided the integral on the right exists and α ∈ R, α > 0. The same can be written in terms of probability measures as, Z dP α−1 1 ln dP Iα (P kR)= α−1 dR X Z 1 dP α = ln dR , α−1 dR X (3.23) whenever P R; Iα (P kR) = +∞, otherwise. Further if we assume µ in (3.21) is a probability measure then Sα (P ) = Iα (P kµ) . (3.24) The Tsallis entropy in the measure-theoretic setting can be defined as follows. 57 D EFINITION 3.7 Tsallis entropy of a pdf p on (X, M, µ) is defined as R Z 1 − X p(x)q dµ(x) 1 Sq (p) = p(x) lnq dµ(x) = , p(x) q−1 X (3.25) provided the integral on the right exists and q ∈ R, q > 0. The q-logarithm lnq is defined as in (2.41). The same can be defined for µcontinuous probability measure P , and can be written as Sq (P ) = Z lnq X dP dµ −1 dP . (3.26) The definition of Tsallis relative-entropy is given below. D EFINITION 3.8 Let (X, M, µ) be a measure space. Let p, r : X → R + be two probability density functions. The Tsallis relative-entropy of p relative to r is defined as Z r(x) dµ(x) = p(x) lnq Iq (pkr) = − p(x) X R p(x)q X r(x)q−1 dµ(x) − 1 q−1 (3.27) provided the integral on the right exists and q ∈ R, q > 0. The same can be written for two probability measures P and R, as Iq (P kR) = − Z lnq X dP dR −1 dP , (3.28) whenever P R; Iq (P kR) = +∞, otherwise. If µ in (3.26) is a probability measure then Sq (P ) = Iq (P kµ) . (3.29) We shall revisit these measure-theoretic definitions in § 3.5. 3.3 Maximum Entropy and Canonical Distributions For all the ME-prescriptions of classical information measures we consider the set of constraints of the form Z Z um (x)p(x) dµ(x) = hum i , m = 1, . . . , M , um dP = X (3.30) X with respect to M-measurable functions u m : X → R, m = 1, . . . , M , whose expectation values hum i, m = 1, . . . , M are (assumed to be) a priori known, along 58 with the normalizing constraint R X dP = 1. (From now on we assume that any set of constraints on probability distributions implicitly includes this constraint, which will therefore not be mentioned in the sequel.) To maximize the entropy (3.4) with respect to the constraints (3.30), the solution is calculated via the Lagrangian: Z Z dP ln dP (x) − 1 L(x, λ, β) = − (x) dP (x) − λ dµ X X Z M X um (x) dP (x) − hum i , (3.31) βm − X m=1 where λ and βm , m = 1, . . . , M are Lagrange parameters (we use the notation β = (β1 , . . . , βM )). The solution is given by ln M X dP βm um (x) = 0 . (x) + λ + dµ m=1 The solution can be calculated as dP (x) = exp − ln Z(β) − or e− dP (x) = p(x) = dµ M m=1 M X ! βm um (x) dµ(x) m=1 βm um (x) , Z(β) where the partition function Z(β) is written as ! Z M X Z(β) = exp − βm um (x) dµ(x) . X (3.32) (3.33) (3.34) m=1 The Lagrange parameters βm , m = 1, . . . M are specified by the set of constraints (3.30). The maximum entropy, denoted by S, can be calculated as S = ln Z + M X m=1 βm hum i . (3.35) The Lagrange parameters βm , m = 1, . . . M , are calculated by searching the unique solution (if it exists) of the following system of nonlinear equations: ∂ ln Z(β) = −hum i , m = 1, . . . M. ∂βm (3.36) We also have ∂S = βm , m = 1, . . . M. ∂hum i Equations (3.36) and (3.37) are referred to as the thermodynamic equations. 59 (3.37) 3.4 ME-prescription for Tsallis Entropy As we mentioned earlier, the great success of Tsallis entropy is attributed to the powerlaw distributions that result from the ME-prescriptions of Tsallis entropy. But there are subtleties involved in the choice of constraints one would choose for ME prescriptions of these entropy functionals. The issue of what kind of constraints one should use in the ME-prescriptions is still a part of the major discussion in the nonextensive formalism (Ferri et al., 2005; Abe & Bagci, 2005; Wada & Scarfone, 2005). In the nonextensive formalism, maximum entropy distributions are derived with respect to the constraints that are different from (3.30), and are inadequate for handling the serious mathematical difficulties that result for instance, those of unwanted divergences etc. cf. (Tsallis, 1988). To handle these difficulties constraints of the form Z um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M (3.38) X are proposed by Curado and Tsallis (1991). The averages of the form hu m iq are referred to as q-expectations. 3.4.1 Tsallis Maximum Entropy Distribution To calculate the maximum Tsallis entropy distribution with respect to the constraints (3.38), the Lagrangian can be written as Z Z 1 L(x, λ, β) = dP (x) − λ lnq dP (x) − 1 p(x) X X Z M X − p(x)q−1 um (x) dP (x) − hum iq . (3.39) βm X m=1 The solution is given by M X 1 −λ− lnq βm um (x)p(x)q−1 = 0 . p(x) m=1 (3.40) By the definition of q-logarithm (2.41), (3.40) can be rearranged as p(x) = h 1 − (1 − q) PM m=1 βm um (x) 1 (λ(1 − q) + 1) 1−q 60 i 1 1−q . (3.41) The denominator in (3.41) can be calculated using the normalizing constraint 1. Finally, Tsallis maximum entropy distribution can be written as p(x) = h 1 − (1 − q) PM m=1 βm um (x) Zq i R X dP = 1 1−q , (3.42) where the partition function is Zq = Z " X 1 − (1 − q) M X βm um (x) m=1 1 # 1−q dµ(x) . (3.43) Tsallis maximum entropy distribution (3.42) can be expressed in terms of the q-expectation function (2.42) as p(x) = − eq M m=1 βm um (x) Zq . (3.44) Note that in order to guarantee that pdf p in (3.42) is non-negative real for any x ∈ X, it is necessary to supplement it with an appropriate prescription for treating h i P negative values of the quantity 1 − (1 − q) M m=1 βm um (x) . That is, we need a prescription for the value of p(x) when " # M X 1 − (1 − q) βm um (x) < 0 . (3.45) m=1 The simplest possible prescription, and the one usually adopted, is to set p(x) = 0 whenever the inequality (3.45) holds (Tsallis, 1988; Curado & Tsallis, 1991). This rule is known as the Tsallis cut-off condition. Simple extensions of Tsallis cut-off conditions are proposed in (Teweldeberhan et al., 2005) by defining an alternate qexponential function. In this thesis, we consider only the usual Tsallis cut-off condition mentioned above. Note that by expressing Tsallis maximum entropy distribution (3.42) in terms of the q-exponential function, as in (3.44), we have assumed Tsallis cut-off condition implicitly. In summary, when we refer to Tsallis maximum entropy distribution we mean the following M − m=1 βm um (x) h i PM eq if 1 − (1 − q) β u (x) >0 m=1 m m Zq p(x) = 0 otherwise. (3.46) Maximum Tsallis entropy can be calculated as (Curado & Tsallis, 1991), Sq = ln Zq + M X m=1 βm hum iq . (3.47) 61 The corresponding thermodynamic equations are as follows (Curado & Tsallis, 1991): ∂ lnq Zq = −hum iq , m = 1, . . . M, ∂βm (3.48) ∂Sq = βm , m = 1, . . . M. ∂hum iq (3.49) It may be interesting to compare these equations with their classical counterparts, (3.36) and (3.37), to see the consistency in generalizations. Here we mention that some important mathematical properties of nonextensive maximum entropy distribution (3.42) for q = 1 2 has been studied and reported by Rebollo- Neira (2001) with applications to data subset selection. One can refer to (Vignat, Hero, & Costa, 2004) for a study of Tsallis maximum entropy distributions in the multivariate case. 3.4.2 The Case of Normalized q-expectation values Constraints of the form (3.38) had been used for some time in the nonextensive MEprescriptions, but because of problems in justifying it on physical grounds (for example q-expectation of a constant need not be a constant and hence they are not expectations in the true sense) the constraints of the following form were proposed in (Tsallis, Mendes, & Plastino, 1998) R um (x)p(x)q dµ(x) XR = hhum iiq , m = 1, . . . , M . q X p(x) dµ(x) (3.50) Here hhum iiq can be considered as the expectation of u m with respect to the modified probability measure P(q) (it is indeed a probability measure) defined as P(q) (E) = Z q p(x) dµ(x) X −1 Z E p(x)q dµ(x) , ∀E ∈ M . (3.51) The modified probability measure P(q) is known as the escort probability measure (Tsallis et al., 1998). Now, the variational principle for Tsallis entropy maximization with respect to constraints (3.50) can be written as Z Z 1 L(x, λ, β) = lnq dP (x) − λ dP (x) − 1 p(x) X X Z M X q−1 (q) − p(x) βm um (x) − hhum iiq dP (x) , (3.52) m=1 X 62 (q) where the parameters βm can be defined in terms of true Lagrange parameters β m as βm (q) βm =Z , m = 1, . . . , M. (3.53) p(x)q dµ(x) X The maximum entropy distribution in this case turns out to be 1 − (1 − q) p(x) = 1 1−q u (x) − hhu ii β m m q m=1 m R q X p(x) dµ(x) PM Zq . (3.54) This can be written using q-exponential functions as P M u (x) − hhu ii β m m m q m=1 1 , R p(x) = expq − q p(x) dµ(x) Zq X (3.55) where Zq = Z X β u (x) − hhu ii m m m m=1 q dµ(x) . R q p(x) dµ(x) X P M expq − (3.56) Maximum Tsallis entropy Sq in this case satisfies Sq = lnq Zq , (3.57) while the corresponding thermodynamic equations are ∂ lnq Zq = −hhum iiq , m = 1, . . . M , ∂βm (3.58) ∂Sq = βm , m = 1, . . . M , ∂hhum iiq (3.59) where lnq Zq = lnq Zq − M X m=1 βm hhum iiq . (3.60) 3.5 Measure-Theoretic Definitions Revisited It is well known that unlike Shannon entropy, Kullback-Leibler relative-entropy in the discrete case can be extended naturally to the measure-theoretic case by a simple 63 limiting process cf. (Topsøe, 2001, Theorem 5.2). In this section, we show that this fact is true for generalized relative-entropies too. Rényi relative-entropy on continuous valued space R and its equivalence with the discrete case is studied by Rényi (1960), Jizba and Arimitsu (2004b). Here, we present the result in the measure-theoretic case and conclude that measure-theoretic definitions of both Tsallis and Rényi relativeentropies are equivalent to their respective entities. We also present a result pertaining to ME of measure-theoretic Tsallis entropy. We prove that ME of Tsallis entropy in the measure-theoretic case is consistent with the discrete case. 3.5.1 On Measure-Theoretic Definitions of Generalized Relative-Entropies Here we show that generalized relative-entropies in the discrete case can be naturally extended to measure-theoretic case, in the sense that measure-theoretic definitions can be defined as limits of sequences of finite discrete entropies of pmfs which approximate the pdfs involved. We refer to any such sequence of pmfs as “approximating sequence of pmfs of a pdf”. To formalize these aspects we need the following lemma. L EMMA 3.1 Let p be a pdf defined on measure space (X, M, µ). Then there exists a sequence of simple functions {fn } (approximating sequence of simple functions of p) such that limn→∞ fn = p and each fn can be written as Z 1 fn (x) = p dµ , ∀x ∈ En,k , k = 1, . . . m(n) , µ(En,k ) En,k (3.61) m(n) where {En,k }k=1 , is the measurable partition of X corresponding to f n (the notation m(n) indicates that m varies with n). Further each f n satisfies Z fn dµ = 1 . (3.62) X Proof Define a sequence of simple functions {f n } as Z 1 p dµ , if 2kn ≤ p(x) < k+1 2n , µ(p−1 ([ 2kn , k+1 )) p−1 ([ kn , k+1 2n ) 2 2n )) k = 0, 1, . . . n2n − 1 , fn (x) = Z 1 p dµ , if n ≤ p(x) . −1 µ(p ([n,∞))) p−1 ([n,∞)) (3.63) 64 Each fn is indeed a simple function and can be written as ! n −1 Z Z n2 X 1 1 p dµ χEn,k + fn = p dµ χFn , (3.64) µ(En,k ) En,k µ(Fn ) Fn k=0 where En,k = p−1 k k+1 2n , 2n , k = 0, . . . , n2n −1 and Fn = p−1 ([n, ∞)). Also, for any measurable set E ∈ M, χE : X → {0, 1} denotes its indicator or characteristic function. Note that {En,0 , . . . , En,n2n −1 , Fn } is indeed a measurable partition of X, R R for any n. Since E p dµ < ∞ for any E ∈ M, we have En,k p dµ = 0 whenever R µ(En,k ) = 0, for k = 0, . . . n2n − 1. Similarly Fn p dµ = 0 whenever µ(Fn ) = 0. Now we show that limn→∞ fn = p, point-wise. Since p is a pdf, we have p(x) < ∞. Then ∃ n ∈ Z + 3 p(x) ≤ n. Also ∃ k ∈ Z+ , 0 ≤ k ≤ n2n − 1 3 0 ≤ |p − fn | < 1 2n k 2n ≤ p(x) < k+1 2n and k 2n ≤ fn (x) < k+1 2n . This implies as required. (Note that this lemmma holds true even if p is not a pdf. This follows from, if p(x) = ∞, for some x ∈ X, then x ∈ Fn for all n, and therefore fn (x) ≥ n for all n; hence limn→∞ fn (x) = ∞ = p(x).) Finally, we have Z fn dµ = X n(m) " X 1 µ(En,k ) k=1 En,k k=1 = = n(m) Z X Z Z # 1 p dµ µ(En,k ) + µ(Fn ) En,k p dµ + Z Z p dµ µ(Fn ) Fn p dµ Fn p dµ = 1 . X The above construction of a sequence of simple functions which approximate a measurable function is similar to the approximation theorem (e.g., Kantorovitz, 2003, pp.6, Theorem 1.8(b)) in the theory of integration. But, approximation in Lemma 3.1 can be seen as a mean-value approximation whereas in the above case it is the lower approximation. Further, unlike in the case of lower approximation, the sequence of simple functions which approximate p in Lemma 3.1 are neither monotone nor satisfy fn ≤ p. Now one can define a sequence of pmfs {p̃ n } corresponding to the sequence of simple functions constructed in Lemma 3.1, denoted by p̃ n = (p̃n,1 , . . . , p̃n,m(n) ), as Z p dµ , k = 1, . . . m(n), (3.65) p̃n,k = µ(En,k ) fn χEn,k (x) = En,k 65 for any n. Note that in (3.65) the function f n χEn,k is a constant function by the construction (Lemma 3.1) of fn . We have m(n) X p̃n,k = k=1 m(n) Z X k=1 p dµ = En,k Z p dµ = 1 , (3.66) X and hence p̃n is indeed a pmf. We call {p̃n } as the approximating sequence of pmfs of pdf p. Now we present our main theorem, where we assume that p and r are bounded. The assumption of boundedness of p and r simplifies the proof. However, the result can be extended to an unbounded case. (See (Rényi, 1959) analysis of Shannon entropy and relative-entropy on R in the unbounded case.) T HEOREM 3.1 Let p and r be pdfs, which are bounded and defined on a measure space (X, M, µ). Let p̃n and r̃n be approximating sequences of pmfs of p and r respectively. Let Iα denote the Rényi relative-entropy as in (3.22) and I q denote the Tsallis relativeentropy as in (3.27). Then lim Iα (p̃n kr̃n ) = Iα (pkr) (3.67) lim Iq (p̃n kr̃n ) = Iq (pkr) , (3.68) n→∞ and n→∞ respectively. Proof It is enough to prove the result for either Tsallis or Rényi since each one of them is a monotone and continuous functions of the other. Hence we write down the proof for the case of Rényi and we use the entropic index α in the proof. Corresponding to pdf p, let {fn } be the approximating sequence of simple func- tions such that limn→∞ fn = p as in Lemma 3.1. Let {gn } be the approximating se- quence of simple functions for r such that lim n→∞ gn = r. Corresponding to simple functions fn and gn there exists a common measurable partition 6 {En,1 , . . . En,m(n) } such that fn and gn can be written as m(n) fn (x) = X k=1 6 (an,k )χEn,k (x) , an,k ∈ R+ , ∀k = 1, . . . m(n) , (3.69a) Let ϕ and φ be two simple functions defined on (X, ). Let {E1 , . . . En } and {F1 , . . . , Fm } be the measurable partitions corresponding to ϕ and φ respectively. Then the collection defined as {Ei ∩ Fj |i = 1, . . . n, j = 1, . . . m} is a common measurable partition for ϕ and φ. 66 m(n) gn (x) = X k=1 (bn,k )χEn,k (x) , bn,k ∈ R+ , ∀k = 1, . . . m(n) , (3.69b) where χEn,k is the characteristic function of E n,k , for k = 1, . . . m(n). By (3.69a) and (3.69b), the approximating sequences of pmfs p̃n = (p̃n,1 , . . . , p̃n,m(n) ) (n) and r̃n = (r̃n,1 , . . . , r̃n,m(n) ) (n) can be written as (see (3.65)) p̃n,k = an,k µ(En,k ) , k = 1, . . . , m(n) , (3.70a) r̃n,k = bn,k µ(En,k ) , k = 1, . . . , m(n) . (3.70b) Now Rényi relative-entropy for p̃ n and r̃n can be written as m(n) α X an,k 1 ln µ(En,k ) . Sα (p̃n kr̃n ) = α−1 bα−1 k=1 n,k (3.71) To prove limn→∞ Sα (p̃n kr̃n ) = Sα (pkr) it is enough to show that 1 ln n→∞ α − 1 lim Z X fn (x)α 1 α−1 dµ(x) = α − 1 ln gn (x) Z X p(x)α dµ(x) , r(x)α−1 (3.72) since we have7 Z X m(n) α X an,k fn (x)α dµ(x) = α−1 µ(En,k ) . α−1 b gn (x) n,k k=1 Further, it is enough to prove that Z Z α lim hn (x) gn (x) dµ(x) = n→∞ X where hn is defined as hn (x) = X p(x)α dµ(x) , r(x)α−1 fn (x) gn (x) . Case 1: 0 < α < 1 7 Note that simple functions (fn )α and (gn )α−1 can be written as m(n) (fn )α (x) = aα n,k χEn,k (x) , and k=1 m(n) (gn )α−1 (x) = bα−1 n,k χEn,k (x) . k=1 Further, fnα (x) = gnα−1 m(n) k=1 aα n,k χEn,k (x) . bα−1 n,k 67 (3.73) (3.74) In this case, the Lebesgue dominated convergence theorem (Rudin, 1964, pp.26, Theorem 1.34) gives that, Z Z fnα pα dµ , lim dµ = α−1 α−1 n→∞ X gn X r (3.75) and hence (3.68) follows. Case 2: α > 1 We have hαn gn → pα r α−1 a.e. By Fatou’s Lemma (Rudin, 1964, pp.23, Theorem 1.28), we obtain that, Z Z α hn (x) gn (x) dµ(x) ≥ lim inf n→∞ X X p(x)α dµ(x) . r(x)α−1 From the construction of fn and gn (Lemma 3.1) we have Z 1 p(x) hn (x)gn (x) = r(x) dµ(x) , ∀x ∈ En,i . µ(En,i ) En,i r(x) (3.76) (3.77) By Jensen’s inequality we get 1 hn (x) gn (x) ≤ µ(En,i ) α Z p(x)α dµ(x) , ∀x ∈ En,i . r(x)α−1 En,i By (3.69a) and (3.69b) we can write (3.78) as Z aαn,i p(x)α µ(E ) ≤ dµ(x) , ∀i = 1, . . . m(n) . n,i α−1 α−1 bn,i En,i r(x) (3.78) (3.79) By summing over on both sides of (3.79) we get m(n) X aαn,i α−1 µ(En,i ) ≤ bn,i i=1 m(n) Z X En,i i=1 Now (3.80) is nothing but Z Z α hn (x)gn (x) dµ(x) ≤ X p(x)α dµ(x) . r(x)α−1 X (3.80) p(x)α dµ(x) , ∀n , r(x)α−1 and hence sup i≥n Z X hαi (x)gi (x) dµ(x) ≤ Z X p(x)α dµ(x) , ∀n . r(x)α−1 Finally we have lim sup n→∞ Z X hαn (x)gn (x) dµ(x) ≤ Z X p(x)α dµ(x) . r(x)α−1 From (3.76) and (3.81) we get Z Z p(x)α fn (x)α dµ(x) = dµ(x) , lim α−1 n→∞ X gn (x)α−1 X r(x) and hence (3.68) follows. 68 (3.81) (3.82) 3.5.2 On ME of Measure-Theoretic Definition of Tsallis Entropy With the shortcomings of Shannon entropy that it cannot be naturally extended to a non-discrete case, we have observed that Shannon entropy in a measure-theoretic framework can be used in ME-prescriptions consistently with the discrete case. One can easily see that generalized information measures of Rényi and Tsallis too cannot be extended naturally to measure-theoretic cases, i.e., measure-theoretic definitions are not equivalent to their corresponding discrete cases in the sense that they cannot be defined as limits of sequences of finite discrete entropies corresponding to pmfs defined on measurable partitions which approximate the pdf. One can use the same counter example we discussed in § 3.1.1. In this section, we show that the ME-prescriptions in the measure-theoretic case are consistent with the discrete case. Proceeding as in the case of measure-theoretic entropy in § 3.1.3, by specifying probability measures µ and P in discrete case as in (3.14a) and (3.14b) respectively, the measure-theoretic Tsallis entropy S q (P ) (3.26) can be reduced to Sq (P ) = n X k=1 Pk lnq µk . Pk (3.83) By (2.46) we get Sq (P ) = n X k=1 Pkq [lnq µk − lnq Pk ] . (3.84) When µ is a uniform distribution i.e., µ k = n1 , ∀k = 1, . . . n we get Sq (P ) = Sqn (P ) − nq−1 lnq n n X Pkq , (3.85) k=1 where Sqn (P ) denotes the Tsallis entropy of pmf P = (P 1 , . . . , Pn ) (2.31), and Sq (P ) denotes the measure-theoretic Tsallis entropy (3.26) reduced to the discrete case, with the probability measures µ and P specified as in (3.14a) and (3.14b) respectively. P Now we show that the quantity nk=1 Pkq is constant in maximization of Sq (P ) with respect to the set of constraints (3.50). The claim is that Z 1−q p(x)q dµ(x) = (Zq ) , (3.86) X which holds for Tsallis maximum entropy distribution (3.54) in general. This can be shown here as follows. I From the maximum entropy distribution (3.54), we have −1 X Z M q βm um (x) − hhum iiq p(x) dµ(x) 1 − (1 − q) p(x)1−q = X m=1 (Zq )1−q 69 , which can be rearranged as (Zq )1−q p(x) = 1 − (1 − q) β u (x) − hhu ii m m m m=1 q p(x)q . R q p(x) dµ(x) X PM By integrating both sides in the above equation, and by using (3.50) we get (3.86). J Now, (3.86) can be written in its discrete form as n X Pkq µq−1 k=1 k = (Zq ) 1−q . (3.87) When µ is the uniform distribution we get n X Pkq = n1−q (Zq ) 1−q , (3.88) k=1 which is a constant. Hence, by (3.85) and (3.88), one can conclude that with respect to a particular instance of ME, measure-theoretic Tsallis entropy S q (P ) defined for a probability measure P on the measure space (X, M, µ) is equal to discrete Tsallis entropy up to an additive constant, when the reference measure µ is chosen as a uniform probability distribution. There by, one can further conclude that with respect to a particular instance of ME, measure-theoretic Tsallis entropy is consistent with its discrete definition. The same result can be shown in the case of q-expectation values too. 3.6 Gelfand-Yaglom-Perez Theorem in the General Case The measure-theoretic definition of KL-entropy plays a basic role in the definitions of classical information measures. Entropy, mutual information and conditional forms of entropy can be expressed in terms of KL-entropy and hence properties of their measure-theoretic analogs will follow from those of measure-theoretic KL-entropy (Gray, 1990). These measure-theoretic definitions are key to extending the ergodic theorems of information theory to non-discrete cases. A fundamental theorem in this respect is the Gelfand-Yaglom-Perez (GYP) Theorem (Pinsker, 1960b, Theorem 2.4.2) which states that measure-theoretic relative-entropy equals the supremum of relativeentropies over all measurable partitions. In this section we prove the GYP-theorem for Rényi relative-entropy of order greater than one. Before we proceed to the definitions and present the notion of relative-entropy on a measurable partition, we recall our notation and introduce new symbols. Let 70 (X, M) be a measurable space and Π denote the set of all measurable partitions of X. m We denote a measurable partition π ∈ Π as π = {E k }m k=1 , i.e, ∪k=1 Ek = X and Ei ∩ Ej = ∅, i 6= j, i, j = 1, . . . m. We denote the set of all simple functions on + (X, M) by L+ 0 , and the set of all nonnegative M-measurable functions by L . The set of all µ-integrable functions, where µ is a measure defined on (X, M), is denoted by L1 (µ). Rényi relative-entropy Iα (P kR) refers to (3.23), which can be written as Z 1 Iα (P kR) = ln ϕα dR , (3.89) α−1 X where ϕ ∈ L1 (R) is defined as ϕ = dP dR . Let P and R be two probability measures on (X, M) such that P R. Relative- entropy of partition π ∈ Π for P with respect to R is defined as m X IP kR (π) = P (Ek ) ln k=1 P (Ek ) . R(Ek ) (3.90) The GYP-theorem states that I(P kR) = sup IP kR (π) , (3.91) π∈Π where I(P kR) measure-theoretic KL-entropy is defined as in Definition 3.2. When P is not absolutely continuous with respect to R, GYP-theorem assigns I(P kR) = +∞. The proof of GYP-theorem given by Dobrushin (Dobrushin, 1959) can be found in (Pinsker, 1960b, pp. 23, Theorem 2.4.2) or in (Gray, 1990, pp. 92, Lemma 5.2.3). Before we state and prove the GYP-theorem for Rényi relative-entropy of order α > 1, we state the following lemma. L EMMA 3.2 Let P and R be probability measures on the measurable space (X, M) such that P R. Let ϕ = dP dR . P (E)α ≤ R(E)α−1 Proof Since P (E) = Z E R E ϕ dR ≤ Z Then for any E ∈ M and α > 1 we have ϕα dR . (3.92) E ϕ dR, ∀E ∈ M, by Hölder’s inequality we have Z α α ϕ dR E 1 Z α dR E That is α P (E) ≤ R(E) 1 ) α(1− α Z 1− 1 ϕα dR , E and hence (3.92) follows. 71 . We now present our main result in a special case as follows. L EMMA 3.3 Let P and R be two probability measures such that P R. Let ϕ = dP dR Then for any 0 < α < ∞, we have ∈ L+ 0. m Iα (P kR) = X P (Ek )α 1 , ln α−1 R(Ek )α−1 (3.93) k=1 where {Ek }m k=1 ∈ Π is the measurable partition corresponding to ϕ. Proof The simple function ϕ ∈ L+ 0 can be written as ϕ(x) = where ak ∈ R, k = 1, . . . m. Now we have P (Ek ) = hence ak = P (Ek ) , R(Ek ) We also have ϕα (x) = Z α ϕ dR = X Pm R k=1 ak χEk (x), Ek ϕ dR = ak R(Ek ), and ∀k = 1, . . . m. Pm α k=1 ak χEk (x), m X ∀x ∈ X, (3.94) ∀x ∈ X and hence aαk R(Ek ) . (3.95) k=1 Now, from (3.89), (3.94) and (3.95) one obtains (3.93). Note that right hand side of (3.93) represents the Rényi relative-entropy of the partition {Ek }m k=1 ∈ Π. Now we state and prove GYP-theorem for Rényi relativeentropy. T HEOREM 3.2 Let (X, M) be a measurable space and Π denote the set of all measurable partitions of X. Let P and R be two probability measures. Then for any α > 1, we have m X P (Ek )α 1 if P R , ln sup α−1 R(E ) {Ek }m ∈Π α − 1 k k=1 k=1 Iα (P kR) = (3.96) +∞ otherwise. Proof If P is not absolutely continuous with respect R, there exists E ∈ M such that P (E) > 0 and R(E) = 0. Since {E, X − E} ∈ Π, Iα (P kR) = +∞. Now, we assume that P R. It is clear that it is enough to prove that Z ϕα dR = X sup {Ek }m k=1 ∈Π m X P (Ek )α α−1 , R(E ) k k=1 72 (3.97) where ϕ = dP dR . have From Lemma 3.2, for any measurable partition {E k }m k=1 ∈ Π, we Z m m Z X X P (Ek )α α ϕ dR = ϕα dR , α−1 ≤ R(Ek ) X k=1 k=1 Ek and hence sup {Ek }m k=1 ∈Π Z m X P (Ek )α ϕα dR . α−1 ≤ R(E ) X k k=1 (3.98) Now we shall obtain the reverse inequality to prove (3.97) . Thus, we now show sup {Ek }m k=1 ∈Π m X P (Ek )α k=1 R(Ek ) α−1 ≥ Z ϕα dR . (3.99) X Note that corresponding to any ϕ ∈ L+ , there exists a sequence of simple functions {ϕn }, ϕn ∈ L+ 0 , that satisfy 0 ≤ ϕ 1 ≤ ϕ2 ≤ . . . ≤ ϕ (3.100) such that limn→∞ ϕn = ϕ (Kantorovitz, 2003, Theorem 1.8(2)). {ϕ n } induces a sequence of measures {Pn } on (X, M) defined by Z ϕn (x) dR(x) , ∀E ∈ M. Pn (E) = (3.101) E We have R E ϕn dR ≤ R E ϕ dR < ∞, ∀E ∈ M and hence Pn R, ∀n. From the Lebesgue bounded convergence theorem, we have lim Pn (E) = P (E) , n→∞ ∀E ∈ M . (3.102) α α α α α Now, ϕn ∈ L+ 0 , ϕn ≤ ϕn+1 ≤ ϕ , 1 ≤ n < ∞ and lim n→∞ ϕn = ϕ for any α > 0. Hence from Lebesgue monotone convergence theorem we have Z Z α lim ϕn dR = ϕα dR . n→∞ X (3.103) X We now claim that (3.103) implies Z Z α α + ϕ dR = sup φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 . (3.104) X This can be verified as follows. Denote φ n = ϕαn . We have 0 ≤ φ ≤ ϕα , ∀n, φn ↑ ϕα , and (as shown above) Z Z lim φn dR = ϕα dR . n→∞ X (3.105) X 73 α Now for any φ ∈ L+ 0 such that 0 ≤ φ ≤ ϕ we have Z Z φ dR ≤ ϕα dR X X and hence sup Z α X φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L+ 0 ≤ Z Now we show the reverse inequality of (3.106). If given any > 0 one can find 0 ≤ n0 < ∞ such that Z Z φn0 dR + ϕα dR < ϕα dR . R X (3.106) ϕα dR < +∞, from (3.105) X X and hence Z Z + α α φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 + . ϕ dR < sup (3.107) X X Since (3.107) is true for any > 0, we can write Z Z + α α φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 . ϕ dR ≤ sup (3.108) X X R Now let us verify (3.108) in the case of X ϕα dR = +∞. In this case, ∀N > 0, one R can choose n0 such that X φn0 dR > N and hence Z (3.109) ϕα dR > N (∵ 0 ≤ φn0 ≤ ϕα ) X and sup Z α X φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L+ 0 >N . Since (3.109) and (3.110) are true for any N > 0, we have Z Z + α α φ dR | 0 ≤ φ ≤ ϕ , φ ∈ L0 = +∞ ϕ dR = sup X (3.110) (3.111) X and hence (3.108) is verified in the case of R X ϕα dR = +∞. Now (3.106) and (3.108) verify the claim that (3.103) implies (3.104). Finally (3.104) together with Lemma 3.3 proves (3.97) and hence the theorem. Now from the fact that Rényi and Tsallis relative-entropies ((3.23) and (3.28) respectively) are monotone and continuous functions of each other, the GYP-theorem presented in the case of Rényi is valid for the Tsallis case too, whenever q > 1. However, the GYP-theorem is yet to be stated for the case when entropic index 0 < α < 1 ( 0 < q < 1 in the case of Tsallis). Work on this problem is ongoing. 74 4 Geometry and Entropies: Pythagoras’ Theorem Abstract Kullback-Leibler relative-entropy, in cases involving distributions resulting from relative-entropy minimization, has a celebrated property reminiscent of squared Euclidean distance: it satisfies an analogue of the Pythagoras’ theorem. And hence, this property is referred to as Pythagoras’ theorem of relative-entropy minimization or triangle equality and plays a fundamental role in geometrical approaches of statistical estimation theory like information geometry. We state and prove the equivalent of Pythagoras’ theorem in the generalized nonextensive formalism as the main result of this chapter. Before presenting this result we study the Tsallis relative-entropy minimization and present some differences with the classical case. This work can also be found in (Dukkipati et al., 2005b; Dukkipati, Murty, & Bhatnagar, 2006a). Apart from being a fundamental measure of information, Kullback-Leibler relativeentropy or KL-entropy plays a role of ‘measure of the distance’ between two probability distributions in statistics. Since it is not a metric, at first glance, it might seem that the geometrical interpretations that metric distance measures provide usually might not be possible at all with the KL-entropy playing a role as a distance measure on a space of probability distributions. But it is a pleasant surprise that it is possible to formulate certain geometric propositions for probability distributions, with the relativeentropy playing the role of squared Euclidean distance. Some of these geometrical interpretations cannot be derived from the properties of KL-entropy alone, but from the properties of “KL-entropy minimization”; restating the previous statement, these geometrical formulations are possible only when probability distributions resulting from ME-prescriptions of KL-entropy are involved. As demonstrated by Kullback (1959), minimization problems of relative-entropy with respect to a set of moment constraints find their importance in the well known Kullback’s minimum entropy principle and thereby play a basic role in the informationtheoretic approach to statistics (Good, 1963; Ireland & Kullback, 1968). They frequently occur elsewhere also, e.g., in the theory of large deviations (Sanov, 1957), and in statistical physics, as maximization of entropy (Jaynes, 1957a, 1957b). Kullback’s minimum entropy principle can be considered as a general method of 75 inference about an unknown probability distribution when there exists a prior estimate of the distribution and new information in the form of constraints on expected values (Shore, 1981b). Formally, one can state this principle as: given a prior distribution r, of all the probability distributions that satisfy the given moment constraints, one should choose the posterior p with the least relative-entropy. The prior distribution r can be a reference distribution (uniform, Gaussian, Lorentzian or Boltzmann etc.) or a prior estimate of p. The principle of Jaynes maximum entropy is a special case of minimization of relative-entropy under appropriate conditions (Shore & Johnson, 1980). Many properties of relative-entropy minimization just reflect well-known properties of relative-entropy but there are surprising differences as well. For example, relative-entropy does not generally satisfy a triangle relation involving three arbitrary probability distributions. But in certain important cases involving distributions that result from relative-entropy minimization, relative-entropy results in a theorem comparable to the Pythagoras’ theorem cf. (Csiszár, 1975) and ( Čencov, 1982, § 11). In this geometrical interpretation, relative-entropy plays the role of squared distance and minimization of relative-entropy appears as the analogue of projection on a sub-space in a Euclidean geometry. This property is also known as triangle equality (Shore, 1981b). The main aim of this chapter is to study the possible generalization of Pythagoras’ theorem to the nonextensive case. Before we take up this problem, we present the properties of Tsallis relative-entropy minimization and present some differences with the classical case. In the representation of such a minimum entropy distribution, we highlight the use of the q-product (q-deformed version of multiplication), an operator that has been introduced recently to derive the mathematical structure behind the Tsallis statistics. Especially, q-product representation of Tsallis minimum relative-entropy distribution will be useful for the derivation of the equivalent of triangle equality for Tsallis relative-entropy. Before we conclude this introduction on geometrical ideas of relative-entropy minimization, we make a note on the other geometric approaches that will not be considered in this thesis. One approach is that of Rao (1945), where one looks at the set of probability distributions on a sample space as a differential manifold and introduce a Riemannian geometry on this manifold. This approach is pioneered by Čencov (1982) and Amari (1985) who have shown the existence of a particular Riemannian geometry which is useful in understanding some questions of statistical inference. This Riemannian geometry turns out to have some interesting connections with information theory 76 and as shown by Campbell (1985), with the minimum relative-entropy. In this approach too, the above mentioned Pythagoras’ Theorem plays an important role (Amari & Nagaoka, 2000, pp.72). The other idea involves the use of Hausdorff dimension (Billingsley, 1960, 1965) to understand why minimizing relative-entropy should provide useful results. This approach was begun by Eggleston (1952) for a special case of maximum entropy and was developed by Campbell (1992). For an excellent review on various geometrical aspects associated with minimum relative-entropy one can refer to (Campbell, 2003). The structure of the chapter is organized as follows. We present the necessary background in § 4.1, where we discuss properties of relative-entropy minimization in the classical case. In § 4.2, we present the ME prescriptions of Tsallis relative-entropy and discuss its differences with the classical case. Finally, the derivation of Pythagoras’ theorem in the nonextensive case is presented in § 4.3. Regarding the notation, we use the same notation as in Chapter 3, and we write all our mathematical formulations on the measure space (X, M, µ). All the assumptions we made in Chapter 3 (see § 3.2) are valid here too. Also, though results presented in this chapter do not involve major measure theoretic concepts, we write all the integrals with respect to the measure µ, as a convention; these integrals can be replaced by summations in the discrete case or Lebesgue integrals in the continuous case. 4.1 Relative-Entropy Minimization in the Classical Case Kullback’s minimum entropy principle can stated formally as follows. Given a prior distribution r with a finite set of moment constraints of the form Z um (x)p(x) dµ(x) = hum i , m = 1, . . . , M , (4.1) X one should choose the posterior p which minimizes the relative-entropy Z p(x) p(x) ln I(pkr) = dµ(x) . r(x) X (4.2) In (4.1), hum i, m = 1, . . . , M are the known expectation values of M-measurable functions um : X → R, m = 1, . . . , M respectively. With reference to (4.2) we clarify here that, though we mainly use expressions of relative-entropy defined for pdfs in this chapter, we use expressions in terms of corresponding probability measures as well. For example, when we write the Lagrangian 77 for relative-entropy minimization below, we use the definition of relative-entropy (3.7) for probability measures P and R, corresponding to pdfs p and r respectively (refer to Definitions 3.2 and 3.3). This correspondence between probability measures P and R with pdfs p and r, respectively, will not be described again in the sequel. 4.1.1 Canonical Minimum Entropy Distribution To minimize the relative-entropy (4.2) with respect to the constraints (4.1), the Lagrangian turns out to be Z Z dP dP (x) − 1 ln (x) dP (x) + λ L(x, λ, β) = dR X X Z M X βm um (x) dP (x) − hum i , + (4.3) X m=1 where λ and βm , m = 1, . . . M are Lagrange multipliers. The solution is given by M X dP ln βm um (x) = 0 , (x) + λ + dR m=1 and the solution can be written in the form of dP e− (x) = Z dR e− M m=1 βm um (x) M m=1 βm um (x) . (4.4) dR X Finally, from (4.4) the posterior distribution p(x) = dP dµ given by Kullback’s minimum entropy principle can be written in terms of the prior r(x) = p(x) = r(x)e− M m=1 where Zb = Z r(x)e− dR dµ as βm um (x) Zb M m=1 βm um (x) , (4.5) dµ(x) (4.6) X is the partition function. Relative-entropy minimization has been applied to many problems in statistics (Kullback, 1959) and statistical mechanics (Hobson, 1971). The other applications include pattern recognition (Shore & Gray, 1982), spectral analysis (Shore, 1981a), speech coding (Markel & Gray, 1976), estimation of prior distribution for Bayesian inference (Caticha & Preuss, 2004) etc. For a list of references on applications of relativeentropy minimization see (Shore & Johnson, 1980) and a recent paper (Cherney & Maslov, 2004). 78 Properties of relative-entropy minimization have been studied extensively and presented by Shore (1981b). Here we briefly mention a few. The principle of maximum entropy is equivalent to relative-entropy minimization in the special case of discrete spaces and uniform priors, in the sense that, when the prior is a uniform distribution with finite support W (over E ⊂ X), the minimum entropy distribution turns out to be p(x) = Z e− e− M m=1 M m=1 βm um (x) , βm um (x) (4.7) dµ(x) E which is in fact, a maximum entropy distribution (3.33) of Shannon entropy with respect to the constraints (4.1). The important relations to relative-entropy minimization are as follows. Minimum relative-entropy, I, can be calculated as I = − ln Zb − M X m=1 βm hum i , (4.8) while the thermodynamic equations are and ∂ b = −hum i , m = 1, . . . M, ln Z ∂βm ∂I = −βm , m = 1, . . . M. ∂hum i (4.9) (4.10) 4.1.2 Pythagoras’ Theorem The statement of Pythagoras’ theorem of relative-entropy minimization can be formulated as follows (Csiszár, 1975). T HEOREM 4.1 Let r be the prior, p be the probability distribution that minimizes the relativeentropy subject to a set of constraints Z um (x)p(x) dµ(x) = hum i , m = 1, . . . , M , (4.11) X with respect to M-measurable functions u m : X → R, m = 1, . . . M whose expectation values hum i, m = 1, . . . M are (assumed to be) a priori known. Let l be any other distribution satisfying the same constraints (4.11), then we have the triangle equality I(lkr) = I(lkp) + I(pkr) . (4.12) 79 Proof We have Z l(x) dµ(x) r(x) X Z Z l(x) p(x) = dµ(x) + dµ(x) l(x) ln l(x) ln p(x) r(x) X X Z p(x) = I(lkp) + dµ(x) l(x) ln r(x) X I(lkr) = l(x) ln (4.13) From the minimum entropy distribution (4.5) we have M X p(x) b . βm um (x) − ln Z ln =− r(x) (4.14) m=1 By substituting (4.14) in (4.13) we get ( Z I(lkr) = I(lkp) + X = I(lkp) − = I(lkp) − l(x) − Z M X m=1 βm um (x) − ln Zb βm m=1 βm hum i − ln Zb X dµ(x) M X m=1 M X ) l(x)um (x) dµ(x) − ln Zb = I(lkp) + I(pkr) . (By hypothesis) (By (4.8)) A simple consequence of the above theorem is that I(lkr) ≥ I(pkr) (4.15) since I(lkp) ≥ 0 for every pair of pdfs, with equality if and only if l = p. A pictorial depiction of the triangle equality (4.12) is shown in Figure 4.1. l r p Figure 4.1: Triangle Equality of Relative-Entropy Minimization Detailed discussions on the importance of Pythagoras’ theorem of relative-entropy minimization can be found in (Shore, 1981b) and (Amari & Nagaoka, 2000, pp. 72). 80 For a study of relative-entropy minimization without the use of Lagrange multiplier technique and corresponding geometrical aspects, one can refer to (Csiszár, 1975). Triangle equality of relative-entropy minimization not only plays a fundamental role in geometrical approaches of statistical estimation theory ( Čencov, 1982) and information geometry (Amari, 1985, 2001) but is also important for applications in which relative-entropy minimization is used for purposes of pattern classification and cluster analysis (Shore & Gray, 1982). 4.2 Tsallis Relative-Entropy Minimization Unlike the generalized entropy measures, ME of generalized relative-entropies is not much addressed in the literature. Here, one has to mention the work of Borland et al. (1998), where they give the minimum relative-entropy distribution of Tsallis relativeentropy with respect to the constraints in terms of q-expectation values. In this section, we study several aspects of Tsallis relative-entropy minimization. First we derive the minimum entropy distribution in the case of q-expectation values (3.38) and then in the case of normalized q-expectation values (3.50). We propose an elegant representation of these distributions by using q-deformed binary operator called q-product ⊗q . This operator is defined by Borges (2004) along similar lines as q-addition ⊕q and q-subtraction q that we discussed in § 2.3.2. Since q-product plays an important role in nonextensive formalism, we include a detailed discussion on the q-product in this section. Finally, we study properties of Tsallis relative-entropy minimization and its differences with the classical case. 4.2.1 Generalized Minimum Relative-Entropy Distribution To minimize Tsallis relative-entropy Z r(x) p(x) lnq dµ(x) Iq (pkr) = − p(x) X (4.16) with respect to the set of constraints specified in terms of q-expectation values Z um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M, (4.17) X the concomitant variational principle is given as follows: Define Z Z r(x) lnq L(x, λ, β) = dP (x) − 1 dP (x) − λ p(x) X X Z M X q−1 − βm p(x) um (x) dP (x) − hum iq X m=1 81 (4.18) where λ and βm , m = 1, . . . M are Lagrange multipliers. Now set dL =0 . dP (4.19) The solution is given by lnq M X r(x) − λ − p(x)q−1 βm um (x) = 0 , p(x) m=1 which can be rearranged by using the definition of q-logarithm ln q x = p(x) = h r(x)1−q − (1 − q) PM m=1 βm um (x) 1 (λ(1 − q) + 1) 1−q i x1−q −1 1−q as 1 1−q . Specifying the Lagrange parameter λ via the normalization R X p(x) dµ(x) = 1, one can write Tsallis minimum relative-entropy distribution as (Borland et al., 1998) p(x) = " r(x)1−q − (1 − q) M X βm um (x) m=1 1 # 1−q cq Z , (4.20) dµ(x) . (4.21) where the partition function is given by cq = Z Z " X r(x)1−q − (1 − q) M X βm um (x) m=1 1 # 1−q The values of the Lagrange parameters β m , m = 1, . . . , M are determined using the constraints (4.17). 4.2.2 q-Product Representation for Tsallis Minimum Entropy Distribution Note that the generalized relative-entropy distribution (4.20) is not of the form of its classical counterpart (4.5) even if we replace the exponential with the q-exponential. But one can express (4.20) in a form similar to the classical case by invoking qdeformed binary operation called q-product. In the framework of q-deformed functions and operators discussed in Chapter 2, a new multiplication, called q-product defined as 1 x1−q + y 1−q − 1 1−q if x, y > 0, 1−q + y 1−q − 1 > 0 x ⊗q y ≡ x 0 otherwise. 82 (4.22) is first introduced in (Nivanen et al., 2003) and explicitly defined in (Borges, 2004) for satisfying the following equations: lnq (x ⊗q y)=lnq x + lnq y , (4.23) exq ⊗q eyq =ex+y . q (4.24) The q-product recovers the usual product in the limit q → 1 i.e., lim q→1 (x ⊗q y) = xy. The fundamental properties of the q-product ⊗ q are almost the same as the usual product, and the distributive law does not hold in general, i.e., a(x ⊗q y) 6= ax ⊗q y (a, x, y ∈ R) . Further properties of the q-product can be found in (Nivanen et al., 2003; Borges, 2004). One can check the mathematical validity of the q-product by recalling the expression of the exponential function ex ex = lim n→∞ 1+ x n . n (4.25) Replacing the power on the right side of (4.25) by n times the q-product ⊗ q : n x ⊗q = x ⊗ q . . . ⊗ q x , | {z } (4.26) n times one can verify that (Suyari, 2004b) exq = lim n→∞ 1+ x ⊗q n . n (4.27) Further mathematical significance of q-product is demonstrated in (Suyari & Tsukada, 2005) by discovering the mathematical structure of statistics based on the Tsallis formalism: law of error, q-Stirling’s formula, q-multinomial coefficient and experimental evidence of q-central limit theorem. Now, one can verify the non-trivial fact that Tsallis minimum entropy distribution (4.20) can be expressed as (Dukkipati, Murty, & Bhatnagar, 2005b), − r(x) ⊗q eq p(x) = where cq = Z Z X M m=1 βm um (x) cq Z − r(x) ⊗q eq M m=1 βm um (x) 83 , (4.28) dµ(x). (4.29) Later in this chapter we see that this representation is useful in establishing properties of Tsallis relative-entropy minimization and corresponding thermodynamic equations. It is important to note that the distribution in (4.20) could be a (local/global) minimum only if q > 0 and the Tsallis cut-off condition (3.46) specified by Tsallis maximum entropy distribution is extended to the relative-entropy case i.e., p(x) = 0 whenh i P ever r(x)1−q − (1 − q) M β u (x) < 0. The latter condition is also required m=1 m m for the q-product representation of the generalized minimum entropy distribution. 4.2.3 Properties As we mentioned earlier, in the classical case, that is when q = 1, relative-entropy minimization with uniform distribution as a prior is equivalent to entropy maximization. But, in the case of nonextensive framework, this is not true. Let r be the uniform distribution with finite support W over E ⊂ X. Then, by (4.20) one can verify that the probability distribution which minimizes Tsallis relative-entropy is p(x) = " Z " E 1 − (1 − q) W 1−q 1 − (1 − q) W 1−q M X βm um (x) m=1 M X βm um (x) m=1 1 # 1−q 1 # 1−q , dµ(x) which can be written as p(x) = Z −W q−1 lnq W − eq −W q−1 lnq W − eq M m=1 M m=1 βm um (x) βm um (x) (4.30) dµ(x) E or p(x) = Z −W 1−q eq −W 1−q eq M m=1 M m=1 βm um (x) βm um (x) . (4.31) dµ(x) E By comparing (4.30) or (4.31) with Tsallis maximum entropy distribution (3.44), one can conclude (formally one can verify this by the thermodynamic equations of Tsallis entropy (3.37)) that minimizing relative-entropy is not equivalent 1 to maximizing 1 For fixed q-expected values hum iq , the two distributions, (4.31) and (3.44) are equal, but the values of corresponding Lagrange multipliers are different when q 6= 1 (while in the classical case they remain same). Further, (4.31) offers the relation between the Lagrange parameters in these two cases. Let (S) βm , m = 1, . . . M be the Lagrange parameters corresponding to the generalized maximum entropy (I) distribution while βm , m = 1, . . . M correspond to generalized minimum entropy distribution with (S) (I) uniform prior. Then, we have the relation βm = W 1−q βm , m = 1, . . . M . 84 entropy when the prior is a uniform distribution. The key observation here is that W appears in (4.31) unlike in (3.44). In this case, one can calculate minimum relative-entropy I q as cq − Iq = − lnq Z M X m=1 βm hum iq . (4.32) To demonstrate the usefulness of q-product representation of generalized minimum entropy distribution we present the verification (4.32). I By using the property of q-multiplication (4.24), Tsallis minimum relative-entropy distribution (4.28) can be written as cq = e− p(x)Z q M m=1 βm um (x)+lnq r(x) . By taking q-logarithm on both sides, we get cq + (1 − q) lnq p(x) lnq Z cq = − lnq p(x) + lnq Z M X βm um (x) + lnq r(x) m=1 q−1 By the property of q-logarithm lnq x y = y (lnq x − lnq y), we have r(x) = p(x)q−1 lnq p(x) ( cq + (1 − q) lnq p(x) lnq Z cq + lnq Z M X βm um (x) m=1 ) . (4.33) By substituting (4.33) in Tsallis relative-entropy (4.16) we get Iq = − Z p(x) X q ( cq + (1 − q) lnq p(x) lnq Z cq + lnq Z M X m=1 βm um (x) ) dµ(x) . By (4.17) and expanding lnq p(x) one can write Iq in its final form as in (4.32). J It is easy to verify the following thermodynamic equations for the minimum Tsallis relative-entropy: ∂ cq = −hum i , m = 1, . . . M, lnq Z q ∂βm ∂Iq = −βm , m = 1, . . . M, ∂hum iq which generalize thermodynamic equations in the classical case. 85 (4.34) (4.35) 4.2.4 The Case of Normalized q-Expectations In this section we discuss Tsallis relative-entropy minimization with respect to the constraints in the form of normalized q-expectations R um (x)p(x)q dµ(x) XR = hhum iiq , m = 1, . . . , M. q X p(x) dµ(x) (4.36) The variational principle for Tsallis relative-entropy minimization in this case is as below. Let Z r(x) dP (x) − 1 L(x, λ, β) = lnq dP (x) − λ p(x) X X Z M X (q) p(x)q−1 um (x) − hhum iiq dP (x) , (4.37) βm − Z X m=1 (q) where the parameters βm can be defined in terms of the true Lagrange parameters βm as βm (q) βm =Z , m = 1, . . . , M. (4.38) p(x)q dµ(x) X This gives minimum entropy distribution as 1 r(x)1−q − (1 − q) c Z p(x) = q where c= Z q Z X r(x)1−q − (1 − q) 1 1−q β u (x) − hhu ii m m m m=1 q R q p(x) dµ(x) X PM PM m=1 βm um (x) − hhum iiq R q X p(x) dµ(x) (4.39) 1 1−q dµ(x) . Now, the minimum entropy distribution (4.39) can be expressed using the q-product (4.22) as P M u (x) − hhu ii β m m m q m=1 1 R p(x) = r(x) ⊗q expq . q c X p(x) dµ(x) Z (4.40) q Minimum Tsallis relative-entropy I q in this case satisfies c , Iq = − lnq Z q (4.41) while one can derive the following thermodynamic equations: ∂ cq = −hhum ii , m = 1, . . . M, lnq Z q ∂βm 86 (4.42) ∂Iq = −βm , m = 1, . . . M, ∂hhum iiq (4.43) where M c − X β hhu ii . cq = lnq Z lnq Z m m q q (4.44) m=1 4.3 Nonextensive Pythagoras’ Theorem With the above study of Tsallis relative-entropy minimization, in this section, we present our main result, Pythagoras’ theorem or triangle equality (Theorem 4.1) generalized to the nonextensive case. To present this result, we shall discuss the significance of triangle equality in the classical case. We restate Theorem 4.1 which is essential for the derivation of the triangle equality in the nonextensive framework. 4.3.1 Pythagoras’ Theorem Restated Significance of the triangle equality is evident in the following scenario. Let r be the prior estimate of the unknown probability distribution l, about which, the information in the form of constraints Z um (x)l(x) dµ(x) = hum i , m = 1, . . . M (4.45) X is available with respect to the fixed functions u m , m = 1, . . . , M . The problem is to choose a posterior estimate p that is in some sense the best estimate of l given by the available information i.e., prior r and the information in the form of expected values (4.45). Kullback’s minimum entropy principle provides a general solution to this inference problem and provides us the estimate (4.5) when we minimize relativeentropy I(pkr) with respect to the constraints Z um (x)p(x) dµ(x) = hum i , m = 1, . . . M . (4.46) X This estimate of posterior p by Kullback’s minimum entropy principle also offers the relation (Theorem 4.1) I(lkr) = I(lkp) + I(pkr) , (4.47) from which one can draw the following conclusions. By (4.15), the minimum relativeentropy posterior estimate of l is not only logically consistent, but also closer to l, in the 87 relative-entropy sense, that is the prior r. Moreover, the difference I(lkr) − I(lkp) is exactly the relative-entropy I(pkr) between the posterior and the prior. Hence, I(pkr) can be interpreted as the amount of information provided by the constraints that is not inherent in r. Additional justification to use minimum relative-entropy estimate of p with respect to the constraints (4.46) is provided by the following expected value matching prop- erty (Shore, 1981b). To explain this concept we restate our above estimation problem as follows. For fixed functions um , m = 1, . . . M , let the actual unknown distribution l satisfy Z um (x)l(x) dµ(x) = hwm i , m = 1, . . . M, (4.48) X where hwm i, m = 1, . . . M are expected values of l, the only information available about l apart from the prior r. To apply minimum entropy principle to estimate poste- rior estimation p of l, one has to determine the constraints for p with respect to which we minimize I(pkr). Equivalently, by assuming that p satisfies the constraints of the form (4.46), one has to determine the expected values hu m i, m = 1, . . . , M . Now, as hum i, m = 1, . . . , M vary, one can show that I q (lkp) has the minimum value when hum i = hwm i , m = 1, . . . M. (4.49) The proof is as follows (Shore, 1981b). I Proceeding as in the proof of Theorem 4.1, we have I(lkp) = I(lkr) + = I(lkr) + Z M X βm m=1 βm hwm i + ln Zb m=1 M X X l(x)um (x) dµ(x) + ln Zb (By (4.48)) (4.50) Since the variation of I(lkp) with respect to hu m i results in the variation of I(lkp) with respect to βm for any m = 1, . . . , M , to find the minimum of I(lkp) one can solve ∂ Iq (lkp) = 0 , m = 1, . . . M , ∂βm which gives the solution as in (4.49). J This property of expectation matching states that, for a distribution p of the form (4.5), I(lkp) is the smallest when the expected values of p match those of l. In particular, p is not only the distribution that minimizes I(pkr) but also minimizes I(lkp). 88 We now restate the Theorem 4.1 which summarizes the above discussion. T HEOREM 4.2 Let r be the prior distribution, and p be the probability distribution that minimizes the relative-entropy subject to a set of constraints Z um (x)p(x) dµ(x) = hum i , m = 1, . . . , M. (4.51) X Let l be any other distribution satisfying the constraints Z um (x)l(x) dµ(x) = hwm i , m = 1, . . . , M. (4.52) X Then 1. I1 (lkp) is minimum only if (expectation matching property) hum i = hwm i , m = 1, . . . M. (4.53) 2. When (4.53) holds, we have I(lkr) = I(lkp) + I(pkr) (4.54) By the above interpretation of triangle equality and analogy with the comparable situation in Euclidean geometry, it is natural to call p, as defined by (4.5) as the projection of r on the plane described by (4.52). Csiszár (1975) has introduced a generalization of this notion to define the projection of r on any convex set E of probability distributions. If p ∈ E satisfies the equation I(pkr) = min I(skr) , (4.55) s∈ then p is called the projection of r on E. Csiszár (1975) develops a number of results about these projections for both finite and infinite dimensional spaces. In this thesis, we will not consider this general approach. 4.3.2 The Case of q-Expectations From the above discussion, it is clear that to derive the triangle equality of Tsallis relative-entropy minimization, one should first deduce the equivalent of expectation matching property in the nonextensive case. We state below and prove the Pythagoras theorem in nonextensive framework (Dukkipati, Murty, & Bhatnagar, 2006a). 89 T HEOREM 4.3 Let r be the prior distribution, and p be the probability distribution that minimizes the Tsallis relative-entropy subject to a set of constraints Z um (x)p(x)q dµ(x) = hum iq , m = 1, . . . , M. (4.56) X Let l be any other distribution satisfying the constraints Z um (x)l(x)q dµ(x) = hwm iq , m = 1, . . . , M. (4.57) X Then 1. Iq (lkp) is minimum only if hum iq = hwm iq 1 − (1 − q)Iq (lkp) , m = 1, . . . M. (4.58) 2. Under (4.58), we have Iq (lkr) = Iq (lkp) + Iq (pkr) + (q − 1)Iq (lkp)Iq (pkr) . Proof (4.59) First we deduce the equivalent of expectation matching property in the nonextensive case. That is, we would like to find the values of hu m iq for which Iq (lkp) is minimum. We write the following useful relations before we proceed to the derivation. We can write the generalized minimum entropy distribution (4.28) as ln r(x) p(x) = eq q ⊗q eq − cq Z M m=1 βm um (x) = eq − M m=1 βm um (x)+lnq r(x) cq Z ln x , (4.60) by using the relations eq q = x and exq ⊗q eyq = ex+y . Further by using q lnq (xy) = lnq x + lnq y + (1 − q) lnq x lnq y we can write (4.60) as cq +(1−q) lnq p(x) lnq Z cq = − lnq p(x)+lnq Z M X m=1 By the property of q-logarithm x = y q−1 (lnq x − lnq y) , lnq y and by q-logarithmic representations of Tsallis entropy, Z Sq = − p(x)q lnq p(x) dµ(x) , X 90 βm um (x)+lnq r(x) .(4.61) (4.62) one can verify that Iq (pkr) = − Z X p(x)q lnq r(x) dµ(x) − Sq (p) . (4.63) With these relations in hand we proceed with the derivation. Consider Z p(x) l(x) lnq Iq (lkp) = − dµ(x) . l(x) X By (4.62) we have Z h i l(x)q lnq p(x) − lnq l(x) dµ(x) X Z h i l(x)q lnq p(x) − lnq r(x) dµ(x) . = Iq (lkr) − Iq (lkp) = − (4.64) X From (4.61), we get Iq (lkp) = Iq (lkr)+ Z l(x)q X cq + lnq Z Z " M X # βm um (x) dµ(x) m=1 l(x)q dµ(x) X Z cq l(x)q lnq p(x) dµ(x) . +(1 − q) lnq Z (4.65) X By using (4.57) and (4.63), Iq (lkp) = Iq (lkr) + M X m=1 cq βm hwm iq + lnq Z M X m=1 l(x)q dµ(x) X h i cq − Iq (lkp) − Sq (l) , +(1 − q) lnq Z and by the expression of Tsallis entropy S q (l) = Iq (lkp) = Iq (lkr) + Z 1 q−1 1− R X (4.66) l(x)q dµ(x) , we have cq − (1 − q) lnq Z cq Iq (lkp) . (4.67) βm hwm iq + lnq Z Since the multipliers βm , m = 1, . . . M are functions of the expected values hu m iq , variations in the expected values are equivalent to variations in the multipliers. Hence, to find the minimum of Iq (lkp), we solve ∂ Iq (lkp) = 0 . ∂βm (4.68) By using thermodynamic equation (4.34), solution of (4.68) provides us with the expectation matching property in the nonextensive case as hum iq = hwm iq 1 − (1 − q)Iq (lkp) , m = 1, . . . M . 91 (4.69) In the limit q → 1 the above equation gives hu m i1 = hwm i1 which is the expectation matching property in the classical case. Now, to derive the triangle equality for Tsallis relative-entropy minimization, we substitute the expression for hwm iq , which is given by (4.69), in (4.67). And after some algebra one can arrive at (4.59). Note that the limit q → 1 in (4.59) gives the triangle equality in the classical case (4.54). The two important cases which arise out of (4.59) are, Iq (lkr) ≤ Iq (lkp) + Iq (pkr) when 0 < q ≤ 1 , (4.70) Iq (lkr) ≥ Iq (lkp) + Iq (pkr) when 1 < q . (4.71) We refer to Theorem 4.3 as nonextensive Pythagoras’ theorem and (4.59) as nonextensive triangle equality, whose pseudo-additivity property is consistent with the pseudo additivity of Tsallis relative-entropy (compare (2.40) and (2.11)), and hence is a natural generalization of triangle equality in the classical case. 4.3.3 In the Case of Normalized q-Expectations In the case of normalized q-expectation too, the Tsallis relative-entropy satisfies nonextensive triangle equality with modified conditions from the case of q-expectation values. T HEOREM 4.4 Let r be the prior distribution, and p be the probability distribution that minimizes the Tsallis relative-entropy subject to the set of constraints R um (x)p(x)q dµ(x) XR = hhum iiq , m = 1, . . . , M. q X p(x) dµ(x) Let l be any other distribution satisfying the constraints R um (x)l(x)q dµ(x) XR = hhwm iiq , m = 1, . . . , M. q X l(x) dµ(x) (4.72) (4.73) Then we have Iq (lkr) = Iq (lkp) + Iq (pkr) + (q − 1)Iq (lkp)Iq (pkr), (4.74) provided hhum iiq = hhwm iiq m = 1, . . . M. 92 (4.75) Proof From Tsallis minimum entropy distribution p in the case of normalized q-expected values (4.40), we have c + (1 − q) ln p(x) ln Z c lnq r(x) − lnq p(x) = lnq Z q q q q PM m=1 βm um (x) − hhum iiq R . + q X p(x) dµ(x) Proceeding as in the proof of Theorem 4.3, we have Z h i l(x)q lnq p(x) − lnq r(x) dµ(x) . Iq (lkp) = Iq (lkr) − (4.76) (4.77) X From (4.76), we obtain Z c Iq (lkp) = Iq (lkr) + lnq Zq l(x)q dµ(x) X Z c +(1 − q) lnq Zq l(x)q lnq p(x) dµ(x) X +R Z M X 1 q βm l(x) um (x) − hhum iiq dµ(x) . q X X p(x) dµ(x) m=1 (4.78) By (4.73) the same can be written as Z c Iq (lkp) = Iq (lkr) + lnq Z l(x)q dµ(x) q X Z c +(1 − q) lnq Zq l(x)q lnq p(x) dµ(x) X R M l(x)q dµ(x) X +RX β hhw ii − hhu ii . m m m q q q X p(x) dµ(x) m=1 (4.79) By using the relations Z l(x)q lnq p(x) dµ(x) = −Iq (lkp) − Sq (l) , X and Z X l(x)q dµ(x) = (1 − q)Sq (l) + 1 , (4.79) can be written as c c − (1 − q) ln Z Iq (lkp) = Iq (lkr) + lnq Z q q q Iq (lkp) R M q X X l(x) dµ(x) R . (4.80) hhw ii − hhu ii β + m m m q q q X p(x) dµ(x) m=1 Finally using (4.41) and (4.75) we have the nonextensive triangle equality (4.74). 93 Note that in this case the minimum of I q (lkp) is not guaranteed. Also the condition (4.75) for nonextensive triangle equality here is the same as the expectation value matching property in the classical case. Finally, nonextensive Pythagoras’ theorem is yet another remarkable and consistent generalization shown by Tsallis formalism. 94 5 Power-laws and Entropies: Generalization of Boltzmann Selection Abstract The great success of Tsallis formalism is due to the resulting power-law distributions from ME-prescriptions of its entropy functional. In this chapter we provide experimental demonstration of use of the power-law distributions in evolutionary algorithms by generalizing Boltzmann selection to the Tsallis case. The proposed algorithm uses Tsallis canonical distribution to weigh the configurations for ’selection’ instead of Gibbs-Boltzmann distribution. This work is motivated by the recently proposed generalized simulated annealing algorithm based on Tsallis statistics. The results in this chapter can also be found in (Dukkipati, Murty, & Bhatnagar, 2005a). The central step of an enormous variety of problems (in Physics, Chemistry, Statistics, Engineering, Economics) is the minimization of an appropriate energy or cost function. (For example, energy function in the traveling salesman problem is the length of the path.) If the cost function is convex, any gradient descent method easily solves the problem. But if the cost function is nonconvex the solution requires more sophisticated methods, since a gradient decent procedure could easily trap the system in a local minimum. Consequently, various algorithmic strategies have been developed along the years for making this important problem increasingly tractable. Among the various methods developed to solve hard optimization problems, the most popular ones are simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) and evolutionary algorithms (Bounds, 1987). Evolutionary computation comprises of techniques for obtaining near-optimal solutions of hard optimization problems in physics (e.g., Sutton, Hunter, & Jan, 1994) and engineering (Holland, 1975). These methods are based largely on ideas from biological evolution and are similar to simulated annealing, except that, instead of exploring the search space with a single point at each instant, these deal with a population – a multi-subset of search space – in order to avoid getting trapped in local optima during the process of optimization. Though evolutionary algorithms are not analyzed traditionally in the Monte Carlo framework, few researchers (e.g., Cercueil & Francois, 2001; Cerf, 1996a, 1996b) analyzed these algorithms in this framework. 95 A typical evolutionary algorithm is a two step process: selection and variation. Selection comprises replicating an individual in the population based on probabilities (selection probabilities) that are assigned to individuals in the population on the basis of a “fitness” measure defined by the objective function. A stochastic perturbation of individuals while replicating is called variation. Selection is a central concept in evolutionary algorithms. There are several selection mechanisms in evolutionary algorithms, among which Boltzmann selection has an important place because of the deep connection between the behavior of complex systems in thermal equilibrium at finite temperature and multivariate optimization (Nulton & Salamon, 1988). In these systems, each configuration is weighted by its GibbsBoltzmann probability factor e−E/T , where E is the energy of the configuration and T is the temperature. Finding the low-temperature state of a system when the energy can be computed amounts to solving an optimization problem. This connection has been used to devise the simulated annealing algorithm (Kirkpatrick et al., 1983). Similarly for evolutionary algorithms, in the selection process where one would select “better” configurations, one can use the same technique to weigh the individuals i.e., using Gibbs-Boltzmann factor. This is called Boltzmann selection, which is nothing but defining selection probabilities in the form of Boltzmann canonical distribution. Classical simulated annealing, as proposed by Kirkpatrick et al. (1983), extended the well-known procedure of Metropolis et al. (1953) for equilibrium Gibbs-Boltzmann statistics: a new configuration is accepted with the probability p = min 1, e−β∆E , where β = 1 T (5.1) is the inverse temperature parameter and ∆E is the change in the energy. The annealing consists in decreasing the temperature gradually. Geman and Geman (1984) showed that if the temperature decreases as the inverse logarithm of time, the system will end in a global minimum. On the other hand, in the generalized simulated annealing procedure proposed by Tsallis and Stariolo (1996) the acceptance probability is generalized to n o 1 , p = min 1, [1 − (1 − q)β∆E] 1−q (5.2) 1 for some q. The term [1 − (1 − q)β∆E] 1−q is due to Tsallis distribution in Tsallis statistics (see § 3.4) and q → 1 in (5.2) retrieves the acceptance probability in the classical case. This method is shown to be faster than both classical simulated annealing and the fast simulated annealing methods (Stariolo & Tsallis, 1995; Tsallis, 96 1988). This algorithm has been used successfully in many applications (Yu & Mo, 2003; Moret et al., 1998; Penna, 1995; Andricioaei & Straub, 1996, 1997). The above described use of power-law distributions in simulated annealing is the motivation for us to incorporate Tsallis canonical probability distribution for selection in evolutionary algorithms and test their novelty. Before we present the proposed algorithm and simulation results, we also present an information theoretic justification of Boltzmann distribution in selection mechanism (Dukkipati et al., 2005a). In fact, in evolutionary algorithms Boltzmann selection is viewed just as an exponential scaling for proportionate selection (de la Maza & Tidor, 1993) (where selection probabilities of configurations are inversely proportional to their energies (Holland, 1975)). We show that by using Boltzmann distribution in the selection mechanism one would implicitly satisfy Kullback minimum relative-entropy principle. 5.1 EAs based on Boltzmann Distribution Let Ω be the search space i.e., space of all configurations of an optimization problem. Let E : Ω → R+ be the objective function – following statistical mechanics terminology (Nulton & Salamon, 1988; Prügel-Bennett & Shapiro, 1994) we refer to this function as energy (in evolutionary computation terminology this is called as fitness function) – where the objective is to find a configuration with lowest energy. t Pt = {ωk }N k=1 denotes a population which is a multi-subset of Ω. Here we assume that the size of population at any time is finite and need not be a constant. In the first step, initial population P 0 is chosen with random configurations. At each time step t, the population undergoes the following procedure. selection Pt −→ Pt0 variation −→ Pt+1 . Variation is nothing but stochastically perturbing the individuals in the population. Various methods in evolutionary algorithms follow different approaches. For example in genetic algorithms, where configurations are represented as binary strings, operators such as mutation and crossover are used; for details see (Holland, 1975). Selection is the mechanism, where “good” configurations are replicated based on t their selection probabilities (Back, 1994). For a population P t = {ωk }N k=1 with the corresponding energy values {Ek }nk=1 , selection probabilities are defined as pt (ωk ) = Prob(ωk ∈ Pt0 |ωk ∈ Pt ) , 97 ∀k = 1 . . . Nt , start Initialize Population Evaluate "Fitness" Apply Selection Randomly Vary Individuals no Stop Criterion yes end Figure 5.1: Structure of evolutionary algorithms t and {pt (ωk )}N k=1 satisfies the condition: P Nt k=1 pt (ωk ) = 1. The general structure of evolutionary algorithms is shown in Figure 5.1; for further details refer to (Fogel, 1994; Back, Hammel, & Schwefel, 1997). According to Boltzmann selection, selection probabilities are defined as e−βEk , pt (ωk ) = PNt −βEj j=1 e (5.3) where β is the inverse temperature at time t. The strength of selection is controlled by the parameter β. A higher value of β (low temperature) gives a stronger selection, and a lower value of β gives a weaker selection (Back, 1994). Boltzmann selection gives faster convergence, but without a good annealing schedule for β, it might lead to premature convergence. This problem is well known from simulated annealing (Aarts & Korst, 1989), but not very well studied in evolutionary algorithms. This problem is addressed in (Mahnig & Mühlenbein, 2001; Dukkipati, Murty, & Bhatnagar, 2004) where annealing schedules for evolutionary algorithms based on Boltzmann selection have been proposed. Now, we derive the selection equation, similar to the one derived in (Dukkipati 98 et al., 2004), which characterizes Boltzmann selection from first principles. Given a t population Pt = {ωk }N k=1 , the simplest probability distribution on Ω which represents Pt is νt (ω) , ∀ω ∈ Ω , Nt ξt (ω) = (5.4) where the function νt : Ω → Z+ ∪ {0} measures the number of occurrences of each configuration ω ∈ Ω in population Pt . Formally νt can be defined as νt (ω) = Nt X k=1 δ(ω, ωk ) , ∀ω ∈ Ω , (5.5) where δ : Ω × Ω → {0, 1} is defined as δ(ω1 , ω2 ) = 1 if ω1 = ω2 , δ(ω1 , ω2 ) = 0 otherwise. The mechanism of selection involves assigning selection probabilities to the configurations in Pt as in (5.3) and sample configurations based on selection probabilities to generate the population Pt+1 . That is, selection probability distribution assigns zero probability to the configurations which are not present in the population. Now from the fact that population is a multi-subset of Ω,we can write selection probability distribution with respect to population P t as, p(ω) = νt (ω)e−βE(ω) −βE(ω) ω∈Pt νt (ω)e , if ω ∈ Pt , 0 (5.6) otherwise. One can estimate the frequencies of configurations after the selection ν t+1 as νt+1 (ω) = νt (ω) P e−βE(ω) N , −βE(ω) t+1 ω∈Pt νt (ω)e (5.7) where Nt+1 is the population size after the selection. Further, the probability distribution which represents the population P t+1 can be estimated as ξt+1 (ω) = νt+1 (ω) e−βE(ω) = νt (ω) P −βE(ω) Nt+1 ω∈Pt νt (ω)e e−βE(ω) . −βE(ω) ω∈Pt ξt (ω)Nt e = ξt (ω)Nt P Finally, we can write the selection equation as ξt+1 (ω) = P ξt (ω)e−βE(ω) . −βE(ω) ω∈Pt ξt (ω)e 99 (5.8) One can observe that (5.8) resembles the minimum relative-entropy distribution that we derived in § 4.1.1 (see 4.5). This motivates one to investigate the possible connection of Boltzmann selection with the Kullback’s relative-entropy principle. Given the distribution ξt , which represents the population P t , we would like to estimate the distribution ξt+1 that represents the population Pt+1 . In this context one can view ξt as a prior estimate of ξt+1 . The available constraints for ξt+1 are X ξt+1 (ω) = 1 , (5.9a) X ξt+1 (ω)E(ω) = hEit+1 , (5.9b) w∈Ω w∈Ω where hEit+1 is the expected value of the function E with respect to ξ t+1 . At this stage let us assume that hEi t+1 is a given quantity; this will be explained later. In this set up, Kullback minimum relative-entropy principle gives the estimate for ξt+1 . That is, one should choose ξt+1 in such a way that it minimizes the relativeentropy I(ξt+1 kξt ) = X ξt+1 (ω) ln ω∈Ω ξt+1 (ω) ξt (ω) (5.10) with respect to the constraints (5.9a) and (5.9b). The corresponding Lagrangian can be written as L ≡ −I(ξt+1 kξt )−(λ − 1) −β X ω∈Ω X ω∈Ω ξt+1 (ω) − 1 ! E(ω)ξt+1 (ω) − hEit+1 ! , where λ and β are Lagrange parameters and ∂L = 0 =⇒ ξt+1 (ω) = eln ξt (ω)−λ−βE(ω) . ∂ξt+1 (ω) By (5.9a) we get ξt (ω)e−βE(ω) eln ξt (ω)−βE(ω) , ξt+1 (ω) = P ln ξ (ω)−βE(ω) = P −βE(ω) t ωe ω∈Pt ξt (ω)e (5.11) which is the selection equation (5.8) that we have derived from the Boltzmann selection mechanism. The Lagrange multiplier β is the inverse temperature parameter in Boltzmann selection. 100 The above justification is incomplete without explaining the relevance of the constraint (5.9b) in this context. Note that the inverse temperature parameter β in (5.11) is determined using constraint (5.9b). Thus we have P −βE(ω) ω∈Ω E(ω)ξt (ω)e P = hEit+1 . −βE(ω) ω∈Ω ξt (ω)e (5.12) Now it is evident that by specifying β in the annealing schedule of Boltzmann selec- tion, we predetermine hEi t+1 , which is the mean of the function E with respect to the population Pt+1 , according to which the configurations for P t+1 are sampled. Now with this information theoretic justification of Boltzmann selection we proceed to its generalization to the Tsallis case. 5.2 EA based on Power-law Distributions We propose a new selection scheme for evolutionary algorithms based on Tsallis generalized canonical distribution, that results from maximum entropy prescriptions of t Tsallis entropy discussed in § 3.4 as follows. For a population P (t) = {ω k }N k=1 with t corresponding energies {Ek }N k=1 we define selection probabilities as 1 [1 − (1 − q)βt Ek ] 1−q pt (ωk ) = P Nt 1 1−q j=1 [1 − (1 − q)βt Ej ] , ∀k = 1, . . . Nt , (5.13) where {βt : t = 1, 2, . . .} is an annealing schedule. We refer to the selection scheme based on Tsallis distribution as Tsallis selection and the evolutionary algorithm with Tsallis selection as generalized evolutionary algorithm. In this algorithm, we use the Cauchy annealing schedule that is proposed in (Dukkipati et al., 2004). This annealing schedule chooses β t as a non-decreasing Cauchy sequence for faster convergence. One such sequence is βt = β 0 t X 1 , iα t = 1, 2, . . . , (5.14) i=1 where β0 is any constant and α > 1. The novelty of this annealing schedule has been demonstrated using simulations in (Dukkipati et al., 2004). Similar to the practice in generalized simulated annealing (Andricioaei & Straub, 1997), in our algorithm, q tends towards 1 as temperature decreases during annealing. The generalized evolutionary algorithm based on Tsallis statistics is given in Figure 5.2. 101 Algorithm 1 Generalized Evolutionary algorithm P (0) ← Initialize with configurations from search space randomly Initialize β and q for t = 1 to T do for all ω ∈ P (t) do (Selection) Calculate 1 [1 − (1 − q)βE(ω)] 1−q p(ω) = Zq Copy ω into P 0 (t) with probability p(ω) with replacement end for for all ω ∈ P 0 (t) do (Variation) Perform variation with specific probability end for Update β according to annealing schedule Update q according to its schedule P (t + 1) ← P 0 (t) end for Figure 5.2: Generalized Evolutionary Algorithm based on Tsallis statistics to optimize the energy function E(ω). 5.3 Simulation Results We discuss the simulations conducted to study the generalized evolutionary algorithm based on Tsallis statistics proposed in this paper. We compare performance of evolutionary algorithms with three selection mechanisms viz., proportionate selection (where selection probabilities of configurations are inversely proportional to their energies (Holland, 1975)), Boltzmann selection and Tsallis selection respectively. For comparison purposes we study multi-variate function optimization in the framework of genetic algorithms. Specifically, we use the following bench mark test functions (Mühlenbein & Schlierkamp-Voosen, 1993), where the aim is to find the configuration with the lowest functional value: • Ackley’s function: q P P l 1 2 E1 (~x) = −20 exp −0.2 l i=1 xi −exp 1l li=1 cos(2πxi ) +20+e , where −30 ≤ xi ≤ 30, • Rastrigin’s function: P E2 (~x) = lA + li=1 x2i − A cos(2πxi ), where A = 10 ; −5.12 ≤ xi ≤ 5.12, 102 • Griewangk’s function: P Q xi 2 xi E3 (~x) = li=1 4000 + 1, − li=1 cos √ i where −600 ≤ xi ≤ 600. Parameters for the algorithms were set to compare performance of these algorithms in identical conditions. Each xi is encoded with 5 bits and l = 15 i.e., search space is of size 275 . Population size is n = 350. For all the experiments, probability of uniform crossover is 0.8 and probability of mutation is below 0.1. We limited each algorithm to 100 iterations and have given plots for the behavior of the process when averaged over 20 runs. As we mentioned earlier, for Boltzmann selection we have used the Cauchy annealing schedule (see (5.14)), in which we set β 0 = 200 and α = 1.01. For Tsallis selection too, we have used the same annealing schedule as Boltzmann selection with identical parameters. In our preliminary simulations, q was kept constant and tested with various values. Then we adopted a strategy from generalized simulated annealing where one would choose an initial value of q 0 and decrease linearly to the value 1. This schedule of q gave better performance than keeping it constant. We reported results with various values of q0 . 20 19 bestfitness 18 17 16 15 q0 = 3 q0 = 2 q0 = 1.5 q0 = 1.01 14 0 10 20 30 40 50 60 70 80 90 100 generations Figure 5.3: Performance of evolutionary algorithm with Tsallis selection for various values of q0 for the test function Ackley From various simulations, we observed that when the problem size is small (for example smaller values of l) all the selection mechanisms perform equally well. Boltzmann selection is effective when we increase the problem size. For Tsallis selection, we performed simulations with various values of q 0 . Figure 5.3 shows the performance for Ackley function for q0 = 3, 2, 1.5 and 1.01, respectively, from which one can see 103 20 proportionate Boltzmann Tsallis 19 best_fitness 18 17 16 15 14 0 10 20 30 40 50 60 70 80 90 100 generations Figure 5.4: Ackley: q0 = 1.5 180 proportionate Boltzmann Tsallis 170 160 best_fitness 150 140 130 120 110 100 0 10 20 30 40 50 60 70 80 90 100 generations Figure 5.5: Rastrigin: q0 = 2 200 proportionate Boltzmann Tsallis 180 best_fitness 160 140 120 100 80 60 0 10 20 30 40 50 generations 60 70 80 Figure 5.6: Griewangk: q0 = 1.01 104 90 100 that the choice of q0 is very important for the evolutionary algorithm with Tsallis selection which varies with the problem at hand. Figures 5.4, 5.5 and 5.6 show the comparisons of evolutionary algorithms based on Tsallis selection, Boltzmann selection and proportionate selection, respectively, for different functions. We have reported only the best behavior for various values of q 0 . From these simulation results, we conclude that the evolutionary algorithm based on Tsallis canonical distribution with appropriate value of q 0 outperforms those based on Boltzmann and proportionate selection respectively. 105 6 Conclusions Abstract In this concluding chapter we summarize the results of the Dissertation, with an emphasis on novelties, and new problems suggested by this research. Information theory based on Shannon entropy functional found applications that cut across a myriad of fields, because of its established mathematical significance i.e., its beautiful mathematical properties. Shannon (1956) too emphasized that “the hard core of information theory is, essentially, a branch of mathematics” and “a thorough understanding of the mathematical foundation . . . is surely a prerequisite to other applications.” Given that “the hard core of information theory is a branch of mathematics,” one could expect formal generalizations of information measures taking place, just as would be the case for any other mathematical concept. At the outset of this Dissertation we noted from (Rényi, 1960; Csiszár, 1974) that generalization of information measures should be indicated by their operational significance (pragmatic approach) and by a set of natural postulates characterizing them (axiomatic approach). In the literature ranging from mathematics to physics, information theory to machine learning one can find various operational and axiomatic justifications of the generalized information measures. In this thesis, we investigated some properties of generalized information measures and their maximum and minimum entropy prescriptions pertaining to their mathematical significance. 6.1 Contributions of the Dissertation In this section we briefly summarize the contributions of this thesis including some problems suggested by this work. Rényi’s recipe for nonextensive information measures Passing an information measure through Rényi’s formalism – a procedure followed by Rényi to generalize Shannon entropy – allows one to study the possible generaliza106 tions and characterizations of information measure in terms of axioms of quasilinear means. In Chapter 2, we studied this technique for nonextensive entropy and showed that Tsallis entropy is unique under Rényi’s recipe. Assuming that any putative candidate for an entropy should be a mean (Rényi, 1961), and in light of attempts to study ME-prescriptions of information measures, where constraints are specified using KNaverages (e.g., Czachor & Naudts, 2002), the results presented in this thesis further the relation between entropy functionals and generalized means. Measure-theoretic formulations In Chapter 3, we extended the discrete case definitions of generalized information measures to the measure-theoretic case. We showed that as in the case of KullbackLeibler relative-entropy, generalized relative-entropies, whether Rényi or Tsallis, in the discrete case can be naturally extended to measure-theoretic case, in the sense that measure-theoretic definitions can be derived from limits of sequences of finite discrete entropies of pmfs which approximate the pdfs involved. We also showed that ME prescriptions of measure-theoretic Tsallis entropy are consistent with the discrete case, which is also true for measure-theoretic Shannon-entropy. GYP-theorem Gelfand-Yaglom-Perez theorem for KL-entropy not only equips it with a fundamental definition but also provides a means to compute KL-entropy and study its behavior. We stated and proved the GYP-theorem for generalized relative entropies of order α > 1 (q > 1 for the Tsallis case) in Chapter 3. However, results for the case 0 < α < 1, are yet to be obtained. q-product representation of Tsallis minimum entropy distribution Tsallis relative-entropy minimization in both the cases, q-expectations and normalized q-expectations, has been studied and some significant differences with the classical case are presented in Chapter 4. We showed that unlike in the classical case, minimizing Tsallis relative-entropy is not equivalent to maximizing entropy when the prior is a uniform distribution. Our usage of q-product in the representation of Tsallis minimum entropy distributions, not only provides it with an elegant representation but also simplifies the calculations in the study of its properties and in deriving the expressions for minimum relative-entropy and corresponding thermodynamic equations. 107 The detailed study of Tsallis relative-entropy minimization in the case of normalized q-expected values and the computation of corresponding minimum relativeentropy distribution (where one has to address the self-referential nature of the probabilities) based on Tsallis et al. (1998), Martı́nez et al. (2000) formalisms for Tsallis entropy maximization is currently under investigation. Considering the various fields to which Tsallis generalized statistics has been applied, studies of applications of Tsallis relative minimization of various inference problems are of particular relevance. Nonextensive Pythagoras’ theorem Phythagoras’ theorem of relative-entropy plays an important role in geometrical approaches of statistical estimation theory like information geometry. In Chapter 4 we proved Pythagoras’ theorem in the nonextensive case i.e., for Tsallis relative-entropy minimization. In our opinion, this result is yet another remarkable and consistent generalization shown by the Tsallis formalism. Use of power-law distributions in EAs Inspired by the generalization of simulated annealing reported by (Tsallis & Stariolo, 1996), in Chapter 5 we proposed a generalized evolutionary algorithm based on Tsallis statistics. The algorithm uses Tsallis canonical probability distribution instead of Boltzmann distribution. Since these distributions are maximum entropy distributions, we presented the information theoretical justification to use Boltzmann selection in evolutionary algorithms – prior to this, Boltzmann selection was viewed only as a special case of proportionate selection with exponential scaling. This should encourage the use of information theoretic methods in evolutionary computation. We tested our algorithm on some bench-mark test functions. We found that with an appropriate choice of nonextensive index (q), evolutionary algorithms based on Tsallis statistics outperform those based on Gibbs-Boltzmann distribution. We believe the Tsallis canonical distribution is a powerful technique for selection in evolutionary algorithms. 6.2 Future Directions There are two fundamental spaces in machine learning. The first space X consists of data points and the second space Θ consists of possible learning models. In statistical 108 learning, Θ is usually a space of statistical models, {p(x; θ) : θ ∈ Θ} in the generative case or {p(y|x; θ) : θ ∈ Θ} in the discriminative case. Learning algorithms select a model θ ∈ Θ based on the training example {x k }nk=1 ⊂ X or {(xk , yk )}nk=1 ⊂ X × Y depending on whether the generative case or the discriminative case are considered. Applying differential geometry, a mathematical theory of geometries, in smooth, locally Euclidean spaces to space of probability distributions and so to statistical models is a fundamental technique in information geometry. Information does however play two roles in it: Kullback-Leibler relative entropy features as a measure of divergence, and Fisher information takes the role of curvature. ME-principle is involved in information geometry due to the following reasons. One is Pythagoras’ theorem of relative-entropy minimization. And the other is due to the work of Amari (2001). Amari showed that ME distributions are exactly the ones with minimal interaction between their variables — these are close to independence. This result plays an important role in geometric approaches to machine learning. Now, equipped with the nonextensive Pythagoras’ theorem in the generalized case of Tsallis, it is interesting to know the resultant geometry when we use generalized information measures and role of entropic index in the geometry. Another open problem in generalized information measures is the kind of constraints one should use for the ME-prescriptions. At present ME-prescriptions for Tsallis come in three flavors. These three flavors correspond to the kind of constraints one would use to derive the canonical distribution. The first is conventional expectation (Tsallis (1988)), second is q-expectation values (Curado-Tsallis (1991)), and the third is normalized q-expectation values (Tsallis-Mendes-Plastino (1998)). The problem of which constraints to use remains an open problem that has so far been addressed only in the context of thermodynamics. Boghosian (1996) suggested that the entropy functional and the constraints one would use should be considered as axioms. By this he suggested that their validity is to be decided solely by the conclusions to which they lead and ultimately by comparison with experiment. A practical study of it in the problems related to estimating probability distributions by using ME of Tsallis entropy might throw some light. Moving on to another problem, we have noted that Tsallis entropy can be written as a Kolmogorov-Nagumo function of Rényi entropy. We have also seen that the same function is KN-equivalent to the function which is used in the generalized averaging of Hartley information to derive Rényi entropy. This suggests the possibility that generalized averages play a role in describing the operational significance of Tsallis entropy, 109 an explanation for which still eludes us. Finally, though Rényi information measure offers very natural – and perhaps conceptually the cleanest – setting for generalization of entropy, and while generalization of Tsallis entropy too can be put in some what formal setting with q-generalizations of functions – we still are not in the know about the complete relevance, in the sense of operational, axiomatic, mathematical, of entropic indexes α in Rényi and q in Tsallis. This is easily the most challenging problem before us. 6.3 Concluding Thought Mathematical formalism plays an important role not only in physical theories but also in theories of information phenomena; some undisputed examples being the Shannon theory of information and Kolmogorov theory of complexity. One can make advances further in these theories by, as Dirac (1939, 1963) suggested for the advancement of theoretical physics, employing all the resources of pure mathematics in an attempt to perfect and generalize the existing mathematical formalism. While operational and axiomatic justifications lay the foundations, the study of “mathematical significance” of these generalized concepts forms the pillars on which one can develop the generalized theory. The ultimate fruits of this labour include, a better understanding of phenomena in the context, better solutions for related practical problems – perhaps, as Wigner (1960) called unreasonable effectiveness of mathematics – and finally, its own beauty. 110 Bibliography Aarts, E., & Korst, J. (1989). Simulated Annealing and Boltzmann Machines–A Stochastic Approach to Combinatorial Optimization and Neural Computing. Wiley, New York. Abe, S., & Suzuki, N. (2004). Scale-free network of earthquakes. Europhysics Letters, 65(4), 581–586. Abe, S. (2000). Axioms and uniqueness theorem for Tsallis entropy. Physics Letters A, 271, 74–79. Abe, S. (2003). Geometry of escort distributions. Physical Review E, 68, 031101. Abe, S., & Bagci, G. B. (2005). Necessity of q-expectation value in nonextensive statistical mechanics. Physical Review E, 71, 016139. Aczél, J. (1948). On mean values. Bull. Amer. Math. Soc., 54, 392–400. Aczél, J., & Daróczy, Z. (1975). On Measures of Information and Their Characterization. Academic Press, New York. Agmon, N., Alhassid, Y., & Levine, R. D. (1979). An algorithm for finding the distribution of maximal entropy. Journal of Computational Physics, 30, 250–258. Amari, S. (2001). Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47, 1701–1711. Amari, S. (1985). Differential-Geometric Methods in Statistics, Vol. 28 of Lecture Notes in Statistics. Springer-Verlag, Heidelberg. Amari, S., & Nagaoka, H. (2000). Methods of Information Geometry, Vol. 191 of Translations of Mathematical Monographs. Oxford University Press, Oxford. Amblard, P.-O., & Vignat, C. (2005). A note on bounded entropies. arXiv:condmat/0509733. Andricioaei, I., & Straub, J. E. (1996). Generalized simulated annealing algorithms using Tsallis statistics: Application to conformational optimization of a tetrapeptide. Physical Review E, 53(4), 3055–3058. 111 Andricioaei, I., & Straub, J. E. (1997). On Monte Carlo and molecular dynamics methods inspired by Tsallis statistics: Methodology, optimization, and application to atomic clusters. J. Chem. Phys., 107(21), 9117–9124. Arimitsu, T., & Arimitsu, N. (2000). Tsallis statistics and fully developed turbulence. J. Phys. A: Math. Gen., 33(27), L235. Arimitsu, T., & Arimitsu, N. (2001). Analysis of turbulence by statistics based on generalized entropies. Physica A, 295, 177–194. Arndt, C. (2001). Information Measures: Information and its Description in Science and Engineering. Springer, Berlin. Ash, R. B. (1965). Information Theory. Interscience, New York. Athreya, K. B. (1994). Entropy maximization. IMA preprint series 1231, Institute for Mathematics and its Applications, University of Minnesota, Minneapolis. Back, T. (1994). Selective pressure in evolutionary algorithms: A characterization of selection mechanisms. In Proceedings of the First IEEE Conference on Evolutionary Computation, pp. 57–62 Piscataway, NJ. IEEE Press. Back, T., Hammel, U., & Schwefel, H.-P. (1997). Evolutionary computation: Comments on the history and current state. IEEE Transactions on Evolutionary Computation, 1(1), 3–17. Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512. Barlow, H. (1990). Conditions for versatile learning, Helmholtz’s unconscious inference and the test of perception. Vision Research, 30, 1561–1572. Bashkirov, A. G. (2004). Maximum Rényi entropy principle for systems with powerlaw hamiltonians. Physical Review Letters, 93, 130601. Ben-Bassat, M., & Raviv, J. (1978). Rényi’s entropy and the probability of error. IEEE Transactions on Information Theory, IT-24(3), 324–331. Ben-Tal, A. (1977). On generalized means and generalized convex functions. Journal of Optimization: Theory and Application, 21, 1–13. Bhattacharyya, A. (1943). On a measure on divergence between two statistical populations defined by their probability distributions. Bull. Calcutta. Math. Soc., 35, 99–109. 112 Bhattacharyya, A. (1946). On some analogues of the amount of information and their use in statistical estimation. Sankhya, 8, 1–14. Billingsley, P. (1960). Hausdorff dimension in probability theory. Illinois Journal of Mathematics, 4, 187–209. Billingsley, P. (1965). Ergodic Theory and Information. John Wiley & Songs, Toronto. Boghosian, B. M. (1996). Thermodynamic description of the relaxation of two- dimensional turbulence using Tsallis statistics. Physical Review E, 53, 4754. Borges, E. P. (2004). A possible deformed algebra and calculus inspired in nonextensive thermostatistics. Physica A, 340, 95–101. Borland, L., Plastino, A. R., & Tsallis, C. (1998). Information gain within nonextensive thermostatistics. Journal of Mathematical Physics, 39(12), 6490–6501. Bounds, D. G. (1987). New optimization methods from physics and biology. Nature, 329, 215. Campbell, L. L. (1965). A coding theorem and Rényi’s entropy. Information and Control, 8, 423–429. Campbell, L. L. (1985). The relation between information theory and the differential geometry approach to statistics. Information Sciences, 35(3), 195–210. Campbell, L. L. (1992). Minimum relative entropy and Hausdorff dimension. Internat. J. Math. & Stat. Sci., 1, 35–46. Campbell, L. L. (2003). Geometric ideas in minimum cross-entropy. In Karmeshu (Ed.), Entropy Measures, Maximum Entropy Principle and Emerging Applications, pp. 103–114. Springer-Verlag, Berlin Heidelberg. Caticha, A., & Preuss, R. (2004). Maximum entropy and Bayesian data analysis: Entropic prior distributions. Physical Review E, 70, 046127. Čencov, N. N. (1982). Statistical Decision Rules and Optimal Inference, Vol. 53 of Translations of Mathematical Monographs. Amer. Math. Soc., Providence RI. Cercueil, A., & Francois, O. (2001). Monte Carlo simulation and population-based optimization. In Proceedings of the 2001 Congress on Evolutionary Computation (CEC2001), pp. 191–198. IEEE Press. 113 Cerf, R. (1996a). The dynamics of mutation-selection algorithms with large population sizes. Ann. Inst. H. Poincaré, 32, 455–508. Cerf, R. (1996b). A new genetic algorithm. Ann. Appl. Probab., 6, 778–817. Cherney, A. S., & Maslov, V. P. (2004). On minimization and maximization of entropy in various disciplines. SIAM journal of Theory of Probability and Its Applications, 48(3), 447–464. Chew, S. H. (1983). A generalization of the quasilinear mean with applications to the measurement of income inequality and decision theory resolving the allais paradox. Econometrica, 51(4), 1065–1092. Costa, J. A., Hero, A. O., & Vignat, C. (2002). A characterization of the multivariate distributions maximizing Rényi entropy. In Proceedings of IEEE International Symposium on Information Theory(ISIT), pp. 263–263. IEEE Press. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. Wiley, New York. Cover, T. M., Gacs, P., & Gray, R. M. (1989). Kolmogorov’s contributions to information theory and algorithmic complexity. The Annals of Probability, 17(3), 840–865. Csiszár, I. (1969). On generalized entropy. Studia Sci. Math. Hungar., 4, 401–419. Csiszár, I. (1974). Information measures: A critical survey. In Information Theory, Statistical Decision Functions and Random Processes, Vol. B, pp. 73–86. Academia Praha, Prague. Csiszár, I. (1975). I-divergence of probability distributions and minimization problems. Ann. Prob., 3(1), 146–158. Curado, E. M. F., & Tsallis, C. (1991). Generalized statistical mechanics: Connections with thermodynamics. J. Phys. A: Math. Gen., 24, 69–72. Czachor, M., & Naudts, J. (2002). Thermostatistics based on Kolmogorov-Nagumo averages: Unifying framework for extensive and nonextensive generalizations. Physics Letters A, 298, 369–374. Daróczy, Z. (1970). Generalized information functions. Information and Control, 16, 36–51. 114 Davis, H. (1941). The Theory of Econometrics. Principia Press, Bloomington, IN. de Finetti, B. (1931). Sul concetto di media. Giornale di Istituto Italiano dei Attuarii, 2, 369–396. de la Maza, M., & Tidor, B. (1993). An analysis of selection procedures with particular attention paid to proportional and Boltzmann selection. In Forrest, S. (Ed.), Proceedings of the Fifth International Conference on Genetic Algorithms, pp. 124–131 San Mateo, CA. Morgan Kaufmann Publishers. Dirac, P. A. M. (1939). The relation between mathematics and physics. Proceedings of the Royal Society of Edinburgh, 59, 122–129. Dirac, P. A. M. (1963). The evolution of the physicist’s picture of nature. Scientific American, 208, 45–53. Dobrushin, R. L. (1959). General formulations of Shannon’s basic theorems of the theory of information. Usp. Mat. Nauk., 14(6), 3–104. dos Santos, R. J. V. (1997). Generalization of Shannon’s theorem for Tsallis entropy. Journal of Mathematical Physics, 38, 4104–4107. Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006a). Gelfand-Yaglom-Perez theorem for generalized relative entropies. arXiv:math-ph/0601035. Dukkipati, A., Bhatnagar, S., & Murty, M. N. (2006b). On measure theoretic definitions of generalized information measures and maximum entropy prescriptions. arXiv:cs.IT/0601080. (Submitted to Physica A). Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2004). Cauchy annealing schedule: An annealing schedule for Boltzmann selection scheme in evolutionary algorithms. In Proceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 1, pp. 55–62. IEEE Press. Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005a). Information theoretic justification of Boltzmann selection and its generalization to Tsallis case. In Proceedings of the IEEE Congress on Evolutionary Computation(CEC), Vol. 2, pp. 1667–1674. IEEE Press. Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2005b). Properties of Kullback-Leibler cross-entropy minimization in nonextensive framework. In Proceedings of IEEE International Symposium on Information Theory(ISIT), pp. 2374–2378. IEEE Press. 115 Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006a). Nonextensive triangle equality and other properties of Tsallis relative-entropy minimization. Physica A, 361, 124–138. Dukkipati, A., Murty, M. N., & Bhatnagar, S. (2006b). Uniqueness of nonextensive entropy under rényi’s recipe. arXiv:cs.IT/05511078. Ebanks, B., Sahoo, P., & Sander, W. (1998). Characterizations of Information Measures. World Scientific, Singapore. Eggleston, H. G. (1952). Sets of fractional dimension which occur in some problems of number theory. Proc. London Math. Soc., 54(2), 42–93. Elsasser, W. M. (1937). On quantum measurements and the role of the uncertainty relations in statistical mechanics. Physical Review, 52, 987–999. Epstein, L. G., & Zin, S. E. (1989). Substitution, risk aversion and the temporal behavior of consumption and asset returns: A theoretical framework. Econometrica, 57, 937–970. Faddeev, D. K. (1986). On the concept of entropy of a finite probabilistic scheme (Russian). Uspehi Mat. Nauk (N.S), 11, 227–231. Ferri, G. L., Martı́nez, S., & Plastino, A. (2005). The role of constraints in Tsallis’ nonextensive treatment revisited. Physica A, 347, 205–220. Fishburn, P. C. (1986). Implicit mean value and certainty equivalence. Econometrica, 54(5), 1197–1206. Fogel, D. B. (1994). An introduction to simulated evolutionary optimization. IEEE Transactions on Neural Networks, 5(1), 3–14. Forte, B., & Ng, C. T. (1973). On a characterization of the entropies of type β. Utilitas Math., 4, 193–205. Furuichi, S. (2005). On uniqueness theorem for Tsallis entropy and Tsallis relative entropy. IEEE Transactions on Information Theory, 51(10), 3638–3645. Furuichi, S. (2006). Information theoretical properties of Tsallis entropies. Journal of Mathematical Physics, 47, 023302. Furuichi, S., Yanagi, K., & Kuriyama, K. (2004). Fundamental properties of Tsallis relative entropy. Journal of Mathematical Physics, 45, 4868–4877. 116 Gelfand, I. M., Kolmogorov, N. A., & Yaglom, A. M. (1956). On the general definition of the amount of information. Dokl. Akad. Nauk USSR, 111(4), 745–748. (In Russian). Gelfand, I. M., & Yaglom, A. M. (1959). Calculation of the amount of information about a random function contained in another such function. Usp. Mat. Nauk, 12(1), 3–52. (English translation in American Mathematical Society Translations, Providence, R.I. Series 2, vol. 12). Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., 6(6), 721–741. Gokcay, E., & Principe, J. C. (2002). Information theoretic clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 158–171. Good, I. J. (1963). Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Ann. Math. Statist., 34, 911–934. Gray, R. M. (1990). Entropy and Information Theory. Springer-Verlag, New York. Grendár jr, M., & Grendár, M. (2001). Maximum entropy: Clearing up mysteries. Entropy, 3(2), 58–63. Guiaşu, S. (1977). Information Theory with Applications. McGraw-Hill, Great Britain. Halsey, T. C., Jensen, M. H., Kadanoff, L. P., Procaccia, I., & Shraiman, B. I. (1986). Fractal measures and their singularities: The characterization of strange sets. Physical Review A, 33, 1141–1151. Hardy, G. H., Littlewood, J. E., & Pólya, G. (1934). Inequalities. Cambridge. Harremoës, P., & Topsøe, F. (2001). Maximum entropy fundamentals. Entropy, 3, 191–226. Hartley, R. V. L. (1928). Transmission of information. Bell System Technical Journal, 7, 535. Havrda, J., & Charvát, F. (1967). Quantification method of classification process: Concept of structural α-entropy. Kybernetika, 3, 30–35. Hinčin, A. (1953). The concept of entropy in the theory of probability (Russian). Uspehi Mat. Nauk, 8(3), 3–28. (English transl.: In Mathematical Foundations of Information Theory, pp. 1-28. Dover, New York, 1957). 117 Hobson, A. (1969). A new theorem of information theory. J. Stat. Phys., 1, 383–391. Hobson, A. (1971). Concepts in Statistical Mechanics. Gordon and Breach, New York. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. The University of Michigan Press, Ann Arbor, MI. Ireland, C., & Kullback, S. (1968). Contingency tables with given marginals. Biometrika, 55, 179–188. Jaynes, E. T. (1957a). Information theory and statistical mechanics i. Physical Review, 106(4), 620–630. Jaynes, E. T. (1957b). Information theory and statistical mechanics ii. Physical Review, 108(4), 171–190. Jaynes, E. T. (1968). Prior probabilities. IEEE Transactions on Systems Science and Cybernetics, sec-4(3), 227–241. Jeffreys, H. (1948). Theory of Probability (2nd Edition). Oxford Clarendon Press. Jizba, P., & Arimitsu, T. (2004a). Observability of Rényi’s entropy. Physical Review E, 69, 026128. Jizba, P., & Arimitsu, T. (2004b). The world according to Rényi: thermodynamics of fractal systems. Annals of Physics, 312, 17–59. Johnson, O., & Vignat, C. (2005). Some results concerning maximum Rényi entropy distributions. math.PR/0507400. Johnson, R., & Shore, J. (1983). Comments on and correction to ’axiomatic derivation of the principle of maximum entropy and the principle of minimum crossentropy’ (jan 80 26-37) (corresp.). IEEE Transactions on Information Theory, 29(6), 942–943. Kallianpur, G. (1960). On the amount of information contained in a σ-field. In Olkin, I., & Ghurye, S. G. (Eds.), Essays in Honor of Harold Hotelling, pp. 265–273. Stanford Univ. Press, Stanford. Kamimura, R. (1998). Minimizing α-information for generalization and interpretation. Algorithmica, 22(1/2), 173–197. Kantorovitz, S. (2003). Introduction to Modern Analysis. Oxford, New York. 118 Kapur, J. N. (1994). Measures of Information and their Applications. Wiley, New York. Kapur, J. N., & Kesavan, H. K. (1997). Entropy Optimization Principles with Applications. Academic Press. Karmeshu, & Sharma, S. (2006). Queue lengh distribution of network packet traffic: Tsallis entropy maximization with fractional moments. IEEE Communications Letters, 10(1), 34–36. Khinchin, A. I. (1956). Mathematical Foundations of Information Theory. Dover, New York. Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671–680. Kolmogorov, A. N. (1930). Sur la notion de la moyenne. Atti della R. Accademia Nazionale dei Lincei, 12, 388–391. Kolmogorov, A. N. (1957). Theorie der nachrichtenübermittlung. In Grell, H. (Ed.), Arbeiten zur Informationstheorie, Vol. 1. Deutscher Verlag der Wissenschaften, Berlin. Kotz, S. (1966). Recent results in information theory. Journal of Applied Probability, 3(1), 1–93. Kreps, D. M., & Porteus, E. L. (1978). Temporal resolution of uncertainty and dynamic choice theory. Econometrica, 46, 185–200. Kullback, S. (1959). Information Theory and Statistics. Wiley, New York. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Ann. Math. Stat., 22, 79–86. Lavenda, B. H. (1998). The analogy between coding theory and multifractals. Journal of Physics A: Math. Gen., 31, 5651–5660. Lazo, A. C. G. V., & Rathie, P. N. (1978). On the entropy of continuous probability distributions. IEEE Transactions on Information Theory, IT-24(1), 120–122. Maassen, H., & Uffink, J. B. M. (1988). Generalized entropic uncertainty relations. Physical Review Letters, 60, 1103–1106. 119 Mahnig, T., & Mühlenbein, H. (2001). A new adaptive Boltzmann selection schedule sds. In Proceedings of the Congress on Evolutionary Computation (CEC’2001), pp. 183–190. IEEE Press. Markel, J. D., & Gray, A. H. (1976). Linear Prediction of Speech. Springer-Verlag, New York. Martı́nez, S., Nicolás, F., Pennini, F., & Plastino, A. (2000). Tsallis’ entropy maximization procedure revisited. Physica A, 286, 489–502. Masani, P. R. (1992a). The measure-theoretic aspects of entropy, Part 1. Journal of Computational and Applied Mathematics, 40, 215–232. Masani, P. R. (1992b). The measure-theoretic aspects of entropy, Part 2. Journal of Computational and Applied Mathematics, 44, 245–260. Mead, L. R., & Papanicolaou, N. (1984). Maximum entropy in the problem of moments. Journal of Mathematical Physics, 25(8), 2404–2417. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21, 1087–1092. Morales, D., Pardo, L., Pardo, M. C., & Vajda, I. (2004). Rényi statistics for testing composite hypotheses in general exponential models. Journal of Theoretical and Applied Statistics, 38(2), 133–147. Moret, M. A., Pascutti, P. G., Bisch, P. M., & Mundim, K. C. (1998). Stochastic molecular optimization using generalized simulated annealing. J. Comp. Chemistry, 19, 647. Mühlenbein, H., & Schlierkamp-Voosen, D. (1993). Predictive models for the breeder genetic algorithm. Evolutionary Computation, 1(1), 25–49. Nagumo, M. (1930). Über eine klasse von mittelwerte. Japanese Journal of Mathematics, 7, 71–79. Naranan, S. (1970). Bradford’s law of bibliography of science: an interpretation. Nature, 227, 631. Nivanen, L., Méhauté, A. L., & Wang, Q. A. (2003). Generalized algebra within a nonextensive statistics. Rep. Math. Phys., 52, 437–434. 120 Norries, N. (1976). General means and statistical theory. The American Statistician, 30, 1–12. Nulton, J. D., & Salamon, P. (1988). Statistical mechanics of combinatorial optimization. Physical Review A, 37(4), 1351–1356. Ochs, W. (1976). Basic properties of the generalized Boltzmann-Gibbs-Shannon entropy. Reports on Mathematical Physics, 9, 135–155. Ormoneit, O., & White, H. (1999). An efficient algorithm to compute maximum entropy densities. Econometric Reviews, 18(2), 127–140. Ostasiewicz, S., & Ostasiewicz, W. (2000). Means and their applications. Annals of Operations Research, 97, 337–355. Penna, T. J. P. (1995). Traveling salesman problem and Tsallis statistics. Physical Review E, 51, R1. Perez, A. (1959). Information theory with abstract alphabets. Theory of Probability and its Applications, 4(1). Pinsker, M. S. (1960a). Dynamical systems with completely positive or zero entropy. Soviet Math. Dokl., 1, 937. Pinsker, M. S. (1960b). Information and Information Stability of Random Variables and Process. Holden-Day, San Francisco, CA. (English ed., 1964, translated and edited by Amiel Feinstein). Prügel-Bennett, A., & Shapiro, J. (1994). Analysis of genetic algorithms using statistical mechanics. Physical Review Letters, 72(9), 1305–1309. Queirós, S. M. D., Anteneodo, C., & Tsallis, C. (2005). Power-law distributions in economics: a nonextensive statistical approach. In Abbott, D., Bouchaud, J.-P., Gabaix, X., & McCauley, J. L. (Eds.), Noise and Fluctuations in Econophysics and Finance, pp. 151–164. SPIE, Bellingham, WA. Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc., 37, 81–91. Rebollo-Neira, L. (2001). Nonextensive maximum-entropy-based formalism for data subset selection. Physical Review E, 65, 011113. 121 Rényi, A. (1959). On the dimension and entropy of probability distributions. Acta Math. Acad. Sci. Hung., 10, 193–215. (reprinted in (Turán, 1976), pp. 320-342). Rényi, A. (1960). Some fundamental questions of information theory. MTA III. Oszt. Közl., 10, 251–282. (reprinted in (Turán, 1976), pp. 526-552). Rényi, A. (1961). On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, pp. 547–561 Berkeley-Los Angeles. University of California Press. (reprinted in (Turán, 1976), pp. 565-580). Rényi, A. (1965). On the foundations of information theory. Rev. Inst. Internat. Stat., 33, 1–14. (reprinted in (Turán, 1976), pp. 304-317). Rényi, A. (1970). Probability Theory. North-Holland, Amsterdam. Rosenblatt-Roth, M. (1964). The concept of entropy in probability theory and its applications in the theory of information transmission through communication channels. Theory Probab. Appl., 9(2), 212–235. Rudin, W. (1964). Real and Complex Analysis. McGraw-Hill. (International edition, 1987). Ryu, H. K. (1993). Maximum entropy estimation of density and regression functions. Journal of Econometrics, 56, 397–440. Sanov, I. N. (1957). On the probability of large deviations of random variables. Mat. Sbornik, 42, 11–44. (in Russian). Schützenberger, M. B. (1954). Contribution aux applications statistiques de la théorie de l’information. Publ. l’Institut Statist. de l’Universit é de Paris, 3, 3–117. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379. Shannon, C. E. (1956). The bandwagon (edtl.). IEEE Transactions on Information Theory, 2, 3–3. Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication. University of Illinois Press, Urbana, Illinois. Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Transactions 122 on Information Theory, IT-26(1), 26–37. (See (Johnson & Shore, 1983) for comments and corrections.). Shore, J. E. (1981a). Minimum cross-entropy spectral analysis. IEEE Transactions on Acoustics Speech and Signal processing, ASSP-29, 230–237. Shore, J. E. (1981b). Properties of cross-entropy minimization. IEEE Transactions on Information Theory, IT-27(4), 472–482. Shore, J. E., & Gray, R. M. (1982). Minimum cross-entropy pattern classification and cluster analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4(1), 11–18. Skilling, J. (1984). The maximum entropy method. Nature, 309, 748. Smith, J. D. H. (2001). Some observations on the concepts of information theoretic entropy and randomness. Entropy, 3, 1–11. Stariolo, D. A., & Tsallis, C. (1995). Optimization by simulated annealing: Recent progress. In Staufer, D. (Ed.), Annual Reviews of Computational Physics, Vol. 2, p. 343. World Scientific, Singapore. Sutton, P., Hunter, D. L., & Jan, N. (1994). The ground state energy of the ±j spin glass from the genetic algorithm. Journal de Physique I France, 4, 1281–1285. Suyari, H. (2002). Nonextensive entropies derived from from invariance of pseudoadditivity. Physica Review E, 65, 066118. Suyari, H. (2004a). Generalization of Shannon-Khinchin axioms to nonextensive systems and the uniqueness theorem for the nonextensive entropy. IEEE Transactions on Information Theory, 50(8), 1783–1787. Suyari, H. (2004b). q-Stirling’s formula in Tsallis statistics. cond-mat/0401541. Suyari, H., & Tsukada, M. (2005). Law of error in Tsallis statistics. IEEE Transactions on Information Theory, 51(2), 753–757. Teweldeberhan, A. M., Plastino, A. R., & Miller, H. G. (2005). On the cut-off prescriptions associated with power-law generalized thermostatistics. Physics Letters A, 343, 71–78. Tikochinsky, Y., Tishby, N. Z., & Levine, R. D. (1984). Consistent inference of probabilities for reproducible experiments. Physical Review Letters, 52, 1357–1360. 123 Topsøe, F. (2001). Basic concepts, identities and inequalities - the toolkit of information theory. Entropy, 3, 162–190. Tsallis, C. (1988). Possible generalization of Boltzmann Gibbs statistics. J. Stat. Phys., 52, 479. Tsallis, C. (1994). What are the numbers that experiments provide?. Quimica Nova, 17, 468. Tsallis, C., & de Albuquerque, M. P. (2000). Are citations of scientific papers a case of nonextensivity?. Eur. Phys. J. B, 13, 777–780. Tsallis, C. (1998). Generalized entropy-based criterion for consistent testing. Physical Review E, 58, 1442–1445. Tsallis, C. (1999). Nonextensive statistics: Theoretical, experimental and computational evidences and connections. Brazilian Journal of Physics, 29, 1. Tsallis, C., Levy, S. V. F., Souza, A. M. C., & Maynard, R. (1995). Statisticalmechanical foundation of the ubiquity of lévy distributions in nature. Physical Review Letters, 75, 3589–3593. Tsallis, C., Mendes, R. S., & Plastino, A. R. (1998). The role of constraints within generalized nonextensive statistics. Physica A, 261, 534–554. Tsallis, C., & Stariolo, D. A. (1996). Generalized simulated annealing. Physica A, 233, 345–406. Turán, P. (Ed.). (1976). Selected Papers of Alfréd Rényi. Akademia Kiado, Budapest. Uffink, J. (1995). Can the maximum entropy principle be explained as a consistency requirement?. Studies in History and Philosophy of Modern Physics, 26, 223– 261. Uffink, J. (1996). The constraint rule of the maximum entropy principle. Studies in History and Philosophy of Modern Physics, 27, 47–79. Vignat, C., Hero, A. O., & Costa, J. A. (2004). About closedness by convolution of the Tsallis maximizers. Physica A, 340, 147–152. Wada, T., & Scarfone, A. M. (2005). Connections between Tsallis’ formalism employing the standard linear average energy and ones employing the normalized q-average enery. Physics Letters A, 335, 351–362. 124 Watanabe, S. (1969). Knowing and Guessing. Wiley. Wehrl, A. (1991). The many facets of entropy. Reports on Mathematical Physics, 30, 119–129. Wiener, N. (1948). Cybernetics. Wiley, New York. Wigner, E. P. (1960). The unreasonable effectiveness of mathematics in the natural sciences. Communications in Pure and Applied Mathematics, 13, 1–14. Wu, X. (2003). Calculation of maximum entropy densities with application to income distribution. Journal of Econometrics, 115, 347–354. Yamano, T. (2001). Information theory based on nonadditive information content. Physical Review E, 63, 046105. Yamano, T. (2002). Some properties of q-logarithm and q-exponential functions in Tsallis statistics. Physica A, 305, 486–496. Yu, Z. X., & Mo, D. (2003). Generalized simulated annealing algorithm applied in the ellipsometric inversion problem. Thin Solid Films, 425, 108. Zellner, A., & Highfield, R. A. (1988). Calculation of maximum entropy distributions and approximation of marginalposterior distributions. Journal of Econometrics, 37, 195–209. Zitnick, C. (2003). Computing Conditional Probabilities in Large Domains by Maximizing Renyi’s Quadratic Entropy. Ph.D. thesis, Robotics Institute, Carnegie Mellon University. 125