Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MELBOURNE UNIVERSITY Department of Mathematics and Statistics An introduction to information geometry and its applications in statistics Martin Pike This thesis was written as part of a Master of Science degree. May 2012 TABLE OF CONTENTS INTRODUCTION 1. ELEMENTS OF PROBABILITY 1.1 1.2 Probability Distribution Families Expectations 2. ELEMENTS OF DIFFERENTIAL GEOMETRY 2.1 2.2 2.3 2.4 Manifolds Tangent Spaces Rates of Change, Partial Derivatives and Differentials Vector Fields 3. INFORMATION GEOMETRY 3.1 3.2 3.3 3.4 Statistical Manifold (Amari & Nagaoka) Statistical Manifold (Murray & Rice) Sub-Manifolds Tangent Spaces of Statistical Manifolds 4. EXPONENTIAL FAMILIES 4.1 4.2 Canonical Co-ordinates of Exponential Distributions Testing for exponentiality 5. STRAIGHTNESS 5.1 5.2 5.3 5.4 5.5 Connections Subspace Connections α-Connections Geodesics Statistical Manifold Geodesics 6. DISTANCES ON STATISTICAL MANIFOLDS 6.1 6.2 6.3 Riemannian Metrics Divergences α-Divergences of the Normal Family 7. DUALITY 7.1 Dual Connections 1 7.2 7.3 7.4 7.5 Parallel Translation Canonical Divergence Dualistic Structure of Exponential Families Differential Duality 8. SURVEY OF APPLICATIONS 8.1 8.2 8.3 8.4 Affine immersions Projections onto sub-manifolds Geodesics for statistical inference Geodesics for statistical evolution CONCLUSION TABLE OF NOTATION 2 INTRODUCTION This thesis aims to be an illustrative and approachable introduction to the theory of information geometry. It is largely based on the canonical text on this subject composed by Amari & Nagaoka1 as well as an alternative text written by Murray & Rice2. A source of reference for applications of information geometry is found in the work of Arwini & Dodson3 and to a lesser extent various international research papers. Wherever possible, references are given to the reader should they care to delve further into a particular topic. In the first and second chapters of this thesis, the basics of probability theory and differential geometry are introduced. At the risk of erring on brevity, only the concepts prerequisite to information geometry are presented. The aim here is to both refresh readers familiar with these two topics and to provide readers unacquainted with either topic with a modest degree of the theory so that the main focus of this thesis can be presented. Essentially, this amounts to the definition of probability distribution functions and families thereof, independence of random variables and expectations from probability theory and from differential geometry the concepts of manifolds, tangent spaces, differentials and vector fields. Chapter 3 is where the main theory of information geometry is established. Two distinct definitions of statistical manifolds by Amari & Nagaoka and Murray & Rice, respectively, are presented and briefly compared. Following this, sub-manifolds of statistical manifolds are briefly discussed in terms of statistical inference and a convenient representation of tangent spaces to statistical manifolds based on affine families of functions is developed. An example of the framework established by information geometry is presented in the fourth chapter where the important class of exponential probability families is investigated. It is shown how such families, when considered as statistical manifolds, admit a universal form of co-ordinate systems and several geometry based tests for exponentiality are discussed with examples. 1 “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 2 “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 3 “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 2 3 In the fifth chapter of this thesis more differential geometry concepts are introduced and applied to statistical manifolds via information geometry. Connections, a generalisation of directional derivatives, are investigated and a special family of connections on statistical manifolds called the α-connections is presented and shown to unify some well-known concepts from differential geometry and probability theory. Lastly, geodesics of manifolds are defined in terms of connections and their application to statistical manifolds examined with worked-through examples. Following on in a similar vein in chapter 6, several distance functions are developed from Riemannian metrics and divergences. Once again, a family of divergences, the α-divergences are shown to generalise concepts from differential geometry and probability theory. To conclude the chapter, an example based upon the Normal family of probability distributions is worked through to solidify the terminology established so far. Chapter 7 develops the last core elements of information geometry introduced in this thesis by giving two contrasting definitions of duality on statistical manifolds by Amari & Nagaoka and Murray & Rice, respectively. Duality is shown to provide further framework for statistical manifolds such as the so-called parallel translation of tangent vectors. The potential applications of this framework are discussed via an important theorem of Amari & Nagaoka which gives conditions for the minimisation of divergences from fixed points in statistical manifold to sub-manifolds. Having now established a large portion of the theory of information geometry, the final chapter is dedicated to the exhibition of several novel applications that have been developed in the literature of the international mathematical community. It is shown how under certain conditions statistical manifolds can be represented as surfaces in real space and several concepts related to statistical estimation which make use of the distance functions developed earlier are discussed with examples. In conclusion, the merits of information geometry are touched upon including the potential advantages and difficulties encountered in applications. 4 CHAPTER 1 - ELEMENTS OF PROBABILITY Formally, the mathematical framework of probability arises from probability spaces – a triple, (Ω, ℱ, ℙ), where Ω is an arbitrary though non-empty set (the sample space), ℱ is a sigma algebra on Ω and ℙ: ℱ→ [0,1] is a probability measure on ℱ (a measure such that ℙ(Ω)=1). From this, a random variable defined on a measurable space (E, ℰ), where E is a non-empty set and ℰ is a sigma algebra on E, is a (Ω, ℱ)-measurable function Z: Ω → E, i.e. for any A∊ℰ its pre-image under Z satisfies Z-1(A)∊ ℱ. The probability of an event A “occurring” is then defined as ℙ(Z-1(A)). In fact, despite what this somewhat lengthy and abstract definition might suggest, for the most part one simply considers functions of the form p: ℝn → ℝ or p: ℤ → ℝ to define the probability of a random variable taking values in a given set by integration and summation, respectively. These functions are interpreted as expressing the likelihood of observing a random variable in a given set, say A – the greater the values that such a function takes on a set A, the more likely it is that the random variable will take values in A. Whilst the more detailed definition of probability spaces and random variables allows for a very rich area of study, the simpler viewpoint is taken here because it permits a more function-centric approach to probabilities with less focus on the underlying spaces and more focus on the functions themselves and simplifies the concepts of measure theory to standard calculus. This advantage will become more apparent once the ideas of information geometry are introduced. In order not to detract from interest in the more detailed and elegant theory of probability spaces, the curious reader is directed to any of the multitude of introductory texts on the subject, for example – Casella & Roger (2001), “Statistical Inference”; Hogg and Tanis (2006), “Probability and Statistical Inference”; Doob (1954, 1990), “Stochastic processes”; Grimmett and Stirzaker (1992), “Probability and Random Processes”. 1.1 – PROBABILITY DISTRIBUTION FAMILIES The most basic elements that will be under consideration are probability density functions of random variables. For the sake of simplicity4, a given random variable’s 4 For an introduction to the theory of information geometry, there is little harm is adopting this lessgeneralised notation and in practice it is the format that is most often used. The reader with background 5 probability density function must satisfy the following properties, depending upon whether the random variable is discrete or continuous: Discrete probability density function: a function p: ℤ → ℝ satisfying a) 0≤ p(n) ≤1 for all n∊ℤ; and b) n∊ℤ p n) = 1. Continuous probability density function: a function p: ℝn → ℝ satisfying a) p(z) ≥ 0 for all z∊ℝn ; b) For any interval (a,b)⊂ℝ, p-1(a,b) is open in ℝn ; 5 and c) ℝn p z dz = 1. The probability of a random variable, say Z, with such a distribution function p taking values in a set A is then given by the respective sum or integral, depending upon Z being discrete or continuous: ℙ Z ∈ A) = ℙ Z ∈ A) = n∊A A p n) (discrete) p z dz (continuous) When considering sets of the form A = {z∊ℝn | z1 < a1, … , zn < an} (and analogously for discrete distributions) these functions are called cumulative distributions – F(a1, … , an) = ℙ(Z∊A). Also note that when A is a discrete set and p is a continuous density that ℙ(Z∊A) = 0. Such functions are parameterised into families as p(z;θ) where the vector θ=(θ1,…, θk), which lays in a pre-defined set Θ⊂ℝk called the parameter space and is dependent on the family, specifies uniquely a particular distribution from a family of probability distributions. In this sense, both discrete and continuous distributions take the same form if the domain of p is simply labelled the sample space and, if considered in terms of measure theory, can be entirely unified. However, this is beyond the requirements of this thesis and so will remain otherwise untouched. Often, either the z, the θ or both will be omitted from notation based upon the desired focus at the time. For example, elements of a given family may be identified simply as p(θ) when the sample space is of no particular relevance to the topic at hand and the significance rests with the parameters. in probability theory will recognise the simplifications made and the reader with no background will only be assisted by using this formulation for now. 5 This additional requirement essentially ensures the functions are sufficiently well-behaved and, in particular, measurable. 6 Some commonly encountered families include the following 6: Binomial distributions, Bi(n,p): for a given a∊(0,1) and n∊ℤ, n k p k; n, a) = ak (1 − a)n−k , k ∊ {0, … , n} 0 , otherwise Poisson distributions, Pn λ): for a given λ > 0, e −λ λ k p k; λ) = k! , k ∊ 0,1,2, … 0 , otherwise Uniform distributions, U(a,b): for given a,b∊ℝ with a<b, 1 p z; a, b) = b−a , z ∊ (a, b) 0 , otherwise Gamma distributions, Γ k,θ): for given k > 0 and θ > 0, 1 p z; k, θ) = Γ(k)θ k z k−1 e−z/θ , z > 0 0 , otherwise Normal distributions, N μ,ς2): for given μ∊ℝ and ς > 0, p z; μ, ς) = 1 2π ς 2 exp − z−μ)2 2ς 2 An important convention regarding notation of random variables is that, in general, the random variable itself is written in capital letters whereas elements of the sample space are denoted by lower-case letters. For example, one may denote a Normal random variable as, say, Z with probability distribution function p(z). If one writes p(z), this denotes a particular real value given by p at the point z. However writing f(Z) (for any function f, not necessarily a probability density function) denotes a random variable – a Normal random variable transformed by the function f. One final concept regarding random variables that will be called upon later is that of independence. Suppose that U, V and W = (U,V) are random variables with density functions pU, pV and pW respectively. Then the random variables U and V are independent if pW(u,v) = pU(u)pV(v). As the name and this requirement suggest, independent random variables have no relation to each other – knowing anything about one random variable gives no information about the other. 6 “Probability and Statistical Inference” – R Hogg and E Tanis, 2006. 7 1.2 – EXPECTATIONS One further cornerstone of probability theory that will be called upon frequently is that of expectations. Known also as means, averages, expected values and more, this concept is interpreted as expressing a base-point for the “most likely” values that a random variable might take – the values closest to the mean are most likely to occur. For a continuous random variable Z with distribution function p: ℝn → ℝ and discrete random variable W with distribution function q: ℤ → ℝ, the expectations are defined below respectively as the multidimensional integral and sum 𝔼 Z) = ℝn z ∙ p(z) dz and 𝔼 W) = n∊ℤ n ∙ q n) The interpretation of these values, however, is far from perfect or consistent. For example, the expected value of a random variable need not even be achievable by the random variable. Take the Bi(1,0.5) distribution as defined above – the expected value in this case is 0.5 whereas such random variables only assume the values 0 or 1. Consider also a continuous random variable with distribution function given by −1 p z) = π z − z 2 on the interval (0,1) (shorthand for “f is zero elsewhere”), the so called Arcsine distribution. As seen in the graph below, most of the probability mass is distributed around the endpoints. However, one can verify that the expected value of such a random variable is 0.5 which is actually the mid-point of the region where, if one were to shift a small interval though (0,1), the random variable is least likely to occur. (Arcsine probability distribution function) 8 As a final example of the somewhat unpredictable behaviour of these values, consider now a continuous random variable with distribution function given by p z) = π + πz 2 )−1 on the real line, the standard Cauchy distribution. Taking the integral of z·p(z) for the expectation only from zero to infinity yields an infinite value and similarly but with opposite sign when taken from negative infinity to zero. Whilst it is tempting to simply conclude that these positively and negatively infinite values “cancel out” to give an expected value of zero, for technical reasons from measure theory it is said that in this case the expectation does not exist. Although following from the definition of expected values, it is convenient when calculating expectations to note that for constants a,b∊ℝ and any random variable Z which has a finite expected value, 𝔼(aZ+b) = a·𝔼(Z)+b. If Z and W are independent random variables then by definition p(W,Z)(w,z) = pW(w)pZ(z). Therefore the expected value of the random variable WZ is 𝔼 WZ) = = ℝm +n ℝm ℝn w∙z p w∙z p W ,Z) W) w, z d w × z) w p Z) z dw dz = 𝔼 W)𝔼 Z) and similarly in the case of discrete random variables via sums. This equality can be useful since calculating the expected value 𝔼(WZ) by first principles requires knowledge of the density function of the random variable WZ or (W,Z) – neither of which can be constructed from the density functions of W and Z alone unless these two random variables are independent. 9 CHAPTER 2 - ELEMENTS OF DIFFERENTIAL GEOMETRY In order to analyse families of probability distributions from a geometric point-of-view, it is necessary to review the basics of differential geometry. Though closely related to other fields of study concerning topological spaces and geometries, differential geometry distinguishes itself in investigating the local qualities of specific topological spaces – whilst a topologist views a sphere and any smooth deformations of it as equivalent, a geometer sees a great deal of difference between them both. 2.1 – MANIFOLDS The particular class of spaces that are of interest in differential geometry are called manifolds. Essentially, an n-dimensional manifold is characterised by having some form of co-ordinate correspondence to a subset of ℝn around each point in the space. Formally, for a space X to be a manifold, it is required that there exists an atlas of charts {(Ua,θa)} (ranging over some index set, say 𝒜) – each Ua is an open set of X and θa is function θa : Ua → ℝn – which satisfy the following two conditions: 1. a∊𝒜 Ua = X ; and 2. For each a∊𝒜, θa: Ua → ℝn is a homeomorphism onto a subset of ℝn. Since the {Ua}a∊𝒜 form an open cover of X, any particular Ua will have some intersection with its neighbours, if any exist (the obvious exceptions being disconnected spaces and atlases that consist of only one chart). If Ub is such an intersecting set, then this gives rise to the idea of transition functions between charts θb ∘ θa −1 : θa Ua ∪ Ub ) → θb Ua ∪ Ub ) which are maps between subsets of ℝn. If all such transition functions are differentiable as real maps, then X is said to be a differentiable manifold. This local correspondence to ℝn, in some sense the simplest type of manifold via the identity map acting as the sole chart, about each point in X endows the manifold with a natural local co-ordinate system – for each x∊Ua its local co-ordinate is denoted by θa(x) = (θ1 x),…, θn(x)). The distinction to be made here is that the points of X themselves are nothing more than just individual objects inherent to the manifold but the local co-ordinates give some arbitrary method of distinguishing between any two such points and are not unique to the manifold. Indeed, different atlases may give rise 10 to entirely different co-ordinates for a given point of X. This subtlety will be made clearer via a familiar example presently. One of the more tangible and yet non-trivial manifolds that can be considered as a reference point for the theory of differential geometry is the unit sphere, S2. One approach is to simply consider it as a subset of ℝ3 and thus inheriting the triple of co-ordinates (x,y,z). The problem with this approach is that this “chart”, now just the identity map, is not a homeomorphism. Indeed, in terms of open sets in ℝ3, every neighbourhood of any point in S2 is very much a closed set. Hence, a different approach is required. Consider instead separating S2 into the upper and lower hemispheres and projecting each hemisphere onto the unit disc. Whilst these are all closed sets, with a bit of imagination it is not difficult see how to extend each hemisphere into the other to create an open cover of S2, say U1 and U2, and similarly extend the image of each Ua to an open disc containing the closed unit disc. This atlas is exactly what is required. What this example should illustrate is that when considering manifolds in an abstract sense, the points of the manifold and their co-ordinates are entirely different things. In the case of the sphere, for example, the most theoretically consistent way of identifying points is to draw a picture and say “this point here”. Writing such a point as, say, (x,y,z) has the potential to confuse the correspondence with the local co-ordinate system laying inside ℝ2. However, this representation does give a simple way of writing points locally (since spherical co-ordinates are not a bijective mapping from ℝ3 to ℝ2) in terms of co-ordinates – e.g. θ1 x, y, z), θ2 x, y, z) = arccos z 2 x2 + y2 + z2 , arctan y x As an aside, one might think to wonder if in fact a simpler atlas, one consisting of perhaps only one chart, can be constructed for the sphere. The answer to this is no and can be rigorously proven. For example, this fact follows as a corollary to the Borsuk– Ulam theorem7. Continuing with this example, one should observe that intuitively the sphere is somehow a sub-manifold of ℝ3. Whilst there are many ways of defining exactly how it is a sub-manifold, a particularly concise method is based upon the Preimage Theorem 8 7 8 Corollary 2B.7, “Algebraic Topology” – A Hatcher, 2002. §1.4, “Differential Topology” – V Guillemin and A Pollack, 1974. 11 which states that, under certain conditions, given two manifolds X and Y if f: X → Y is a smooth map then f -1(y) is a sub-manifold of X for particular y∊Y. Defining f: ℝ3 → ℝ to be the squared Euclidean distance f x, y, z) = x 2 + y 2 + z 2 gives the required preimage S2 = f-1(1). For real space there is an obvious correspondence to lower dimensional real spaces by restricting some co-ordinates to be constant, i.e. ℝm = {(x1, … , xm , 0, … 0) ∊ℝn } for 0<m<n. One then says that Y is an m-dimensional sub-manifold of an n-dimensional manifold X (m<n) if there exist co-ordinate charts {(Ua,θa)} in X about every point in y∊Y such that the points around y are given by θ-1(x1, … , xm , 0, … 0) , see Murray & Rice9. This assumes, of course, that there is a natural inclusion map from Y to X in the first place so that one may consider points of Y as points of X in a well-defined manner. It is seen then that, locally, sub-manifolds look like sub-spaces of real space. 2.2 – TANGENT SPACES Continuing with the example of the sphere as a sub-manifold of ℝ3, notice that at every point s on this surface there exists a unique plane such that every line on this plane which intersects the sphere at s is tangential to the sphere, namely the tangent plane. It is desirable to extend this concept to manifolds at large in a consistent manner. Recall that the definition of a manifold does not require anything further than a topological space – there is no reason to assume that an arbitrary manifold can be so readily pictured in real space as the sphere. The most obvious counter-example, one which will be the focus of this thesis, is that of manifolds of functions. Without any means of visualising such a manifold, the idea of constructing a tangent plane at any point is difficult to approach without a solid definition of what is meant by this concept. Suppose now that X is a manifold with atlas {(Ua,θa)} and let x be an arbitrary point in X sitting in the chart (U0,θ0). Let γ: -ε, ε) → X be a path in X such that γ(0)= x, then by considering the corresponding path in real space (where differentiation is well-defined) given by θ0 ∘ γ)(t), perhaps restricting the domain of γ(t) so that the path lays entirely in U0, it is possible to consider if γ(t) is a differentiable map about t = 0. In the affirmative case, the vector θ0 ∘ γ)’ 0) is said to be a representative of the equivalence class of a tangent vector at x, denoted by γ’. This definition arises from the fact that there may be infinitely many paths through x which, when composed with the homeomorphism θ0, have the same derivative at t = 0. In general, one simply calls γ’ a tangent vector for brevity of notation. 9 §3.1.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 12 With this definition in mind, the tangent space at the point x is simply the set of all such tangent vectors and is denoted by TxX. Despite the slightly obtuse construction of TxX, this space turns out to have a very simple structure as will now be demonstrated. In particular, TxX is bijective with the vector space ℝn via the map from taking γ’ to (θ0 ∘ γ)’ 0) so is itself a vector space. Injectivity follows simply from the definition of γ’ since if any two intersecting paths γ1 and γ2 satisfy the equality (θ0 ∘ γ1)’ 0)= (θ0 ∘ γ2)’ 0) then γ1’= γ2’ represent the same equivalence class. For surjectivity note that since θ0 is a homeomorphism, for any vector v∊ℝn γ(t) ≔ θ0 -1(θ0 x)+t∙v) defines a path through x and that θ0 ∘ γ)′ 0) = θ0 ∘ θ0 −1 θ0 x) + t ∙ v ′ 0) = θ0 x) + t ∙ v ′ 0) = v Hence any vector in ℝn can be realised by an equivalence class of TxX and so there is indeed a bijection between the two spaces. One reasonable question to ask is if TxX is a vector space then what can be used as a basis? The simplest way of defining such a basis is by lifting any basis for ℝn. Specifically, given a basis {v1, … , vn} for ℝn the corresponding basis for is TxX given by {γ1’ , … , γn’} where γk ≔ θ0 -1(θ0 x)+t∙vk) for k=1, … , n. In general, two distinct tangent spaces of a given manifold need not be related as vector spaces since the geometry of the manifold about the respective base points may be entirely different. Locally, however, tangent spaces do correspond to the same real space by virtue of the fact that a chart is homeomorphic on its (open) domain in X. That being said, one often considers the tangent bundle of a manifold given by the union of all tangent spaces – TX ≔ x∊X Tx X . 2.3 – RATES OF CHANGE, PARTIAL DERIVATIVES AND DIFFERENTIALS In defining the tangent space of a manifold at a given point, the map θ: U → ℝn was used to define the rate of change of a given path in the manifold. In a similar manner, it is possible to define the rate of change of a real function, say f: X → ℝn, over a path γ: [0,1] → X at the point γ(c)= x∊X where c∊[0,1] by d f ∘ γ)′ c) ≔ dt f γ t) 13 t=c Consider now a given co-ordinate chart θ in terms of its co-ordinates in ℝn by writing θ(x) = (θ1 x), … , θn(x)). By varying at unit rate the kth co-ordinate in ℝn only, it is possible to create a path through a given x∊X defined by the lift γk t) ≔ θ−1 θ x) + t ∙ ek = θ−1 θ1 x), … , θk x) + t , … , θn x) By taking the rate of change of a function over this path, the partial derivatives of the function are defined to be ∂f ∂θ k x) ≔ (f ∘ γk )’(0) A very useful result that uses this framework is the chain rule. The result, stated here without a necessarily long and technical proof which can be found in Murray & Rice10, is that for a real function f and a path γ through x∊X, one may write ∂f ∂f f ∘ γ)′ 0) = ∂θ 1 x) ∙ θ1 ∘ γ)′ 0) + ⋯ + ∂θ n x) ∙ θn ∘ γ)′ 0) When instead considering more generally a map between manifolds, say F: X → Y, it is possible to push forward a path γ:[0,1] → X through γ(s)= x∊X to Y via (F ∘ γ)(t): [0,1] → Y. Then for a given co-ordinate chart φ about F(x)∊Y, the differential of F is dF: Tx X → TF(x) Y, γ′ ↦ φ ∘ F ∘ γ) ′ 0) This is a linear map between tangent spaces since for a scalar λ and tangent vectors γ1’, γ2’∊TxX d dF λγ1 ′) = dt φ F γ1 λt) d t=0 = λ dt φ F γ1 t) t=0 = λdF(γ1 ’) and from the bijectivity of TF(x)Y and ℝm the path F(γ1+γ2)(t) satisfies φ ∘ F γ1 + γ2 ) ′ 0) = φ ∘ F γ1 ) ′ 0) + φ ∘ F γ2 ) ′ 0) which implies that dF γ1 ′ + γ2 ′) = (φ ∘ F γ1 + γ2 )′ 0) = (φ ∘ F γ1 )′ 0) + (φ ∘ F γ2 )′ 0) = dF γ1 ′) + dF γ2 ′) Because the co-ordinate functions θ1, … , θn are just maps from X to ℝ, the chain rule can be reformulated as 10 §2.2.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 14 ∂f ∂f df γ’) = ∂θ 1 x) ∙ dθ1 γ’) + ⋯ + ∂θ n x) ∙ dθn γ’) which emphasises more the directional aspect of rates of change as acting on tangent vectors. 2.4 – VECTOR FIELDS A vector field on a manifold X is defined to be a map V: X → TX which defines for each point x∊X a unique tangent vector in TxX. Since each tangent space is equivalent to real space of dimension equal to that of the manifold X, it is generally required that a vector field be smooth by considering it to be a map V: X → ℝn. This gives rise to the intuition of what vector fields are by imagining in real space an arrow drawn at each point. In defining the rate of change of a function, the paths γ1, … , γn represented the tangent vectors of unit length in the 1st through to nth co-ordinates, respectively, at a given point in the manifold. By doing this at every point in the manifold, a vector field can be constructed, following the notation of Murray & Rice11, which is denoted by ∂ ∂θ 1 ∂ x), … , ∂θ n x) That is, for a given x∊X each of these n vector fields represents the tangent vector γk’ represented by (θk ∘ γk)’ 0) = 0, … , 1, … , 0) = ek. This notation arises from the fact that these paths were used to define the partial derivatives of a function - the rates of change of a function over each of the n paths. Note also that for any x∊X dθk ∂ ∂θ k x) = 1 and dθk ∂ ∂θ j x) = 0 when j≠k. 11 §2.2.5, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 15 CHAPTER 3 – INFORMATION GEOMETRY Having established the basic concepts from probability and differential geometry that are pre-requisite to information geometry, it is now possible to introduce the idea of a statistical manifold in a formal way. However, there are at least two means to this end given firstly by Amari & Nagaoka 12and subsequently by Murray & Rice13. The perhaps simpler approach given by Amari & Nagaoka will be introduced before the slightly more abstract definition given by Murray & Rice. 3.1 – STATISTICAL MANIFOLD (AMARI & NAGAOKA VERSION) Let P = { p(z;θ) | θ∊Θ be a family of probability distributions as defined in chapter 1 with sample space Ω - a subset of either ℤ or ℝn. Assume that 1. Θ is an open subset of ℝk ; 2. The support of each p(z;θ)∊P, i.e. supp(p) = { z∊Ω | p(z;θ)>0 }, does not vary with θ ; 3. For each z ∊ supp(p) the map from Θ to ℝ given by θ ↦ p(z;θ) is infinitely differentiable ; and 4. The order of integration/summation and differentiation may be interchanged for integrals/sums over the sample space involving any p(z;θ)∊P. then P defines a statistical manifold with co-ordinates given by the θ = (θ1,…, θk). Essentially, these conditions ensure that the Fisher information, an important function to be defined and used in the proceeding chapters, exists14 for all probability distributions p∊P. For an example of a statistical manifold under this definition, consider the family of Normal distributions with sample space Ω = ℝ: 𝒩 = p z; μ, ς) = 1 2πς exp 2 12 − z−μ)2 2ς 2 μ ∊ ℝ, ς > 0 §2.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §3.2.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 14 §2.3.1, “Theory of Statistics” – M Schervish, 1995. 13 16 Setting θ = μ, ς) it is seen that the parameter space Θ is the open upper half-plane in ℝ2 and hence the first condition is satisfied. Also, since any p(z)∊𝒩 is strictly positive on ℝ the second condition is easily satisfied. For the third condition, note that for any fixed z∊ℝ each p(z;θ)∊𝒩 is a composition of smooth functions on Θ and so is itself smooth, i.e. infinitely differentiable. In fact, each p(z;θ)∊𝒩 is a composition of smooth functions on ℝ×Θ and so the fourth definition follows from Leibniz’s rule for differentiation under the integral sign. Therefore this rather important family is indeed a statistical manifold. Conversely, this definition of a statistical manifold does have some limitations. In particular, the second condition regarding the support of each probability density function prevents many probability families from otherwise being statistical manifolds. Take, for example, the family of Uniform random variables on the real line with density functions: 1 p z; a, b) = b−a , z ∊ (a, b) 0 , otherwise Since the support of p is dictated by its parameters, this family cannot become a statistical manifold under this definition. Being a somewhat common probability family this is a noteworthy limitation. 3.2 – STATISTICAL MANIFOLD (MURRAY & RICE VERSION) The approach of Murray & Rice is somewhat more indirect and requires one further piece of terminology to be explained before it can be defined – the log-likelihood map15, appearing frequently in the theory of statistics. If P is a probability family with sample space Ω, then the log-likelihood map is the map ℓ: P → RΩ mapping the family P to the space of measurable functions on Ω, RΩ, taking p ↦ log(p(z)). Often this notation is shortened to just ℓ = ℓ p). A statistical manifold is then defined as a subset of the space of all probability measures on a sample space Ω, which is also a manifold, satisfying the following conditions: 1. The log-likelihood function ℓ p(z;θ)) is smooth with respect to θ for each z∊Ω; and 15 § 7.2.2, “Statistical Inference” – G Casella and R Berger, 2001. 17 2. For each distribution function p(z;θ), the functions (known as the scores) ∂ℓ ∂θ 1 p z; θ ∂ℓ , … , ∂θ n p z; θ are linearly independent – i.e. each score cannot be written as a linear combination of the other scores. Note that the chain rule at the end of section 2.3 of this thesis and the observation regarding the action of differentials on the vector fields ∂ ∂θ 1 ∂ p), … , ∂θ n p) at the end of section 2.4 of this thesis give the following useful identity for the scores: dℓ ∂ ∂θ k ∂ℓ p) = ∂θ k p) (k = 1, … , n) Reverting briefly to the theory of differential geometry, a differentiable map f between differentiable manifolds X and Y is said to be an immersion if its differential df: TxX → Tf(x)Y is injective over each point x in the domain of the map16. Immersions are of interest because whilst being diffeomorphisms locally they need not be so globally – the canonical example of which is the map sending a circle to a figure-8 which fails to be a manifold due to the crossover point in the middle of the figure-8. It is seen then that this definition of a statistical manifold is similar to requiring that the log-likelihood function is an immersion into the space of all probability functions on the sample space. The only impediment with this interpretation is that an immersion requires the target space to be a manifold also. The space of all probability functions on a given sample space, however, must be in general infinite dimensional – being at least as large as the space of all continuous functions on the sample space that integrate to 1 – and so arises the difficulty with this point of view. There are many definitions of infinite dimensional manifolds but for the sake of convenience Murray & Rice opt to ignore these details for the sake of simplicity. Note that both these definitions of statistical manifolds induce manifolds via the coordinate systems θ = (θ1,…, θk). As before, to verify that these conditions give a reasonable definition of a statistical manifold, consider once more the family of Normal distributions on ℝ, denoted by 𝒩. By its definition, 𝒩 is a subset of the set of all probability distributions on ℝ and ℝ itself is definitely a manifold. For any p(z;θ)∊𝒩, its log-likelihood is given by 1 z2 zμ μ2 2 2ς ς 2ς 2 ℓ p) = − log 2πς2 ) − 16 2 + §1.3, “Differential Topology” – V Guillemin and A Pollack, 1974. 18 2 − which is just a polynomial in z so 𝒩 passes the first requirement. For any probability density function p(z;θ)∊𝒩, the scores for this element are given by the equations ∂ℓ ∂μ ∂ℓ ∂ς p) = z−μ ς2 1 z−μ)2 ς ς3 p) = − + which are linearly independent since the former is a polynomial of degree one in z and the latter is a quadratic polynomial in z. Therefore 𝒩 is once again to be deemed a statistical manifold. 3.3 – SUB-MANIFOLDS A useful observation given by Arwini & Dodson17 is that certain probability families contain other probability families via some restriction on the larger family’s parameters. For example, consider the Log-Gamma distributions (named so because for a random variable G following the Gamma distribution the random variable (–log G) follows the Log-Gamma distribution) with density functions p z; ν, τ) = ν τ z ν −1 −log z) τ−1 Γ τ) for z∊(0,1) and ν,τ>0. Setting ν,τ = 1 gives p(z;1,1) = 1 which is the density function of a Uniform random variable on (0,1). Although a one-point set is a rather uninteresting manifold by any standard, this example does at least highlight the fact that it is possible to consider smaller families as a statistical sub-manifold of the larger families. This viewpoint can be useful in statistical estimation and will be expounded later once more of the information geometry framework has been established. However, in a casual sense suppose that in the course of estimating a Normal random variable there was reason to believe that the parameter ς was fixed at a given value, say ς0. By the definition of a sub-manifold in section 2.1, this subset of the Normal family ℳ = p μ, ς) ∈ 𝒩 ς = ς0 } is a sub-manifold of the Normal manifold. 17 §3.6, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 19 It is not uncommon that experimental estimation gives rise to natural errors and so it is possible that the estimate of ς will not be equal to ς0. As will be seen later, the information geometry framework will give a way to project the estimated random variable laying in 𝒩 onto the sub-manifold ℳ in a sensible manner (i.e. with some justification based upon properties of the manifolds themselves) thus satisfying the prior knowledge of this parameter. Whilst it might be convenient to simply ignore the estimated value of ς, this is surely wasteful if there is some information that can be extracted from its value. 3.4 – TANGENT SPACES OF STATISTICAL MANIFOLDS A novel approach to defining tangent spaces of a given statistical manifold is given by Murray & Rice18 by imposing an affine structure onto the space of measures on a given measure space. Whilst a thorough exposition of measure theory deviates too far from the scope of this thesis, the interested reader may like to consult Rudin19 for a complete explanation of the theory. For immediate purposes, consider integration of a real function - A f dz . Whilst the dz is often thought of as just a signifier as to which variable is being integrated, it is in fact a measure on the real line. Although in this case, using this “standard” measure (known formally as the Lebesgue measure), the measure of a simple set of the form (a,b) is just the length of the interval derived by taking the integral (a,b) dz , by changing the measure one may arrive at different measures of this set. One way of achieving this is to append a nonnegative function to the measure and integrate over the interval to get its measure f dz . (a,b) An affine space consists of a set Z and a vector space V together with a commutative operator ‘+’ such that for any two elements z1 and z2 of Z there exists a unique vector v∊V such that z1 + v = z2. Such a structure can by imposed on the space of positive measures on a measure space by letting the (infinite dimensional) vector space be the set of measurable functions and defining addition by dz + f = ef∙dz. This is commutative since dz + f) + g = ef ∙ dz + g = eg ∙ ef ∙ dz = ef+g ∙ dz dz + f + g) = ef+g ∙ dz 18 19 Example 2.2.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. §1.18, “Real and Complex Analysis” – W Rudin, 1986. 20 To retrieve the measure pdz from the measure dz one simply translates by the function log(p) – the log-likelihood of p – which is unique by the fact that the exponential map is injective from ℝ to (0,∞). The affine space structure affords the potential to write any path through a measure in this space as the translation by a path through the vector space. That is, the path through the measure space p(t)dz corresponds uniquely to translation by the path through the vector space log(p(t)). The tangent vector of the latter is just t-derivative of log(p(t)) evaluated at zero. Hence, via an argument of dimensionality, one may identify a basis for the tangent space at p(0)dz where p(t) = p(θ(t)) belongs to a probability family by the tangent vectors corresponding to the paths defined by log[p θ1(0), … , θk(0) + t, … , θn(0))] for k = 1, … n, i.e. the scores of the distribution function p. 21 CHAPTER 4 – EXPONENTIAL FAMILIES A class of probability distributions that have been extensively studied and shown to exhibit many ‘desirable’ qualities in relation to statistical estimation is that of exponential families. Amongst other properties, these distributions have been shown to admit in some sense the ‘best’ statistical estimators above all other classes of random variables. In information geometry, too, these families exhibit in many ways some of the simplest geometric properties possible in a manifold. 4.1 – CANONICAL CO-ORDINATES OF EXPONENTIAL DISTRIBUTIONS The definition of an exponential family is dependent on its probability distribution functions – if they take the form n i i i=1 θ y p z; θ = exp C z + z −φ θ where φ is a real-valued function on the parameter space and C and each of the yi are real-valued functions on the sample space then the family is exponential. For example, the Normal family is exponential by setting μ −1 1 y1 z) = z, y 2 z) = z 2 , θ1 = ς 2 , θ2 = 2ς 2 , φ θ1 , θ2 ) = 2 log −π θ2 − θ1 2 4θ 2 , C z =0 where θ1∊ℝ and θ2<0, since substituting these into the exponential form gives 1 p z; θ = exp θ1 z + θ2 z 2 − 2 log = = = 1 2π ς exp 2 1 2π ς 2 1 2π ς μ 1 ς 2ς 2z− exp − exp − 2 −π + θ2 2 2z + μ ς2 θ1 2 4θ 2 2 −ς 2 2 z 2 −2μz+μ 2 2ς 2 z−μ)2 2ς 2 which is a Normal probability density function. Notice that the co-ordinates here, θ1 and θ2, are quite different from the co-ordinates that were introduced previously – namely, μ and ς. However, the two sets of co-ordinates are bijective since μ = –θ1 ∙ 2θ2)-1 and ς = (–2θ2)-1/2. For any exponential family, the co-ordinates (θ1, … , θn) are called the canonical co-ordinates. 22 An interesting question raised by Murray & Rice20 is whether it is possible to represent the densities of an exponential family using two different sets of canonical coordinates. More precisely, whether one can write p z; θ = exp C z + n i i i=1 θ y z −φ θ = exp n i i i=1 η z −𝜓 θ for suitable co-ordinates η1, … , ηn and a function on the parameter space ψ : Θ → ℝ. The answer that they supply is that it is indeed possible as long as the partial derivatives ∂C ∂z i i = 1, … , n ∂y i and ∂z j i, j = 1, … , n remain constant, in which case the two co-ordinates are related by ηi = i n ∂y j=1 ∂z j ∂C θj + ∂z i i = 1, … , n Assuming these partial derivatives are constant, this is an affine relation and so it is possible to define the notion of a ‘straight-line’ through an exponential family by lifting straight lines in the real space image of any such canonical co-ordinate system back to the family itself, i.e. if θ t) defines a straight line in the parameter space inside ℝn, then p(z;θ t)) defines a straight line in the exponential family. Since these canonical co-ordinates are affinely related, the straight lines for one co-ordinate system are straight lines for any other. 4.2 – TESTING FOR EXPONENTIALITY So far, it is only clear that if the distribution functions of a family may be expressed in a certain way then that family is exponential. It is not so clear, however, from this how to determine when a family is not exponential. For example, the Logistic family on ℝ with probability density functions of the form 1 p z; μ, s) = s e− z−μ ) s 1 + e− z−μ ) s −2 μ ∈ ℝ, s > 0 does not seem to take the form of an exponential family due to the inverse square term but this in itself is not enough to thoroughly prove that this family is not of the exponential type. Murray & Rice21 give 4 criteria for determining outright whether a family of distributions is exponential: 1. The family is an affine subspace of the space of all probability measures; 20 21 §1.2 “Differential Geometry and Statistics” – M Murray and J Rice, 1993. §1.6 “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 23 2. For each probability density function in the family, the second partial derivatives of its log-likelihood function, considered as functions on the sample space, is in the span of its scores and a constant; 3. The second fundamental form of the family (defined below) is zero; and 4. A generalisation of Efron’s statistical curvature (defined below) is zero. In order to avoid any confusion, it is important to clarify that in the second test the scores may be linearly combined with coefficients that are functions of the parameters of the family since each probability density function is considered separately and so the parameters in this case are just constants. The first test, as noted by Murray & Rice, is equivalent to the existence of a representation for the probability density functions of a family as per the definition of an exponential family 22. In the case of the Logistic family, say ℒ, this translates to the existence of real-valued functions φ: ℝ× 0,∞) → ℝ, C: ℝ → ℝ and y1,y2: ℝ → ℝ such that for any p(z;μ,s)∊ℒ p μ, s) = exp −log s) − z−μ) s − 2 ∙ log 1 + e− z−μ ) s = exp C z) + μ ∙ y1 z) + s ∙ y 2 z) − φ μ, s) The difficulty with this test, as illustrated by this example, is that it can be difficult to prove that no such representation exists. In a casual sense, on may wish to claim that the logarithm term – being neither a function of z or μ,s) alone, nor a product of a function of z and either μ or s – precludes any such p(z;μ,s)∊ℒ from having an exponential representation. However, this is not rigorous and in other cases, where the function in question can be factored into the required terms, may lead to misdiagnoses. One should therefore proceed with caution when applying this test to prove that a given family of probability distributions is not exponential. Continuing onto the second test, the log-likelihood function for any p μ,s)∊ℒ is ℓ p μ, s) = −log[s] − x−μ) s − 2 ∙ log 1 + e− x −μ ) s and so the scores are 𝜕ℓ 𝜕𝜇 𝜕ℓ 𝜕𝑠 22 = −1 𝑠 + x−μ) s2 1 2 = − 𝑠 − 1+e − 1 − x −μ ) ∙ ∙ e 𝑠 s 2 x −μ ) − s 1+e ∙ x−μ) s2 x −μ ) s ∙ e− = x −μ ) s 1 𝑠 = 2 1− e −1 𝑠 + x −μ ) s +1 x−μ) s2 §1.4 “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 24 2 1− e x −μ ) s +1 To prove that ℒ fails to be an exponential family it suffices to show that the span of the scores and a constant does not include the partial derivates of the scores with respect to μ and s. To this end, note that the partial derivative 𝜕 2ℓ 𝜕𝜇 2 x −μ ) s −2∙e = 𝑠2 e 2 x −μ ) s +1 is not a linear combination of the scores and a constant since the denominator introduces a squared term which is not a part of either of the scores or any combination of them. Therefore ℒ fails this test of exponentiality. The second fundamental form, also known as the embedding curvature by Amari & Nagaoka23, is defined by considering the space of measurable functions on a measure space as the direct product of the tangent space of an augmented statistical manifold and its normal space in terms of the inner product 〈φ, ψ〉p = 𝔼p(φ∙ψ) (the expectation of φ∙ψ with respect to the probability density function p). Using the previous result regarding the second partial derivatives of the log-likelihood functions of an exponential family being in the span of the scores, it is shown that the normal component of these second partial derivatives must in fact be equal to zero hence giving a computable criterion for exponentiality – the second fundamental form of the probability density function p: ∂2 ℓ αi,j p) = ∂θ i ∂θ j − n s,t s,t=1 g p ) ∙ 𝔼p ∂2ℓ ∂θ i ∂θ j ∂ℓ ∂ℓ ∂θ s ∂θ t − 𝔼p ∂2 ℓ ∂θ i ∂θ j where [gs,t] is the inverse matrix of the matrix (known as the Fisher information matrix) defined by g i,j p) = 𝔼p ∂ℓ ∂ℓ ∂θ i ∂θ j = −𝔼p ∂2ℓ ∂θ i ∂θ j If each αi,j is zero for each probability density function p in the statistical manifold, then the normal component of the derivatives of the scores is zero and so the family is exponential. Note that the Fisher information matrix exists under the requirements of the probability density functions of a statistical manifold, according to Amari & Nagaoka, since the expectation of the squared scores of each probability density function existing, as per the Fisher information, implies that the expectation of the product of two such scores is also integrable. However, under the conditions of Murray & Rice that the probability density functions of a statistical manifold must satisfy, the Fisher information matrix may not exist, a fact which is noted by these authors 24. Having already verified that ℒ is not an exponential family, it is of no real benefit to calculate the large number of complicated terms in each second fundamental form. 23 24 §1.9, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §1.5.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 25 However, for families where these expressions may simplify due to terms vanishing, this test may prove preferable due to the fact that it requires only the evaluation of each fundamental form and the verification that they are zero. The last test for exponentiality is given by the function of the second fundamental forms and the inverse of the Fisher information matrix for a given probability density function p in a statistical manifold: n i,j i,j,k,l=1 g γ p) = p) ∙ g k,l p) ∙ 𝔼f αi,k p) ∙ αj,l p) If this function is zero on the statistical manifold, then this family is exponential 25. Once again, given the large number and complexity of the terms involved in this expression, the evaluation of this function on ℒ will be dismissed. Instead, consider the simpler one parameter family of probability distributions that is the Poisson family on the non-negative integers with density functions of the form p k; λ) = e −λ λ k λ>0 k! The log-likelihood, its first and second derivatives and the Fisher information “matrix” are given by ℓ p λ) = k ∙ log λ) − λ − log k!) ∂ℓ k λ) = λ − 1 ∂λ ∂ 2ℓ ∂λ 2 λ) = −k λ2 g p λ) = −𝔼p −k λ2 1 =λ The Poisson family passes the second test immediately since ∂ 2ℓ ∂λ 2 −k = λ2 = −1 λ ∂ℓ ∙ ∂λ + −1 λ ∙1 factors into a linear combination of the score and 1. Given that this family has only one parameter, the last two tests are rather easy to apply also. The second fundamental form of the Poisson family is ∂2 ℓ α p) = ∂λ 2 − g −1 p) ∙ 𝔼p = 25 −k λ2 = −k = −k λ2 λ2 − λ ∙ 𝔼p − + −1−λ λ k λ2 −k 2 λ3 ∂ 2 ℓ ∂ℓ ∂λ 2 ∂λ k + λ2 +1 k 1 −1 +λ − λ ∂ℓ − 𝔼p ∂λ k ∂λ 2 − 1 − 𝔼p λ −1 − λ ∂2ℓ −k λ2 −1 λ =0 §1.5.2 “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 26 and so the Poisson family is exponential. For the last criterion, note that the test function γ is indeed zero since γ p) = g p)2 ∙ 𝔼p α p)2 ) = λ2 ∙ 𝔼p 0) = 0 Although these tests were very easy to perform for the Poisson family, they need not always be so simply or even calculable, as in the Logistic family or any other family with a large number of parameters or complex probability density functions. This illustrates the benefit of having a wide range of tests for exponentiality available. 27 CHAPTER 5 – STRAIGHTNESS In ℝn with Cartesian co-ordinates there is a well-understood concept of straight lines – for example, one way of many to define when a path in ℝn, γ t):[0,1] → ℝn, is straight is if its derivative with respect to t, γ’(t), is constant. However, in the setting of manifolds where γ t):[0,1] → X is a path in a manifold X, there is no assumed structure on X that allows one to compute for any x,y∊X the quantity “x – y” and hence differentiation by first principles. Furthermore, there is no assumed relation between the tangent spaces Tγ t)X and Tγ s)X for t,s∊[0,1] and so asking if dγ t):[0,1] → TX is a constant map need not be well-defined. It is apparent, then, that some additional framework is required to define straightness on manifolds. 5.1 – CONNECTIONS Since straight lines are related to rates of change of functions on manifolds, it is natural to consider the tangent spaces of a manifold when defining straightness. Given a path through a manifold X, p(t):[0,1] → X, for each value of t∊[0,1] the path defines a tangent vector in Tp(t)X via the map to the equivalence class t ↦ [ θ∘p)’ t)] where θ is a co-ordinate chart on a neighbourhood containing p(t) and so the path gives rise to a vector field over X denoted by ṗ(t). If this vector field is constant, i.e. has zero rate of change, then the path p(t) is straight. The problem that arises here is that, in general, for two points x and y in X the corresponding tangent spaces TxX and TyX need not be related and so there may not be any way of comparing tangent vectors from each space in an obvious way, let alone take derivatives in via limits. To overcome this problem, one defines an operator called a connection26 or covariant derivative27 on vector fields denoted by ∇. In order to make this operator as flexible as possible, Murray & Rice simply list 3 properties that it must satisfy: 1. For any vector field V, ∇V: TxX → TxX is a linear function from the tangent space TxX to itself for every x∊X ; 2. For any two vector fields V1 and V2, ∇ V1+V2)= ∇V1+∇V2 ; and 3. For any vector field V and differentiable function f:X → ℝ, ∇ f x)∙V)(γ’)=df(γ’)V x)+f x)∇V γ’) for all γ’∊TxX and all x∊X. Notice then that if ∇ is such a connection then expanding a vector field V in terms of the co-ordinate vector fields and functions v1, … , vn: X → ℝ 26 27 §4.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. §1.6, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 28 ∂ ∂ V x) = v1 x) ∂θ 1 x) + ⋯ + vn x) ∂θ n x) then properties 2 and 3 show that for any tangent vector γ’∊TxX ∇V γ’) = n i=1 dvi ∂ ∂ γ’) ∂θ i x) + vi x)∇ ∂θ i γ’) and so the connection is completely specified by its values on the co-ordinate vector fields. Now, by property 1 it is possible to write for each i = 1, … , n ∂ ∂ ∂ ∇ ∂θ i γ’) = A1i γ’) ∂θ 1 x) + ⋯ + Ani γ’) ∂θ n x) for some real-valued linear functions A1i , … , Ani on the tangent space. The chain rule expands each of these functions as j j j Ai γ’) = Γ i,1 dθ1 γ’) + ⋯ + Γ i,n dθn γ’) j where the Γ i,k : X → ℝ are linear functions on the manifold called the Christoffel symbols. Then finally this gives the representation ∂ ∇ ∂θ i ∂ ∂θ j = n k ∂ k=1 Γ i,j ∂θ k which completely specifies the connection ∇. Although it is not entirely explicit from this representation in order to simplify notation, in general the Christoffel symbols j j j vary over the manifold, i.e. Γ i,k = Γ i,k x) ≠ Γ i,k y) for x≠y. Notice that so far no mention has been made of precisely how a connection should be defined, just the properties that it should satisfy, even though the aim has been to establish what straight lines are on arbitrary manifolds. Whilst this may seem counter-intuitive, it is intentional and reflects the idea that a given manifold does not have an arbitrary sense of straightness. Although a path is straight if the connection of the vector field it traces out is the zero function, this is entirely dependent on the connection itself. To illustrate this distinction, consider the plane with standard Cartesian co-ordinates. Let pt = (αt, βt) be a path on the plane with vector field ∂ ∂ pt = α′t ∙ ∂x pt ) + β′t ∙ ∂y pt ) then the action of a connection can be expressed as 29 ∂ ∂ ∂ ∂ ∇pt γ’) = d[α′t ] γ’) ∂x pt ) + α′t pt )∇ ∂x γ’) + d[β′t ] γ’) ∂y pt ) + β′t pt )∇ ∂y γ’) Now if one makes the decision that the Cartesian co-ordinate vector fields are to be constant, so that when acted on by the connection they both vanish, then this formula reduces to the standard Euclidean form ∂ ∂ ∇pt γ’) = d[α′t ] γ’) ∂x + d[β′t ] γ’) ∂y which agrees with the usual sense of straightness on the place – i.e. the path pt = (αt, βt) is straight if its Cartesian co-ordinates, αt and βt, are linear so that d[α’t] and d[β’t] are zero. However, if one chooses instead to use polar co-ordinates on the plane (locally, since these co-ordinates do not give an injective description of the plane, amongst other things) and specify that the polar co-ordinate vector fields are constant then this formula becomes ∂ ∂ ∇pt γ’) = d[α′t ] γ’) ∂r + d[β′t ] γ’) ∂θ and so pt is now straight only when its radial and rotational components are constant. That is, the straight lines under this geometry are formed from rays from the origin and rotations about the origin. So, it is seen that the plane does not have any inherent notions of straight lines and that this concept is to be specified by connections. 5.2 – SUBSPACE CONNECTIONS Given a manifold X with connection ∇ there is a natural way of establishing a connection on any subspace Y⊂X by treating points in Y as points in X and tangent spaces in Y as subspaces of tangent spaces in X. However, in general there is no guarantee that for a given vector field V on Y that the image of ∇V will lay entirely within TyY for all y∊Y. For example, consider the unit circle, S1, as a sub-manifold of the plane under Euclidean geometry. Writing points (locally) in S1 as just θ∊ℝ, the corresponding point in ℝ2 is given by cos θ, sin θ) and so the co-ordinate vector fields satisfy ∂ ∂ ∂θ ∂ = −sin θ ∂x + cos θ ∂y Now, property 3 of connections in §5.1 gives the expression ∇ ∂ ∂θ = −cos θ ∂ ∂x ∂ −sin θ ∇ ∂ ∂x ∂ = −cos θ ∂x − sin θ ∂y 30 − sin θ ∂ ∂y + cos θ ∇ ∂ ∂y where the values of the connection on the x-y vector fields are zero by the chosen Euclidean geometry of ℝ2. It is seen then that the image of a tangent vector in S1 is always pointing inwards towards the origin which is certainly not tangential to S1 and so this connection on the plane fails to produce a connection on S1. In the case that a connection ∇ on a manifold X is also a connection when restricted to a sub-manifold Y, the latter is said to be an autoparallel sub-manifold 28 of X. When a sub-manifold Y⊂X is not autoparallel with respect to a connection ∇, it is still possible to use this connection to define a connection on Y. As in Amari & Nagaoka29, if πx:TxX → TyY is a linear map over X that fixes TyY, i.e. a projection from TX onto TY, then the operator ∇π[V γ’)] = π[∇V γ’)] defines a connection on Y since it is a linear map by assumption and for any vector field V on Y and each y∊Y ∇π [fV γ′)] = πy [∇ fV) γ′)] = πy [df γ′)V y) + f y)∇V γ′)] = df γ′)πy [V y)] + f y)πy [∇V γ′)] = df γ′ )V y) + f y)∇π [V γ′ )] 5.3 – α-CONNECTIONS Having now defined the generalised concept of connections, it is appropriate to mention the family of connections on statistical manifolds defined by Amari & Nagaoka30 called the α-connections. Let P be a statistical manifold with co-ordinates θ = θ1, … , θn) then for each point p∊P and some α∊ℝ Murray & Rice show that the α-connection is defined by the Christoffel symbols31 j (α) Γ i,k p) = n j,l l=1 g ∙ 𝔼p ∂ 2 ℓp ∂θ i ∂θ k + 1−α ∂ℓp ∂ℓp ∂ℓp ∂θ i ∂θ k ∂θ l 2 For a given α∊ℝ, Amari & Nagaoka note that it is possible to define the α-connection in terms of three special instances of these connections. Namely, the following equalities hold 28 §1.8, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §1.9, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 30 §2.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 31 §4.6, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 29 31 1+α) ∇(α) = 1 − α)∇(0) + α∇(1) = 2 1−α ∇(1) + 2 ∇(−1) To see the significance of the 1-connection, consider an exponential family n i i i=1 θ y p z; θ = exp C z + z −φ θ Then the partial derivatives of its log-likelihood are given by ∂ℓp ∂θ i = yi z − ∂ 2 ℓp ∂φ θ ∂θ i ∂2 φ θ ∂θ i ∂θ k = − ∂θ i ∂θ k and so the equation defining the Christoffel symbols reduces to j (1) Γ i,k n j,l l=1 g p) = − ∂2 φ θ ∙ ∂θ i ∂θ k ∙ 𝔼p ∂ℓp ∂θ l which is zero since (recalling the assumption when defining a statistical manifold in §3.1 that orders of differentiation and integration of the scores may be swapped) 𝔼p ∂ℓp ∂θ l = = ∂ℓp z ℝn ∂θ l p z dz 1 ∂p z ℝn p z ∂θ l ∂ = ∂θ l ℝn p z dz p z dz ∂ = ∂θ l 1) = 0 Hence, because ∇(1) reduces to ∇V γ’) = n i=1 dvi ∂ γ’) ∂θ i p) for any vector field ∂ ∂ V p) = v1 p) ∂θ 1 p) + ⋯ + vn p) ∂θ n p) on an exponential family, such families are said to be flat with respect to the 1-connection. Conversely, consider now a mixture family on a measure space M given by p m; θ = n i i i=1 θ p m where each pi is a probability distribution function on M and the θi are non-negative weights satisfying θ1 + … + θn = 1. Then the partial derivatives of the log-likelihood functions are given by 32 ∂ℓp ∂θ i ∂ 2 ℓp ∂θ i ∂θ k =− pi m =p m;θ pi m pj m p m ;θ 2 ∂ℓ ∂ℓ = − ∂θpi ∂θpk Thus the Christoffel symbols for ∇(-1) vanish since ∂ 2 ℓp ∂θ i ∂θ k + ∂ℓp ∂ℓp ∂θ i ∂θ k =− ∂ℓp ∂ℓp ∂θ i ∂θ k + ∂ℓp ∂ℓp ∂θ i ∂θ k =0 Therefore, any mixture family is flat with respect to the (-1)-connection. It will be seen later when examining inner products on manifolds that the 0-connection coincides with a special type of connection which preserves inner products under a form of translation on manifolds – a Riemannian connection. Thus the α-connection is seen to give a way of unifying these three concepts into a more general form. 5.4 – GEODESICS In real space with Cartesian co-ordinates, the shortest path between any two points, say x and y, is always just the line joining the two points: γ t)=t·x+(1– t)·y. However, in general this is not always the case. Although locally all manifolds look like real space, as has been shown it is their geometries that define the idea of straightness. As in Murray & Rice32, a path γt, where t ranges over an interval I⊂ℝ, in a manifold X with connection ∇ is a geodesic if it satisfies the equality ∇γt ′ γt ′) = 0 for all t∊I, where γt’ = [ θ∘γt)’ t)] is the vector field traced out by γt. The interpretation is that the path traced out by γt is straight under the geometry implied by the connection. In order to find such paths in a manifold, Murray & Rice go on to derive the following system of partial differential equations which the co-ordinates of a geodesic, θi = θi γt) i = 1, …, n, must satisfy θi + n i j k j,k=1 Γ j,k θ θ =0 where the derivatives are taken with respect to t. 32 §4.9, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 33 If the manifold is flat under the connection specified by the Christoffel symbols, i.e. the j Γ i,k are all zero, then this reduces to the standard concept of geodesics. Namely, paths whose co-ordinate representations are of the form θi γt)=ai + bi∙t for constants (a1, … , an) and (b1, … , bn), i = 1, … ,n . In general, however, one must solve these partial differential equations to determine the geodesics defined by a connection (or show that a path satisfies these equations) which, recalling that the Christoffel symbols are functions on the manifold in general, can be a difficult task. To elucidate this and the established theory so far in this chapter, two examples will now be presented. 5.5 – STATISTICAL MANIFOLD GEODESICS A bountiful reference for the explicit forms of many of these concepts on statistical manifolds is given by Arwini & Dodson. One of many families that they consider is that of the Normal distributions on ℝ: 𝒩 = p z; μ, ς) = 1 2πς exp 2 − z−μ)2 μ ∊ ℝ, ς > 0 2ς 2 They present the non-zero forms of the Christoffel symbols for the α-connections of this family without ado as follows33: α+1 1 (α) 1 (α) Γ 1,2 = Γ 2,1 = − ς 1 − α 2 (α) Γ 1,1 = 2ς 2α + 1 2 (α) Γ 2,2 = − ς Using these coefficients, the system of equations defined by Murray & Rice for the geodesics of this geometry are derived here as 1 (α) 1 (α) μ = −Γ 1,2 μς − Γ 2,1 ςμ = 2 (α) 2 (α) ς = − Γ 1,1 μ2 + Γ 2,2 ς2 = − 2 α+1) 1−α 2ς ς μς μ2 + 2α +1 ς ς2 It is worth pointing out that setting α = 0 gives the following system of equations μς = 2μς 1 ςς = − μ2 + ς2 2 33 §3.8.1, “Information Geometry: Near Randomness and Near Independence” – K Arwini & C Dodson, 2008. 34 which is well-understood from the equations for geodesics on the hyperbolic plane given by the Poincaré metric34. The simplest solution to these equations is given by the constant function μ = c∊ℝ and the exponential map ς = k∙et for some k∊ℝ, since the derivatives of μ are all zero and ς is unchanged by differentiation. The more interesting solutions are given by μ = –p ∙ tanh at) and ς = r ∙ sech bt) for some constants a,b,r,s∊ℝ. To determine the required relations for these constants note that the derivatives of μ and ς are given by μ = ap −1 + tanh at)2 ) = −ap ∙ sech at)2 μ = 2 a2 p ∙ tanh at) (1 − tanh at)2 ) ς = −br ∙ tanh bt) sech bt) ς = −b2 r ∙ sech bt)3 + b2 r ∙ tanh bt)2 sech bt) Hence, the coefficients must satisfy the equality μς = 2a2 pr ∙ tanh at) 1 − tanh at)2 ) sech bt) = 2abpr ∙ tanh bt) 1 − tanh at)2 ) sech bt) = 2μς so that a = b. The coefficients must also satisfy the equality ςς = −b2 r 2 ∙ sech bt)4 + b2 r 2 ∙ tanh bt)2 sech bt)2 =− a2 p 2 2 ∙ sech at)4 + b2 r 2 ∙ tanh bt)2 sech bt)2 1 = − 2 μ2 + ς2 which implies p2 = 2r2. Thus the geodesic co-ordinate functions are μ = −r 2 ∙ tanh at) ς = −r ∙ sech at) Note that this reduces to the relation ς2 = r2 – 1/2)μ2 via standard hyperbolic identities. 34 p.76, “Dictionary of Distances” – E Deza & M Deza, 2006. 35 As seen in the graph below, although the geodesics for fixed μ correspond to the standard concept of straightness, all other geodesics are actually semi-ellipses which certainly is an uncommon interpretation of straight lines. (Elliptical geodesics of the Normal statistical manifold under the 0-connection) Another family presented by Arwini & Dodson is the family of Gamma distributions on {z | z>0 } = ℝ>0, given in (non-canonical) co-ordinates as Ⅎ= p z; γ, κ) = κ κ γ z κ−1 −z κγ e γ > 0, 𝜅 > 0 Γ κ) In particular, they show that the system of equations that must be satisfied by geodesics of the 0-connection are γ= κ = 2γ 2 γ2 γ κγ2 κ𝜓′ − − κ)−1) γκ κ κ 2 𝜓′′ κ)+1 κ 2 2κ κ𝜓′ κ)−1) where ψ is the digamma function defined by 𝜓 y) = Γ′ y) Γ y) In comparison to the simpler equations established by the Normal family under the 36 0-connection, the equations for the Gamma family are not readily solvable due to the complicated digamma function. Indeed, Arwini & Dodson present only computer generated geodesics due to this inherent difficulty. Therefore, it is seen that there is potential for complication in establishing geodesics for a given family under a given connection. This should be kept in mind for little benefit can be gained if one is unable to implement the theory in practice. 37 CHAPTER 6 – DISTANCES ON STATISTICAL MANIFOLDS The parameters θ1, … , θn of a statistical family take values in a given subset of ℝn and so it is not unreasonable to propose a way to measure distances between elements of the family by using, say, the standard Euclidean distance on the parameters. However, as motivation to investigate other ideas of distance, consider the so-called “taxicab metric”35 on ℝ2 given by d x1 , y1 ), x2 , y2 ) = x2 − x1 + y2 − y1 As the name suggests, this metric is supposed to emulate the concept of distance perceived by a vehicle that can only move on a vertical and horizontal grid as defined by the roadways. Under this metric, any route that has a total horizontal displacement of |x2-x1| and a total vertical displacement of |y2-y1|, no matter how many horizontal and vertical sections this is composed of, will have achieved the minimum distance between (x1,y1) and (x2,y2). Compare this with the Euclidean distance on ℝ2 which has a unique shortest path given by the diagonal between (x1,y1) and (x2,y2). Hence, there are times when one wishes to consider alternative measures to accommodate for peculiarities that one may encounter. 6.1 – RIEMANNIAN METRICS Although initially seeming somewhat roundabout, a common method of constructing a measure of distance on a manifold is via its tangent space. Since this space has a vector space structure by default, one may define an inner product over each point on the base manifold. That is, a function 〈 · | · 〉x :TxX×TxX → ℝ that satisfies the following rules: 1. For a given x∊X and for all a,b∊ℝ and u,v,w∊TxX, 〈 a∙u + b∙v | w 〉x = 〈 a∙u | w 〉x +〈 b∙v | w 〉x ; 2. For a given x∊X and for all u,v∊TxX, 〈 u | v 〉x = 〈 v | u 〉x ; and 3. For a given x∊X, if u∊TxX is not the zero vector, then 〈 u | u 〉x > 0. Under these circumstances, this inner product is called a Riemannian metric36 on the (Riemannian) manifold X. 35 36 “Taxicab Geometry: An Adventure in Non-Euclidean Geometry” – E Krause, 1986. §1.5, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 38 Given a co-ordinate system θ1, … , θn) with corresponding basis ∂ ∂θ 1 x), … , ∂ x) ∂θ n for the tangent space at x∊X, a Riemannian metric can be defined in terms of its action on the basis via the coefficients ∂ g i,j x) = x) ∂θ i ∂ x) ∂θ j x This follows from representation of any tangent vector v∊TxX as a linear combination of the basis vectors ∂ ∂ u = u1 ∂θ 1 x) + ⋯ + un ∂θ n x) combined with the first rule, linearity, for inner products. The inner product of two tangent vectors u,v∊TxX is then given by u v x n i,j=1 g i,j = x)ui v j Given a path through a manifold, say γ: [0,1] → X, one defines its length using a Riemannian metric as 1 length γ) = γ′ γ′ γ t) dt 0 where γ’ t) = θ∘γ)’ t) is the vector field traced out by γ. From this, it is possible to define a distance function 37 on the manifold as d x, y) = inf γ:[0,1]→X γ 0)=x ,γ 1)=y length γ) In particular, d( , ) is a metric and thus has the following properties for all x,y,z∊X 1. d(x,y) ≥ 0 and d(x,y) = 0 if and only if x = y ; 2. d(x,y) = d(y,x); and 3. d(x,z) ≤ d(x,y) + d(y,z). It is seen, as with connections, that notions of distance are not by any means inherent to a given manifold. A vector space may admit any number of inner products which in turn give rise to different metrics on the manifold. In order to establish a unified framework for statistical manifolds, it is necessary to give some way of specifying a Riemannian metric base upon the family in question itself. 37 §6.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 39 To this end, Murray & Rice continue by introducing a metric known as the Fisher information metric, a generalisation of the Fisher Information38, defined over a point (more precisely, a density) p in a statistical manifold via the family’s expectation at that point of the differentials of the log-likelihood ℓ=log p) at that point evaluated on the tangent vectors in question: u v = 𝔼 p d p ℓ) u ∙ d p ℓ) v Noting that each differential dp ℓ) can be expressed in terms of its values on the scores, the matrix for this inner product at any point p on the statistical manifold is given by g i,j p) = ∂ ∂θ i p) ∂ ∂θ j = 𝔼 p d p ℓ) = 𝔼p ∂ℓ ∂θ i p) p ∂ p) ∙ d p ℓ) ∂θ i ∂ ∂θ j p) ∂ℓ p) ∙ ∂θ j p) where the last equality follows from the definition of the scores. Although it should be clear that this proposed inner product satisfies the first to requirements – linearity and symmetry – because they are passed-down from the properties of expectations, it is not at clear at first glance that it should necessarily satisfy the third rule – positive-definiteness. However, recalling that the definition of a statistical manifold as proposed by Murray & Rice required that the scores be linearly independent39, it follows that positive-definiteness is achieved. As per the discussion in section 4.2 of this thesis, the existence of this Fisher information metric is not guaranteed under the definition of a statistical manifold given by Murray & Rice but is guaranteed to exist under the definition given by Amari & Nagaoka. In the present and proceeding chapters, it is assumed that this metric does exist. An interesting example given by Murray & Rice of this Riemannian metric is derived from the Normal family’s statistical manifold 40. They show that when parameterised as 𝒩 = p μ, ς) = 1 2πς 2 exp 38 − z−μ)2 2ς 2 μ ∊ ℝ, ς > 0 (5.10), “Theory of Point Estimation” – E Lehmann and G Casella, 1998. §3.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 40 §6.2.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 39 40 the inner product at the point p μ,ς)∊𝒩, the probability density function of the random variable Z, is defined as ∂ ∂μ ∂ ∂μ ∂ ∂ς p) p) ∂ p) ∂μ ∂ ∂ς ∂ ∂ς p) p p) p p) p Z−μ 2 = 𝔼p = 𝔼p = 𝔼p 1 = ς2 ς Z−μ)3 ς4 Z−μ)2 ς3 − Z−μ ς2 1 =0 2 − ς2 2 = ς2 Although perhaps insignificant in itself, in fact, this corresponds to the inner product on the upper half plane under hyperbolic geometry. In the case of unit variance, ς = 1, this is seen to reduce to standard Euclidean geometry 6.2 – DIVERGENCES An alternative and more direct way of defining distances on a manifold X is via a divergence function 41 – a smooth function on the manifold D(∙‖∙): X×X→ ℝ which is non-negative for every (x,y)∊X×X and zero only when x = y. In contrast to a metric, a divergence need not be symmetric or satisfy any form of the triangle relation and so has less stringent requirements but as a result can be considered to lose some accuracy in measuring distances. However, divergences still deliver useful information regarding manifolds and, when finding geodesics is out of the question, are often a more tangible alternative. One of the divergence functions presented by Amari & Nagaoka is the f-divergence, defined on a statistical manifold P for a function f: ℝ → ℝ, which must be convex on ℝ>0 and zero at z = 1, by the integral (or, in the discrete case, the analogous sum) taken over the domain of the family Df p‖q) = p z ∙f q z p z dz The requirement that f(1) = 0 is explained by Jensen’s inequality since Df p‖q) ≥ f 41 p z ∙ q z dz = f 1) p z §3.2, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 41 and so ensures non-negativity of the divergence and vanishing of the divergence only when p = q. Amari & Nagaoka also note the interesting observation that this divergence exhibits a form of convexity. Namely, for any constant λ∊[0,1] and points in the manifold p1,p2,q1,q2∊P the following inequality holds: Df λ ∙ p1 + 1 − λ) ∙ p2 ‖ λ ∙ q1 + 1 − λ) ∙ q2 ) ≤ λ ∙ Df p1 ‖q1 ) + 1 − λ) ∙ Df p2 ‖q2 ) Another divergence introduced by Amari & Nagaoka is the α-divergence. As the name suggests, this divergence is related to the α-connection defined earlier as will be shown in the proceeding chapter of this thesis. For now, define a convex function on ℝ given for some real number α by 4 f α) 1−α 2 1 − x (1+α)/2 , α ≠ ±1 x) = x ∙ log x) , −log x) , α=1 α = −1 Then the α-divergence is defined as the f-divergence with respect to this function (noting that f vanishes at x = 1 for any value of α) for a fixed value of α. Essentially, this divergence is seen to reduce to two cases, dependent of the given value of α. When α ≠ ±1, the divergence of two elements p,q∊P is given by the following integral over the sample space Ω of the family P (or the analogous sum, in the case of a discrete family) D α) 4 p‖q) = 1−α 2 1 − Ω p x 1−α 2 q x 1+α 2 dx When α = ±1, under the same conditions the formula is given by D −1) p‖q) = D 1) q‖p) = Ω p x log p x q x dx Amari & Nagaoka note that this divergence has several interesting features. For one, the relation between divergences D α)(p‖q) = D(-α)(q‖p) holds for all real values of α and all p,q∊P. Also, it is seen that the 0-divergence corresponds to the square of the so-called Hellinger distance – a metric on probability spaces defined as the square root of 2 D 0) p‖q) = 2 p x − q x 42 dx Lastly, they note that the ±1-divergence corresponds to the oft employed KullbackLeibler information42. Thus the α-divergence gives a convenient way of relating and generalising these concepts but under the more formal frame work of differential geometry. It is not immediately apparent, however, if this serves any practical purpose – an issue which is not directly addressed by Amari & Nagaoka. 6.3 – α-DIVERGENCE OF THE NORMAL FAMILY To conclude this chapter the α-divergence (for α ≠ ±1) of the Normal family will be derived directly and shown to exhibit a simple form. Let p,q∊𝒩 be any two elements of the statistical manifold of the Normal family represented in μ,ς) co-ordinates as 1 p= 2π ς 1 1 q= 2π ς 2 exp 2 exp 2 − z−μ 1 )2 2ς 1 2 − z−μ 2 )2 2ς 2 2 Then by the definition of the α-divergence (α ≠ ±1) 4 D α) p‖q) = 1−α 2 1 − 1 ℝ ℝ 2πς 2 α +1 ς 1 α −1 4 = 1−α 2 1 − exp ς 1 α −1 4 = 1−α 2 1 − 2π ς 1 2 ℝ 2πς 2 α +1 … 1−α 2 − z−μ 1 )2 1 2ς 1 2 − z−μ 1 )2 1−α 2ς 1 2 2 exp exp − μ2 2ς 2 2 2π ς 2 2 1−α 4ς 1 2 − 2 2ς 2 2 z−μ 2 )2 1+α 1+α + 4ς exp − z−μ 2 )2 2 2ς 2 2 z2 + 2 μ1 2ς 1 2 1+α 2 dz dz 1 − α) + ⋯ μ 2 μ 2 1 2 1 + α) z − 4ς1 2 1 − α) − 4ς2 2 1 + α) dz Define the following variables (really, functions of the co-ordinates and α which are to be considered as constants) for convenience of notation ς 1 α −1 a= b= 1 c = 2b μ1 2ς 1 2 ς 2 α +1 1−α 4ς 1 2 + 1+α 4ς 2 2 μ 1 − α) + 2ς2 2 1 + α) 2 μ 2 μ 2 1 2 d = 4ς1 2 1 − α) + 4ς2 2 1 + α) 42 §1.3, “Information theory and statistics” – S Kullback, 1997. 43 The divergence function then simplifies to D α) p‖q) = 4 1−a 1−α 2 1 ℝ 2π exp −b ∙ z 2 + 2bc ∙ z − d) dz 4 1 ℝ 2π exp −b ∙ z 2 + 2bc ∙ z − bc 2 ) dz 4 1 ℝ 2π exp −b ∙ z − c)2 ) dz = 1−α 2 1 − a ∙ exp bc 2 − d) = 1−α 2 1 − a ∙ exp bc 2 − d) Now, setting w = 2b z − c) so that 2b ∙ dz = dw gives the following well-known integral (solvable via basic techniques from complex analysis or from the fact that the integrand represents the density function of a Normal random variable with unit mean and variance) which is equal to 1 and thus giving a closed form solution of the α-divergence of the Normal family: 4 D α) p‖q) = 1−α 2 1 − a ∙ exp bc 2 − d) 4 = 1−α 2 1 − a ∙ exp bc 2 − d) 1 2b ℝ 1 exp 2π −w 2 2 dw 1 2b For α = ±1, the α-divergence is defined by the Kullback-Leibler information: D −1) p‖q) = D 1) q‖p) = = = 1 ℝ 2π ς 1 2 exp 1 ℝ 2π ς 1 2 = log ς2 = log ς2 ς1 ς1 + + log − z−μ 1 )2 2ς 1 2 ς2 ς1 + ℝ p x)log ς2 log ς1 exp p x) q x) dx − z−μ 1 )2 2ς 1 2 + z−μ 2 )2 2ς 2 2 dx z 2 ς 1 2 −ς 2 2 +2z μ 1 ς 2 2 −μ 2 ς 1 2 +μ 2 2 ς 1 2 −μ 1 2 ς 2 2 2ς 1 2 ς 2 2 ς 1 2 −ς 2 2 μ 1 2 +ς 1 2 +2μ 1 μ 1 ς 2 2 −μ 2 ς 1 2 +μ 2 2 ς 1 2 −μ 1 2 ς 2 2 2ς 1 2 ς 2 2 μ 1 −μ 2 )2 2ς 2 2 1 ς 2 − 2 + 2ς1 2 2 44 exp − z−μ 1 )2 2ς 1 2 dx CHAPTER 7 – DUALITY A concept common to many areas of mathematics is that of duality – generally speaking, a correspondence between related objects that provides an additional layer of interaction between the two. Information geometry is no exception and exhibits varied notions of duality centred about the concept of connections. 7.1 – DUAL CONNECTIONS Amari & Nagaoka define duality of connections as follows 43. Given a statistical manifold P with a Riemannian metric 〈 · | · 〉p :TpP×TpP → ℝ, two (affine) connections ∇ and ∇* are considered to be dual if for any vector fields U, V and W over P the action of the metric at any point p in the statistical manifold P yields W U V p = ∇U W) V p + U ∇ ∗ V W) p At first glance this notation can be somewhat confusing. The left hand side of this relation is interpreted as follows: given a point p∊P, the vector field W(p) evaluated at p yields a tangent vector, say ∂ ∂ p) + ⋯ + w n p) n p) 1 ∂θ ∂θ then by defining 〈 U(p) | V(p) 〉p = ψ (p) to be a function of p alone it is possible to take the partial derivatives of ψ (p) as dictated by W(p) W p) = w1 p) ∂𝜓 ∂𝜓 w1 p) ∂θ 1 p) + ⋯ + wn p) ∂θ n p) ≕ W U V p thus resulting in a real-valued function on the manifold P. The right hand side is more straightforward since at any point in the manifold the action of ∇U and ∇V on W produces a tangent vector and so it is just a matter of evaluating the Riemannian metric on the resultant tangent vectors. j j ∗ Now, if Γ i,k and Γ i,k are the Christoffel symbols of two connections and gi,j are the coefficients which define the inner product g i,j p) = ∂ ∂θ i p) ∂ ∂θ j p) p then these connections are dual if the following symbolic relation holds ∂g i,j ∂θ k 43 p) = n a=1 g a,j p)Γ aik p) + g i,a p)Γ ajk ∗ §3.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 45 p) To see why this relation must hold, consider the most basic form of vector fields – elements of the basis of the tangent bundle for a given co-ordinate system ∂ ∂ ∂ U = ∂θ i , V = ∂θ j , W = ∂θ k By the definition of the gi,j and the Christoffel symbols it follows that ∂ ∂ W〈U | V〉p = ∂θ k p) 〈∂θ i p) | ∂ ∂ ∂ 〈∇U W) | V〉p = 〈∇ i ∂θ ∂ a n a=1 Γ ik ∂θ a =〈 a n a=1 Γ ik = = 〈 ∂ ∂θ a p)〉p = ∂θ j | ∂θ j 〉p ∂ ∂ | ∂θ j 〉p ∂ ∂ 〈U | ∇∗ V W)〉p = 〈 i | ∇∗ j ∂θ ∂θ = ∂ ∂θ k 〉p a ∗ ∂ n a=1 Γ jk ∂θ a 〉p a ∗ ∂ n a=1 Γ jk 〈∂θ i ∂ | ∂θ a 〉p a ∗ n a=1 Γ jk g i,a = p) | ∂θ j 〉p g a,j p) ∂ ∂θ k ∂ ∂θ k a n a=1 Γ ik = 〈∂θ i | ∂g i,j p) and so the desired relation follows from this, the linearity of the inner product and the decomposition of vector fields in terms of the bases for the tangent bundle. As an example of dual connections, it turns out that the α-connections defined previously have a very simple dual structure. Namely, Amari & Nagaoka show that the connections ∇ α) and ∇(-α) respect the dual relation defined above44 when the metric is the Fisher Information metric. Instead of utilising Christoffel symbols, Amari & Nagaoka define the α–connections via the functions 45 𝔼p ∂ 2 ℓp ∂θ i ∂θ j + 1−α ∂ℓp ∂ℓp ∂ℓp ∂θ i ∂θ k 2 ∂θ j α) = Γ ij,k p ) ≕ 〈∇ α) ∂ ∂θ i ∂ ∂θ j and note that they satisfy the equality 44 45 Theorem 3.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §2.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 46 ∂ | ∂θ k 〉p α) Γ ij,k p) = ℝn ∂i ∂j ℓpα) ∙ ∂k ℓp−α) dx where the functions in the integrand are defined as ∂k ℓpα) x = p x ∂i ∂j ℓpα) x = p x 1−α 2 ∂ 2 ℓp ∂θ i ∂θ j 1−α 2 ∂ℓp x ∂θ i x + 1−α ∂ℓp 2 ∂θ i ∂ℓp x ∂θ j x Since the Fisher information metric is symmetric and has the representation g i,j p) = g j,i p) = 𝔼p ∂ℓ ∂ℓ p) ∂θ j p) = ∂θ i ℝn ∂i ℓpα) ∙ ∂j ℓp−α) dx it is follows (recalling the assumption in the definition of a statistical manifold by Amari & Nagaoka that one may differentiate with respect to the θk under the integral sign) that ∂g i,j ∂θ k p) = ℝn ∂k ∂i ℓpα) ∙ ∂j ℓp−α) + ∂i ℓpα) ∙ ∂k ∂j ℓp−α) dx ) = Γ kiα),j p) + Γ kj−α) ,i p = ∇ α) ∂ ∂θ k ∂θ i ∂ = ∇ α) ∂ ∂θ k ∂θ i ∂ ∂ ∂θ j p ∂ ∂θ j p + ∇ + −α) ∂ ∂θ k ∂ ∂θ i ∇ ∂ ∂θ j −α) ∂ ∂θ k ∂ ∂θ i p ∂ ∂θ j p and so the conditions for duality are satisfied. An almost immediate consequence of this duality is that dual connections share the same sense of flatness46 – if a statistical manifold flat with respect to ∇ then it is also flat with respect to ∇*. Combined with the previously noted duality of the connections ∇ α) and ∇(-α), this in turn shows that α-flat families are also (-α)–flat. Another implication of this duality between the α–connections is that ∇(0) is self-dual. Such a connection is referred to as being a Riemannian connection47 or a Levi-Civita connection48 and is unique for a given Riemannian metric. 7.2 – PARALLEL TRANSLATION Given a vector in ℝn with a base-point there is a natural method of translating this vector to another point in ℝn such that it remains constant according to the Euclidean 46 §3.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §1.10, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 48 §6.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 47 47 connection along the path – translation by the straight line connecting the two points. A generalisation of this is the notion of parallel translation on a manifold. Specifically, suppose that a connection ∇ is defined on a manifold X then given two points x,y∊X and a simple path (one that is injective and has a non-vanishing derivative) γ:[0,1] → X connecting these points, one seeks a way to construct a vector field Πv(x) for any tangent vector v∊Tγ 0)X such that Πv γ 0) = v and ∇Πv γ′ t) = 0 for any t∊[0,1]. The vector field Πv is called the parallel translation of v along γ49. What this concept is attempting to emulate is the idea of translating (tangent) vectors along a manifold along “straight lines” as defined by the connection ∇. What such a translation gives is a tangent vector based at another point in the manifold and an expression dictating how it arrives there. Although it is simple enough to verify whether or not these conditions are satisfied for a given vector field, it is not so apparent from the definition how to construct such a vector field or even prove that such a vector field does exist. However, Murray & Rice show that this amounts to only the solving of a system of linear first-order differential equations as follows50. Assume that θ1, … , θn) is a co-ordinate system on a neighbourhood containing the path γ:[0,1] → X then using the basis for the tangent space induced by these co-ordinates one may express the initial vector v∊Tγ 0)X and the vector field Πv(γ t)) as ∂ ∂ v = v1 ∂θ 1 γ(0)) + ⋯ + v n ∂θ n γ(0)) ∂ ∂ Πv γ(t)) = f 1 t) ∂θ 1 γ(t)) + ⋯ + f n t) ∂θ n γ(t)) where the v1, … , vn are real-valued constants and the f1, … , fn are real-valued functions on [0,1]. Now, the third axiom of connections gives the first equality in the expression below for the action of a connection on the vector field Πv(x) by using the above representation. 49 50 §1.6, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. §5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 48 ∂ ∂ ∇Πv γ′ t) = n i=1 df i t) ∂θ i γ t) + f i t)∇ ∂θ i γ t) = n i=1 df i t) ∂θ i γ t) + f i t) n k j,k=1 Γ ij = n i=1 n i j,k=1 Γ kj dθj γ′ t) ∂ df i t) + f k t) dθj γ′ t) ∂ ∂θ i ∂ ∂θ k γ t) γ t) The second equality here comes from the definition of the Christoffel symbols of a connection and the third from reordering the summations. Therefore, in compact notation for the expressions above, the system of differential equations for the parallel transport becomes df i + n k i k,j=1 f Γ kj dθj = 0 for each i=1, … ,n with initial conditions given by f i 0) = v i for each i=1, … ,n. A noteworthy property of parallel translations is in relation to dual connections. If Π(x) is a parallel translation with respect to the connection ∇ and Σ(x) is parallel with respect to the dual connection ∇*, where each translation is defined over the same curve γ, then the definition of duality gives for any vector field W(x) W〈 Π | Σ 〉x = 〈 ∇Π W) | Σ 〉x + 〈 Π | ∇∗ Σ W)〉x = 〈 0 | Σ 〉x + 〈 Π | 0 〉x =0 In particular, this shows that the rate of change of 〈 Π | Σ 〉x along the path γ is zero and so must be constant. That is to say, inner products are preserved in this situation when subject to a dual pair of parallel translations 51. If ∇ = ∇* is a Riemannian connection then this relation implies that for a parallel translation Πv(x) over the path γ: [0,1] → X the inner product of any two vectors u,v∊Tγ 0)X is constant over γ: u v γ 0) = Πu γ t) Πv γ t) γ t) t ∈ [0,1] For an illustrative example of the above discussion, return once more to the Normal family’s statistical manifold 51 Theorem 3.2, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 49 𝒩 = p z; μ, ς) = 1 2πς − z−μ)2 exp 2 μ ∊ ℝ, ς > 0 2ς 2 with the Christoffel symbols for the α–connection that are not zero: 1 (α) Γ 1,2 =− 2 (α) ς 1−α Γ 1,1 = 2 (α) =− Γ 2,2 α +1 2ς 2α +1 ς The system of equations for a parallel translation thus reduces to 1 (α) df 1 + Γ 1,2 f 1 dθ2 = df 1 − 2 (α) 2 (α) df 2 + Γ 1,1 f 1 dθ1 + Γ 2,2 f 2 dθ2 = df 2 + α+1 f 1 dθ2 = 0 ς 1−α f 1 dθ1 − 2ς 2α+1 ς f 2 dθ2 = 0 Define a simple path in from μ1,ς1) to μ2,ς2) in 𝒩 for t∊[0,1] by θ γ t) = 1 − t) μ1 , ς1 ) + t μ2 , ς2 ) so that the values of the co-ordinate differentials on γ’ are dθ1 γ′ t) = μ2 − μ1 ) dθ2 γ′ t) = ς2 − ς1 ) In this case, then, the differential equation for f1 is simply df 1 − α+1 1−t)ς 1 +tς 2 ς2 − ς1 )f 1 = 0 which is solved in general by f 1 t) = c1 ς2 − ς1 )t + ς1 α+1) From the initial condition that f 1(0) = v1, the specific solution is thus f 1 t) = v 1 α+1) ς 2 −ς 1 t+ς 1 ς1 Equipped with this solution for f 1, the second differential equation becomes df 2 + v1 1−α ς 1 )α +1 2 μ2 − μ1 ) ς2 − ς1 )t + ς1 α − 2α+1 ς 2 −ς 1 )t+ς 1 ς2 − ς1 )f 2 = 0 which can be solved in by computer algebra software to yield the general solution 50 t v 1 1−α) μ 2 −μ 1 0 2 ς 1 )α +1 f 2 t) = c 2 + ς1 + ς2 − ς1 )t)2α+1 − v 1 1−α) μ 2 −μ 1 = c2 + 2 −α)−1 [ ς1 + ς2 − ς1 )t)−α − ς1 )−α ] , α ≠ 0 ς 1 )α +1 ς 2 −ς 1 ) v 1 1−α) μ 2 −μ 1 c2 + 2 ς 1 )α +1 log ς 2 −ς 1 ) ς1 + ς2 − ς1 )s)−1−α ds ς 1 + ς 2 −ς 1 t ,α = 0 ς1 The initial condition that f2(0) = v2 then just implies that c2 = v2. 7.3 – CANONICAL DIVERGENCE Although divergence functions are arbitrary, in general, Amari & Nagaoka show that for a certain class of statistical manifolds there exists a canonical divergence – one defined by the manifold itself which possesses several worthwhile properties 52. The conditions for the existence of the canonical divergence are that a Riemannian manifold (X, 〈 · | · 〉x) be a dually flat space, i.e. flat with respect to dual connections ∇ and ∇*. From this one may assume that there exist co-ordinates θ1, … , θn and η1, … , ηn for which the connections are flat, that is ∂ ∂ ∇ ∂θ i ∂θ j ∂ ∇∗ ∂η i ∂ ∂η j = 0 for all i, j = 1, … , n = 0 for all i, j = 1, … , n In fact, it is shown that given the co-ordinates θ1, … , θn, by using affine transformations of the co-ordinates η1, … , ηn one may assume that 〈 ∂ ∂θ i x) | ∂ x) 〉x = δi,j ∂η j for all x∊X and for all i,j=1, … , n, where δi,j is the Kronecker delta function. For this pair of co-ordinates, define two functions φ,ψ : X → ℝ, called potentials, by the systems of partial differential equations ∂ ∂θ i ∂ ∂η i 𝜓 𝑥) = ηi 𝑥 ) i = 1, … , n φ 𝑥) = θi 𝑥 ) i = 1, … , n or equivalently by the relations ∂ ∂ ∂θ i ∂θ j 52 ∂ ∂ 𝜓 𝑥 ) = 〈 ∂θ i | ∂θ j 〉x i, j = 1, … , n §3.4, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 51 ∂ ∂ ∂η i ∂η j ∂ ∂ φ 𝑥 ) = 〈 ∂η i | ∂η j 〉x i, j = 1, … , n then the canonical divergence for this dually flat manifold is given by n i i=1 θ D x‖y) = 𝜓 x) + φ y) − x)ηi y) That this function actually possesses the properties required of a divergence follows from the observation given by Amari & Nagaoka that the functions φ and ψ can be expressed as 𝜓 x) = maxy∈X n i i=1 θ x)ηi y) − φ y) φ y) = maxx∈X n i i=1 θ x)ηi y) − 𝜓 x) and from the fact that the relations ∂ ∂ ∂θ i ∂θ j ∂ ∂ ∂η i ∂η j ∂ ∂ ∂ ∂ 𝜓 𝑥 ) = 〈 ∂θ i | ∂θ j 〉x ≥ 0 φ 𝑥 ) = 〈 ∂η i | ∂η j 〉x ≥ 0 imply that both φ and ψ are convex functions. As a particular case of a canonical divergence, Amari & Nagaoka note that in the case of a Riemannian connection, where ∇ = ∇* is self-dual and thus co-ordinate systems are self-dual, the potentials are 1 𝜓 x) = φ x) = 2 n i i=1 θ x)2 and so the divergence function D x‖y) = 𝜓 x) + φ y) − n i i=1 θ 1 x)θi y) = 2 n i=1 θi x) − θi y) 2 corresponds to one half of the square of the standard Euclidean distance. Perhaps one of the most useful properties of the canonical divergence given by a theorem53 of Amari & Nagaoka is their Pythagorean relation: Suppose x,y,z are three points in the dually flat manifold X and that γ1:[0,1] → X is the geodesic with respect to ∇ connecting x= γ1(0) and y= γ1(1) and γ2:[0,1] → X is the geodesic with respect to ∇* connecting y= γ2(0) and z= γ2(1). If the curves γ1 and γ2 are orthogonal at the point y, that is 〈 γ1’(1) | γ2’(0) 〉y = 0, then the following triangle relation holds for the canonical divergence: D x‖z) = D x‖y) + D y‖z) 53 Theorem 3.8, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 52 From this theorem they follow with a corollary which may find applications in statistical estimation when one wishes to assume that true distributions form a sub-manifold of the dually flat manifold Y. The corollary states that if X⊂Y is an ∇*–autoparallel sub-manifold of a dually flat manifold Y (i.e. when restricted to X, ∇* still defines a valid connection) then the canonical divergence D(y‖∙): X → ℝ between a given point y∊Y and the sub-manifold X is minimised by x∊X if and only if the geodesic between x and y is orthogonal at x to every path in X though x (in this case, such a geodesic is said to be orthogonal to X). The point x∊X that minimises the canonical divergence as above is said to be the ∇–projection of y onto X. This will be shown later to be of use when projecting onto sub-manifolds in the course of statistical estimations. Essentially, the divergence function is being used to find the closest point in X to any observation found outside of X in Y to project the observation in a meaningful way onto X54. Although theoretically a worthwhile concept, in practice, however, one must realise that this method of divergence minimisation depends on knowledge of several other constructs which can be complicated to calculate, if not unattainable explicitly. For example, it has been shown that the α–divergences of the Normal statistical manifold can be defined exactly but that the geodesics on the Normal statistical manifold have a closed form only for α = 0. 7.4 – DUALISTIC STRUCTURE OF EXPONENTIAL FAMILIES Whilst the previous discussion of dually flat manifolds included some promising results, it also depended on fairly detailed requirements that a manifold must pass. For these results to be of any use then, they must be applicable to at least some non-exotic probability spaces. Fortunately, it can be demonstrated that the results apply to a very large and widely used class of manifolds – that of exponential families55. Suppose that such a family consists of probability density functions of the form p z; θ = exp C z + n i i i=1 θ y z −𝜓 θ then Amari & Nagaoka define (expectation) co-ordinates (η1, … , ηn) on its statistical manifold by ηi p z; θ 54 55 = 𝔼θ y i Z = y i z ∙ p z; θ dz §6.5.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. §3.5, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 53 and note that their potential is just ψ (θ) since 0= ∂ p z; θ dz ∂θ i = = ∂ ∂θ i n i i i=1 θ y exp C z + yi z − = ηi p z; θ ∂𝜓 θ ∂θ i − z − 𝜓 θ dz p z; θ dz ∂𝜓 θ ∂θ i Although little justification is presented by Amari & Nagaoka as to how these are proper co-ordinates, it follows from the fact that the maximum likelihood estimator (MLE) M: Ω → ℝ of an exponential family ℰ is bijective and satisfies the equality M −1 p) = ∂𝜓 θ ∂θ 1 ,…, ∂𝜓 θ ∂θ n for any probability density function p∊ℰ and so it is possible recover the canonical co-ordinates from the expectation co-ordinates in the case of exponential families 56. It is further noted that the expectation co-ordinates satisfy the requirements of dual co-ordinates to the canonical co-ordinates θ1, … , θn) since g i,j p) = −𝔼p = −𝔼p = 𝔼p ∂2ℓ ∂θ i ∂θ j ∂2 ∂θ i ∂θ j C z + n i i i=1 θ y z −𝜓 θ ∂2 𝜓 θ ∂θ i ∂θ j which Amari & Nagaoka show is equivalent to the requirement that 〈 ∂ ∂θ i ∂ | ∂η j 〉 = δi,j Now, as previously discussed, under canonical co-ordinates any exponential family is flat with respect to the 1–connection, the dual of which is the (-1)–connection. Therefore, the co-ordinate duality of the expectation co-ordinates and the canonical co-ordinates implies that any exponential family is also flat with respect to the (-1)–connection under expectation co-ordinates. Hence, any exponential family is dually flat. 56 §7.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 54 7.5 – DIFFERENTIAL DUALITY An alternative concept of duality is introduced by Murray & Rice which relates tangent vectors and differentials. Since tangent spaces are vector spaces and differentials linear functions on tangent spaces, the concept of duality that they propose is similar to the more common concept of vector space duality but is realised through connections57. Firstly, it is necessary to establish exactly what a connection on the space of differentials (called the co-tangent bundle and denoted henceforth by Tx*X) should be. In line with connections on tangent spaces, Murray & Rice require that such a connection should satisfy the following three properties for differentials defined over each point in a manifold: 1. For any differential ωx, ∇ωx: TxX → Tx*X is a linear function the tangent space for every x∊X ; 2. For any two differentials ωx and νx, ∇ ωx+νx) = ∇ωx+∇νx ; and 3. For any differential ωx and differentiable function f:X → ℝ, ∇ f(x) ∙ωx)(γ’)=df γ’)ωx+f x)∇ωx(γ’) for all γ’∊TxX and all x∊X. As with connections on the tangent bundle of a manifold X, given a co-ordinate system (θ1, … , θn) on X it is possible to expand a differential ωx∊Tx*X as ωx = ω1(x)dθ1 + … + ωn(x)dθn for real-valued functions ω1, … , ωn on X. Thus the action of a connection on ωx can be expanded by the properties of connections on the co-tangent bundle as ∇ωx γ′) = n i=1 dωi γ′)dθi x) + ωi x)∇dθi γ′) Therefore, since the differentials dω1 through dωn are already well-defined, connections over co-tangent bundles are defined by their values on the differentials dθi. Since ∇dθix γ′ ) is by definition a differential for any tangent vector γ’∊TxX and so expandable in terms of the dθix , writing ∇dθix γ′) = i n j=1 A j j γ′)dθx results in the representation of such connections via the Christoffel symbols i A j γ′) = i n k k =1 Γ j,k dθx γ′) which follows from the linearity of ∇ωx on tangent vectors. 57 §4.7, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 55 The Christoffel symbols themselves can either be arbitrary constants for each point x∊X, perhaps satisfying some regularity conditions if a particular type of connection is desired, or else defined from a given connection by the equality i Γ j,k = ∇dθix ∂ ∂ ∂θ j ∂θ k That is, one evaluates ∇dθix on the jth co-ordinate derivative to form a differential and then applies said differential to the kth co-ordinate derivative. To see why this is, let γ’ and ς’ be any two tangent vectors in TxX. Expand these tangent vectors in terms of the co-ordinate basis as γ′ = γ1 ∂ ∂ + ⋯ + γn n 1 ∂θ ∂θ ς′ = ς1 ∂ ∂ + ⋯ + ςn n 1 ∂θ ∂θ then by the bilinearity of each ∇dθix on TxX following equalities hold: ∇dθix γ′ ) ς′ ) = i n j=1 A j γ1 ∂ ∂θ 1 = i n b a a,b,j=1 γ ς A j = n b a a,b,j=1 γ ς = i n b a a,b=1 γ ς Γ a,b + ⋯ + γn ∂ ∂ j ∂θ b dθx i n k k=1 Γ j,k dθx ∂ j dθx ς1 ∂θ n ∂θ 1 + ⋯ + ςn ∂ ∂θ n ∂ ∂θ a ∂ ∂θ b ∂ j dθx ∂θ a Expanding ∇dθix γ′ ) ς′ ) directly then yields the equality ∇dθix γ′ ) ς′ ) = n b a i a,b=1 γ ς ∇dθx ∂ ∂ ∂θ b ∂θ a = i n b a a,b=1 γ ς Γ a,b from which the desired relation follows. In order to define their concept of duality, Murray & Rice introduce a form of the Leibniz rule for differentials based upon connections on tangent bundles and co-tangent bundles, having now defined the latter 58. In particular, they relate a given vector field V over X, a differential ωx∊Tx*X and a tangent vector γ’∊TxX by the following differential formula (where the over-bar is used to denote the co-tangent bundle connection): d ωx V x) 58 γ′) = ∇ωx γ′) V x) + ωx ∇V γ′) §4.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 56 Since the action of each dθix on each basis vector of TxX under a given co-ordinate system is by assumption given by the relation ∂ x) = δi,j ∂θj dθix and so is constant, this Leibniz rule gives rise to the concept of duality of the connections ∇ and ∇ presented by Murray & Rice through the following equality: 0 = d dθix ∂ x) ∂θ j ∂ γ′) = ∇dθix γ′) ∂ x) + dθix ∇ ∂θ j γ′) ∂θ j Explicitly, by the definition of the Christoffel symbols of each connection this equality relates the two by i Γ j,k = ∇dθix ∂ ∂θ j x) ∂ = −dθix ∇ ∂θ j x) ∂ ∂θ k ∂ ∂θ k ∂ a n a=1 Γ j,k ∂θ a = −dθix x) x) x) = −Γ ij,k The motivation for defining duality in this way arises from parallel translations. Supposing that for each tangent vector u∊Tρ 0)X there exists a parallel translation along a path ρ:[0,1] → X, say Πρ(t)(u), which establishes a bijective correspondence between tangent spaces, then one may define the action of a parallel-translated differential ωρ(0)∊Tρ 0)*X on a tangent vector v = Πρ(t)(u)∊Tρ t)X to be Πρ t) ωρ 0) u) = ωρ 0) v) Effectively, this parallel translation of differentials is defined by taking the action of the differential on the reversely translated tangent vector which now lies in the appropriate tangent space. Note that this does indeed define a parallel translation, analogously to parallel translations of tangent vectors, since the action of the parallel-translated differential over a path γ(t) on its vector field γ’ t) is constant and so its connection vanishes: Πγ t) ω) γ’ t) = ωγ 0) γ’ 0) ⇒ ∇ Πγ 57 t) ω) γ’ t) = 0 By considering parallel translations in terms of the implied differential equations, Murray & Rice claim that this bijectivity between tangent spaces is realised 59. Moreover, defining the inverse of a parallel translation on the co-tangent bundle as −1 t) Πγ ωγ ≔ ν ∈ Tγ t) 0) ∗ X Πγ t) ν) = ωγ t) which consists of precisely one element due to bijectivity, they also claim that this parallel translation of differentials defines a connection via the relation ∇ω γ −1 t) d 0) γ′ 0) = dt Πγ ωγ t) t=0 At first glance, this formula may be less than transparent in meaning. Recall that a co-tangent bundle connection produces for every tangent vector γ’∊TxX a differential ∇ωx γ’)∊Tx*X. The right hand side of this relation is taking the derivative with respect to t of a differential, say ν t)∊Tx*X, hence it is necessary to firstly show that this results in another differential ν’ t)∊Tx*X. Write ν t) = ν1 t)dθ1 + … + νn t)dθn, then the derivative is d dt −1 t) Πγ ωγ d t) t=0 = dt [ν t)] t=0 d = dt [ν1 t)dθ1 + ⋯ + νn t)dθn ] 1 t=0 n = ν1 ′(0)dθ + ⋯ + νn ′(0)dθ and so defines another differential ν’ t)∊Tx*X as required. That the action of this proposed connection is linear with respect to differentials follows simply from the linearity of differentiation. Lastly, given and differentiable function f:X → ℝ, this formula yields ∇ f γ 0) ∙ ω γ −1 t) d 0) γ′ 0) = dt Πγ f γ t) ∙ ω γ t) t=0 d = dt f γ t) ∙ ν t) t=0 d = dt f γ t) ∙ ν1 t)dθ1 + ⋯ + f γ t) ∙ νn t)dθn = 59 n i=1 t=0 df γ′ 0) ∙ νi 0)dθi + f γ 0) ∙ ν′i 0)dθi = df γ′ 0) ∙ ∇ ωγ 0) = df γ′ 0) ∙ ωγ γ′ 0) + f γ 0) ∙ ∇ ωγ 0) γ′ 0) + f γ 0) ∙ ∇ ωγ §5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 58 0) 0) γ′ 0) γ′ 0) where the last equality follows from the fact that −1 0) Πγ ωγ = ν ∈ Tγ 0) 0) ∗ X Πγ ν) = ωγ 0) 0) is by definition just ωγ 0) itself. Therefore, since all the prerequisites are satisfied, this does indeed define a connection on the co-tangent bundle. Returning to the original issue of motivating the definition of duality of connections on tangent and co-tangent bundles via the Leibniz rule d ωx V x) γ′) = ∇ωx γ′) V x) + ωx ∇V γ′) Murray & Rice show that by writing the action of the differential ωγ t) on the vector field V(γ t)) as ωγ t) V γ t) = ωγ t) −1 t) Πγ t)Πγ−1t)V γ t) = Πγ ωγ t) Πγ−1t) V γ t) then this Leibniz rule now follows naturally 60 (the second last equality below being an application of the chain-rule): d ωγ 0) V γ 0) d γ′ ) = dt ωγ d −1 t) = dt Πγ = 60 ωγ t) d −1 Πγ t) ωγ dt = ∇ωγ 0) V γ t) t) t) γ′ 0) t=0 Πγ−1t)V γ t) t=0 V γ 0) t=0 V γ 0) + ωγ + ωγ 0) 0) ∇V γ′ 0) §5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993. 59 d −1 Π V γ t) dt γ t) t=0 CHAPTER 8 – SURVEY OF APPLICATIONS Interest in the theory of information geometry is at least partially motivated by the possibility of gaining a fresh perspective on the well-known topics of probability and statistics. As eluded to previously in this thesis, several applications in the field of statistics have emerged over the years as information geometry has established itself. Primarily, these new methods are in relation to statistical estimation. 8.1 – AFFINE IMMERSIONS There exists a famous theorem of Whitney 61 in the field of differential topology which states that all manifolds which satisfy a certain condition admit an immersion into real space. Essentially, this theorem gives a way of representing a given manifold in real space which, for manifolds of small enough dimension, permits one to visualise the manifold – much like how one visualises the 2-manifold that is the unit sphere in 3-dimensional real space. Suppose that X and Y are manifolds and that f: X → Y is a map from X to Y. The map f is an immersion if for each point x∊X its derivative dxf: TxX → Tf(x)Y is an injective map. Then Whitney’s theorem is stated as follows: Any smooth manifold X of dimension m > 1 admits a function f: X → ℝ2m -1 which is an immersion. A consequence of this theorem is that any 2-dimensional manifold may be visualised in 3-dimensional real space. Note, however, that there is no guarantee that this map is injective. The Klein bottle is an example of such a manifold – although it admits immersions into 3-space, such maps always produce a self-intersecting surface62. As manifolds, statistical manifolds are no exception to Whitney’s theorem. However, using the framework endowed by information geometry it is possible to improve upon Whitney’s result in certain cases as specified by a proposition of Arwini & Dodson63. 61 §1.8, “Differential Topology” – V Guillemin and A Pollack, 1974. §4, “A Topological Aperitif” – S Huggett and D Jordan, 2001. 63 §3.4, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 62 60 Before this can be presented, however, a definition from topology is required. Given a topological space X, if any two points x,y∊X can be joined by a path γ1: [0,1] → X and any two such paths between x and y, say γ1 and γ2, admit a continuous deformation H:[0,1]×[0,1] → X such that H(0,t) = γ1(t), H(1,t) = γ2(t), H(s,1) = γ1(1) and H(s,0) = γ1(0) then the space X is said to be simply connected. For a more in-depth discussion of this topic see, for example, Hatcher64. Now, let P be a dually flat statistical manifold with respect to a Riemannian metric 〈 · | · 〉p : TpP×TpP → ℝ and connections ∇ and ∇* which have flat co-ordinates θ1, … , θn) and η1, … , ηn), respectively. As in section 7.3 of this thesis, define the potential function φ: P → ℝ by the either of the following two equivalent relations: ∂ ∂η i ∂ ∂ ∂η i ∂η j φ p) = θi p) ∂ i = 1, … , n ∂ φ p) = 〈∂η i | ∂η j 〉p i, j = 1, … , n Then if P is simply connected the map f: P → ℝn+1 where f θ1, … , θn) = θ1, … , θn, φ θ)) is an immersion. As a special case of this proposition note that as in section 7.4 of this thesis that exponential families are dually flat with respect to Amari & Nagaoka’s 1-connection and (–1)-connection under the Fisher information metric. Thus if an exponential family P consists of probability density functions of the form p z; θ = exp C z + n i i i=1 θ y z −φ θ then the potential function in this case is just φ and so the immersion into ℝn+1 is just f(θ1, … , θn) = (θ1, … , θn, φ(θ)). Some examples of these immersions are presented by Arwini & Dodson for various statistical manifolds under the 1-connection and (–1)-connection and the Fisher information metric. If 𝒩 is the Normal family of probability distributions under the canonical co-ordinates θ = (θ1,θ2) = (μ/ς2, –1/ 2ς2)) then it takes the form 1 2 1 2 𝒩 = p z; θ = exp θ ∙ z + θ ∙ z − 2 log 64 §1.1, “Algebraic Topology” – A Hatcher, 2002. 61 −π θ2 + θ1 2 4θ 2 θ1 ∊ ℝ, θ2 < 0 Therefore, the potential function is 1 −π 2 θ2 φ θ1 , θ2 ) = log θ1 − 2 4θ 2 θ1 ∊ ℝ, θ2 < 0 and so Arwini & Dodson’s immersion65 into ℝ3 is f: ℝ×( –∞, 0) → ℝ3 given by 1 2) 1 f θ ,θ = θ 1 −π , θ2 , 2 log θ 2 − θ1 2 4θ 2 An important family of probability distributions that is applied to many modelling situations66 due to its flexibility is the Beta family which consists of probability density functions of the form ℬ = p z; α, β) = B 1 α,β) z α−1 1 − z)β−1 α, β > 0 where the values of z lay in (0,1) and B α,β) is the beta function Γ α+β) B α, β) = Γ α)Γ β) = 1 0 uα−1 1 − u)β−1 du Despite what the form of the Beta family’s distributions may suggest, this is in fact an exponential family with y1 z) = log z), y 2 z) = log 1 − z), θ1 = α − 1), θ2 = β − 1), C z) = 0, φ θ = log B θ1 + 1, θ2 + 1) since substituting these into the exponential form gives exp C z) + θ1 ∙ y1 z) + θ2 ∙ y 2 z) − φ θ = exp =B 1 α,β) α − 1)log z) + β − 1)log 1 − z) − log B θ1 + 1, θ2 + 1) z α−1 1 − z)β−1 which is a probability distribution function of the Beta family, as required. Therefore, under the 1-connection and (–1)-connection and the Fisher information metric, the Beta family can be immersed into ℝ3 via the map f: (–1,∞)×( –1, ∞) → ℝ3 defined as f θ1 , θ2 ) = θ1 , θ2 , log B θ1 + 1, θ2 + 1) Examples of the graphs of both of these immersions are given below. 65 §3.4, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 66 “Handbook of Beta distribution and its applications” – A Gupta and S Nadarajah, 2004. 62 (Immersion of Normal statistical manifold into ℝ3) (Immersion of Beta statistical manifold into ℝ3) 63 8.2 – PROJECTIONS ONTO SUB-MANIFOLDS Statistical inference is the study of estimating probability distributions and related information from data, in some cases assuming that the distributions are from a given family of distribution functions. For example, a commonly estimated quantity is that of the sample mean – given data points (z1, … ,zn) and assuming that each of these observations was independent of the others and all the observations were generated by the same random variable, the sample mean is defined as μ= 1 n n i=1 zi and is used to estimate the mean value of the random variable in question. In some cases, knowledge of this expected value is enough to determine a random variable’s probability distribution from a given family, such as the Poisson distribution Pn λ) (c.f. section 1.1) – a one parameter family of distributions with expected value equal to λ. For a more detailed discussion of this topic see, amongst others, Casella and Berger67. Now, suppose that one wishes to estimate a random variable from a known family of distributions P and that this famly is part of a larger family Q. If the estimated distribution ê is not an element of the family P but is a part of the larger family Q, then it is reasonable to try to generate a new estimate from ê that does lay in P. An example of this concept is the estimation of probability distribution functions from the Normal family 𝒩 μ,ς), classified by its mean μ and standard deviation ς, when there is assumed to be a dependence between the these two parameters, say ς = f(μ), or when one of the parameters is assumed to be a predefined constant. If the estimated distribution function does not satisfy these assumptions then one may wish to transform the estimate into a distribution function that does. One method of achieving this is via minimisation of divergences. Given a statistical manifold Q with a divergence function D: Q×Q → ℝ, for a particular point q∊Q and a sub-manifold P⊂Q the aim is to minimise the function f: P → ℝ where f(p) = D(q,p). Note that there is no inherent requirement that P should be an actual sub-manifold instead of a subset of Q – the function f can be minimised regardless of this – but there are advantages to making this assumption as will now be presented. 67 “Statistical Inference” – G Casella and R Berger, 2001. 64 Amari & Nagaoka prove the following68: Let (Q, 〈 · | · 〉q) be a dually flat n-dimensional Riemannian manifold with respect to the connections ∇ and ∇* and the co-ordinates θ1, … , θn} and η1, … , ηn}. If D: Q×Q → ℝ is the canonical divergence D p‖q) = 𝜓 p) + φ q) − n i i=1 θ p)ηi q) defined for the potentials ψ : Q → ℝ and φ: Q → ℝ which are obtained from the differential relations ∂ ∂θ i ∂ ∂η i 𝜓 = ηi i = 1, … , n φ = θi i = 1, … , n then for any point q∊Q and sub-manifold P⊂Q the function f(p) = D(q,p) on P is minimised by the point ê∊P if and only if the geodesic from q to ê with respect to the connection ∇ is orthogonal to P at ê with respect to the inner-product 〈 · | · 〉ê. Recall that a path γ: [0,1] → Q is a ∇-geodesic when the vector field that it traces out, γ’: [0,1] → Tγ t)Q, satisfies the equality ∇γt’ γt’) = 0. This path is orthogonal to P if for any path ν: [0,1] → P in P where ν(a) = ê the inner product of the two vector fields 〈 γ’ 1) ν’ a) 〉ê is zero at ê. To demonstrate this theorem in practice, assume that Q is the statistical manifold of an exponential family under the canonical co-ordinates and 〈 · | · 〉q is the Fisher information metric then, as explained in section 5.3 of this thesis, Q is flat with respect to Amari & Nagaoka’s 1–connection (implying that the Christoffel symbols below vanish) and so Murray & Rice’s system of partial differential equations that any ∇(1)-geodesic γ:[0,1] → Q must satisfy reduce to γi + n i j k j,k=1 Γ j,k γ γ = γi = 0 where the γi are defined to be the co-ordinates of θ γ t))= γ1 t), … , γn(t)). This implies that the ∇(1)-geodesics must be linear in the sense that γi = ai + bi·t for constants (a1, b1), … , (an, bn). In terms of an exponential family p z; θ = exp C z + n i i i=1 θ y z −φ θ this becomes γ t) = exp C z + 68 n i=1 ai + bi ∙ t)y i z − φ a1 + b1 ∙ t , … , an + bn ∙ t) Theorem 3.10, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 65 Fix a point q∊Q, let ν:[0,1] → P be a path in the sub-manifold P, assume that ν s) = γ 1) for some given s∊[0,1] and that γ(0) = q. This implies that θ ν s) = θ γ 1) = a1 + b1 , … , an + bn ) = θ1 q) + θ1 ν s) − θ1 q) , … , θn q) + θn ν s) − θn q) and so ai = θi(q) and bi = θi ν s)) – θi(q) for each i = 1, … , n. Then the inner product of the two vector fields ∂ ∂ ∂ ν′ t) = ν1 t) ∂θ 1 + ⋯ + νn t) ∂θ n ∂ γ′ t) = b1 ∂θ 1 + ⋯ + bn ∂θ n at the point ν s) = γ 1) = p is γ′ 1) ν′ s) p = n i,j=1 g i,j n i,j=1 g i,j p)bi νj s) = p) θi ν s) − θi q) νj s) where the gi,j are the coefficients defined in section 6.1 as g i,j p) = ∂ ∂θ i ∂ p) ∂θ j p) p This equality for 〈 γ’ 1) ν’ s) 〉p must be solved for zero but without any further assumptions about the manifolds P and Q and the coefficients gi,j no further simplifications can be made. As an example, assume further that Q is the statistical manifold of the Normal family. It was shown in section 6.1 that the coefficients gi,j with respect to the ρ1, ρ2) = μ,ς) co-ordinates are 1 g1,1 = ς 2 g1,2 = g 2,1 = 0 2 g 2,2 = ς 2 However, the coefficients needed now are those wish respect to the canonical co-ordinates. In order to calculate these coefficients in terms of the canonical co-ordinates one may use the transformation equation noted by Amari & Nagaoka 69 as g k,l = n i,j=1 g i,j ∂ρ i ∂ρ j ∂θ k ∂θ l where the left hand coefficients are those with respect to the canonical co-ordinates. 69 (1.22), “Methods of Information Geometry” – S Amari and H Nagaoka, 1993. 66 These coefficients work out to be g1,1 = −1 2θ 2 θ1 1 g1,2 = g 2,1 = 2θ 2 θ 2 θ1 )2 + 2 2 θ2 )3 g 2,2 = − and so the inner product in question becomes γ′ 1) ν′ s) −1 = 2θ 2 ν s) p 2 i,j=1 g i,j = p) θi ν s) − θi q) νj s) θ 1 ν s) θ1 ν s) − θ1 q) ν1 s) + θ 1 ν s) … 1 … 2 +2 3 2 θ 2 ν s) θ2 ν s) − θ2 q) ν1 s) + θ1 ν s) − θ1 q) ν2 s) − 2θ 2 ν s) θ 2 ν s) θ 1 ν s) 1 2θ 2 ν s) θ 2 ν s) θ2 ν s) − θ2 q) ν2 s) = − θ 1 ν s) −θ 1 q) + 2θ 2 ν s) … θ 1 ν s) θ 2 ν s) −θ 2 q) 2 θ 2 ν s) 2 θ 1 ν s) θ 1 ν s) −θ 1 q) 2 θ 2 ν s) − 2 ν1 s) + θ 1 ν s) 2 +2 θ 2 ν s) −θ 2 q) 2 θ 2 ν s) 3 ν2 s) = θ 1 ν s) θ 2 ν s) −θ 2 q) − θ 1 ν s) −θ 1 q) θ 2 ν s) 2 … θ2 ν s) ν1 s) + 2 θ 1 ν s) θ 2 ν s) θ 1 ν s) −θ 1 q) − θ 1 ν s) 2 θ2 ν s) 2 +2 θ 2 ν s) −θ 2 q) 3 ν2 s) = 2 −θ 1 ν s) θ 2 ν s) θ 2 q)+θ 1 q) θ 2 ν s) 2 … θ2 ν s) 3 ν1 s) + −θ 1 ν s) θ 2 ν s) θ 1 q)+ θ 1 ν s) 2 θ2 ν s) 2 θ 2 q)−2θ 2 ν s) +2θ 2 q) 3 ν2 s) which still does not have any obvious values for ν1 s), ν2(s), θ1(ν(s)) and θ2(ν s)) in general such that this inner product is zero. If given more information about the submanifold P⊂Q and hence restrictions on the possible values of these four terms, then further simplifications may become apparent and lead to a solution. 67 For example, if the situation is simplified further to assuming that the sub-manifold P consists of the probability density functions for which one of the parameters, θ1 = ζ∊ℝ or θ2 = ξ<0, is held fixed, then this implies that either ν1(t) or ν2(t) must be zero, respectively, for all t∊[0,1] if ν t) is to lay in P. In which case the vanishing of the inner product implies that the minimum divergence occurs at the point p = ν(s) for θ1(ν(s)) = ζ when θ 2 q) ζ 2 +2 θ2 ν s) = ζ∙θ 1 q)+2 and for θ2(ν(s)) = ξ when θ1 ν s) = ξ∙θ 1 q) θ 2 q) which results in the following probability density functions from the two Normal submanifolds, respectively: p z; ζ, θ2 ν s) = exp ζ ∙ z + p z; θ1 ν s) , ξ = exp θ 2 q) ζ 2 +2 ζ∙θ 1 q)+2 ξ∙θ 1 q) θ 2 q) 1 ∙ z 2 − 2 log 1 −π ζ∙θ 1 q)+2 θ2 ∙ z + ξ ∙ z 2 − 2 log q)[ζ 2 +2] −π ξ ζ 2 ζ∙θ 1 q)+2 + 4θ 2 ξ θ 1 q) +4 q)[ζ 2 +2] 2 θ 2 q) What the preceding example should illustrate is that in actually implementing this method of projection onto sub-manifolds based upon divergence minimisation there is the potential for complications to arise in all but the simplest cases. Of course, on a case-by-case basis, it may be possible to use various algorithms to solve these equations but a detailed discussion of this lies outside the scope of this thesis. It is encouraging to see that there has in fact been some usage of this theorem of Amari & Nagaoka in the international mathematical community for the purposes for statistical inference. For example, Hirose & Komaki have presented a Least Angle Regression algorithm70 based upon these information geometry principles that under certain assumptions iteratively produces the desired divergence-minimising point of a given sub-manifold. On the other hand, Zhong, Sun & Zhang have applied this theorem to stochastic control theory and developed an algorithm intended to locate optimal approximations of control functions71. 70 “An extension of Least Angle Regression based on the information geometry of dually flat spaces” – Y Hirose and F Komaki, 2010. 71 “An information geometry algorithm for distribution control” – F Zhong, H Sun and Z Zhang, 2008. 68 8.3 – GEODESICS FOR STATISTICAL INFERENCE When estimating probability density functions from supplied datasets, one method of quality assessment is given by hypothesis testing. The basic premise is that for a given set of independent observations assumed to originate from the same random variable, one evaluates the probability of such a dataset occurring assuming that the random variable has the proposed probability density function. If this probability falls below a predefined cut-off point, then the proposed probability density function is rejected. For a more thorough treatment of this theory, refer to Hogg and Tanis72, for example. An alternative to this method may be constructed via the framework afforded by information geometry. Specifically, using either geodesics, Riemannian metrics or divergences, as per sections 5 and 6 of this thesis, it is possible to define distance functions on statistical manifolds which may be used to assess the suitability of probability density functions for given datasets. Given such a distance function d( · , · ): P×P → ℝ on a statistical manifold P, if p0 is the predicted probability density function and p1 is the probability density function estimated from a dataset, then one may reject p0 as being the probability density function of the random variable that generated the data if d(p0,p1) exceeds a predefined value. Additionally, for 2-dimensional statistical manifolds in particular, these methods of distance measurement may give insight into the relation between elements of the manifold in question. For example, fixing a point in p(ξ1, ξ2) in a 2-dimensional statistical manifold P, one may graph the function f θ1,θ2) = d(p(ξ1, ξ2), p θ1,θ2)) over the parameter space of P to determine neighbourhoods of p(ξ1, ξ2) – subsets of P of the form Nc ξ1, ξ2) = { p θ1,θ2)∊P | f θ1,θ2) < c } for c > 0. This information may be of use in statistical estimation since if it is known that such a neighbourhood of the point p(ξ1, ξ2)∊P is in some sense “large”, then estimations about this point may be less accurate than if the neighbourhood were “small”. Arwini & Dodson adopt this approach with respect to geodesics of the Gamma statistical manifold Ⅎ under the Fisher information metric and the Levi-Civita connection (Amari & Nagaoka’s 0–connection). As noted in section 5.5 of this thesis, however, explicit solutions to the differential equations for such geodesics are unattainable and so they employ upper-bound approximations73. 72 §8, “Probability and Statistical Inference” – R Hogg and E Tanis, 2006. §7.3.1, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 73 69 To summarise their findings, they show that there exist non-parallel geodesics (with respect to the Fisher information metric) having explicit solutions which may be used to provide an upper bound for the geodesic based distance between any two points p,q∊Ⅎ. By finding a point u∊Ⅎ that lays in the intersection of these two geodesics taken through p and q separately, the upper bound is d(p,q) ≤ d(p,u) + d(u,q) where d( · , · ) is defined as per section 6.1 of this thesis in terms of the inner product on Ⅎ as 1 d p, q) = inf γ:[0,1]→Ⅎ γ′ γ′ γ 0)=p ,γ 1)=q γ t) dt 0 In section 6.3 of this thesis the α-divergences (α ≠ ±1) for the Normal family under (μ,ς) co-ordinates were derived as 4 D α) p μ1 , ς1 )‖p μ2 , ς2 ) = 1−α 2 1 − a ∙ exp bc 2 − d) 1 2b where ς 1 α −1 a= b= 1 c = 2b μ1 2ς 1 2 ς 2 α +1 1−α + 4ς 1 2 1+α 4ς 2 2 μ 1 − α) + 2ς2 2 1 + α) 2 μ 2 μ 2 1 2 d = 4ς1 2 1 − α) + 4ς2 2 1 + α) This result can now be put to use in defining a distance function on the Normal statistical manifold. Setting α = 0 reduces the expression for the divergence to the symmetric function D 0) p μ1 , ς1 )‖p μ2 , ς2 ) = 4 1 − a ∙ exp bc 2 − d) where 1 a= ς1ς2 1 b = 4ς 1 c = 2b 1 1 2 + 4ς μ1 2ς 1 2 μ 2 2 2 μ + 2ς2 2 2 μ 2 d = 4ς1 2 + 4ς2 2 1 2 The graphs below represent the consequential function f:ℝ× 0,∞) given by f μ, ς) = D 0) p 1,1)‖p μ, ς) 70 1 2b which corresponds to the distance defined by the 0-divergence from the point p(1,1) (the standard Normal probability distribution function) in the Normal statistical manifold. (0-divergence induced distance from the point p(1,1) in the Normal statistical manifold) These graphs suggest that according to this distance function on the Normal statistical manifold changes in μ produce a greater effect on the distance from a given point than do equivalent changes in ς. Recalling that μ corresponds to the mean value of a Normal random variable and ς corresponds to its spread, this implication seems logical. 71 8.4 – GEODESICS FOR STATISTICAL EVOLUTION A novel application of geodesics through statistical manifolds was presented by Arwini & Dodson in relation to the development of statistical processes over time74. Suppose that one has a time-dependent statistical process represented by the probability density function p(z; θ(t)) that belongs to a given family of probability distributions P where θ t): [0,1] → Θ is a path in the sample space Θ. If instead of having an explicit expression for the function θ(t) one only knows its values for a finite number of points, then it may be desirable to have a method of interpolation to predict the intermediate values of θ(t). There exist many methods for interpolating general functions– approximation by polynomials, approximation by Fourier series, approximation by splines – but none of which take advantage of the information that the family of probability distributions P endows as a statistical manifold. For the sake of simplicity, assume that only the values of θ(0)∊Θ and θ(1)∊Θ are known. The method of interpolation suggested by Arwini & Dodson is then to join these two points via the unique geodesic in P connecting p(z; θ(0)) and p(z; θ(1)). The idea here is that these geodesics represent a ‘natural’ or ‘most efficient’ path between any two points in a family of probability distribution functions. There is, however, an initial complication to this method: as per the definition given in section 5.4 of this paper, in order to define any such geodesic on a statistical manifold P it is necessary to have a connection ∇ defined on P. Although it is not difficult to do so, the main issue here is that different connections will, in general, define different geodesics. Therefore, there is some question as to how appropriate this method of interpolation is, in the absence of any reasoning for choosing a particular connection. Another potential difficulty is that actually calculating these geodesics may be an unsolvable problem but it may be possible to circumvent this issue by employing computer-generated approximations. Arwini & Dodson apply this method to the modelling of voids in space and, ignoring questions of validity of the connection used, showed that the results can be used to gain insight into the stability of states of this system. 74 §6.6, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008. 72 CONCLUSION Information geometry is heralded by Amari & Nagaoka as having applications across a wide variety of mathematics based fields and to a certain extent this is true. In this thesis alone many worthwhile concepts have been discussed and examples of their applications examined. Furthermore, it is evident that a sizeable volume of research has been and continues to be undertaken into this developing field of mathematics. Whilst in some sense the construction of statistical manifolds based upon the parameters of a family of probably distributions is a fairly evident approach, much of the theory of information geometry is novel. As an example of this, the α–connections and α–divergences that it defines have in general little meaning or significance in differential geometry at large but provide a standardised method of defining geometries on statistical manifolds based upon their respective families of probability distribution functions. These geometries therefore depend upon inherent qualities of the families themselves rather than just providing an arbitrary structure. On the other hand, there are several key problems which much of the literature regarding information geometry tends to gloss over. For example, as demonstrated in this thesis, in all but the simplest of cases, finding explicit solutions for many functions and objects generated by information geometry can be difficult if not unattainable. Whilst computer generated approximations may be of assistance in this regard, it does seem to be a noteworthy concern when such a large amount of the theory is somewhat difficult to apply. Another question that does not appear to be generally addressed by most writers on this topic is that of the validity of the geometries employed by information geometry. Whilst exponential families of probability distributions can be seen to be flat with respect to the ±1–connections, for example, little regard seems to be given by most authors as to why this geometry should be used to represent these families, apart from the convenience it endows. One final but minor point of uncertainty is regarding the usefulness of the family of α–connections introduced by Amari & Nagaoka. For most applications, the only values of α that are used are 0 and ±1 – all of which correspond to concepts independent of information geometry. Therefore, whilst the α–connections do relate these concepts, the classification does seem at times to be somewhat vacuous. 73 These perhaps slightly philosophical qualms aside, there does seem to be genuine benefit gained from the theory of information geometry. The distance functions that it brings to probability theory and statistics, for example, permit a perspective that is quite alternative to the methods of comparison usually found in the realm of statistical estimation. However, further research regarding the justification behind many of the concepts is something that may be beneficial to consider. The final chapter of this thesis gives a short overview of the potential applications of this theory. As such, much of this section could easily lead to further topics of research. For example, several different notions of distance measures are given – it may prove of interest to investigate the properties of these and determine which, if any, might be better measures. In particular, it may be of merit to compare the distance measure based on the 0-divergence for the Normal statistical manifold to other measures of distance discussed in this thesis. Another potential avenue of research may consist of comparing these new concepts with the more standard concepts of statistical estimation. Finding some way of rating the efficiency and/or accuracy of both the old and the new would be an interesting result. In closing, it is hoped that the reader has gained some appreciation of the depth of the field of differential geometry and the potential that lies within the theory of information geometry. If this thesis has served as an accessible introduction to any of these concepts then it has served its purpose. 74 TABLE OF NOTATION CHAPTER 1 Random variable – Z: Ω → E Probability density function – p(z;θ) Family of probability distributions – { p(z;θ) | θ∊Θ Parameter space – Θ Expected value – 𝔼: P → ℝk CHAPTER 2 Atlas of charts – {(Ua,θa)}a∊𝒜 Co-ordinate system – θa x) = θ1 x),…, θn(x)) Tangent space – TxX Tangent bundle – TX Rate of change of a real function – f∘γ)’ 0) Partial derivative – ∂f ∂θ k x) Differential – dF: TxX → TF(x)Y Vector field – V: X → TX CHAPTER 3 Log-likelihood map – ℓ: P → RΩ Score – ∂ℓ ∂θ i p Z; θ CHAPTER 4 Exponential family – p z; θ = exp C z + n i i i=1 θ y z −φ θ Canonical co-ordinates – θ1, … , θn) Second fundamental form – αi,j(p) ∂2 ℓ ∂ℓ ∂ℓ Fisher information matrix – g i,j p) = 𝔼p ∂θ i ∂θ j = −𝔼p (Generalised) Efron’s statistical curvature – γ p) = n i,j i,j,k,l=1 g ∂θ i ∂θ j p) ∙ g k,l p) ∙ 𝔼f αi,k p) ∙ αj,l p) 75 CHAPTER 5 Connection/covariant derivative – ∇ j Christoffel symbols – Γ i,k Tangent space projection – πx:TxX → TyY Projective connection – ∇π α-connection – ∇ α) Mixture family – p m; θ = ni=1 θi pi m CHAPTER 6 Riemannian metric – 〈 · | · 〉x :TxX×TxX → ℝ Fisher information metric – u v p = 𝔼 p d p ℓ) u ∙ d p ℓ) v Divergence – D ∙‖∙): X×X→ ℝ f-divergence – Df p‖q) = ℝk p z ∙f q z p z dz α-divergence – D α)(p∥q) 2 Hellinger distance – D 0) p‖q) = 2 p x − q x dx CHAPTER 7 Dual connection – ∇* Parallel translation – Xv(x): X → TX Potential functions – φ,ψ : X → ℝ Canonical divergence – D p‖q) = 𝜓 p) + φ p) − Maximum likelihood estimator – M: Ω → ℝ Co-tangent bundle – Tx*X Parallel translation – Πρ t) γ’) 76 n i i=1 θ p)ηi q)