Download PDF - School of Mathematics and Statistics

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
MELBOURNE UNIVERSITY
Department of Mathematics and Statistics
An introduction to information geometry and its
applications in statistics
Martin Pike
This thesis was written as part of a Master of Science degree.
May 2012
TABLE OF CONTENTS
INTRODUCTION
1. ELEMENTS OF PROBABILITY
1.1
1.2
Probability Distribution Families
Expectations
2. ELEMENTS OF DIFFERENTIAL GEOMETRY
2.1
2.2
2.3
2.4
Manifolds
Tangent Spaces
Rates of Change, Partial Derivatives and Differentials
Vector Fields
3. INFORMATION GEOMETRY
3.1
3.2
3.3
3.4
Statistical Manifold (Amari & Nagaoka)
Statistical Manifold (Murray & Rice)
Sub-Manifolds
Tangent Spaces of Statistical Manifolds
4. EXPONENTIAL FAMILIES
4.1
4.2
Canonical Co-ordinates of Exponential Distributions
Testing for exponentiality
5. STRAIGHTNESS
5.1
5.2
5.3
5.4
5.5
Connections
Subspace Connections
α-Connections
Geodesics
Statistical Manifold Geodesics
6. DISTANCES ON STATISTICAL MANIFOLDS
6.1
6.2
6.3
Riemannian Metrics
Divergences
α-Divergences of the Normal Family
7. DUALITY
7.1
Dual Connections
1
7.2
7.3
7.4
7.5
Parallel Translation
Canonical Divergence
Dualistic Structure of Exponential Families
Differential Duality
8. SURVEY OF APPLICATIONS
8.1
8.2
8.3
8.4
Affine immersions
Projections onto sub-manifolds
Geodesics for statistical inference
Geodesics for statistical evolution
CONCLUSION
TABLE OF NOTATION
2
INTRODUCTION
This thesis aims to be an illustrative and approachable introduction to the theory of
information geometry. It is largely based on the canonical text on this subject
composed by Amari & Nagaoka1 as well as an alternative text written by Murray &
Rice2. A source of reference for applications of information geometry is found in the
work of Arwini & Dodson3 and to a lesser extent various international research papers.
Wherever possible, references are given to the reader should they care to delve
further into a particular topic.
In the first and second chapters of this thesis, the basics of probability theory and
differential geometry are introduced. At the risk of erring on brevity, only the concepts
prerequisite to information geometry are presented. The aim here is to both refresh
readers familiar with these two topics and to provide readers unacquainted with either
topic with a modest degree of the theory so that the main focus of this thesis can be
presented.
Essentially, this amounts to the definition of probability distribution functions and
families thereof, independence of random variables and expectations from probability
theory and from differential geometry the concepts of manifolds, tangent spaces,
differentials and vector fields.
Chapter 3 is where the main theory of information geometry is established. Two
distinct definitions of statistical manifolds by Amari & Nagaoka and Murray & Rice,
respectively, are presented and briefly compared. Following this, sub-manifolds of
statistical manifolds are briefly discussed in terms of statistical inference and a
convenient representation of tangent spaces to statistical manifolds based on affine
families of functions is developed.
An example of the framework established by information geometry is presented in the
fourth chapter where the important class of exponential probability families is
investigated. It is shown how such families, when considered as statistical manifolds,
admit a universal form of co-ordinate systems and several geometry based tests for
exponentiality are discussed with examples.
1
“Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
“Differential Geometry and Statistics” – M Murray and J Rice, 1993.
2
“Differential Geometry and Statistics” – M Murray and J Rice, 1993.
3
“Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson, 2008.
2
3
In the fifth chapter of this thesis more differential geometry concepts are introduced
and applied to statistical manifolds via information geometry. Connections, a
generalisation of directional derivatives, are investigated and a special family of
connections on statistical manifolds called the α-connections is presented and shown
to unify some well-known concepts from differential geometry and probability theory.
Lastly, geodesics of manifolds are defined in terms of connections and their application
to statistical manifolds examined with worked-through examples.
Following on in a similar vein in chapter 6, several distance functions are developed
from Riemannian metrics and divergences. Once again, a family of divergences, the
α-divergences are shown to generalise concepts from differential geometry and
probability theory. To conclude the chapter, an example based upon the Normal family
of probability distributions is worked through to solidify the terminology established so
far.
Chapter 7 develops the last core elements of information geometry introduced in this
thesis by giving two contrasting definitions of duality on statistical manifolds by Amari
& Nagaoka and Murray & Rice, respectively. Duality is shown to provide further
framework for statistical manifolds such as the so-called parallel translation of tangent
vectors. The potential applications of this framework are discussed via an important
theorem of Amari & Nagaoka which gives conditions for the minimisation of
divergences from fixed points in statistical manifold to sub-manifolds.
Having now established a large portion of the theory of information geometry, the
final chapter is dedicated to the exhibition of several novel applications that have been
developed in the literature of the international mathematical community. It is shown
how under certain conditions statistical manifolds can be represented as surfaces in
real space and several concepts related to statistical estimation which make use of the
distance functions developed earlier are discussed with examples.
In conclusion, the merits of information geometry are touched upon including the
potential advantages and difficulties encountered in applications.
4
CHAPTER 1 - ELEMENTS OF PROBABILITY
Formally, the mathematical framework of probability arises from probability spaces – a
triple, (Ω, ℱ, ℙ), where Ω is an arbitrary though non-empty set (the sample space), ℱ is
a sigma algebra on Ω and ℙ: ℱ→ [0,1] is a probability measure on ℱ (a measure such
that ℙ(Ω)=1). From this, a random variable defined on a measurable space (E, ℰ),
where E is a non-empty set and ℰ is a sigma algebra on E, is a (Ω, ℱ)-measurable
function Z: Ω → E, i.e. for any A∊ℰ its pre-image under Z satisfies Z-1(A)∊ ℱ. The
probability of an event A “occurring” is then defined as ℙ(Z-1(A)).
In fact, despite what this somewhat lengthy and abstract definition might suggest, for
the most part one simply considers functions of the form p: ℝn → ℝ or p: ℤ → ℝ to
define the probability of a random variable taking values in a given set by integration
and summation, respectively. These functions are interpreted as expressing the
likelihood of observing a random variable in a given set, say A – the greater the values
that such a function takes on a set A, the more likely it is that the random variable will
take values in A.
Whilst the more detailed definition of probability spaces and random variables allows
for a very rich area of study, the simpler viewpoint is taken here because it permits a
more function-centric approach to probabilities with less focus on the underlying
spaces and more focus on the functions themselves and simplifies the concepts of
measure theory to standard calculus. This advantage will become more apparent once
the ideas of information geometry are introduced.
In order not to detract from interest in the more detailed and elegant theory of
probability spaces, the curious reader is directed to any of the multitude of
introductory texts on the subject, for example – Casella & Roger (2001), “Statistical
Inference”; Hogg and Tanis (2006), “Probability and Statistical Inference”; Doob (1954,
1990), “Stochastic processes”; Grimmett and Stirzaker (1992), “Probability and Random
Processes”.
1.1 – PROBABILITY DISTRIBUTION FAMILIES
The most basic elements that will be under consideration are probability density
functions of random variables. For the sake of simplicity4, a given random variable’s
4
For an introduction to the theory of information geometry, there is little harm is adopting this lessgeneralised notation and in practice it is the format that is most often used. The reader with background
5
probability density function must satisfy the following properties, depending upon
whether the random variable is discrete or continuous:
Discrete probability density function: a function p: ℤ → ℝ satisfying
a) 0≤ p(n) ≤1 for all n∊ℤ; and
b) n∊ℤ p n) = 1.
Continuous probability density function: a function p: ℝn → ℝ satisfying
a) p(z) ≥ 0 for all z∊ℝn ;
b) For any interval (a,b)⊂ℝ, p-1(a,b) is open in ℝn ; 5 and
c) ℝn p z dz = 1.
The probability of a random variable, say Z, with such a distribution function p taking
values in a set A is then given by the respective sum or integral, depending upon Z
being discrete or continuous:
ℙ Z ∈ A) =
ℙ Z ∈ A) =
n∊A
A
p n)
(discrete)
p z dz
(continuous)
When considering sets of the form A = {z∊ℝn | z1 < a1, … , zn < an} (and analogously
for discrete distributions) these functions are called cumulative distributions –
F(a1, … , an) = ℙ(Z∊A). Also note that when A is a discrete set and p is a continuous
density that ℙ(Z∊A) = 0.
Such functions are parameterised into families as p(z;θ) where the vector
θ=(θ1,…, θk), which lays in a pre-defined set Θ⊂ℝk called the parameter space and is
dependent on the family, specifies uniquely a particular distribution from a family of
probability distributions. In this sense, both discrete and continuous distributions take
the same form if the domain of p is simply labelled the sample space and, if considered
in terms of measure theory, can be entirely unified. However, this is beyond the
requirements of this thesis and so will remain otherwise untouched.
Often, either the z, the θ or both will be omitted from notation based upon the desired
focus at the time. For example, elements of a given family may be identified simply as
p(θ) when the sample space is of no particular relevance to the topic at hand and the
significance rests with the parameters.
in probability theory will recognise the simplifications made and the reader with no background will only
be assisted by using this formulation for now.
5
This additional requirement essentially ensures the functions are sufficiently well-behaved and, in
particular, measurable.
6
Some commonly encountered families include the following 6:
Binomial distributions, Bi(n,p): for a given a∊(0,1) and n∊ℤ,
n
k
p k; n, a) =
ak (1 − a)n−k , k ∊ {0, … , n}
0
, otherwise
Poisson distributions, Pn λ): for a given λ > 0,
e −λ λ k
p k; λ) =
k!
, k ∊ 0,1,2, …
0 , otherwise
Uniform distributions, U(a,b): for given a,b∊ℝ with a<b,
1
p z; a, b) =
b−a
, z ∊ (a, b)
0 , otherwise
Gamma distributions, Γ k,θ): for given k > 0 and θ > 0,
1
p z; k, θ) =
Γ(k)θ k
z k−1 e−z/θ , z > 0
0
, otherwise
Normal distributions, N μ,ς2): for given μ∊ℝ and ς > 0,
p z; μ, ς) =
1
2π ς 2
exp
− z−μ)2
2ς 2
An important convention regarding notation of random variables is that, in general,
the random variable itself is written in capital letters whereas elements of the sample
space are denoted by lower-case letters. For example, one may denote a Normal
random variable as, say, Z with probability distribution function p(z). If one writes
p(z), this denotes a particular real value given by p at the point z. However writing
f(Z) (for any function f, not necessarily a probability density function) denotes a
random variable – a Normal random variable transformed by the function f.
One final concept regarding random variables that will be called upon later is that of
independence. Suppose that U, V and W = (U,V) are random variables with density
functions pU, pV and pW respectively. Then the random variables U and V are
independent if pW(u,v) = pU(u)pV(v). As the name and this requirement suggest,
independent random variables have no relation to each other – knowing anything
about one random variable gives no information about the other.
6
“Probability and Statistical Inference” – R Hogg and E Tanis, 2006.
7
1.2 – EXPECTATIONS
One further cornerstone of probability theory that will be called upon frequently is
that of expectations. Known also as means, averages, expected values and more, this
concept is interpreted as expressing a base-point for the “most likely” values that a
random variable might take – the values closest to the mean are most likely to occur.
For a continuous random variable Z with distribution function p: ℝn → ℝ and discrete
random variable W with distribution function q: ℤ → ℝ, the expectations are defined
below respectively as the multidimensional integral and sum
𝔼 Z) =
ℝn
z ∙ p(z) dz
and
𝔼 W) =
n∊ℤ
n ∙ q n)
The interpretation of these values, however, is far from perfect or consistent. For
example, the expected value of a random variable need not even be achievable by the
random variable. Take the Bi(1,0.5) distribution as defined above – the expected
value in this case is 0.5 whereas such random variables only assume the values 0 or 1.
Consider also a continuous random variable with distribution function given by
−1
p z) = π z − z 2
on the interval (0,1) (shorthand for “f is zero elsewhere”), the so
called Arcsine distribution. As seen in the graph below, most of the probability mass is
distributed around the endpoints. However, one can verify that the expected value of
such a random variable is 0.5 which is actually the mid-point of the region where, if
one were to shift a small interval though (0,1), the random variable is least likely to
occur.
(Arcsine probability distribution function)
8
As a final example of the somewhat unpredictable behaviour of these values, consider
now a continuous random variable with distribution function given by
p z) = π + πz 2 )−1 on the real line, the standard Cauchy distribution. Taking the
integral of z·p(z) for the expectation only from zero to infinity yields an infinite value
and similarly but with opposite sign when taken from negative infinity to zero. Whilst it
is tempting to simply conclude that these positively and negatively infinite values
“cancel out” to give an expected value of zero, for technical reasons from measure
theory it is said that in this case the expectation does not exist.
Although following from the definition of expected values, it is convenient when
calculating expectations to note that for constants a,b∊ℝ and any random variable Z
which has a finite expected value, 𝔼(aZ+b) = a·𝔼(Z)+b.
If Z and W are independent random variables then by definition p(W,Z)(w,z) =
pW(w)pZ(z). Therefore the expected value of the random variable WZ is
𝔼 WZ) =
=
ℝm +n
ℝm ℝn
w∙z p
w∙z p
W ,Z)
W)
w, z d w × z)
w p
Z)
z dw dz = 𝔼 W)𝔼 Z)
and similarly in the case of discrete random variables via sums.
This equality can be useful since calculating the expected value 𝔼(WZ) by first
principles requires knowledge of the density function of the random variable WZ or
(W,Z) – neither of which can be constructed from the density functions of W and Z
alone unless these two random variables are independent.
9
CHAPTER 2 - ELEMENTS OF DIFFERENTIAL GEOMETRY
In order to analyse families of probability distributions from a geometric point-of-view,
it is necessary to review the basics of differential geometry. Though closely related to
other fields of study concerning topological spaces and geometries, differential
geometry distinguishes itself in investigating the local qualities of specific topological
spaces – whilst a topologist views a sphere and any smooth deformations of it as
equivalent, a geometer sees a great deal of difference between them both.
2.1 – MANIFOLDS
The particular class of spaces that are of interest in differential geometry are called
manifolds. Essentially, an n-dimensional manifold is characterised by having some
form of co-ordinate correspondence to a subset of ℝn around each point in the space.
Formally, for a space X to be a manifold, it is required that there exists an atlas of
charts {(Ua,θa)} (ranging over some index set, say 𝒜) – each Ua is an open set of X and
θa is function θa : Ua → ℝn – which satisfy the following two conditions:
1.
a∊𝒜 Ua = X ; and
2. For each a∊𝒜, θa: Ua → ℝn is a homeomorphism onto a subset of ℝn.
Since the {Ua}a∊𝒜 form an open cover of X, any particular Ua will have some
intersection with its neighbours, if any exist (the obvious exceptions being
disconnected spaces and atlases that consist of only one chart). If Ub is such an
intersecting set, then this gives rise to the idea of transition functions between charts
θb ∘ θa −1 : θa Ua ∪ Ub ) → θb Ua ∪ Ub )
which are maps between subsets of ℝn. If all such transition functions are
differentiable as real maps, then X is said to be a differentiable manifold.
This local correspondence to ℝn, in some sense the simplest type of manifold via the
identity map acting as the sole chart, about each point in X endows the manifold with
a natural local co-ordinate system – for each x∊Ua its local co-ordinate is denoted by
θa(x) = (θ1 x),…, θn(x)). The distinction to be made here is that the points of X
themselves are nothing more than just individual objects inherent to the manifold but
the local co-ordinates give some arbitrary method of distinguishing between any two
such points and are not unique to the manifold. Indeed, different atlases may give rise
10
to entirely different co-ordinates for a given point of X. This subtlety will be made
clearer via a familiar example presently.
One of the more tangible and yet non-trivial manifolds that can be considered as a
reference point for the theory of differential geometry is the unit sphere, S2. One
approach is to simply consider it as a subset of ℝ3 and thus inheriting the triple of
co-ordinates (x,y,z). The problem with this approach is that this “chart”, now just the
identity map, is not a homeomorphism. Indeed, in terms of open sets in ℝ3, every
neighbourhood of any point in S2 is very much a closed set. Hence, a different
approach is required.
Consider instead separating S2 into the upper and lower hemispheres and projecting
each hemisphere onto the unit disc. Whilst these are all closed sets, with a bit of
imagination it is not difficult see how to extend each hemisphere into the other to
create an open cover of S2, say U1 and U2, and similarly extend the image of each Ua to
an open disc containing the closed unit disc. This atlas is exactly what is required.
What this example should illustrate is that when considering manifolds in an abstract
sense, the points of the manifold and their co-ordinates are entirely different things. In
the case of the sphere, for example, the most theoretically consistent way of
identifying points is to draw a picture and say “this point here”. Writing such a point as,
say, (x,y,z) has the potential to confuse the correspondence with the local co-ordinate
system laying inside ℝ2. However, this representation does give a simple way of
writing points locally (since spherical co-ordinates are not a bijective mapping from ℝ3
to ℝ2) in terms of co-ordinates – e.g.
θ1 x, y, z), θ2 x, y, z) = arccos
z
2
x2 + y2 + z2
, arctan
y
x
As an aside, one might think to wonder if in fact a simpler atlas, one consisting of
perhaps only one chart, can be constructed for the sphere. The answer to this is no and
can be rigorously proven. For example, this fact follows as a corollary to the Borsuk–
Ulam theorem7.
Continuing with this example, one should observe that intuitively the sphere is
somehow a sub-manifold of ℝ3. Whilst there are many ways of defining exactly how it
is a sub-manifold, a particularly concise method is based upon the Preimage Theorem 8
7
8
Corollary 2B.7, “Algebraic Topology” – A Hatcher, 2002.
§1.4, “Differential Topology” – V Guillemin and A Pollack, 1974.
11
which states that, under certain conditions, given two manifolds X and Y if f: X → Y is a
smooth map then f -1(y) is a sub-manifold of X for particular y∊Y. Defining f: ℝ3 → ℝ to
be the squared Euclidean distance
f x, y, z) = x 2 + y 2 + z 2
gives the required preimage S2 = f-1(1).
For real space there is an obvious correspondence to lower dimensional real spaces by
restricting some co-ordinates to be constant, i.e. ℝm = {(x1, … , xm , 0, … 0) ∊ℝn } for
0<m<n. One then says that Y is an m-dimensional sub-manifold of an n-dimensional
manifold X (m<n) if there exist co-ordinate charts {(Ua,θa)} in X about every point in
y∊Y such that the points around y are given by θ-1(x1, … , xm , 0, … 0) , see Murray &
Rice9. This assumes, of course, that there is a natural inclusion map from Y to X in the
first place so that one may consider points of Y as points of X in a well-defined manner.
It is seen then that, locally, sub-manifolds look like sub-spaces of real space.
2.2 – TANGENT SPACES
Continuing with the example of the sphere as a sub-manifold of ℝ3, notice that at
every point s on this surface there exists a unique plane such that every line on this
plane which intersects the sphere at s is tangential to the sphere, namely the tangent
plane. It is desirable to extend this concept to manifolds at large in a consistent
manner. Recall that the definition of a manifold does not require anything further than
a topological space – there is no reason to assume that an arbitrary manifold can be so
readily pictured in real space as the sphere. The most obvious counter-example, one
which will be the focus of this thesis, is that of manifolds of functions. Without any
means of visualising such a manifold, the idea of constructing a tangent plane at any
point is difficult to approach without a solid definition of what is meant by this concept.
Suppose now that X is a manifold with atlas {(Ua,θa)} and let x be an arbitrary point in
X sitting in the chart (U0,θ0). Let γ: -ε, ε) → X be a path in X such that γ(0)= x, then
by considering the corresponding path in real space (where differentiation is
well-defined) given by θ0 ∘ γ)(t), perhaps restricting the domain of γ(t) so that the
path lays entirely in U0, it is possible to consider if γ(t) is a differentiable map about
t = 0. In the affirmative case, the vector θ0 ∘ γ)’ 0) is said to be a representative of
the equivalence class of a tangent vector at x, denoted by γ’. This definition arises from
the fact that there may be infinitely many paths through x which, when composed
with the homeomorphism θ0, have the same derivative at t = 0. In general, one simply
calls γ’ a tangent vector for brevity of notation.
9
§3.1.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
12
With this definition in mind, the tangent space at the point x is simply the set of all
such tangent vectors and is denoted by TxX. Despite the slightly obtuse construction of
TxX, this space turns out to have a very simple structure as will now be demonstrated.
In particular, TxX is bijective with the vector space ℝn via the map from taking γ’ to
(θ0 ∘ γ)’ 0) so is itself a vector space.
Injectivity follows simply from the definition of γ’ since if any two intersecting paths γ1
and γ2 satisfy the equality (θ0 ∘ γ1)’ 0)= (θ0 ∘ γ2)’ 0) then γ1’= γ2’ represent the
same equivalence class.
For surjectivity note that since θ0 is a homeomorphism, for any vector v∊ℝn
γ(t) ≔ θ0 -1(θ0 x)+t∙v) defines a path through x and that
θ0 ∘ γ)′ 0) = θ0 ∘ θ0 −1 θ0 x) + t ∙ v
′ 0) = θ0 x) + t ∙ v ′ 0) = v
Hence any vector in ℝn can be realised by an equivalence class of TxX and so there is
indeed a bijection between the two spaces.
One reasonable question to ask is if TxX is a vector space then what can be used as a
basis? The simplest way of defining such a basis is by lifting any basis for ℝn.
Specifically, given a basis {v1, … , vn} for ℝn the corresponding basis for is TxX given by
{γ1’ , … , γn’} where γk ≔ θ0 -1(θ0 x)+t∙vk) for k=1, … , n.
In general, two distinct tangent spaces of a given manifold need not be related as
vector spaces since the geometry of the manifold about the respective base points
may be entirely different. Locally, however, tangent spaces do correspond to the same
real space by virtue of the fact that a chart is homeomorphic on its (open) domain in X.
That being said, one often considers the tangent bundle of a manifold given by the
union of all tangent spaces – TX ≔ x∊X Tx X .
2.3 – RATES OF CHANGE, PARTIAL DERIVATIVES AND DIFFERENTIALS
In defining the tangent space of a manifold at a given point, the map θ: U → ℝn was
used to define the rate of change of a given path in the manifold. In a similar manner,
it is possible to define the rate of change of a real function, say f: X → ℝn, over a path
γ: [0,1] → X at the point γ(c)= x∊X where c∊[0,1] by
d
f ∘ γ)′ c) ≔ dt f γ t)
13
t=c
Consider now a given co-ordinate chart θ in terms of its co-ordinates in ℝn by writing
θ(x) = (θ1 x), … , θn(x)). By varying at unit rate the kth co-ordinate in ℝn only, it is
possible to create a path through a given x∊X defined by the lift
γk t) ≔ θ−1 θ x) + t ∙ ek = θ−1 θ1 x), … , θk x) + t , … , θn x)
By taking the rate of change of a function over this path, the partial derivatives of the
function are defined to be
∂f
∂θ k
x) ≔ (f ∘ γk )’(0)
A very useful result that uses this framework is the chain rule. The result, stated here
without a necessarily long and technical proof which can be found in Murray & Rice10,
is that for a real function f and a path γ through x∊X, one may write
∂f
∂f
f ∘ γ)′ 0) = ∂θ 1 x) ∙ θ1 ∘ γ)′ 0) + ⋯ + ∂θ n x) ∙ θn ∘ γ)′ 0)
When instead considering more generally a map between manifolds, say F: X → Y, it is
possible to push forward a path γ:[0,1] → X through γ(s)= x∊X to Y via
(F ∘ γ)(t): [0,1] → Y. Then for a given co-ordinate chart φ about F(x)∊Y, the
differential of F is
dF: Tx X → TF(x) Y,
γ′ ↦ φ ∘ F ∘ γ) ′ 0)
This is a linear map between tangent spaces since for a scalar λ and tangent vectors
γ1’, γ2’∊TxX
d
dF λγ1 ′) = dt φ F γ1 λt)
d
t=0
= λ dt φ F γ1 t)
t=0
= λdF(γ1 ’)
and from the bijectivity of TF(x)Y and ℝm the path F(γ1+γ2)(t) satisfies
φ ∘ F γ1 + γ2 ) ′ 0) = φ ∘ F γ1 ) ′ 0) + φ ∘ F γ2 ) ′ 0)
which implies that
dF γ1 ′ + γ2 ′) = (φ ∘ F γ1 + γ2 )′ 0)
= (φ ∘ F γ1 )′ 0) + (φ ∘ F γ2 )′ 0)
= dF γ1 ′) + dF γ2 ′)
Because the co-ordinate functions θ1, … , θn are just maps from X to ℝ, the chain rule
can be reformulated as
10
§2.2.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
14
∂f
∂f
df γ’) = ∂θ 1 x) ∙ dθ1 γ’) + ⋯ + ∂θ n x) ∙ dθn γ’)
which emphasises more the directional aspect of rates of change as acting on tangent
vectors.
2.4 – VECTOR FIELDS
A vector field on a manifold X is defined to be a map V: X → TX which defines for each
point x∊X a unique tangent vector in TxX. Since each tangent space is equivalent to
real space of dimension equal to that of the manifold X, it is generally required that a
vector field be smooth by considering it to be a map V: X → ℝn. This gives rise to the
intuition of what vector fields are by imagining in real space an arrow drawn at each
point.
In defining the rate of change of a function, the paths γ1, … , γn represented the
tangent vectors of unit length in the 1st through to nth co-ordinates, respectively, at a
given point in the manifold. By doing this at every point in the manifold, a vector field
can be constructed, following the notation of Murray & Rice11, which is denoted by
∂
∂θ 1
∂
x), … , ∂θ n x)
That is, for a given x∊X each of these n vector fields represents the tangent vector γk’
represented by (θk ∘ γk)’ 0) = 0, … , 1, … , 0) = ek.
This notation arises from the fact that these paths were used to define the partial
derivatives of a function - the rates of change of a function over each of the n paths.
Note also that for any x∊X
dθk
∂
∂θ k
x) = 1
and
dθk
∂
∂θ j
x) = 0
when j≠k.
11
§2.2.5, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
15
CHAPTER 3 – INFORMATION GEOMETRY
Having established the basic concepts from probability and differential geometry that
are pre-requisite to information geometry, it is now possible to introduce the idea of a
statistical manifold in a formal way. However, there are at least two means to this end
given firstly by Amari & Nagaoka 12and subsequently by Murray & Rice13. The perhaps
simpler approach given by Amari & Nagaoka will be introduced before the slightly
more abstract definition given by Murray & Rice.
3.1 – STATISTICAL MANIFOLD (AMARI & NAGAOKA VERSION)
Let P = { p(z;θ) | θ∊Θ be a family of probability distributions as defined in chapter 1
with sample space Ω - a subset of either ℤ or ℝn. Assume that
1. Θ is an open subset of ℝk ;
2. The support of each p(z;θ)∊P, i.e. supp(p) = { z∊Ω | p(z;θ)>0 }, does not
vary with θ ;
3. For each z ∊ supp(p) the map from Θ to ℝ given by θ ↦ p(z;θ) is infinitely
differentiable ; and
4. The order of integration/summation and differentiation may be
interchanged for integrals/sums over the sample space involving any
p(z;θ)∊P.
then P defines a statistical manifold with co-ordinates given by the θ = (θ1,…, θk).
Essentially, these conditions ensure that the Fisher information, an important function
to be defined and used in the proceeding chapters, exists14 for all probability
distributions p∊P.
For an example of a statistical manifold under this definition, consider the family of
Normal distributions with sample space Ω = ℝ:
𝒩 = p z; μ, ς) =
1
2πς
exp
2
12
− z−μ)2
2ς 2
μ ∊ ℝ, ς > 0
§2.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§3.2.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
14
§2.3.1, “Theory of Statistics” – M Schervish, 1995.
13
16
Setting θ = μ, ς) it is seen that the parameter space Θ is the open upper half-plane in
ℝ2 and hence the first condition is satisfied. Also, since any p(z)∊𝒩 is strictly positive
on ℝ the second condition is easily satisfied.
For the third condition, note that for any fixed z∊ℝ each p(z;θ)∊𝒩 is a composition of
smooth functions on Θ and so is itself smooth, i.e. infinitely differentiable.
In fact, each p(z;θ)∊𝒩 is a composition of smooth functions on ℝ×Θ and so the
fourth definition follows from Leibniz’s rule for differentiation under the integral sign.
Therefore this rather important family is indeed a statistical manifold.
Conversely, this definition of a statistical manifold does have some limitations. In
particular, the second condition regarding the support of each probability density
function prevents many probability families from otherwise being statistical manifolds.
Take, for example, the family of Uniform random variables on the real line with density
functions:
1
p z; a, b) =
b−a
, z ∊ (a, b)
0 , otherwise
Since the support of p is dictated by its parameters, this family cannot become a
statistical manifold under this definition. Being a somewhat common probability family
this is a noteworthy limitation.
3.2 – STATISTICAL MANIFOLD (MURRAY & RICE VERSION)
The approach of Murray & Rice is somewhat more indirect and requires one further
piece of terminology to be explained before it can be defined – the log-likelihood
map15, appearing frequently in the theory of statistics. If P is a probability family with
sample space Ω, then the log-likelihood map is the map ℓ: P → RΩ mapping the family
P to the space of measurable functions on Ω, RΩ, taking p ↦ log(p(z)). Often this
notation is shortened to just ℓ = ℓ p).
A statistical manifold is then defined as a subset of the space of all probability
measures on a sample space Ω, which is also a manifold, satisfying the following
conditions:
1. The log-likelihood function ℓ p(z;θ)) is smooth with respect to θ for each
z∊Ω; and
15
§ 7.2.2, “Statistical Inference” – G Casella and R Berger, 2001.
17
2. For each distribution function p(z;θ), the functions (known as the scores)
∂ℓ
∂θ 1
p z; θ
∂ℓ
, … , ∂θ n p z; θ
are linearly independent – i.e. each score
cannot be written as a linear combination of the other scores.
Note that the chain rule at the end of section 2.3 of this thesis and the observation
regarding the action of differentials on the vector fields
∂
∂θ 1
∂
p), … , ∂θ n p) at the end
of section 2.4 of this thesis give the following useful identity for the scores:
dℓ
∂
∂θ k
∂ℓ
p) = ∂θ k p)
(k = 1, … , n)
Reverting briefly to the theory of differential geometry, a differentiable map f between
differentiable manifolds X and Y is said to be an immersion if its differential
df: TxX → Tf(x)Y is injective over each point x in the domain of the map16. Immersions
are of interest because whilst being diffeomorphisms locally they need not be so
globally – the canonical example of which is the map sending a circle to a figure-8
which fails to be a manifold due to the crossover point in the middle of the figure-8.
It is seen then that this definition of a statistical manifold is similar to requiring that the
log-likelihood function is an immersion into the space of all probability functions on
the sample space. The only impediment with this interpretation is that an immersion
requires the target space to be a manifold also. The space of all probability functions
on a given sample space, however, must be in general infinite dimensional – being at
least as large as the space of all continuous functions on the sample space that
integrate to 1 – and so arises the difficulty with this point of view. There are many
definitions of infinite dimensional manifolds but for the sake of convenience Murray &
Rice opt to ignore these details for the sake of simplicity.
Note that both these definitions of statistical manifolds induce manifolds via the coordinate systems θ = (θ1,…, θk).
As before, to verify that these conditions give a reasonable definition of a statistical
manifold, consider once more the family of Normal distributions on ℝ, denoted by 𝒩.
By its definition, 𝒩 is a subset of the set of all probability distributions on ℝ and ℝ
itself is definitely a manifold. For any p(z;θ)∊𝒩, its log-likelihood is given by
1
z2
zμ
μ2
2
2ς
ς
2ς 2
ℓ p) = − log 2πς2 ) −
16
2 +
§1.3, “Differential Topology” – V Guillemin and A Pollack, 1974.
18
2 −
which is just a polynomial in z so 𝒩 passes the first requirement.
For any probability density function p(z;θ)∊𝒩, the scores for this element are given by
the equations
∂ℓ
∂μ
∂ℓ
∂ς
p) =
z−μ
ς2
1
z−μ)2
ς
ς3
p) = − +
which are linearly independent since the former is a polynomial of degree one in z and
the latter is a quadratic polynomial in z. Therefore 𝒩 is once again to be deemed a
statistical manifold.
3.3 – SUB-MANIFOLDS
A useful observation given by Arwini & Dodson17 is that certain probability families
contain other probability families via some restriction on the larger family’s
parameters. For example, consider the Log-Gamma distributions (named so because
for a random variable G following the Gamma distribution the random variable
(–log G) follows the Log-Gamma distribution) with density functions
p z; ν, τ) =
ν τ z ν −1 −log z)
τ−1
Γ τ)
for z∊(0,1) and ν,τ>0. Setting ν,τ = 1 gives p(z;1,1) = 1 which is the density function
of a Uniform random variable on (0,1).
Although a one-point set is a rather uninteresting manifold by any standard, this
example does at least highlight the fact that it is possible to consider smaller families
as a statistical sub-manifold of the larger families.
This viewpoint can be useful in statistical estimation and will be expounded later once
more of the information geometry framework has been established. However, in a
casual sense suppose that in the course of estimating a Normal random variable there
was reason to believe that the parameter ς was fixed at a given value, say ς0. By the
definition of a sub-manifold in section 2.1, this subset of the Normal family
ℳ = p μ, ς) ∈ 𝒩 ς = ς0 }
is a sub-manifold of the Normal manifold.
17
§3.6, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson,
2008.
19
It is not uncommon that experimental estimation gives rise to natural errors and so it
is possible that the estimate of ς will not be equal to ς0. As will be seen later, the
information geometry framework will give a way to project the estimated random
variable laying in 𝒩 onto the sub-manifold ℳ in a sensible manner (i.e. with some
justification based upon properties of the manifolds themselves) thus satisfying the
prior knowledge of this parameter. Whilst it might be convenient to simply ignore the
estimated value of ς, this is surely wasteful if there is some information that can be
extracted from its value.
3.4 – TANGENT SPACES OF STATISTICAL MANIFOLDS
A novel approach to defining tangent spaces of a given statistical manifold is given by
Murray & Rice18 by imposing an affine structure onto the space of measures on a given
measure space. Whilst a thorough exposition of measure theory deviates too far from
the scope of this thesis, the interested reader may like to consult Rudin19 for a
complete explanation of the theory. For immediate purposes, consider integration of a
real function - A f dz . Whilst the dz is often thought of as just a signifier as to which
variable is being integrated, it is in fact a measure on the real line.
Although in this case, using this “standard” measure (known formally as the Lebesgue
measure), the measure of a simple set of the form (a,b) is just the length of the
interval derived by taking the integral (a,b) dz , by changing the measure one may
arrive at different measures of this set. One way of achieving this is to append a nonnegative function to the measure and integrate over the interval to get its measure
f dz .
(a,b)
An affine space consists of a set Z and a vector space V together with a commutative
operator ‘+’ such that for any two elements z1 and z2 of Z there exists a unique vector
v∊V such that z1 + v = z2. Such a structure can by imposed on the space of positive
measures on a measure space by letting the (infinite dimensional) vector space be the
set of measurable functions and defining addition by dz + f = ef∙dz. This is
commutative since
dz + f) + g = ef ∙ dz + g = eg ∙ ef ∙ dz = ef+g ∙ dz
dz + f + g) = ef+g ∙ dz
18
19
Example 2.2.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
§1.18, “Real and Complex Analysis” – W Rudin, 1986.
20
To retrieve the measure pdz from the measure dz one simply translates by the
function log(p) – the log-likelihood of p – which is unique by the fact that the
exponential map is injective from ℝ to (0,∞).
The affine space structure affords the potential to write any path through a measure in
this space as the translation by a path through the vector space. That is, the path
through the measure space p(t)dz corresponds uniquely to translation by the path
through the vector space log(p(t)). The tangent vector of the latter is just t-derivative
of log(p(t)) evaluated at zero.
Hence, via an argument of dimensionality, one may identify a basis for the tangent
space at p(0)dz where p(t) = p(θ(t)) belongs to a probability family by the tangent
vectors corresponding to the paths defined by log[p θ1(0), … , θk(0) + t, … , θn(0))]
for k = 1, … n, i.e. the scores of the distribution function p.
21
CHAPTER 4 – EXPONENTIAL FAMILIES
A class of probability distributions that have been extensively studied and shown to
exhibit many ‘desirable’ qualities in relation to statistical estimation is that of
exponential families. Amongst other properties, these distributions have been shown
to admit in some sense the ‘best’ statistical estimators above all other classes of
random variables. In information geometry, too, these families exhibit in many ways
some of the simplest geometric properties possible in a manifold.
4.1 – CANONICAL CO-ORDINATES OF EXPONENTIAL DISTRIBUTIONS
The definition of an exponential family is dependent on its probability distribution
functions – if they take the form
n
i i
i=1 θ y
p z; θ = exp C z +
z −φ θ
where φ is a real-valued function on the parameter space and C and each of the yi are
real-valued functions on the sample space then the family is exponential.
For example, the Normal family is exponential by setting
μ
−1
1
y1 z) = z, y 2 z) = z 2 , θ1 = ς 2 , θ2 = 2ς 2 , φ θ1 , θ2 ) = 2 log
−π
θ2
−
θ1
2
4θ 2
, C z =0
where θ1∊ℝ and θ2<0, since substituting these into the exponential form gives
1
p z; θ = exp θ1 z + θ2 z 2 − 2 log
=
=
=
1
2π ς
exp
2
1
2π ς 2
1
2π ς
μ
1
ς
2ς
2z−
exp −
exp −
2
−π
+
θ2
2
2z +
μ
ς2
θ1
2
4θ 2
2 −ς 2
2
z 2 −2μz+μ 2
2ς 2
z−μ)2
2ς 2
which is a Normal probability density function.
Notice that the co-ordinates here, θ1 and θ2, are quite different from the
co-ordinates that were introduced previously – namely, μ and ς. However, the two
sets of co-ordinates are bijective since μ = –θ1 ∙ 2θ2)-1 and ς = (–2θ2)-1/2. For any
exponential family, the co-ordinates (θ1, … , θn) are called the canonical
co-ordinates.
22
An interesting question raised by Murray & Rice20 is whether it is possible to represent
the densities of an exponential family using two different sets of canonical coordinates. More precisely, whether one can write
p z; θ = exp C z +
n
i i
i=1 θ y
z −φ θ
= exp
n
i i
i=1 η z
−𝜓 θ
for suitable co-ordinates η1, … , ηn and a function on the parameter space ψ : Θ → ℝ.
The answer that they supply is that it is indeed possible as long as the partial
derivatives
∂C
∂z i
i = 1, … , n
∂y i
and
∂z j
i, j = 1, … , n
remain constant, in which case the two co-ordinates are related by
ηi =
i
n ∂y
j=1 ∂z j
∂C
θj + ∂z i
i = 1, … , n
Assuming these partial derivatives are constant, this is an affine relation and so it is
possible to define the notion of a ‘straight-line’ through an exponential family by lifting
straight lines in the real space image of any such canonical co-ordinate system back to
the family itself, i.e. if θ t) defines a straight line in the parameter space inside ℝn,
then p(z;θ t)) defines a straight line in the exponential family. Since these canonical
co-ordinates are affinely related, the straight lines for one co-ordinate system are
straight lines for any other.
4.2 – TESTING FOR EXPONENTIALITY
So far, it is only clear that if the distribution functions of a family may be expressed in a
certain way then that family is exponential. It is not so clear, however, from this how
to determine when a family is not exponential.
For example, the Logistic family on ℝ with probability density functions of the form
1
p z; μ, s) = s e−
z−μ )
s
1 + e−
z−μ )
s
−2
μ ∈ ℝ, s > 0
does not seem to take the form of an exponential family due to the inverse square
term but this in itself is not enough to thoroughly prove that this family is not of the
exponential type.
Murray & Rice21 give 4 criteria for determining outright whether a family of
distributions is exponential:
1. The family is an affine subspace of the space of all probability measures;
20
21
§1.2 “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
§1.6 “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
23
2. For each probability density function in the family, the second partial
derivatives of its log-likelihood function, considered as functions on the
sample space, is in the span of its scores and a constant;
3. The second fundamental form of the family (defined below) is zero; and
4. A generalisation of Efron’s statistical curvature (defined below) is zero.
In order to avoid any confusion, it is important to clarify that in the second test the
scores may be linearly combined with coefficients that are functions of the parameters
of the family since each probability density function is considered separately and so
the parameters in this case are just constants.
The first test, as noted by Murray & Rice, is equivalent to the existence of a
representation for the probability density functions of a family as per the definition of
an exponential family 22. In the case of the Logistic family, say ℒ, this translates to the
existence of real-valued functions φ: ℝ× 0,∞) → ℝ, C: ℝ → ℝ and y1,y2: ℝ → ℝ such
that for any p(z;μ,s)∊ℒ
p μ, s) = exp −log s) −
z−μ)
s
− 2 ∙ log 1 + e−
z−μ )
s
= exp C z) + μ ∙ y1 z) + s ∙ y 2 z) − φ μ, s)
The difficulty with this test, as illustrated by this example, is that it can be difficult to
prove that no such representation exists. In a casual sense, on may wish to claim that
the logarithm term – being neither a function of z or μ,s) alone, nor a product of a
function of z and either μ or s – precludes any such p(z;μ,s)∊ℒ from having an
exponential representation. However, this is not rigorous and in other cases, where
the function in question can be factored into the required terms, may lead to
misdiagnoses. One should therefore proceed with caution when applying this test to
prove that a given family of probability distributions is not exponential.
Continuing onto the second test, the log-likelihood function for any p μ,s)∊ℒ is
ℓ p μ, s) = −log[s] −
x−μ)
s
− 2 ∙ log 1 + e−
x −μ )
s
and so the scores are
𝜕ℓ
𝜕𝜇
𝜕ℓ
𝜕𝑠
22
=
−1
𝑠
+
x−μ)
s2
1
2
= −
𝑠
−
1+e
−
1
−
x −μ ) ∙ ∙ e
𝑠
s
2
x −μ )
−
s
1+e
∙
x−μ)
s2
x −μ )
s
∙ e−
=
x −μ )
s
1
𝑠
=
2
1−
e
−1
𝑠
+
x −μ )
s +1
x−μ)
s2
§1.4 “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
24
2
1−
e
x −μ )
s +1
To prove that ℒ fails to be an exponential family it suffices to show that the span of the
scores and a constant does not include the partial derivates of the scores with respect
to μ and s. To this end, note that the partial derivative
𝜕 2ℓ
𝜕𝜇 2
x −μ )
s
−2∙e
=
𝑠2 e
2
x −μ )
s +1
is not a linear combination of the scores and a constant since the denominator
introduces a squared term which is not a part of either of the scores or any
combination of them. Therefore ℒ fails this test of exponentiality.
The second fundamental form, also known as the embedding curvature by Amari &
Nagaoka23, is defined by considering the space of measurable functions on a measure
space as the direct product of the tangent space of an augmented statistical manifold
and its normal space in terms of the inner product 〈φ, ψ〉p = 𝔼p(φ∙ψ) (the expectation
of φ∙ψ with respect to the probability density function p).
Using the previous result regarding the second partial derivatives of the
log-likelihood functions of an exponential family being in the span of the scores, it is
shown that the normal component of these second partial derivatives must in fact be
equal to zero hence giving a computable criterion for exponentiality – the second
fundamental form of the probability density function p:
∂2 ℓ
αi,j p) = ∂θ i ∂θ j −
n
s,t
s,t=1 g
p ) ∙ 𝔼p
∂2ℓ
∂θ i
∂θ j
∂ℓ
∂ℓ
∂θ s
∂θ t
− 𝔼p
∂2 ℓ
∂θ i ∂θ j
where [gs,t] is the inverse matrix of the matrix (known as the Fisher information
matrix) defined by
g i,j p) = 𝔼p
∂ℓ ∂ℓ
∂θ i
∂θ j
= −𝔼p
∂2ℓ
∂θ i ∂θ j
If each αi,j is zero for each probability density function p in the statistical manifold,
then the normal component of the derivatives of the scores is zero and so the family is
exponential.
Note that the Fisher information matrix exists under the requirements of the
probability density functions of a statistical manifold, according to Amari & Nagaoka,
since the expectation of the squared scores of each probability density function
existing, as per the Fisher information, implies that the expectation of the product of
two such scores is also integrable. However, under the conditions of Murray & Rice
that the probability density functions of a statistical manifold must satisfy, the Fisher
information matrix may not exist, a fact which is noted by these authors 24.
Having already verified that ℒ is not an exponential family, it is of no real benefit to
calculate the large number of complicated terms in each second fundamental form.
23
24
§1.9, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§1.5.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
25
However, for families where these expressions may simplify due to terms vanishing,
this test may prove preferable due to the fact that it requires only the evaluation of
each fundamental form and the verification that they are zero.
The last test for exponentiality is given by the function of the second fundamental
forms and the inverse of the Fisher information matrix for a given probability density
function p in a statistical manifold:
n
i,j
i,j,k,l=1 g
γ p) =
p) ∙ g k,l p) ∙ 𝔼f αi,k p) ∙ αj,l p)
If this function is zero on the statistical manifold, then this family is exponential 25.
Once again, given the large number and complexity of the terms involved in this
expression, the evaluation of this function on ℒ will be dismissed. Instead, consider the
simpler one parameter family of probability distributions that is the Poisson family on
the non-negative integers with density functions of the form
p k; λ) =
e −λ λ k
λ>0
k!
The log-likelihood, its first and second derivatives and the Fisher information “matrix”
are given by
ℓ p λ) = k ∙ log λ) − λ − log k!)
∂ℓ
k
λ) = λ − 1
∂λ
∂ 2ℓ
∂λ 2
λ) =
−k
λ2
g p λ) = −𝔼p
−k
λ2
1
=λ
The Poisson family passes the second test immediately since
∂ 2ℓ
∂λ 2
−k
=
λ2
=
−1
λ
∂ℓ
∙ ∂λ +
−1
λ
∙1
factors into a linear combination of the score and 1.
Given that this family has only one parameter, the last two tests are rather easy to
apply also. The second fundamental form of the Poisson family is
∂2 ℓ
α p) = ∂λ 2 − g −1 p) ∙ 𝔼p
=
25
−k
λ2
=
−k
=
−k
λ2
λ2
− λ ∙ 𝔼p
−
+
−1−λ
λ
k
λ2
−k 2
λ3
∂ 2 ℓ ∂ℓ
∂λ 2 ∂λ
k
+ λ2
+1
k
1
−1
+λ −
λ
∂ℓ
− 𝔼p
∂λ
k
∂λ 2
− 1 − 𝔼p
λ
−1 −
λ
∂2ℓ
−k
λ2
−1
λ
=0
§1.5.2 “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
26
and so the Poisson family is exponential.
For the last criterion, note that the test function γ is indeed zero since
γ p) = g p)2 ∙ 𝔼p α p)2 ) = λ2 ∙ 𝔼p 0) = 0
Although these tests were very easy to perform for the Poisson family, they need not
always be so simply or even calculable, as in the Logistic family or any other family
with a large number of parameters or complex probability density functions. This
illustrates the benefit of having a wide range of tests for exponentiality available.
27
CHAPTER 5 – STRAIGHTNESS
In ℝn with Cartesian co-ordinates there is a well-understood concept of straight lines –
for example, one way of many to define when a path in ℝn, γ t):[0,1] → ℝn, is straight
is if its derivative with respect to t, γ’(t), is constant. However, in the setting of
manifolds where γ t):[0,1] → X is a path in a manifold X, there is no assumed structure
on X that allows one to compute for any x,y∊X the quantity “x – y” and hence
differentiation by first principles. Furthermore, there is no assumed relation between
the tangent spaces Tγ t)X and Tγ s)X for t,s∊[0,1] and so asking if dγ t):[0,1] → TX is a
constant map need not be well-defined. It is apparent, then, that some additional
framework is required to define straightness on manifolds.
5.1 – CONNECTIONS
Since straight lines are related to rates of change of functions on manifolds, it is
natural to consider the tangent spaces of a manifold when defining straightness. Given
a path through a manifold X, p(t):[0,1] → X, for each value of t∊[0,1] the path defines
a tangent vector in Tp(t)X via the map to the equivalence class t ↦ [ θ∘p)’ t)] where θ
is a co-ordinate chart on a neighbourhood containing p(t) and so the path gives rise to
a vector field over X denoted by ṗ(t). If this vector field is constant, i.e. has zero rate of
change, then the path p(t) is straight. The problem that arises here is that, in general,
for two points x and y in X the corresponding tangent spaces TxX and TyX need not be
related and so there may not be any way of comparing tangent vectors from each
space in an obvious way, let alone take derivatives in via limits.
To overcome this problem, one defines an operator called a connection26 or covariant
derivative27 on vector fields denoted by ∇. In order to make this operator as flexible as
possible, Murray & Rice simply list 3 properties that it must satisfy:
1. For any vector field V, ∇V: TxX → TxX is a linear function from the tangent
space TxX to itself for every x∊X ;
2. For any two vector fields V1 and V2, ∇ V1+V2)= ∇V1+∇V2 ; and
3. For any vector field V and differentiable function f:X → ℝ,
∇ f x)∙V)(γ’)=df(γ’)V x)+f x)∇V γ’) for all γ’∊TxX and all x∊X.
Notice then that if ∇ is such a connection then expanding a vector field V in terms of
the co-ordinate vector fields and functions v1, … , vn: X → ℝ
26
27
§4.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
§1.6, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
28
∂
∂
V x) = v1 x) ∂θ 1 x) + ⋯ + vn x) ∂θ n x)
then properties 2 and 3 show that for any tangent vector γ’∊TxX
∇V γ’) =
n
i=1 dvi
∂
∂
γ’) ∂θ i x) + vi x)∇ ∂θ i γ’)
and so the connection is completely specified by its values on the co-ordinate vector
fields.
Now, by property 1 it is possible to write for each i = 1, … , n
∂
∂
∂
∇ ∂θ i γ’) = A1i γ’) ∂θ 1 x) + ⋯ + Ani γ’) ∂θ n x)
for some real-valued linear functions A1i , … , Ani on the tangent space. The chain rule
expands each of these functions as
j
j
j
Ai γ’) = Γ i,1 dθ1 γ’) + ⋯ + Γ i,n dθn γ’)
j
where the Γ i,k : X → ℝ are linear functions on the manifold called the Christoffel
symbols.
Then finally this gives the representation
∂
∇ ∂θ i
∂
∂θ j
=
n
k ∂
k=1 Γ i,j ∂θ k
which completely specifies the connection ∇. Although it is not entirely explicit from
this representation in order to simplify notation, in general the Christoffel symbols
j
j
j
vary over the manifold, i.e. Γ i,k = Γ i,k x) ≠ Γ i,k y) for x≠y.
Notice that so far no mention has been made of precisely how a connection should be
defined, just the properties that it should satisfy, even though the aim has been to
establish what straight lines are on arbitrary manifolds. Whilst this may seem
counter-intuitive, it is intentional and reflects the idea that a given manifold does not
have an arbitrary sense of straightness. Although a path is straight if the connection of
the vector field it traces out is the zero function, this is entirely dependent on the
connection itself.
To illustrate this distinction, consider the plane with standard Cartesian
co-ordinates. Let pt = (αt, βt) be a path on the plane with vector field
∂
∂
pt = α′t ∙ ∂x pt ) + β′t ∙ ∂y pt )
then the action of a connection can be expressed as
29
∂
∂
∂
∂
∇pt γ’) = d[α′t ] γ’) ∂x pt ) + α′t pt )∇ ∂x γ’) + d[β′t ] γ’) ∂y pt ) + β′t pt )∇ ∂y γ’)
Now if one makes the decision that the Cartesian co-ordinate vector fields are to be
constant, so that when acted on by the connection they both vanish, then this formula
reduces to the standard Euclidean form
∂
∂
∇pt γ’) = d[α′t ] γ’) ∂x + d[β′t ] γ’) ∂y
which agrees with the usual sense of straightness on the place – i.e. the path
pt = (αt, βt) is straight if its Cartesian co-ordinates, αt and βt, are linear so that d[α’t]
and d[β’t] are zero.
However, if one chooses instead to use polar co-ordinates on the plane (locally, since
these co-ordinates do not give an injective description of the plane, amongst other
things) and specify that the polar co-ordinate vector fields are constant then this
formula becomes
∂
∂
∇pt γ’) = d[α′t ] γ’) ∂r + d[β′t ] γ’) ∂θ
and so pt is now straight only when its radial and rotational components are constant.
That is, the straight lines under this geometry are formed from rays from the origin
and rotations about the origin. So, it is seen that the plane does not have any inherent
notions of straight lines and that this concept is to be specified by connections.
5.2 – SUBSPACE CONNECTIONS
Given a manifold X with connection ∇ there is a natural way of establishing a
connection on any subspace Y⊂X by treating points in Y as points in X and tangent
spaces in Y as subspaces of tangent spaces in X. However, in general there is no
guarantee that for a given vector field V on Y that the image of ∇V will lay entirely
within TyY for all y∊Y.
For example, consider the unit circle, S1, as a sub-manifold of the plane under
Euclidean geometry. Writing points (locally) in S1 as just θ∊ℝ, the corresponding point
in ℝ2 is given by cos θ, sin θ) and so the co-ordinate vector fields satisfy
∂
∂
∂θ
∂
= −sin θ ∂x + cos θ ∂y
Now, property 3 of connections in §5.1 gives the expression
∇
∂
∂θ
= −cos θ
∂
∂x
∂
−sin θ ∇
∂
∂x
∂
= −cos θ ∂x − sin θ ∂y
30
− sin θ
∂
∂y
+ cos θ ∇
∂
∂y
where the values of the connection on the x-y vector fields are zero by the chosen
Euclidean geometry of ℝ2.
It is seen then that the image of a tangent vector in S1 is always pointing inwards
towards the origin which is certainly not tangential to S1 and so this connection on the
plane fails to produce a connection on S1. In the case that a connection ∇ on a
manifold X is also a connection when restricted to a sub-manifold Y, the latter is said
to be an autoparallel sub-manifold 28 of X.
When a sub-manifold Y⊂X is not autoparallel with respect to a connection ∇, it is still
possible to use this connection to define a connection on Y. As in Amari & Nagaoka29,
if πx:TxX → TyY is a linear map over X that fixes TyY, i.e. a projection from TX onto TY,
then the operator ∇π[V γ’)] = π[∇V γ’)] defines a connection on Y since it is a linear
map by assumption and for any vector field V on Y and each y∊Y
∇π [fV γ′)] = πy [∇ fV) γ′)]
= πy [df γ′)V y) + f y)∇V γ′)]
= df γ′)πy [V y)] + f y)πy [∇V γ′)]
= df γ′ )V y) + f y)∇π [V γ′ )]
5.3 – α-CONNECTIONS
Having now defined the generalised concept of connections, it is appropriate to
mention the family of connections on statistical manifolds defined by Amari &
Nagaoka30 called the α-connections.
Let P be a statistical manifold with co-ordinates θ = θ1, … , θn) then for each point
p∊P and some α∊ℝ Murray & Rice show that the α-connection is defined by the
Christoffel symbols31
j (α)
Γ i,k
p) =
n
j,l
l=1 g
∙ 𝔼p
∂ 2 ℓp
∂θ i ∂θ k
+
1−α ∂ℓp ∂ℓp
∂ℓp
∂θ i ∂θ k
∂θ l
2
For a given α∊ℝ, Amari & Nagaoka note that it is possible to define the
α-connection in terms of three special instances of these connections. Namely, the
following equalities hold
28
§1.8, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§1.9, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
30
§2.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
31
§4.6, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
29
31
1+α)
∇(α) = 1 − α)∇(0) + α∇(1) =
2
1−α
∇(1) +
2
∇(−1)
To see the significance of the 1-connection, consider an exponential family
n
i i
i=1 θ y
p z; θ = exp C z +
z −φ θ
Then the partial derivatives of its log-likelihood are given by
∂ℓp
∂θ i
= yi z −
∂ 2 ℓp
∂φ θ
∂θ i
∂2 φ θ
∂θ i ∂θ k
= − ∂θ i ∂θ k
and so the equation defining the Christoffel symbols reduces to
j (1)
Γ i,k
n
j,l
l=1 g
p) = −
∂2 φ θ
∙ ∂θ i ∂θ k ∙ 𝔼p
∂ℓp
∂θ l
which is zero since (recalling the assumption when defining a statistical manifold in
§3.1 that orders of differentiation and integration of the scores may be swapped)
𝔼p
∂ℓp
∂θ l
=
=
∂ℓp z
ℝn
∂θ l
p z dz
1 ∂p z
ℝn p z ∂θ l
∂
= ∂θ l
ℝn
p z dz
p z dz
∂
= ∂θ l 1) = 0
Hence, because ∇(1) reduces to
∇V γ’) =
n
i=1 dvi
∂
γ’) ∂θ i p)
for any vector field
∂
∂
V p) = v1 p) ∂θ 1 p) + ⋯ + vn p) ∂θ n p)
on an exponential family, such families are said to be flat with respect to the
1-connection.
Conversely, consider now a mixture family on a measure space M given by
p m; θ =
n
i i
i=1 θ p
m
where each pi is a probability distribution function on M and the θi are non-negative
weights satisfying θ1 + … + θn = 1. Then the partial derivatives of the log-likelihood
functions are given by
32
∂ℓp
∂θ i
∂ 2 ℓp
∂θ i
∂θ k
=−
pi m
=p
m;θ
pi m pj m
p m ;θ
2
∂ℓ ∂ℓ
= − ∂θpi ∂θpk
Thus the Christoffel symbols for ∇(-1) vanish since
∂ 2 ℓp
∂θ i ∂θ k
+
∂ℓp ∂ℓp
∂θ i ∂θ k
=−
∂ℓp ∂ℓp
∂θ i ∂θ k
+
∂ℓp ∂ℓp
∂θ i ∂θ k
=0
Therefore, any mixture family is flat with respect to the (-1)-connection.
It will be seen later when examining inner products on manifolds that the
0-connection coincides with a special type of connection which preserves inner
products under a form of translation on manifolds – a Riemannian connection. Thus
the α-connection is seen to give a way of unifying these three concepts into a more
general form.
5.4 – GEODESICS
In real space with Cartesian co-ordinates, the shortest path between any two points,
say x and y, is always just the line joining the two points: γ t)=t·x+(1– t)·y. However,
in general this is not always the case. Although locally all manifolds look like real space,
as has been shown it is their geometries that define the idea of straightness.
As in Murray & Rice32, a path γt, where t ranges over an interval I⊂ℝ, in a manifold X
with connection ∇ is a geodesic if it satisfies the equality
∇γt ′ γt ′) = 0
for all t∊I, where γt’ = [ θ∘γt)’ t)] is the vector field traced out by γt. The
interpretation is that the path traced out by γt is straight under the geometry implied
by the connection.
In order to find such paths in a manifold, Murray & Rice go on to derive the following
system of partial differential equations which the co-ordinates of a geodesic,
θi = θi γt) i = 1, …, n, must satisfy
θi +
n
i
j k
j,k=1 Γ j,k θ θ
=0
where the derivatives are taken with respect to t.
32
§4.9, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
33
If the manifold is flat under the connection specified by the Christoffel symbols, i.e. the
j
Γ i,k are all zero, then this reduces to the standard concept of geodesics. Namely, paths
whose co-ordinate representations are of the form θi γt)=ai + bi∙t for constants
(a1, … , an) and (b1, … , bn), i = 1, … ,n .
In general, however, one must solve these partial differential equations to determine
the geodesics defined by a connection (or show that a path satisfies these equations)
which, recalling that the Christoffel symbols are functions on the manifold in general,
can be a difficult task. To elucidate this and the established theory so far in this chapter,
two examples will now be presented.
5.5 – STATISTICAL MANIFOLD GEODESICS
A bountiful reference for the explicit forms of many of these concepts on statistical
manifolds is given by Arwini & Dodson. One of many families that they consider is that
of the Normal distributions on ℝ:
𝒩 = p z; μ, ς) =
1
2πς
exp
2
− z−μ)2
μ ∊ ℝ, ς > 0
2ς 2
They present the non-zero forms of the Christoffel symbols for the α-connections of
this family without ado as follows33:
α+1
1 (α)
1 (α)
Γ 1,2 = Γ 2,1 = −
ς
1
−
α
2 (α)
Γ 1,1 =
2ς
2α + 1
2 (α)
Γ 2,2 = −
ς
Using these coefficients, the system of equations defined by Murray & Rice for the
geodesics of this geometry are derived here as
1 (α)
1 (α)
μ = −Γ 1,2 μς − Γ 2,1 ςμ =
2 (α)
2 (α)
ς = − Γ 1,1 μ2 + Γ 2,2 ς2 = −
2 α+1)
1−α
2ς
ς
μς
μ2 +
2α +1
ς
ς2
It is worth pointing out that setting α = 0 gives the following system of equations
μς = 2μς
1
ςς = − μ2 + ς2
2
33
§3.8.1, “Information Geometry: Near Randomness and Near Independence” – K Arwini & C Dodson,
2008.
34
which is well-understood from the equations for geodesics on the hyperbolic plane
given by the Poincaré metric34.
The simplest solution to these equations is given by the constant function μ = c∊ℝ and
the exponential map ς = k∙et for some k∊ℝ, since the derivatives of μ are all zero and
ς is unchanged by differentiation.
The more interesting solutions are given by μ = –p ∙ tanh at) and ς = r ∙ sech bt) for
some constants a,b,r,s∊ℝ. To determine the required relations for these constants
note that the derivatives of μ and ς are given by
μ = ap −1 + tanh at)2 ) = −ap ∙ sech at)2
μ = 2 a2 p ∙ tanh at) (1 − tanh at)2 )
ς = −br ∙ tanh bt) sech bt)
ς = −b2 r ∙ sech bt)3 + b2 r ∙ tanh bt)2 sech bt)
Hence, the coefficients must satisfy the equality
μς = 2a2 pr ∙ tanh at) 1 − tanh at)2 ) sech bt)
= 2abpr ∙ tanh bt) 1 − tanh at)2 ) sech bt)
= 2μς
so that a = b. The coefficients must also satisfy the equality
ςς = −b2 r 2 ∙ sech bt)4 + b2 r 2 ∙ tanh bt)2 sech bt)2
=−
a2 p 2
2
∙ sech at)4 + b2 r 2 ∙ tanh bt)2 sech bt)2
1
= − 2 μ2 + ς2
which implies p2 = 2r2. Thus the geodesic co-ordinate functions are
μ = −r 2 ∙ tanh at)
ς = −r ∙ sech at)
Note that this reduces to the relation ς2 = r2 – 1/2)μ2 via standard hyperbolic
identities.
34
p.76, “Dictionary of Distances” – E Deza & M Deza, 2006.
35
As seen in the graph below, although the geodesics for fixed μ correspond to the
standard concept of straightness, all other geodesics are actually semi-ellipses which
certainly is an uncommon interpretation of straight lines.
(Elliptical geodesics of the Normal statistical manifold under the 0-connection)
Another family presented by Arwini & Dodson is the family of Gamma distributions on
{z | z>0 } = ℝ>0, given in (non-canonical) co-ordinates as
Ⅎ=
p z; γ, κ) =
κ
κ
γ
z κ−1 −z κγ
e
γ > 0, 𝜅 > 0
Γ κ)
In particular, they show that the system of equations that must be satisfied by
geodesics of the 0-connection are
γ=
κ = 2γ 2
γ2
γ
κγ2
κ𝜓′
−
−
κ)−1)
γκ
κ
κ 2 𝜓′′ κ)+1 κ 2
2κ κ𝜓′ κ)−1)
where ψ is the digamma function defined by
𝜓 y) =
Γ′ y)
Γ y)
In comparison to the simpler equations established by the Normal family under the
36
0-connection, the equations for the Gamma family are not readily solvable due to the
complicated digamma function. Indeed, Arwini & Dodson present only computer
generated geodesics due to this inherent difficulty.
Therefore, it is seen that there is potential for complication in establishing geodesics
for a given family under a given connection. This should be kept in mind for little
benefit can be gained if one is unable to implement the theory in practice.
37
CHAPTER 6 – DISTANCES ON STATISTICAL MANIFOLDS
The parameters θ1, … , θn of a statistical family take values in a given subset of ℝn and
so it is not unreasonable to propose a way to measure distances between elements of
the family by using, say, the standard Euclidean distance on the parameters. However,
as motivation to investigate other ideas of distance, consider the so-called “taxicab
metric”35 on ℝ2 given by
d x1 , y1 ), x2 , y2 ) = x2 − x1 + y2 − y1
As the name suggests, this metric is supposed to emulate the concept of distance
perceived by a vehicle that can only move on a vertical and horizontal grid as defined
by the roadways.
Under this metric, any route that has a total horizontal displacement of |x2-x1| and a
total vertical displacement of |y2-y1|, no matter how many horizontal and vertical
sections this is composed of, will have achieved the minimum distance between
(x1,y1) and (x2,y2). Compare this with the Euclidean distance on ℝ2 which has a
unique shortest path given by the diagonal between (x1,y1) and (x2,y2). Hence, there
are times when one wishes to consider alternative measures to accommodate for
peculiarities that one may encounter.
6.1 – RIEMANNIAN METRICS
Although initially seeming somewhat roundabout, a common method of constructing a
measure of distance on a manifold is via its tangent space. Since this space has a vector
space structure by default, one may define an inner product over each point on the
base manifold. That is, a function 〈 · | · 〉x :TxX×TxX → ℝ that satisfies the following
rules:
1. For a given x∊X and for all a,b∊ℝ and u,v,w∊TxX,
〈 a∙u + b∙v | w 〉x = 〈 a∙u | w 〉x +〈 b∙v | w 〉x ;
2. For a given x∊X and for all u,v∊TxX, 〈 u | v 〉x = 〈 v | u 〉x ; and
3. For a given x∊X, if u∊TxX is not the zero vector, then 〈 u | u 〉x > 0.
Under these circumstances, this inner product is called a Riemannian metric36 on the
(Riemannian) manifold X.
35
36
“Taxicab Geometry: An Adventure in Non-Euclidean Geometry” – E Krause, 1986.
§1.5, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
38
Given a co-ordinate system θ1, … , θn) with corresponding basis
∂
∂θ 1
x), … ,
∂
x)
∂θ n
for the tangent space at x∊X, a Riemannian metric can be defined in terms of its action
on the basis via the coefficients
∂
g i,j x) =
x)
∂θ i
∂
x)
∂θ j
x
This follows from representation of any tangent vector v∊TxX as a linear combination
of the basis vectors
∂
∂
u = u1 ∂θ 1 x) + ⋯ + un ∂θ n x)
combined with the first rule, linearity, for inner products. The inner product of two
tangent vectors u,v∊TxX is then given by
u v
x
n
i,j=1 g i,j
=
x)ui v j
Given a path through a manifold, say γ: [0,1] → X, one defines its length using a
Riemannian metric as
1
length γ) =
γ′ γ′
γ t)
dt
0
where γ’ t) = θ∘γ)’ t) is the vector field traced out by γ.
From this, it is possible to define a distance function 37 on the manifold as
d x, y) = inf γ:[0,1]→X
γ 0)=x ,γ 1)=y
length γ)
In particular, d( , ) is a metric and thus has the following properties for all x,y,z∊X
1. d(x,y) ≥ 0 and d(x,y) = 0 if and only if x = y ;
2. d(x,y) = d(y,x); and
3. d(x,z) ≤ d(x,y) + d(y,z).
It is seen, as with connections, that notions of distance are not by any means inherent
to a given manifold. A vector space may admit any number of inner products which in
turn give rise to different metrics on the manifold. In order to establish a unified
framework for statistical manifolds, it is necessary to give some way of specifying a
Riemannian metric base upon the family in question itself.
37
§6.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
39
To this end, Murray & Rice continue by introducing a metric known as the Fisher
information metric, a generalisation of the Fisher Information38, defined over a point
(more precisely, a density) p in a statistical manifold via the family’s expectation at
that point of the differentials of the log-likelihood ℓ=log p) at that point evaluated on
the tangent vectors in question:
u v = 𝔼 p d p ℓ) u ∙ d p ℓ) v
Noting that each differential dp ℓ) can be expressed in terms of its values on the
scores, the matrix for this inner product at any point p on the statistical manifold is
given by
g i,j p) =
∂
∂θ i
p)
∂
∂θ j
= 𝔼 p d p ℓ)
= 𝔼p
∂ℓ
∂θ i
p)
p
∂
p) ∙ d p ℓ)
∂θ i
∂
∂θ j
p)
∂ℓ
p) ∙ ∂θ j p)
where the last equality follows from the definition of the scores.
Although it should be clear that this proposed inner product satisfies the first to
requirements – linearity and symmetry – because they are passed-down from the
properties of expectations, it is not at clear at first glance that it should necessarily
satisfy the third rule – positive-definiteness. However, recalling that the definition of a
statistical manifold as proposed by Murray & Rice required that the scores be linearly
independent39, it follows that positive-definiteness is achieved.
As per the discussion in section 4.2 of this thesis, the existence of this Fisher
information metric is not guaranteed under the definition of a statistical manifold
given by Murray & Rice but is guaranteed to exist under the definition given by Amari
& Nagaoka. In the present and proceeding chapters, it is assumed that this metric does
exist.
An interesting example given by Murray & Rice of this Riemannian metric is derived
from the Normal family’s statistical manifold 40. They show that when parameterised as
𝒩 = p μ, ς) =
1
2πς 2
exp
38
− z−μ)2
2ς 2
μ ∊ ℝ, ς > 0
(5.10), “Theory of Point Estimation” – E Lehmann and G Casella, 1998.
§3.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
40
§6.2.2, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
39
40
the inner product at the point p μ,ς)∊𝒩, the probability density function of the
random variable Z, is defined as
∂
∂μ
∂
∂μ
∂
∂ς
p)
p)
∂
p)
∂μ
∂
∂ς
∂
∂ς
p)
p
p)
p
p)
p
Z−μ 2
= 𝔼p
= 𝔼p
= 𝔼p
1
= ς2
ς
Z−μ)3
ς4
Z−μ)2
ς3
−
Z−μ
ς2
1
=0
2
− ς2
2
= ς2
Although perhaps insignificant in itself, in fact, this corresponds to the inner product
on the upper half plane under hyperbolic geometry. In the case of unit variance, ς = 1,
this is seen to reduce to standard Euclidean geometry
6.2 – DIVERGENCES
An alternative and more direct way of defining distances on a manifold X is via a
divergence function 41 – a smooth function on the manifold D(∙‖∙): X×X→ ℝ which is
non-negative for every (x,y)∊X×X and zero only when x = y.
In contrast to a metric, a divergence need not be symmetric or satisfy any form of the
triangle relation and so has less stringent requirements but as a result can be
considered to lose some accuracy in measuring distances. However, divergences still
deliver useful information regarding manifolds and, when finding geodesics is out of
the question, are often a more tangible alternative.
One of the divergence functions presented by Amari & Nagaoka is the f-divergence,
defined on a statistical manifold P for a function f: ℝ → ℝ, which must be convex on
ℝ>0 and zero at z = 1, by the integral (or, in the discrete case, the analogous sum)
taken over the domain of the family
Df p‖q) =
p z ∙f
q z
p z
dz
The requirement that f(1) = 0 is explained by Jensen’s inequality since
Df p‖q) ≥ f
41
p z ∙
q z
dz = f 1)
p z
§3.2, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
41
and so ensures non-negativity of the divergence and vanishing of the divergence only
when p = q.
Amari & Nagaoka also note the interesting observation that this divergence exhibits a
form of convexity. Namely, for any constant λ∊[0,1] and points in the manifold
p1,p2,q1,q2∊P the following inequality holds:
Df λ ∙ p1 + 1 − λ) ∙ p2 ‖ λ ∙ q1 + 1 − λ) ∙ q2 ) ≤ λ ∙ Df p1 ‖q1 ) + 1 − λ) ∙ Df p2 ‖q2 )
Another divergence introduced by Amari & Nagaoka is the α-divergence. As the name
suggests, this divergence is related to the α-connection defined earlier as will be shown
in the proceeding chapter of this thesis. For now, define a convex function on ℝ given
for some real number α by
4
f
α)
1−α 2
1 − x (1+α)/2 ,
α ≠ ±1
x) = x ∙ log x) ,
−log x) ,
α=1
α = −1
Then the α-divergence is defined as the f-divergence with respect to this function
(noting that f vanishes at x = 1 for any value of α) for a fixed value of α.
Essentially, this divergence is seen to reduce to two cases, dependent of the given
value of α. When α ≠ ±1, the divergence of two elements p,q∊P is given by the
following integral over the sample space Ω of the family P (or the analogous sum, in
the case of a discrete family)
D
α)
4
p‖q) = 1−α 2 1 −
Ω
p x
1−α
2
q x
1+α
2
dx
When α = ±1, under the same conditions the formula is given by
D −1) p‖q) = D 1) q‖p) =
Ω
p x log
p x
q x
dx
Amari & Nagaoka note that this divergence has several interesting features. For one,
the relation between divergences D α)(p‖q) = D(-α)(q‖p) holds for all real values of α
and all p,q∊P.
Also, it is seen that the 0-divergence corresponds to the square of the so-called
Hellinger distance – a metric on probability spaces defined as the square root of
2
D
0)
p‖q) = 2
p x − q x
42
dx
Lastly, they note that the ±1-divergence corresponds to the oft employed KullbackLeibler information42. Thus the α-divergence gives a convenient way of relating and
generalising these concepts but under the more formal frame work of differential
geometry. It is not immediately apparent, however, if this serves any practical purpose
– an issue which is not directly addressed by Amari & Nagaoka.
6.3 – α-DIVERGENCE OF THE NORMAL FAMILY
To conclude this chapter the α-divergence (for α ≠ ±1) of the Normal family will be
derived directly and shown to exhibit a simple form.
Let p,q∊𝒩 be any two elements of the statistical manifold of the Normal family
represented in μ,ς) co-ordinates as
1
p=
2π ς 1
1
q=
2π ς 2
exp
2
exp
2
− z−μ 1 )2
2ς 1 2
− z−μ 2 )2
2ς 2 2
Then by the definition of the α-divergence (α ≠ ±1)
4
D α) p‖q) = 1−α 2 1 −
1
ℝ
ℝ
2πς 2 α +1
ς 1 α −1
4
= 1−α 2 1 −
exp
ς 1 α −1
4
= 1−α 2 1 −
2π ς 1 2
ℝ
2πς 2 α +1
…
1−α
2
− z−μ 1 )2
1
2ς 1 2
− z−μ 1 )2 1−α
2ς 1 2
2
exp
exp −
μ2
2ς 2
2
2π ς 2 2
1−α
4ς 1 2
−
2
2ς 2 2
z−μ 2 )2 1+α
1+α
+ 4ς
exp
− z−μ 2 )2
2
2ς 2 2
z2 +
2
μ1
2ς 1 2
1+α
2
dz
dz
1 − α) + ⋯
μ 2
μ 2
1
2
1 + α) z − 4ς1 2 1 − α) − 4ς2 2 1 + α) dz
Define the following variables (really, functions of the co-ordinates and α which are to
be considered as constants) for convenience of notation
ς 1 α −1
a=
b=
1
c = 2b
μ1
2ς 1 2
ς 2 α +1
1−α
4ς 1
2
+
1+α
4ς 2 2
μ
1 − α) + 2ς2 2 1 + α)
2
μ 2
μ 2
1
2
d = 4ς1 2 1 − α) + 4ς2 2 1 + α)
42
§1.3, “Information theory and statistics” – S Kullback, 1997.
43
The divergence function then simplifies to
D α) p‖q) =
4
1−a
1−α 2
1
ℝ 2π
exp −b ∙ z 2 + 2bc ∙ z − d) dz
4
1
ℝ 2π
exp −b ∙ z 2 + 2bc ∙ z − bc 2 ) dz
4
1
ℝ 2π
exp −b ∙ z − c)2 ) dz
= 1−α 2 1 − a ∙ exp bc 2 − d)
= 1−α 2 1 − a ∙ exp bc 2 − d)
Now, setting w = 2b z − c) so that 2b ∙ dz = dw gives the following well-known
integral (solvable via basic techniques from complex analysis or from the fact that the
integrand represents the density function of a Normal random variable with unit mean
and variance) which is equal to 1 and thus giving a closed form solution of the
α-divergence of the Normal family:
4
D α) p‖q) = 1−α 2 1 − a ∙ exp bc 2 − d)
4
= 1−α 2 1 − a ∙ exp bc 2 − d)
1
2b ℝ
1
exp
2π
−w 2
2
dw
1
2b
For α = ±1, the α-divergence is defined by the Kullback-Leibler information:
D −1) p‖q) = D 1) q‖p) =
=
=
1
ℝ 2π ς 1 2
exp
1
ℝ 2π ς 1 2
= log
ς2
= log
ς2
ς1
ς1
+
+
log
− z−μ 1 )2
2ς 1 2
ς2
ς1
+
ℝ
p x)log
ς2
log
ς1
exp
p x)
q x)
dx
− z−μ 1 )2
2ς 1 2
+
z−μ 2 )2
2ς 2 2
dx
z 2 ς 1 2 −ς 2 2 +2z μ 1 ς 2 2 −μ 2 ς 1 2 +μ 2 2 ς 1 2 −μ 1 2 ς 2 2
2ς 1 2 ς 2 2
ς 1 2 −ς 2 2 μ 1 2 +ς 1 2 +2μ 1 μ 1 ς 2 2 −μ 2 ς 1 2 +μ 2 2 ς 1 2 −μ 1 2 ς 2 2
2ς 1 2 ς 2 2
μ 1 −μ 2 )2
2ς 2
2
1
ς 2
− 2 + 2ς1 2
2
44
exp
− z−μ 1 )2
2ς 1 2
dx
CHAPTER 7 – DUALITY
A concept common to many areas of mathematics is that of duality – generally
speaking, a correspondence between related objects that provides an additional layer
of interaction between the two. Information geometry is no exception and exhibits
varied notions of duality centred about the concept of connections.
7.1 – DUAL CONNECTIONS
Amari & Nagaoka define duality of connections as follows 43. Given a statistical
manifold P with a Riemannian metric 〈 · | · 〉p :TpP×TpP → ℝ, two (affine) connections
∇ and ∇* are considered to be dual if for any vector fields U, V and W over P the action
of the metric at any point p in the statistical manifold P yields
W U V
p
= ∇U W) V
p
+ U ∇ ∗ V W)
p
At first glance this notation can be somewhat confusing. The left hand side of this
relation is interpreted as follows: given a point p∊P, the vector field W(p) evaluated at
p yields a tangent vector, say
∂
∂
p) + ⋯ + w n p) n p)
1
∂θ
∂θ
then by defining 〈 U(p) | V(p) 〉p = ψ (p) to be a function of p alone it is possible to
take the partial derivatives of ψ (p) as dictated by W(p)
W p) = w1 p)
∂𝜓
∂𝜓
w1 p) ∂θ 1 p) + ⋯ + wn p) ∂θ n p) ≕ W U V
p
thus resulting in a real-valued function on the manifold P.
The right hand side is more straightforward since at any point in the manifold the
action of ∇U and ∇V on W produces a tangent vector and so it is just a matter of
evaluating the Riemannian metric on the resultant tangent vectors.
j
j ∗
Now, if Γ i,k and Γ i,k are the Christoffel symbols of two connections and gi,j are the
coefficients which define the inner product
g i,j p) =
∂
∂θ i
p)
∂
∂θ j
p)
p
then these connections are dual if the following symbolic relation holds
∂g i,j
∂θ k
43
p) =
n
a=1
g a,j p)Γ aik p) + g i,a p)Γ ajk
∗
§3.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
45
p)
To see why this relation must hold, consider the most basic form of vector fields –
elements of the basis of the tangent bundle for a given co-ordinate system
∂
∂
∂
U = ∂θ i , V = ∂θ j , W = ∂θ k
By the definition of the gi,j and the Christoffel symbols it follows that
∂
∂
W〈U | V〉p = ∂θ k p) 〈∂θ i p) |
∂
∂
∂
〈∇U W) | V〉p = 〈∇ i
∂θ
∂
a
n
a=1 Γ ik ∂θ a
=〈
a
n
a=1 Γ ik
=
=
〈
∂
∂θ a
p)〉p =
∂θ j
| ∂θ j 〉p
∂
∂
| ∂θ j 〉p
∂
∂
〈U | ∇∗ V W)〉p = 〈 i | ∇∗ j
∂θ
∂θ
=
∂
∂θ k
〉p
a ∗ ∂
n
a=1 Γ jk ∂θ a 〉p
a ∗ ∂
n
a=1 Γ jk 〈∂θ i
∂
| ∂θ a 〉p
a ∗
n
a=1 Γ jk g i,a
=
p)
| ∂θ j 〉p
g a,j p)
∂
∂θ k
∂
∂θ k
a
n
a=1 Γ ik
= 〈∂θ i |
∂g i,j
p)
and so the desired relation follows from this, the linearity of the inner product and the
decomposition of vector fields in terms of the bases for the tangent bundle.
As an example of dual connections, it turns out that the α-connections defined
previously have a very simple dual structure. Namely, Amari & Nagaoka show that the
connections ∇ α) and ∇(-α) respect the dual relation defined above44 when the metric is
the Fisher Information metric.
Instead of utilising Christoffel symbols, Amari & Nagaoka define the α–connections via
the functions 45
𝔼p
∂ 2 ℓp
∂θ i
∂θ j
+
1−α ∂ℓp ∂ℓp
∂ℓp
∂θ i
∂θ k
2
∂θ j
α)
= Γ ij,k
p ) ≕ 〈∇
α) ∂
∂θ i
∂
∂θ j
and note that they satisfy the equality
44
45
Theorem 3.1, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§2.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
46
∂
| ∂θ k 〉p
α)
Γ ij,k
p) =
ℝn
∂i ∂j ℓpα) ∙ ∂k ℓp−α) dx
where the functions in the integrand are defined as
∂k ℓpα) x = p x
∂i ∂j ℓpα) x = p x
1−α
2
∂ 2 ℓp
∂θ i
∂θ j
1−α
2
∂ℓp
x
∂θ i
x +
1−α ∂ℓp
2
∂θ i
∂ℓp
x
∂θ j
x
Since the Fisher information metric is symmetric and has the representation
g i,j p) = g j,i p) = 𝔼p
∂ℓ
∂ℓ
p) ∂θ j p) =
∂θ i
ℝn
∂i ℓpα) ∙ ∂j ℓp−α) dx
it is follows (recalling the assumption in the definition of a statistical manifold by Amari
& Nagaoka that one may differentiate with respect to the θk under the integral sign)
that
∂g i,j
∂θ k
p) =
ℝn
∂k ∂i ℓpα) ∙ ∂j ℓp−α) + ∂i ℓpα) ∙ ∂k ∂j ℓp−α) dx
)
= Γ kiα),j p) + Γ kj−α)
,i p
= ∇
α) ∂
∂θ k
∂θ i
∂
= ∇
α) ∂
∂θ k
∂θ i
∂
∂
∂θ j p
∂
∂θ j p
+ ∇
+
−α) ∂
∂θ k
∂
∂θ i
∇
∂
∂θ j
−α) ∂
∂θ k
∂
∂θ i p
∂
∂θ j
p
and so the conditions for duality are satisfied.
An almost immediate consequence of this duality is that dual connections share the
same sense of flatness46 – if a statistical manifold flat with respect to ∇ then it is also
flat with respect to ∇*. Combined with the previously noted duality of the connections
∇ α) and ∇(-α), this in turn shows that α-flat families are also (-α)–flat.
Another implication of this duality between the α–connections is that ∇(0) is self-dual.
Such a connection is referred to as being a Riemannian connection47 or a Levi-Civita
connection48 and is unique for a given Riemannian metric.
7.2 – PARALLEL TRANSLATION
Given a vector in ℝn with a base-point there is a natural method of translating this
vector to another point in ℝn such that it remains constant according to the Euclidean
46
§3.3, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§1.10, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
48
§6.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
47
47
connection along the path – translation by the straight line connecting the two points.
A generalisation of this is the notion of parallel translation on a manifold.
Specifically, suppose that a connection ∇ is defined on a manifold X then given two
points x,y∊X and a simple path (one that is injective and has a non-vanishing
derivative) γ:[0,1] → X connecting these points, one seeks a way to construct a vector
field Πv(x) for any tangent vector v∊Tγ 0)X such that
Πv γ 0) = v
and
∇Πv γ′ t) = 0
for any t∊[0,1]. The vector field Πv is called the parallel translation of v along γ49.
What this concept is attempting to emulate is the idea of translating (tangent) vectors
along a manifold along “straight lines” as defined by the connection ∇. What such a
translation gives is a tangent vector based at another point in the manifold and an
expression dictating how it arrives there.
Although it is simple enough to verify whether or not these conditions are satisfied for
a given vector field, it is not so apparent from the definition how to construct such a
vector field or even prove that such a vector field does exist. However, Murray & Rice
show that this amounts to only the solving of a system of linear first-order differential
equations as follows50.
Assume that θ1, … , θn) is a co-ordinate system on a neighbourhood containing the
path γ:[0,1] → X then using the basis for the tangent space induced by these
co-ordinates one may express the initial vector v∊Tγ 0)X and the vector field Πv(γ t))
as
∂
∂
v = v1 ∂θ 1 γ(0)) + ⋯ + v n ∂θ n γ(0))
∂
∂
Πv γ(t)) = f 1 t) ∂θ 1 γ(t)) + ⋯ + f n t) ∂θ n γ(t))
where the v1, … , vn are real-valued constants and the f1, … , fn are real-valued
functions on [0,1].
Now, the third axiom of connections gives the first equality in the expression below for
the action of a connection on the vector field Πv(x) by using the above representation.
49
50
§1.6, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
§5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
48
∂
∂
∇Πv γ′ t) =
n
i=1
df i t) ∂θ i γ t) + f i t)∇ ∂θ i γ t)
=
n
i=1
df i t) ∂θ i γ t) + f i t)
n
k
j,k=1 Γ ij
=
n
i=1
n
i
j,k=1 Γ kj
dθj γ′ t)
∂
df i t) + f k t)
dθj γ′ t)
∂
∂θ i
∂
∂θ k
γ t)
γ t)
The second equality here comes from the definition of the Christoffel symbols of a
connection and the third from reordering the summations.
Therefore, in compact notation for the expressions above, the system of differential
equations for the parallel transport becomes
df i +
n
k i
k,j=1 f Γ kj
dθj = 0
for each i=1, … ,n with initial conditions given by
f i 0) = v i
for each i=1, … ,n.
A noteworthy property of parallel translations is in relation to dual connections. If Π(x)
is a parallel translation with respect to the connection ∇ and Σ(x) is parallel with
respect to the dual connection ∇*, where each translation is defined over the same
curve γ, then the definition of duality gives for any vector field W(x)
W〈 Π | Σ 〉x = 〈 ∇Π W) | Σ 〉x + 〈 Π | ∇∗ Σ W)〉x
= 〈 0 | Σ 〉x + 〈 Π | 0 〉x
=0
In particular, this shows that the rate of change of 〈 Π | Σ 〉x along the path γ is zero
and so must be constant. That is to say, inner products are preserved in this situation
when subject to a dual pair of parallel translations 51.
If ∇ = ∇* is a Riemannian connection then this relation implies that for a parallel
translation Πv(x) over the path γ: [0,1] → X the inner product of any two vectors
u,v∊Tγ 0)X is constant over γ:
u v γ 0) = Πu γ t) Πv γ t) γ t)
t ∈ [0,1]
For an illustrative example of the above discussion, return once more to the Normal
family’s statistical manifold
51
Theorem 3.2, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
49
𝒩 = p z; μ, ς) =
1
2πς
− z−μ)2
exp
2
μ ∊ ℝ, ς > 0
2ς 2
with the Christoffel symbols for the α–connection that are not zero:
1 (α)
Γ 1,2
=−
2 (α)
ς
1−α
Γ 1,1
=
2 (α)
=−
Γ 2,2
α +1
2ς
2α +1
ς
The system of equations for a parallel translation thus reduces to
1 (α)
df 1 + Γ 1,2 f 1 dθ2 = df 1 −
2 (α)
2 (α)
df 2 + Γ 1,1 f 1 dθ1 + Γ 2,2 f 2 dθ2 = df 2 +
α+1
f 1 dθ2 = 0
ς
1−α
f 1 dθ1 −
2ς
2α+1
ς
f 2 dθ2 = 0
Define a simple path in from μ1,ς1) to μ2,ς2) in 𝒩 for t∊[0,1] by
θ γ t) = 1 − t) μ1 , ς1 ) + t μ2 , ς2 )
so that the values of the co-ordinate differentials on γ’ are
dθ1 γ′ t) = μ2 − μ1 )
dθ2 γ′ t) = ς2 − ς1 )
In this case, then, the differential equation for f1 is simply
df 1 −
α+1
1−t)ς 1 +tς 2
ς2 − ς1 )f 1 = 0
which is solved in general by
f 1 t) = c1 ς2 − ς1 )t + ς1
α+1)
From the initial condition that f 1(0) = v1, the specific solution is thus
f 1 t) = v 1
α+1)
ς 2 −ς 1 t+ς 1
ς1
Equipped with this solution for f 1, the second differential equation becomes
df 2 +
v1
1−α
ς 1 )α +1
2
μ2 − μ1 ) ς2 − ς1 )t + ς1
α
−
2α+1
ς 2 −ς 1 )t+ς 1
ς2 − ς1 )f 2 = 0
which can be solved in by computer algebra software to yield the general solution
50
t v 1 1−α) μ 2 −μ 1
0
2 ς 1 )α +1
f 2 t) = c 2 + ς1 + ς2 − ς1 )t)2α+1 −
v 1 1−α) μ 2 −μ 1
=
c2 + 2
−α)−1 [ ς1 + ς2 − ς1 )t)−α − ς1 )−α ] , α ≠ 0
ς 1 )α +1 ς 2 −ς 1 )
v 1 1−α) μ 2 −μ 1
c2 + 2
ς 1 )α +1
log
ς 2 −ς 1 )
ς1 + ς2 − ς1 )s)−1−α ds
ς 1 + ς 2 −ς 1 t
,α = 0
ς1
The initial condition that f2(0) = v2 then just implies that c2 = v2.
7.3 – CANONICAL DIVERGENCE
Although divergence functions are arbitrary, in general, Amari & Nagaoka show that
for a certain class of statistical manifolds there exists a canonical divergence – one
defined by the manifold itself which possesses several worthwhile properties 52.
The conditions for the existence of the canonical divergence are that a Riemannian
manifold (X, 〈 · | · 〉x) be a dually flat space, i.e. flat with respect to dual connections ∇
and ∇*. From this one may assume that there exist co-ordinates θ1, … , θn and
η1, … , ηn for which the connections are flat, that is
∂
∂
∇ ∂θ i
∂θ j
∂
∇∗ ∂η i
∂
∂η j
= 0 for all i, j = 1, … , n
= 0 for all i, j = 1, … , n
In fact, it is shown that given the co-ordinates θ1, … , θn, by using affine
transformations of the co-ordinates η1, … , ηn one may assume that
〈
∂
∂θ i
x) |
∂
x) 〉x = δi,j
∂η j
for all x∊X and for all i,j=1, … , n, where δi,j is the Kronecker delta function.
For this pair of co-ordinates, define two functions φ,ψ : X → ℝ, called potentials, by
the systems of partial differential equations
∂
∂θ i
∂
∂η i
𝜓 𝑥) = ηi 𝑥 )
i = 1, … , n
φ 𝑥) = θi 𝑥 )
i = 1, … , n
or equivalently by the relations
∂
∂
∂θ i ∂θ j
52
∂
∂
𝜓 𝑥 ) = 〈 ∂θ i | ∂θ j 〉x
i, j = 1, … , n
§3.4, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
51
∂
∂
∂η i ∂η j
∂
∂
φ 𝑥 ) = 〈 ∂η i | ∂η j 〉x
i, j = 1, … , n
then the canonical divergence for this dually flat manifold is given by
n
i
i=1 θ
D x‖y) = 𝜓 x) + φ y) −
x)ηi y)
That this function actually possesses the properties required of a divergence follows
from the observation given by Amari & Nagaoka that the functions φ and ψ can be
expressed as
𝜓 x) = maxy∈X
n
i
i=1 θ
x)ηi y) − φ y)
φ y) = maxx∈X
n
i
i=1 θ
x)ηi y) − 𝜓 x)
and from the fact that the relations
∂
∂
∂θ i ∂θ j
∂
∂
∂η i ∂η j
∂
∂
∂
∂
𝜓 𝑥 ) = 〈 ∂θ i | ∂θ j 〉x ≥ 0
φ 𝑥 ) = 〈 ∂η i | ∂η j 〉x ≥ 0
imply that both φ and ψ are convex functions.
As a particular case of a canonical divergence, Amari & Nagaoka note that in the case
of a Riemannian connection, where ∇ = ∇* is self-dual and thus co-ordinate systems
are self-dual, the potentials are
1
𝜓 x) = φ x) = 2
n
i
i=1 θ
x)2
and so the divergence function
D x‖y) = 𝜓 x) + φ y) −
n
i
i=1 θ
1
x)θi y) = 2
n
i=1
θi x) − θi y)
2
corresponds to one half of the square of the standard Euclidean distance.
Perhaps one of the most useful properties of the canonical divergence given by a
theorem53 of Amari & Nagaoka is their Pythagorean relation:
Suppose x,y,z are three points in the dually flat manifold X and that
γ1:[0,1] → X is the geodesic with respect to ∇ connecting x= γ1(0) and
y= γ1(1) and γ2:[0,1] → X is the geodesic with respect to ∇* connecting
y= γ2(0) and z= γ2(1). If the curves γ1 and γ2 are orthogonal at the point y,
that is 〈 γ1’(1) | γ2’(0) 〉y = 0, then the following triangle relation holds for the
canonical divergence:
D x‖z) = D x‖y) + D y‖z)
53
Theorem 3.8, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
52
From this theorem they follow with a corollary which may find applications in
statistical estimation when one wishes to assume that true distributions form a
sub-manifold of the dually flat manifold Y.
The corollary states that if X⊂Y is an ∇*–autoparallel sub-manifold of a dually flat
manifold Y (i.e. when restricted to X, ∇* still defines a valid connection) then the
canonical divergence D(y‖∙): X → ℝ between a given point y∊Y and the sub-manifold
X is minimised by x∊X if and only if the geodesic between x and y is orthogonal at x to
every path in X though x (in this case, such a geodesic is said to be orthogonal to X).
The point x∊X that minimises the canonical divergence as above is said to be the
∇–projection of y onto X. This will be shown later to be of use when projecting onto
sub-manifolds in the course of statistical estimations. Essentially, the divergence
function is being used to find the closest point in X to any observation found outside of
X in Y to project the observation in a meaningful way onto X54.
Although theoretically a worthwhile concept, in practice, however, one must realise
that this method of divergence minimisation depends on knowledge of several other
constructs which can be complicated to calculate, if not unattainable explicitly. For
example, it has been shown that the α–divergences of the Normal statistical manifold
can be defined exactly but that the geodesics on the Normal statistical manifold have a
closed form only for α = 0.
7.4 – DUALISTIC STRUCTURE OF EXPONENTIAL FAMILIES
Whilst the previous discussion of dually flat manifolds included some promising results,
it also depended on fairly detailed requirements that a manifold must pass. For these
results to be of any use then, they must be applicable to at least some non-exotic
probability spaces. Fortunately, it can be demonstrated that the results apply to a very
large and widely used class of manifolds – that of exponential families55.
Suppose that such a family consists of probability density functions of the form
p z; θ = exp C z +
n
i i
i=1 θ y
z −𝜓 θ
then Amari & Nagaoka define (expectation) co-ordinates (η1, … , ηn) on its statistical
manifold by
ηi p z; θ
54
55
= 𝔼θ y i Z
=
y i z ∙ p z; θ dz
§6.5.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
§3.5, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
53
and note that their potential is just ψ (θ) since
0=
∂
p z; θ dz
∂θ i
=
=
∂
∂θ i
n
i i
i=1 θ y
exp C z +
yi z −
= ηi p z; θ
∂𝜓 θ
∂θ i
−
z − 𝜓 θ dz
p z; θ dz
∂𝜓 θ
∂θ i
Although little justification is presented by Amari & Nagaoka as to how these are
proper co-ordinates, it follows from the fact that the maximum likelihood estimator
(MLE) M: Ω → ℝ of an exponential family ℰ is bijective and satisfies the equality
M −1 p) =
∂𝜓 θ
∂θ 1
,…,
∂𝜓 θ
∂θ n
for any probability density function p∊ℰ and so it is possible recover the canonical
co-ordinates from the expectation co-ordinates in the case of exponential families 56.
It is further noted that the expectation co-ordinates satisfy the requirements of dual
co-ordinates to the canonical co-ordinates θ1, … , θn) since
g i,j p) = −𝔼p
= −𝔼p
= 𝔼p
∂2ℓ
∂θ i ∂θ j
∂2
∂θ i ∂θ j
C z +
n
i i
i=1 θ y
z −𝜓 θ
∂2 𝜓 θ
∂θ i ∂θ j
which Amari & Nagaoka show is equivalent to the requirement that
〈
∂
∂θ i
∂
| ∂η j 〉 = δi,j
Now, as previously discussed, under canonical co-ordinates any exponential family is
flat with respect to the 1–connection, the dual of which is the (-1)–connection.
Therefore, the co-ordinate duality of the expectation co-ordinates and the canonical
co-ordinates implies that any exponential family is also flat with respect to the
(-1)–connection under expectation co-ordinates. Hence, any exponential family is
dually flat.
56
§7.3, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
54
7.5 – DIFFERENTIAL DUALITY
An alternative concept of duality is introduced by Murray & Rice which relates tangent
vectors and differentials. Since tangent spaces are vector spaces and differentials
linear functions on tangent spaces, the concept of duality that they propose is similar
to the more common concept of vector space duality but is realised through
connections57.
Firstly, it is necessary to establish exactly what a connection on the space of
differentials (called the co-tangent bundle and denoted henceforth by Tx*X) should be.
In line with connections on tangent spaces, Murray & Rice require that such a
connection should satisfy the following three properties for differentials defined over
each point in a manifold:
1. For any differential ωx, ∇ωx: TxX → Tx*X is a linear function the tangent
space for every x∊X ;
2. For any two differentials ωx and νx, ∇ ωx+νx) = ∇ωx+∇νx ; and
3. For any differential ωx and differentiable function f:X → ℝ,
∇ f(x) ∙ωx)(γ’)=df γ’)ωx+f x)∇ωx(γ’) for all γ’∊TxX and all x∊X.
As with connections on the tangent bundle of a manifold X, given a co-ordinate system
(θ1, … , θn) on X it is possible to expand a differential ωx∊Tx*X as
ωx = ω1(x)dθ1 + … + ωn(x)dθn for real-valued functions ω1, … , ωn on X. Thus the
action of a connection on ωx can be expanded by the properties of connections on the
co-tangent bundle as
∇ωx γ′) =
n
i=1 dωi
γ′)dθi x) + ωi x)∇dθi γ′)
Therefore, since the differentials dω1 through dωn are already well-defined,
connections over co-tangent bundles are defined by their values on the differentials
dθi. Since ∇dθix γ′ ) is by definition a differential for any tangent vector γ’∊TxX and so
expandable in terms of the dθix , writing
∇dθix γ′) =
i
n
j=1 A j
j
γ′)dθx
results in the representation of such connections via the Christoffel symbols
i
A j γ′) =
i
n
k
k =1 Γ j,k dθx
γ′)
which follows from the linearity of ∇ωx on tangent vectors.
57
§4.7, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
55
The Christoffel symbols themselves can either be arbitrary constants for each point
x∊X, perhaps satisfying some regularity conditions if a particular type of connection is
desired, or else defined from a given connection by the equality
i
Γ j,k = ∇dθix
∂
∂
∂θ j
∂θ k
That is, one evaluates ∇dθix on the jth co-ordinate derivative to form a differential and
then applies said differential to the kth co-ordinate derivative.
To see why this is, let γ’ and ς’ be any two tangent vectors in TxX. Expand these
tangent vectors in terms of the co-ordinate basis as
γ′ = γ1
∂
∂
+ ⋯ + γn n
1
∂θ
∂θ
ς′ = ς1
∂
∂
+ ⋯ + ςn n
1
∂θ
∂θ
then by the bilinearity of each ∇dθix on TxX following equalities hold:
∇dθix γ′ ) ς′ ) =
i
n
j=1 A j
γ1
∂
∂θ 1
=
i
n
b a
a,b,j=1 γ ς A j
=
n
b a
a,b,j=1 γ ς
=
i
n
b a
a,b=1 γ ς Γ a,b
+ ⋯ + γn
∂
∂
j
∂θ b
dθx
i
n
k
k=1 Γ j,k dθx
∂
j
dθx ς1
∂θ n
∂θ 1
+ ⋯ + ςn
∂
∂θ n
∂
∂θ a
∂
∂θ b
∂
j
dθx
∂θ a
Expanding ∇dθix γ′ ) ς′ ) directly then yields the equality
∇dθix γ′ ) ς′ ) =
n
b a
i
a,b=1 γ ς ∇dθx
∂
∂
∂θ b
∂θ a
=
i
n
b a
a,b=1 γ ς Γ a,b
from which the desired relation follows.
In order to define their concept of duality, Murray & Rice introduce a form of the
Leibniz rule for differentials based upon connections on tangent bundles and
co-tangent bundles, having now defined the latter 58. In particular, they relate a given
vector field V over X, a differential ωx∊Tx*X and a tangent vector γ’∊TxX by the
following differential formula (where the over-bar is used to denote the
co-tangent bundle connection):
d ωx V x)
58
γ′) = ∇ωx γ′) V x) + ωx ∇V γ′)
§4.8, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
56
Since the action of each dθix on each basis vector of TxX under a given co-ordinate
system is by assumption given by the relation
∂
x) = δi,j
∂θj
dθix
and so is constant, this Leibniz rule gives rise to the concept of duality of the
connections ∇ and ∇ presented by Murray & Rice through the following equality:
0 = d dθix
∂
x)
∂θ j
∂
γ′) = ∇dθix γ′)
∂
x) + dθix ∇ ∂θ j γ′)
∂θ j
Explicitly, by the definition of the Christoffel symbols of each connection this equality
relates the two by
i
Γ j,k = ∇dθix
∂
∂θ j
x)
∂
= −dθix ∇ ∂θ j x)
∂
∂θ k
∂
∂θ k
∂
a
n
a=1 Γ j,k ∂θ a
= −dθix
x)
x)
x) = −Γ ij,k
The motivation for defining duality in this way arises from parallel translations.
Supposing that for each tangent vector u∊Tρ 0)X there exists a parallel translation
along a path ρ:[0,1] → X, say Πρ(t)(u), which establishes a bijective correspondence
between tangent spaces, then one may define the action of a parallel-translated
differential ωρ(0)∊Tρ 0)*X on a tangent vector v = Πρ(t)(u)∊Tρ t)X to be
Πρ
t)
ωρ
0)
u) = ωρ
0)
v)
Effectively, this parallel translation of differentials is defined by taking the action of the
differential on the reversely translated tangent vector which now lies in the
appropriate tangent space.
Note that this does indeed define a parallel translation, analogously to parallel
translations of tangent vectors, since the action of the parallel-translated differential
over a path γ(t) on its vector field γ’ t) is constant and so its connection vanishes:
Πγ
t)
ω) γ’ t) = ωγ
0)
γ’ 0) ⇒ ∇ Πγ
57
t)
ω) γ’ t) = 0
By considering parallel translations in terms of the implied differential equations,
Murray & Rice claim that this bijectivity between tangent spaces is realised 59.
Moreover, defining the inverse of a parallel translation on the co-tangent bundle as
−1
t)
Πγ
ωγ
≔ ν ∈ Tγ
t)
0)
∗
X Πγ
t)
ν) = ωγ
t)
which consists of precisely one element due to bijectivity, they also claim that this
parallel translation of differentials defines a connection via the relation
∇ω γ
−1
t)
d
0)
γ′ 0) = dt Πγ
ωγ
t)
t=0
At first glance, this formula may be less than transparent in meaning. Recall that a
co-tangent bundle connection produces for every tangent vector γ’∊TxX a differential
∇ωx γ’)∊Tx*X. The right hand side of this relation is taking the derivative with respect
to t of a differential, say ν t)∊Tx*X, hence it is necessary to firstly show that this results
in another differential ν’ t)∊Tx*X.
Write ν t) = ν1 t)dθ1 + … + νn t)dθn, then the derivative is
d
dt
−1
t)
Πγ
ωγ
d
t)
t=0
= dt [ν t)]
t=0
d
= dt [ν1 t)dθ1 + ⋯ + νn t)dθn ]
1
t=0
n
= ν1 ′(0)dθ + ⋯ + νn ′(0)dθ
and so defines another differential ν’ t)∊Tx*X as required. That the action of this
proposed connection is linear with respect to differentials follows simply from the
linearity of differentiation.
Lastly, given and differentiable function f:X → ℝ, this formula yields
∇ f γ 0) ∙ ω γ
−1
t)
d
0)
γ′ 0) = dt Πγ
f γ t) ∙ ω γ
t)
t=0
d
= dt f γ t) ∙ ν t)
t=0
d
= dt f γ t) ∙ ν1 t)dθ1 + ⋯ + f γ t) ∙ νn t)dθn
=
59
n
i=1
t=0
df γ′ 0) ∙ νi 0)dθi + f γ 0) ∙ ν′i 0)dθi
= df γ′ 0) ∙ ∇ ωγ
0)
= df γ′ 0) ∙ ωγ
γ′ 0) + f γ 0) ∙ ∇ ωγ
0)
γ′ 0) + f γ 0) ∙ ∇ ωγ
§5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
58
0)
0)
γ′ 0)
γ′ 0)
where the last equality follows from the fact that
−1
0)
Πγ
ωγ
= ν ∈ Tγ
0)
0)
∗
X Πγ
ν) = ωγ
0)
0)
is by definition just ωγ 0) itself.
Therefore, since all the prerequisites are satisfied, this does indeed define a
connection on the co-tangent bundle.
Returning to the original issue of motivating the definition of duality of connections on
tangent and co-tangent bundles via the Leibniz rule
d ωx V x)
γ′) = ∇ωx γ′) V x) + ωx ∇V γ′)
Murray & Rice show that by writing the action of the differential ωγ t) on the vector
field V(γ t)) as
ωγ
t)
V γ t)
= ωγ
t)
−1
t)
Πγ t)Πγ−1t)V γ t)
= Πγ
ωγ
t)
Πγ−1t) V γ t)
then this Leibniz rule now follows naturally 60 (the second last equality below being an
application of the chain-rule):
d ωγ
0)
V γ 0)
d
γ′ ) = dt ωγ
d
−1
t)
= dt Πγ
=
60
ωγ
t)
d
−1
Πγ t) ωγ
dt
= ∇ωγ
0)
V γ t)
t)
t)
γ′ 0)
t=0
Πγ−1t)V γ t)
t=0
V γ 0)
t=0
V γ 0)
+ ωγ
+ ωγ
0)
0)
∇V γ′ 0)
§5.1.1, “Differential Geometry and Statistics” – M Murray and J Rice, 1993.
59
d −1
Π V γ t)
dt γ t)
t=0
CHAPTER 8 – SURVEY OF APPLICATIONS
Interest in the theory of information geometry is at least partially motivated by the
possibility of gaining a fresh perspective on the well-known topics of probability and
statistics. As eluded to previously in this thesis, several applications in the field of
statistics have emerged over the years as information geometry has established itself.
Primarily, these new methods are in relation to statistical estimation.
8.1 – AFFINE IMMERSIONS
There exists a famous theorem of Whitney 61 in the field of differential topology which
states that all manifolds which satisfy a certain condition admit an immersion into real
space. Essentially, this theorem gives a way of representing a given manifold in real
space which, for manifolds of small enough dimension, permits one to visualise the
manifold – much like how one visualises the 2-manifold that is the unit sphere in
3-dimensional real space.
Suppose that X and Y are manifolds and that f: X → Y is a map from X to Y. The map f
is an immersion if for each point x∊X its derivative dxf: TxX → Tf(x)Y is an injective map.
Then Whitney’s theorem is stated as follows:
Any smooth manifold X of dimension m > 1 admits a function
f: X → ℝ2m -1 which is an immersion.
A consequence of this theorem is that any 2-dimensional manifold may be
visualised in 3-dimensional real space. Note, however, that there is no
guarantee that this map is injective. The Klein bottle is an example of such a
manifold – although it admits immersions into 3-space, such maps always
produce a self-intersecting surface62.
As manifolds, statistical manifolds are no exception to Whitney’s theorem.
However, using the framework endowed by information geometry it is possible
to improve upon Whitney’s result in certain cases as specified by a proposition
of Arwini & Dodson63.
61
§1.8, “Differential Topology” – V Guillemin and A Pollack, 1974.
§4, “A Topological Aperitif” – S Huggett and D Jordan, 2001.
63
§3.4, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson,
2008.
62
60
Before this can be presented, however, a definition from topology is required.
Given a topological space X, if any two points x,y∊X can be joined by a path
γ1: [0,1] → X and any two such paths between x and y, say γ1 and γ2, admit a
continuous deformation H:[0,1]×[0,1] → X such that H(0,t) = γ1(t),
H(1,t) = γ2(t), H(s,1) = γ1(1) and H(s,0) = γ1(0) then the space X is said to
be simply connected. For a more in-depth discussion of this topic see, for
example, Hatcher64.
Now, let P be a dually flat statistical manifold with respect to a Riemannian
metric 〈 · | · 〉p : TpP×TpP → ℝ and connections ∇ and ∇* which have flat
co-ordinates θ1, … , θn) and η1, … , ηn), respectively. As in section 7.3 of this
thesis, define the potential function φ: P → ℝ by the either of the following two
equivalent relations:
∂
∂η i
∂
∂
∂η i ∂η j
φ p) = θi p)
∂
i = 1, … , n
∂
φ p) = 〈∂η i | ∂η j 〉p
i, j = 1, … , n
Then if P is simply connected the map f: P → ℝn+1 where
f θ1, … , θn) = θ1, … , θn, φ θ)) is an immersion.
As a special case of this proposition note that as in section 7.4 of this thesis that
exponential families are dually flat with respect to Amari & Nagaoka’s
1-connection and (–1)-connection under the Fisher information metric. Thus if
an exponential family P consists of probability density functions of the form
p z; θ = exp C z +
n
i i
i=1 θ y
z −φ θ
then the potential function in this case is just φ and so the immersion into ℝn+1
is just f(θ1, … , θn) = (θ1, … , θn, φ(θ)).
Some examples of these immersions are presented by Arwini & Dodson for
various statistical manifolds under the 1-connection and (–1)-connection and
the Fisher information metric.
If 𝒩 is the Normal family of probability distributions under the canonical
co-ordinates θ = (θ1,θ2) = (μ/ς2, –1/ 2ς2)) then it takes the form
1
2
1
2
𝒩 = p z; θ = exp θ ∙ z + θ ∙ z − 2 log
64
§1.1, “Algebraic Topology” – A Hatcher, 2002.
61
−π
θ2
+
θ1
2
4θ 2
θ1 ∊ ℝ, θ2 < 0
Therefore, the potential function is
1
−π
2
θ2
φ θ1 , θ2 ) = log
θ1
−
2
4θ 2
θ1 ∊ ℝ, θ2 < 0
and so Arwini & Dodson’s immersion65 into ℝ3 is f: ℝ×( –∞, 0) → ℝ3 given by
1
2)
1
f θ ,θ
= θ
1
−π
, θ2 , 2 log θ 2
−
θ1
2
4θ 2
An important family of probability distributions that is applied to many modelling
situations66 due to its flexibility is the Beta family which consists of probability density
functions of the form
ℬ = p z; α, β) = B
1
α,β)
z α−1 1 − z)β−1 α, β > 0
where the values of z lay in (0,1) and B α,β) is the beta function
Γ α+β)
B α, β) = Γ
α)Γ β)
=
1
0
uα−1 1 − u)β−1 du
Despite what the form of the Beta family’s distributions may suggest, this is in fact an
exponential family with
y1 z) = log z), y 2 z) = log 1 − z), θ1 = α − 1), θ2 = β − 1), C z) = 0,
φ θ = log B θ1 + 1, θ2 + 1)
since substituting these into the exponential form gives
exp C z) + θ1 ∙ y1 z) + θ2 ∙ y 2 z) − φ θ
= exp
=B
1
α,β)
α − 1)log z) + β − 1)log 1 − z) − log B θ1 + 1, θ2 + 1)
z α−1 1 − z)β−1
which is a probability distribution function of the Beta family, as required.
Therefore, under the 1-connection and (–1)-connection and the Fisher
information metric, the Beta family can be immersed into ℝ3 via the map
f: (–1,∞)×( –1, ∞) → ℝ3 defined as
f θ1 , θ2 ) = θ1 , θ2 , log B θ1 + 1, θ2 + 1)
Examples of the graphs of both of these immersions are given below.
65
§3.4, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson,
2008.
66
“Handbook of Beta distribution and its applications” – A Gupta and S Nadarajah, 2004.
62
(Immersion of Normal statistical manifold into ℝ3)
(Immersion of Beta statistical manifold into ℝ3)
63
8.2 – PROJECTIONS ONTO SUB-MANIFOLDS
Statistical inference is the study of estimating probability distributions and
related information from data, in some cases assuming that the distributions
are from a given family of distribution functions. For example, a commonly
estimated quantity is that of the sample mean – given data points (z1, … ,zn)
and assuming that each of these observations was independent of the others
and all the observations were generated by the same random variable, the
sample mean is defined as
μ=
1
n
n
i=1 zi
and is used to estimate the mean value of the random variable in question. In
some cases, knowledge of this expected value is enough to determine a
random variable’s probability distribution from a given family, such as the
Poisson distribution Pn λ) (c.f. section 1.1) – a one parameter family of
distributions with expected value equal to λ. For a more detailed discussion of
this topic see, amongst others, Casella and Berger67.
Now, suppose that one wishes to estimate a random variable from a known
family of distributions P and that this famly is part of a larger family Q. If the
estimated distribution ê is not an element of the family P but is a part of the
larger family Q, then it is reasonable to try to generate a new estimate from ê
that does lay in P.
An example of this concept is the estimation of probability distribution
functions from the Normal family 𝒩 μ,ς), classified by its mean μ and standard
deviation ς, when there is assumed to be a dependence between the these
two parameters, say ς = f(μ), or when one of the parameters is assumed to be
a predefined constant. If the estimated distribution function does not satisfy
these assumptions then one may wish to transform the estimate into a
distribution function that does.
One method of achieving this is via minimisation of divergences. Given a statistical
manifold Q with a divergence function D: Q×Q → ℝ, for a particular point q∊Q and a
sub-manifold P⊂Q the aim is to minimise the function f: P → ℝ where f(p) = D(q,p).
Note that there is no inherent requirement that P should be an actual sub-manifold
instead of a subset of Q – the function f can be minimised regardless of this – but there
are advantages to making this assumption as will now be presented.
67
“Statistical Inference” – G Casella and R Berger, 2001.
64
Amari & Nagaoka prove the following68:
Let (Q, 〈 · | · 〉q) be a dually flat n-dimensional Riemannian manifold with respect to
the connections ∇ and ∇* and the co-ordinates θ1, … , θn} and η1, … , ηn}. If
D: Q×Q → ℝ is the canonical divergence
D p‖q) = 𝜓 p) + φ q) −
n
i
i=1 θ
p)ηi q)
defined for the potentials ψ : Q → ℝ and φ: Q → ℝ which are obtained from the
differential relations
∂
∂θ i
∂
∂η i
𝜓 = ηi
i = 1, … , n
φ = θi
i = 1, … , n
then for any point q∊Q and sub-manifold P⊂Q the function f(p) = D(q,p) on P is
minimised by the point ê∊P if and only if the geodesic from q to ê with respect to the
connection ∇ is orthogonal to P at ê with respect to the inner-product 〈 · | · 〉ê.
Recall that a path γ: [0,1] → Q is a ∇-geodesic when the vector field that it traces out,
γ’: [0,1] → Tγ t)Q, satisfies the equality ∇γt’ γt’) = 0. This path is orthogonal to P if for
any path ν: [0,1] → P in P where ν(a) = ê the inner product of the two vector fields
〈 γ’ 1) ν’ a) 〉ê is zero at ê.
To demonstrate this theorem in practice, assume that Q is the statistical manifold of an
exponential family under the canonical co-ordinates and 〈 · | · 〉q is the Fisher
information metric then, as explained in section 5.3 of this thesis, Q is flat with respect
to Amari & Nagaoka’s 1–connection (implying that the Christoffel symbols below
vanish) and so Murray & Rice’s system of partial differential equations that any
∇(1)-geodesic γ:[0,1] → Q must satisfy reduce to
γi +
n
i
j k
j,k=1 Γ j,k γ γ
= γi = 0
where the γi are defined to be the co-ordinates of θ γ t))= γ1 t), … , γn(t)). This
implies that the ∇(1)-geodesics must be linear in the sense that γi = ai + bi·t for
constants (a1, b1), … , (an, bn).
In terms of an exponential family
p z; θ = exp C z +
n
i i
i=1 θ y
z −φ θ
this becomes
γ t) = exp C z +
68
n
i=1
ai + bi ∙ t)y i z − φ a1 + b1 ∙ t , … , an + bn ∙ t)
Theorem 3.10, “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
65
Fix a point q∊Q, let ν:[0,1] → P be a path in the sub-manifold P, assume that
ν s) = γ 1) for some given s∊[0,1] and that γ(0) = q. This implies that
θ ν s) = θ γ 1) = a1 + b1 , … , an + bn )
= θ1 q) + θ1 ν s) − θ1 q) , … , θn q) + θn ν s) − θn q)
and so ai = θi(q) and bi = θi ν s)) – θi(q) for each i = 1, … , n. Then the inner product
of the two vector fields
∂
∂
∂
ν′ t) = ν1 t) ∂θ 1 + ⋯ + νn t) ∂θ n
∂
γ′ t) = b1 ∂θ 1 + ⋯ + bn ∂θ n
at the point ν s) = γ 1) = p is
γ′ 1) ν′ s)
p
=
n
i,j=1 g i,j
n
i,j=1 g i,j
p)bi νj s) =
p) θi ν s) − θi q) νj s)
where the gi,j are the coefficients defined in section 6.1 as
g i,j p) =
∂
∂θ i
∂
p)
∂θ j
p)
p
This equality for 〈 γ’ 1) ν’ s) 〉p must be solved for zero but without any further
assumptions about the manifolds P and Q and the coefficients gi,j no further
simplifications can be made.
As an example, assume further that Q is the statistical manifold of the Normal family. It
was shown in section 6.1 that the coefficients gi,j with respect to the ρ1, ρ2) = μ,ς)
co-ordinates are
1
g1,1 = ς 2
g1,2 = g 2,1 = 0
2
g 2,2 = ς 2
However, the coefficients needed now are those wish respect to the canonical
co-ordinates.
In order to calculate these coefficients in terms of the canonical co-ordinates one may
use the transformation equation noted by Amari & Nagaoka 69 as
g k,l =
n
i,j=1 g i,j
∂ρ i
∂ρ j
∂θ k
∂θ l
where the left hand coefficients are those with respect to the canonical co-ordinates.
69
(1.22), “Methods of Information Geometry” – S Amari and H Nagaoka, 1993.
66
These coefficients work out to be
g1,1 =
−1
2θ 2
θ1 1
g1,2 = g 2,1 =
2θ 2 θ 2
θ1 )2 + 2
2 θ2 )3
g 2,2 = −
and so the inner product in question becomes
γ′ 1) ν′ s)
−1
= 2θ 2
ν s)
p
2
i,j=1 g i,j
=
p) θi ν s) − θi q) νj s)
θ 1 ν s)
θ1 ν s) − θ1 q) ν1 s) +
θ 1 ν s)
…
1
…
2
+2
3
2 θ 2 ν s)
θ2 ν s) − θ2 q) ν1 s) +
θ1 ν s) − θ1 q) ν2 s) −
2θ 2 ν s) θ 2 ν s)
θ 1 ν s)
1
2θ 2 ν s) θ 2 ν s)
θ2 ν s) − θ2 q) ν2 s)
=
− θ 1 ν s) −θ 1 q)
+
2θ 2 ν s)
…
θ 1 ν s) θ 2 ν s) −θ 2 q)
2 θ 2 ν s)
2
θ 1 ν s) θ 1 ν s) −θ 1 q)
2 θ 2 ν s)
−
2
ν1 s) +
θ 1 ν s)
2
+2 θ 2 ν s) −θ 2 q)
2 θ 2 ν s)
3
ν2 s)
=
θ 1 ν s) θ 2 ν s) −θ 2 q) − θ 1 ν s) −θ 1 q) θ 2 ν s)
2
…
θ2
ν s)
ν1 s) +
2
θ 1 ν s) θ 2 ν s) θ 1 ν s) −θ 1 q) − θ 1 ν s)
2
θ2
ν s)
2
+2 θ 2 ν s) −θ 2 q)
3
ν2 s)
=
2
−θ 1 ν s) θ 2 ν s) θ 2 q)+θ 1 q) θ 2 ν s)
2
…
θ2
ν s)
3
ν1 s) +
−θ 1 ν s) θ 2 ν s) θ 1 q)+ θ 1 ν s)
2
θ2
ν s)
2
θ 2 q)−2θ 2 ν s) +2θ 2 q)
3
ν2 s)
which still does not have any obvious values for ν1 s), ν2(s), θ1(ν(s)) and θ2(ν s)) in
general such that this inner product is zero. If given more information about the submanifold P⊂Q and hence restrictions on the possible values of these four terms, then
further simplifications may become apparent and lead to a solution.
67
For example, if the situation is simplified further to assuming that the sub-manifold P
consists of the probability density functions for which one of the parameters, θ1 = ζ∊ℝ
or θ2 = ξ<0, is held fixed, then this implies that either ν1(t) or ν2(t) must be zero,
respectively, for all t∊[0,1] if ν t) is to lay in P. In which case the vanishing of the inner
product implies that the minimum divergence occurs at the point p = ν(s) for
θ1(ν(s)) = ζ when
θ 2 q) ζ 2 +2
θ2 ν s) =
ζ∙θ 1 q)+2
and for θ2(ν(s)) = ξ when
θ1 ν s) =
ξ∙θ 1 q)
θ 2 q)
which results in the following probability density functions from the two Normal submanifolds, respectively:
p z; ζ, θ2 ν s)
= exp ζ ∙ z +
p z; θ1 ν s) , ξ = exp
θ 2 q) ζ 2 +2
ζ∙θ 1
q)+2
ξ∙θ 1 q)
θ 2 q)
1
∙ z 2 − 2 log
1
−π ζ∙θ 1 q)+2
θ2
∙ z + ξ ∙ z 2 − 2 log
q)[ζ 2 +2]
−π
ξ
ζ 2 ζ∙θ 1 q)+2
+ 4θ 2
ξ θ 1 q)
+4
q)[ζ 2 +2]
2
θ 2 q)
What the preceding example should illustrate is that in actually implementing this
method of projection onto sub-manifolds based upon divergence minimisation there is
the potential for complications to arise in all but the simplest cases. Of course, on a
case-by-case basis, it may be possible to use various algorithms to solve these
equations but a detailed discussion of this lies outside the scope of this thesis.
It is encouraging to see that there has in fact been some usage of this theorem of
Amari & Nagaoka in the international mathematical community for the purposes for
statistical inference. For example, Hirose & Komaki have presented a Least Angle
Regression algorithm70 based upon these information geometry principles that under
certain assumptions iteratively produces the desired divergence-minimising point of a
given sub-manifold. On the other hand, Zhong, Sun & Zhang have applied this theorem
to stochastic control theory and developed an algorithm intended to locate optimal
approximations of control functions71.
70
“An extension of Least Angle Regression based on the information geometry of dually flat spaces” – Y
Hirose and F Komaki, 2010.
71
“An information geometry algorithm for distribution control” – F Zhong, H Sun and Z Zhang, 2008.
68
8.3 – GEODESICS FOR STATISTICAL INFERENCE
When estimating probability density functions from supplied datasets, one
method of quality assessment is given by hypothesis testing. The basic premise
is that for a given set of independent observations assumed to originate from
the same random variable, one evaluates the probability of such a dataset
occurring assuming that the random variable has the proposed probability
density function. If this probability falls below a predefined cut-off point, then
the proposed probability density function is rejected. For a more thorough
treatment of this theory, refer to Hogg and Tanis72, for example.
An alternative to this method may be constructed via the framework afforded
by information geometry. Specifically, using either geodesics, Riemannian
metrics or divergences, as per sections 5 and 6 of this thesis, it is possible to
define distance functions on statistical manifolds which may be used to assess
the suitability of probability density functions for given datasets. Given such a
distance function d( · , · ): P×P → ℝ on a statistical manifold P, if p0 is the
predicted probability density function and p1 is the probability density function
estimated from a dataset, then one may reject p0 as being the probability
density function of the random variable that generated the data if d(p0,p1)
exceeds a predefined value.
Additionally, for 2-dimensional statistical manifolds in particular, these methods of
distance measurement may give insight into the relation between elements of the
manifold in question. For example, fixing a point in p(ξ1, ξ2) in a 2-dimensional
statistical manifold P, one may graph the function f θ1,θ2) = d(p(ξ1, ξ2), p θ1,θ2))
over the parameter space of P to determine neighbourhoods of p(ξ1, ξ2) – subsets of P
of the form Nc ξ1, ξ2) = { p θ1,θ2)∊P | f θ1,θ2) < c } for c > 0. This information may be
of use in statistical estimation since if it is known that such a neighbourhood of the
point p(ξ1, ξ2)∊P is in some sense “large”, then estimations about this point may be
less accurate than if the neighbourhood were “small”.
Arwini & Dodson adopt this approach with respect to geodesics of the Gamma
statistical manifold Ⅎ under the Fisher information metric and the Levi-Civita
connection (Amari & Nagaoka’s 0–connection). As noted in section 5.5 of this thesis,
however, explicit solutions to the differential equations for such geodesics are
unattainable and so they employ upper-bound approximations73.
72
§8, “Probability and Statistical Inference” – R Hogg and E Tanis, 2006.
§7.3.1, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson,
2008.
73
69
To summarise their findings, they show that there exist non-parallel geodesics (with
respect to the Fisher information metric) having explicit solutions which may be used
to provide an upper bound for the geodesic based distance between any two points
p,q∊Ⅎ. By finding a point u∊Ⅎ that lays in the intersection of these two geodesics taken
through p and q separately, the upper bound is d(p,q) ≤ d(p,u) + d(u,q) where
d( · , · ) is defined as per section 6.1 of this thesis in terms of the inner product on Ⅎ as
1
d p, q) = inf γ:[0,1]→Ⅎ
γ′ γ′
γ 0)=p ,γ 1)=q
γ t)
dt
0
In section 6.3 of this thesis the α-divergences (α ≠ ±1) for the Normal family under
(μ,ς) co-ordinates were derived as
4
D α) p μ1 , ς1 )‖p μ2 , ς2 ) = 1−α 2 1 − a ∙ exp bc 2 − d)
1
2b
where
ς 1 α −1
a=
b=
1
c = 2b
μ1
2ς 1 2
ς 2 α +1
1−α
+
4ς 1 2
1+α
4ς 2 2
μ
1 − α) + 2ς2 2 1 + α)
2
μ 2
μ 2
1
2
d = 4ς1 2 1 − α) + 4ς2 2 1 + α)
This result can now be put to use in defining a distance function on the Normal
statistical manifold.
Setting α = 0 reduces the expression for the divergence to the symmetric function
D 0) p μ1 , ς1 )‖p μ2 , ς2 ) = 4 1 − a ∙ exp bc 2 − d)
where
1
a=
ς1ς2
1
b = 4ς
1
c = 2b
1
1
2
+ 4ς
μ1
2ς 1 2
μ
2
2
2
μ
+ 2ς2 2
2
μ
2
d = 4ς1 2 + 4ς2 2
1
2
The graphs below represent the consequential function f:ℝ× 0,∞) given by
f μ, ς) = D 0) p 1,1)‖p μ, ς)
70
1
2b
which corresponds to the distance defined by the 0-divergence from the point p(1,1)
(the standard Normal probability distribution function) in the Normal statistical
manifold.
(0-divergence induced distance from the point p(1,1) in the Normal statistical manifold)
These graphs suggest that according to this distance function on the Normal statistical
manifold changes in μ produce a greater effect on the distance from a given point than
do equivalent changes in ς. Recalling that μ corresponds to the mean value of a
Normal random variable and ς corresponds to its spread, this implication seems logical.
71
8.4 – GEODESICS FOR STATISTICAL EVOLUTION
A novel application of geodesics through statistical manifolds was presented by
Arwini & Dodson in relation to the development of statistical processes over
time74. Suppose that one has a time-dependent statistical process represented
by the probability density function p(z; θ(t)) that belongs to a given family of
probability distributions P where θ t): [0,1] → Θ is a path in the sample space
Θ. If instead of having an explicit expression for the function θ(t) one only
knows its values for a finite number of points, then it may be desirable to have
a method of interpolation to predict the intermediate values of θ(t).
There exist many methods for interpolating general functions– approximation
by polynomials, approximation by Fourier series, approximation by splines –
but none of which take advantage of the information that the family of
probability distributions P endows as a statistical manifold.
For the sake of simplicity, assume that only the values of θ(0)∊Θ and θ(1)∊Θ
are known. The method of interpolation suggested by Arwini & Dodson is then
to join these two points via the unique geodesic in P connecting
p(z; θ(0)) and p(z; θ(1)). The idea here is that these geodesics represent a
‘natural’ or ‘most efficient’ path between any two points in a family of
probability distribution functions.
There is, however, an initial complication to this method: as per the definition
given in section 5.4 of this paper, in order to define any such geodesic on a
statistical manifold P it is necessary to have a connection ∇ defined on P.
Although it is not difficult to do so, the main issue here is that different
connections will, in general, define different geodesics. Therefore, there is
some question as to how appropriate this method of interpolation is, in the
absence of any reasoning for choosing a particular connection.
Another potential difficulty is that actually calculating these geodesics may be an
unsolvable problem but it may be possible to circumvent this issue by employing
computer-generated approximations.
Arwini & Dodson apply this method to the modelling of voids in space and, ignoring
questions of validity of the connection used, showed that the results can be used to
gain insight into the stability of states of this system.
74
§6.6, “Information Geometry: Near Randomness and Near Independence” – K Arwini and C Dodson,
2008.
72
CONCLUSION
Information geometry is heralded by Amari & Nagaoka as having applications across a
wide variety of mathematics based fields and to a certain extent this is true. In this
thesis alone many worthwhile concepts have been discussed and examples of their
applications examined. Furthermore, it is evident that a sizeable volume of research
has been and continues to be undertaken into this developing field of mathematics.
Whilst in some sense the construction of statistical manifolds based upon the
parameters of a family of probably distributions is a fairly evident approach, much of
the theory of information geometry is novel. As an example of this, the α–connections
and α–divergences that it defines have in general little meaning or significance in
differential geometry at large but provide a standardised method of defining
geometries on statistical manifolds based upon their respective families of probability
distribution functions. These geometries therefore depend upon inherent qualities of
the families themselves rather than just providing an arbitrary structure.
On the other hand, there are several key problems which much of the literature
regarding information geometry tends to gloss over. For example, as demonstrated in
this thesis, in all but the simplest of cases, finding explicit solutions for many functions
and objects generated by information geometry can be difficult if not unattainable.
Whilst computer generated approximations may be of assistance in this regard, it does
seem to be a noteworthy concern when such a large amount of the theory is
somewhat difficult to apply.
Another question that does not appear to be generally addressed by most writers on
this topic is that of the validity of the geometries employed by information geometry.
Whilst exponential families of probability distributions can be seen to be flat with
respect to the ±1–connections, for example, little regard seems to be given by most
authors as to why this geometry should be used to represent these families, apart
from the convenience it endows.
One final but minor point of uncertainty is regarding the usefulness of the family of
α–connections introduced by Amari & Nagaoka. For most applications, the only values
of α that are used are 0 and ±1 – all of which correspond to concepts independent of
information geometry. Therefore, whilst the α–connections do relate these concepts,
the classification does seem at times to be somewhat vacuous.
73
These perhaps slightly philosophical qualms aside, there does seem to be genuine
benefit gained from the theory of information geometry. The distance functions that it
brings to probability theory and statistics, for example, permit a perspective that is
quite alternative to the methods of comparison usually found in the realm of statistical
estimation. However, further research regarding the justification behind many of the
concepts is something that may be beneficial to consider.
The final chapter of this thesis gives a short overview of the potential applications of
this theory. As such, much of this section could easily lead to further topics of research.
For example, several different notions of distance measures are given – it may prove of
interest to investigate the properties of these and determine which, if any, might be
better measures. In particular, it may be of merit to compare the distance measure
based on the 0-divergence for the Normal statistical manifold to other measures of
distance discussed in this thesis.
Another potential avenue of research may consist of comparing these new concepts
with the more standard concepts of statistical estimation. Finding some way of rating
the efficiency and/or accuracy of both the old and the new would be an interesting
result.
In closing, it is hoped that the reader has gained some appreciation of the depth of the
field of differential geometry and the potential that lies within the theory of
information geometry. If this thesis has served as an accessible introduction to any of
these concepts then it has served its purpose.
74
TABLE OF NOTATION
CHAPTER 1
Random variable – Z: Ω → E
Probability density function – p(z;θ)
Family of probability distributions – { p(z;θ) | θ∊Θ
Parameter space – Θ
Expected value – 𝔼: P → ℝk
CHAPTER 2
Atlas of charts – {(Ua,θa)}a∊𝒜
Co-ordinate system – θa x) = θ1 x),…, θn(x))
Tangent space – TxX
Tangent bundle – TX
Rate of change of a real function – f∘γ)’ 0)
Partial derivative –
∂f
∂θ k
x)
Differential – dF: TxX → TF(x)Y
Vector field – V: X → TX
CHAPTER 3
Log-likelihood map – ℓ: P → RΩ
Score –
∂ℓ
∂θ i
p Z; θ
CHAPTER 4
Exponential family – p z; θ = exp C z +
n
i i
i=1 θ y
z −φ θ
Canonical co-ordinates – θ1, … , θn)
Second fundamental form – αi,j(p)
∂2 ℓ
∂ℓ ∂ℓ
Fisher information matrix – g i,j p) = 𝔼p ∂θ i ∂θ j = −𝔼p
(Generalised) Efron’s statistical curvature –
γ p) =
n
i,j
i,j,k,l=1 g
∂θ i ∂θ j
p) ∙ g k,l p) ∙ 𝔼f αi,k p) ∙ αj,l p)
75
CHAPTER 5
Connection/covariant derivative – ∇
j
Christoffel symbols – Γ i,k
Tangent space projection – πx:TxX → TyY
Projective connection – ∇π
α-connection – ∇ α)
Mixture family – p m; θ = ni=1 θi pi m
CHAPTER 6
Riemannian metric – 〈 · | · 〉x :TxX×TxX → ℝ
Fisher information metric – u v
p
= 𝔼 p d p ℓ) u ∙ d p ℓ) v
Divergence – D ∙‖∙): X×X→ ℝ
f-divergence – Df p‖q) =
ℝk
p z ∙f
q z
p z
dz
α-divergence – D α)(p∥q)
2
Hellinger distance – D
0)
p‖q) = 2
p x − q x
dx
CHAPTER 7
Dual connection – ∇*
Parallel translation – Xv(x): X → TX
Potential functions – φ,ψ : X → ℝ
Canonical divergence – D p‖q) = 𝜓 p) + φ p) −
Maximum likelihood estimator – M: Ω → ℝ
Co-tangent bundle – Tx*X
Parallel translation – Πρ t) γ’)
76
n
i
i=1 θ
p)ηi q)