Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Exponential families
Peter D. Hoff
September 26, 2013
Much of this content comes from Lehmann and Casella [1998] section 1.5.
Contents
1 The canonical exponential family
1
2 Basic results
6
1
The canonical exponential family
Construction of an exponential family of densities
Exponential families are classes of probability measures constructed from
1. a dominating measure µ
1
2. a statistic t(X)
Let
• (X , A) be a measurable space,
• µ be a measure on A,
• t : X → Rs
For η ∈ Rs , define the measure
Z
eη
T t(x)
µ(dx) ∀A ∈ A
Z
T
A(η) = log νη (X ) = log eη t(x) µ(dx).
νη (A) =
A
If A(η) < ∞, we can define a probability measure Pη on (X , A) via its density w.r.t. µ:
T
p(x|η) = eη t(x)−A(η) , x ∈ X
Z
Pη (A) =
p(x|η)µ(dx).
A
Note that
• Pη (X ) = 1 by construction, and so (X , A, Pη ) is a probability space.
• Pη is absolutely continuous w.r.t. µ, with RN density p(x|η).
R T
We can construct such a density for each η ∈ Rs for which eη t(x) dx is finite.
Definition 1 (canonical exponential family). Let
• (X , A, µ) be a measure space,
• t : X → Rs be an s-dimensional statistic that does not satisfy any linear constraints,
R T
• A(η) = log eη t(x) µ(dx).
A collection of densities given by
{p(x|η) = exp(η T t(x) − A(η)) : η ∈ H̃} , where
H̃ ⊂ H = {η : A(η) < ∞}
is called an s-dimensional exponential family.
2
Notes:
• The set H = {η : A(η) < ∞} is called the natural parameter space.
• Each density p(x|η) defines a measure Pη µ via Pη (A) =
R
A
p(x|η)µ(dx).
We say that the measures {Pη : η ∈ H̃} “have a common dominating measure” µ.
Minimal, full and curved exponential families
“Doesn’t satisfy a linear constraint” means
6 ∃a ∈ Rs : a 6= 0, aT t(x) = c ∀x ∈ X .
Some authors do not include this “no linear constraints” requirement for the statistic t.
If t does satisfy a linear constraint, the natural parameter space includes points that correspond to the same density and probability distribution. As a result, the parameter will be
non-identifiable (in the natural parameter space):
Definition 2. A model P = {p(x|η) : η ∈ H} for (X , A) is nonidentifiable if there exists
η1 , η2 ∈ H : η1 6= η2 but P (A|η1 ) = P (A|η2 ) ∀A ∈ A.
Exercise: Show that if t satisfies a linear constraint and H is the parameter space, then the
exponential family model is non-identifiable.
Most authors refer to an EFM where t does not satisfy a linear constraint as a minimal
parametrization. Since a non-minimal representation can always be made minimal, and the
recommendation is always to do so, it seems simplest just to require it in the definition.
Definition 3 (full rank). If the parameter space for an exponential family contains an sdimensional open set, then it is called full rank.
An exponential family that is not full rank is generally called a curved exponential family,
as typically the parameter space is a curve in Rs of dimension less than s.
Examples
Often an exponential family model is parameterized as
P = {p(x|θ) = h(x) exp{η(θ)T t(x) − B(θ) : θ ∈ Θ}.
This is done
3
• if the parameter θ is more interpretable than η
• so that the dominating measure can be something simple.
Example(normal model):
The univariate normal model on (R, B(R)) can be represented with the class of densities
{p(x|µ, σ 2 ) : µ ∈ R, σ 2 ∈ R+ } w.r.t. Lebesgue measure, where
p(x|µ, σ 2 ) = (2πσ 2 )−1/2 exp(−(x − µ)2 /[2σ 2 ])
= (2π)−1/2 exp(−x2 2σ1 2 + x σµ2 −
µ2
2σ 2
− 12 log σ 2 ).
This is the same model as p(x|η) = (2π)−1/2 exp(η T t(x) − A(η)) where
!
!
2
x
µ/σ
t(x) =
, η(µ, σ 2 ) =
, A(η) = (µ2 /σ 2 + log σ 2 )/2.
2
2
−1/(2σ )
x
To reparameterize back, note that µ = −η1 /(2η2 ) and σ 2 = −1/(2η2 ).
What is the natural parameter space?
Does it correspond to (µ, σ 2 ) ∈ R × R+ ?
Recall,
Z
∞
H = {η1 , η2 :
2
eη1 x+η2 x dx < ∞}
−∞
−
Convince yourself that H = R × R , which gives (µ, σ 2 ) ∈ R × R+ .
The exponential family model defined by t(x) = (x, x2 ) and H̃ = H is the normal model.
The normal model with (µ, σ 2 ) ∈ R × R+ is a two-dimensional full rank exponential family.
Example(a curved normal model):
Consider the normal model having the following mean-variance relationship:
X ∼ normal(θ, θ2 ) , θ ∈ R.
Let P = {p(x|µ, σ 2 ) : µ ∈ R, σ 2 = µ2 }, where p(x|µ, σ 2 ) are the normal densities given above.
The densities in this model can be written
p(x|θ) = (2πθ2 )−1/2 exp(−(x − θ)2 /[2θ2 ])
∝x exp(−(x2 − 2θx + θ2 )/[2θ2 ])
= exp(x/θ − x2 /[2θ2 ] − 1/2)
∝x exp(x/θ − x2 /[2θ2 ])
≡ exp(η1 t1 (x) + η2 t2 (x)).
4
Since t(x) = (x, x2 ) doesn’t satisfy a linear constraint, this is a two-dimensional exponential
family.
The natural parameter space corresponding to t(x) is η ∈ R × R−1 .
Our reduced parameter space is η(θ) = (1/θ, −1/[2θ2 ]).
This is a one-dimensional curve in two-dimensional space.
Draw a picture.
This family is a two-dimensional exponential family (in minimal form).
It is not a full rank exponential family.
Example:(multinomial model)
Let X ∼ multinomial(n, θ), for which
P
Θ = {θ ∈ Rp : θj = 1} and
P
X = {x ∈ {0, 1, . . . , n} :
xj = n}.
The density of Pθ w.r.t. counting measure µ on X is
p(x|θ) =
n
x
θ1x1 × · · · × θpxp .
We can rewrite this in canonical exponential form as
p(x|η) = exp(x1 η1 + · · · xp ηp ),
where ηj = log θj and the dominating measure is
µ̃(x) =
n
x
× µ(x),
i.e. the multinomial coefficient “has been absorbed into the dominating measure”.
P
The parameter space for this model is H̃ = {η ∈ Rp : eηj = 1}, which is a p−1-dimensional
curve in Rp .
Is the multinomial model a p-dimensional curved exponential family?
Note that 1T t(x) = 1 ∀x ∈ X , so this “family”
• doesn’t satisfy our definition, or if you prefer
• is not in minimal form.
5
Consider the usual parameterization again, but now express the model in terms of t(x) =
(x1 , . . . , xp−1 ):
Pp−1
x1
θ1 · · · θpn− 1 xj
p−1 xj
Y
θj
n n
= x θp
θp
p(x|θ) =
n
x
j=1
=
n
x
exp(η1 x1 + · · · + ηp−1 xp−1 − A(η)),
where ηj = log(θj /θp ) and A(η) can be computed as follows:
θj = θp eηj
X
1 − θp = θp
eηj
θp =
1+
1
P
eηj
A(η) = −n log θp = n log(1 +
X
eηj )
Thus the multinomial model is a (p − 1)-dimensional exponential family generated by the
statistic t(x) = (x1 , . . . , xp−1 ).
Does Θ correspond to H?
H = {η ∈ Rp−1 :
X
exp{η1 x1 + · · · + ηp−1 xp−1 } < ∞} = Rp−1 .
x∈X
This of contains a p − 1-dimensional rectangle, and so the multinomial model is a full rank
p − 1-dimensional exponential family.
2
Basic results
Convexity of H:
The largest EFM based on a statistic t(x) is the one based on the natural parameter space:
{p(x|η) : η ∈ H̃} ⊂ {p(x|η) : η ∈ H} since H̃ ⊂ H.
6
The natural parameter space is usually (but not always) open,
making this “fullest family” also full rank.
It is always the case that H is convex, and that A(η) is convex on H.
Theorem 1. The natural parameter space H for densities of the form p(x|η) = exp(η T t(x)−
A(η)) is convex, and A(η) is convex on H.
Proof. Recall Hölder’s inequality: For a ∈ [0, 1], b = 1 − a,
Z
Z
fg ≤
f
1/a
a Z
g
1/b
b
Now let η1 , η2 ∈ H and apply the inequality:
Z
Z
T
T
A(aη1 +bη2 )
T
e
= exp((aη1 + bη2 ) t(x)) = eaη1 t ebη2 t
Z
a Z
b
η1T t
η2T t
e
e
≤
= eaA(η1 )+bA(η2 ) < ∞
and so aη1 + bη2 ∈ H, and A(η) is convex.
Continuity, integration and differentiation
The following theorem is useful in a variety of contexts:
Theorem 2 (LC 5.8). For any integrable function f the expected value function E[f |η],
Z
E[f |η] = f (x) exp(η T t(x) − A(η)) µ(dx),
is, at any η in the interior of H,
1. continuous as a function of η,
2. has derivatives w.r.t. η of all orders,
3. derivatives can be obtained by differentiating the integrand.
The first item is used in two key results in estimation and testing:
• In estimation, the theorem implies that risk function for exponential family models are
continuous. This will help us characterize all admissible estimators for such models.
7
• In testing, the theorem implies that the power function for any test is continuous. This
will help us characterize unbiased testing procedures.
An important application of the theorem is the calculation of moments of t.
R
By definition, eA(η) = eηt µ(dx).
Taking derivatives w.r.t. η gives
d A(η)
e
dη
0
A(η)
A (η)e
0
=
d
dη
Z
=
Z
A (η) =
Z
eηt µ(dx)
teηt µ(dx)
teηt−A(η) µ(dx) = E[t(X)|η].
More generally,
Theorem 3 (Barndorff-Neilsen(1978) thm 8.1). Let P = {p(x|η) = exp(η T t−A(η)) : η ∈ H}
be an exponential family and η ∈ int H. Then
Z
∂k
T
A(η)
e
= tk11 (x) × · · · × tks s (x)eη t(x) µ(dx)
k1
k1
∂η1 · · · ∂η1
∀k1 , . . . , ks ≥ 0.
This result helps us with the moment generating function.
Moment generating function:
T
Mt (u1 , . . . , up ) = E[eu t |η]
Z
T
= e(η+u) t−A(η) µ(dx)
Z
T
A(η+u)−A(η)
=e
e(η+u) t−A(η+u) µ(dx)
= eA(η+u)−A(η)
This works as long as η is in the interior of H and u is small enough so that η + u ∈ H.
From this, we can use the above theorem to show
∂k
Mt (u)|u=0 = E[tk11 × · · · × tks s |η].
∂η1k1 · · · ∂η1k1
8
References
E. L. Lehmann and George Casella. Theory of point estimation. Springer Texts in Statistics.
Springer-Verlag, New York, second edition, 1998. ISBN 0-387-98502-6.
9