Download Optimal error exponents in hidden markov models order estimation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability interpretations wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Central limit theorem wikipedia , lookup

Transcript
964
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
Optimal Error Exponents in Hidden Markov Models
Order Estimation
Elisabeth Gassiat and Stéphane Boucheron, Member, IEEE
Abstract—We consider the estimation of the number of hidden
states (the order) of a discrete-time finite-alphabet hidden Markov
model (HMM). The estimators we investigate are related to
code-based order estimators: penalized maximum-likelihood
(ML) estimators and penalized versions of the mixture estimator
introduced by Liu and Narayan. We prove strong consistency of
those estimators without assuming any a priori upper bound on
the order and smaller penalties than previous works. We prove a
version of Stein’s lemma for HMM order estimation and derive
an upper bound on underestimation exponents. Then we prove
that this upper bound can be achieved by the penalized ML
estimator and by the penalized mixture estimator. The proof of
the latter result gets around the elusive nature of the ML in HMM
by resorting to large-deviation techniques for empirical processes.
Finally, we prove that for any consistent HMM order estimator,
for most HMM, the overestimation exponent is null.
Index Terms—Composite hypothesis testing, error exponents,
generalized likelihood ratio testing, hidden Markov model
(HMM), large deviations, order estimation, Stein’s lemma.
I. INTRODUCTION
A. The Problem
T
HE order estimation problem for hidden Markov models
(HMM) can be described in the following way. Let denote a fixed finite emission alphabet of cardinality . Let denote a finite set of hidden states. Assume that a random -valued
has been produced according to the
sequence
following procedure: an -valued hidden Markov chain is first
sampled for steps, providing a sequence of hidden states ;
then for each from to , an observation is drawn according
that depends solely on the
to a probability distribution
hidden state (i.e., the s are independently distributed con). Henceforth the distributions
will be
ditionally on
called the emission distributions. The cardinality of the underlying state space is called the order of the HMM. The identifrom the sequence of observations
is called
fication of
the order estimation problem for HMM.
When the hidden state space has size , an HMM is completely defined by a transition matrix on the hidden state space
, that is, by probability distributions over , and by an emission matrix, that is by probability distributions over . Letting
Manuscript received September 26, 2001; revised May 14, 2002. This work
was supported by the Esprit Working Group RAND-APPX.
E. Gassiat is with the Department of Mathematics, Université Paris-Sud,
91405 Orsay, France.
S. Boucheron is with the CNRS and the Department of Computer Science,
Université Paris-Sud, 91405 Orsay, France (e-mail: stephane.boucheron@lri.
fr).
Communicated by G. Lugosi, Associate Editor for Nonparametric Estimation, Classification, and Neural Networks.
Digital Object Identifier 10.1109/TIT.2003.809574
(resp.,
) denote the set of probabilities on
(resp., ), the parameter space denoted by
may be identified with
An element of
, together with an initial distribution on
hidden state space defines a unique probability distribution
on
and taking marginals on
(both spaces being
provided with cylindrical -algebras). This probability distribution is sometimes called a hidden Markov source (HMS).
When the order of the HMM is known a priori, inference
and compression problems can, in principle, be solved by maximum-likelihood (ML) estimation techniques [1], [4], [14], [15],
[31], [26] or by some related universal coding techniques [11],
[29], [41], [7], [2], [5]. But in many applications where HMM
are used as a modeling device (see [35] and references therein),
nature does not provide any clear indication about the right order
of the model. According to McDonald and Zucchini [35], “in the
case of HMM, the problem of model selection (and in particular
the choice of the number of states in the Markov chain component model) has yet to be satisfactorily solved.” The recent tutorial paper by Ephraim and Merhav [18] presents the state of the
art about HMM. The goal of this paper is to report significant
progress in the direction of order estimation.
When dealing with the order estimation problem, a first difficulty is raised by identifiability issues. The distribution of the
may be represented by
process of observations
HMM of different orders. To get rid of this problem, following
an established tradition [20], [33], we agree on the fact that the
order of is the smallest integer such that there exists some
and some initial distribution on hidden states such
. Note that may not be
that is distributed according to
unique. Anyway, up to a permutation, most of the elements in
are identifiable (see [20, Ch. I] for a survey on those issues).
Anorder estimation procedure is a sequence of estimators
that, given input sequences of length
,
outputs estimates of the true order. Let us call the generating
its true order. A sequence of estimators is
source and
strongly consistent with respect to and initial distribution ,
-almost surely, the sequence
converges
if
is said to overestimate the
toward . If the true order is ,
and to underestimate the order if
.
order if
The analysis of HMM order estimation has been influenced
by the theory of universal coding and by the theory of composite
hypothesis testing. The first perspective provides a convenient
framework for designing order estimator, while the second provides guidelines in the analysis of the performance of order estimators.
0018-9448/03$17.00 © 2003 IEEE
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
B. Code-Based Order Estimators
The order estimation procedures investigated in this paper
have been considered previously in the literature [38], [39], [36],
[20], [48], [28], [33], [21], [2], [10], [9]. Adopting the terminology coined by Kieffer in [28], they fit into the category of
code-based estimation procedures. The pervasive influence of
concepts originating from universal coding theory in this area
should not be a surprise. Recall that by the Kraft–McMillan indefines a (sub)equality [6], a uniquely decodable code on
, and conversely for any probability distribuprobability on
, there exists a uniquely decodable code for
such
tion on
that codeword lengths are proportional to logarithms of probability. Henceforth, the probability associated with a code will
be called the coding probability, and the logarithm of the coding
probability represents the ideal codeword length associated with
the coding probability.
Code-based order estimation procedures may be sketched in
the following way. Assume that for each sample length , for
,
denotes the coding probability associated with a
each
on
. The coding probuniquely decodable code from
are assumed to form a sequence of universal coding
abilities
. That is, for any source
from
probabilities for
Henceforth,
will denote a nonnegative function of
and . A code-based order estimation procedure is defined as
follows:
(1)
, the code-based order estimation procedure
If
implements the minimum description length (MDL) procedure
introduced and illustrated by Rissanen [38], [39], [2]. Note that
the code-based estimators investigated in [36], [20], [48], [28],
[33], [21], [2], [10] all use nontrivial penalties to enforce strong
consistency or some other property.
The first question raised by code-based estimation procedures
is as follows: for a given sequence of universal coding probabilities, what kind of penalties warrant strong consistency? In [28],
Kieffer describes a strongly consistent order estimator based on
Shtarkov normalized ML (NML) codes [41], [2], [5] (see Secused in
tion II). For a fixed value of , the penalties
[28] grow exponentially fast with respect to . In [33], Liu and
Narayan describe a strongly consistent order estimator based on
Krichevsky–Trofimov mixtures [11], [29] (see Section II). For
a fixed value of , the penalties that are implicit in the presentation of [33] grow like a cubic polynomial with respect to .
But strong consistency is only proved under the promise that
the true order is less than some a priori upper bound. In [33],
a strongly consistent version of the order estimation procedure
described in [48] is also presented. Again, consistency is proved
under the promise that the true order is less than some a priori
upper bound.
C. Error Exponents
As pointed out in [42], under many circumstances, the definition of consistency is too weak to be useful. Here comes the
965
testing theory perspective. As a matter of fact, the order estimation problem is intimately related to the composite hypothesis
testing problems (see [36], [48], [34], [33]). If we have a consistent sequence of order estimators, we should be able to manufacture a sequence of consistent tests for the following question: “is the true order smaller than ?” This paper, as well as
the aforementioned literature [20], [33], [21], [36], [48], [34],
is interested in Bahadur relative efficiency (in the parlance of
asymptotic statistics), or in error exponents (in information-theoretical language) which have proved to be relevant measures of
asymptotic relative efficiency.
Let us describe this framework in more detail. The two hy, the order is smaller than ,
potheses may be defined as
, the order is . Note that we face nested composite hyand
when the true order is
potheses. We want the test to reject
and to accept when
holds. The region
on which
is called the critical region. The
the test rejects hypothesis
of the test maps sources toward the probapower function
bility of the critical region
If for all
of order smaller than , and for sufficiently large ,
, the test is said to be asymptotically of level .
The difficulty consists in finding a test with asymptotic level
, and such that
tends toward for all sources
of true order greater than or equal to .
A sequence of tests with level is said to achieve error exif for all sources
of
ponent (Bahadur efficiency)
true order equal to
and for all sources
of true order smaller than
As pointed out many times in the literature [36], [48], [33],
[21], [8], [19], [34], this (classical) framework generalizes the
usual Neyman–Pearson setting. Hence, it is natural to wonder
whether some generalized likelihood ratio test is efficient in this
setup (see [46], [19], [34] for more general discussions on this
issue). As a matter of fact, the code-based order estimators discussed above are intimately related with ML techniques, and
investigating the optimality of the error exponents achieved by
those code-based order estimators will turn out to be equivalent to analyzing the optimality of generalized likelihood ratio
testing.
In [33], Liu and Narayan already prove that a consistent codebased order estimator achieves positive underestimation exponents, but they leave open the determination of the optimal underestimation exponent. Overestimation exponents may be defined in an analogous manner, but up to our knowledge, no
consistent HMM order estimator with nontrivial overestimation exponent has ever been reported in the literature. However,
while this paper was submitted for publication, Kudhanpur and
Narayan [30] proposed an order estimator for a special class of
966
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
HMM which was proved to achieve the best underestimation
exponent. Our result applies to general HMM.
E. Organization of the Paper
D. Lessons From Related Problems
To understand what is at stake, we may learn from two
simpler settings for the nested composite hypothesis testing
problem: the independently and identically distributed (i.i.d.)
case, and Markov order estimation.
In [8], Csiszár describes Hoeffding’s analysis of the (nested)
composite hypothesis testing problem in the finite alphabet ,
i.i.d. setup. The test decides whether the source belongs to some
closed set of memoryless sources. The Bahadur efficiency is
identified with some information-divergence rate distance. The
transparent analysis relies on the availability of relevant sufficient statistics: in the i.i.d. case, the type or empirical distribucoincides with the ML estimator. The
tion of the sequence
.
type is a vector of dimension
Results concerning Markov order estimation are also especially relevant to our concerns. In the Markov order estimaare assumed to be distion problem, the observations
tributed as a Markov chain of (unknown) order , that is, for
all
The Markov order estimation problem consists in determining
from
. In [21], Finesso et al. have identithe value of
fied the optimal underestimation exponents and proved that a
sequence of code-based order estimators achieves those optimal
exponents. Just as in the i.i.d. case, the underestimation exponents could be interpreted as information-divergence rate distances. Finesso et al. could also establish that for any consistent
Markov order estimator, for most Markov sources, the overestimation exponent is null.
Despite their apparent similarity, the Markov order and
the HMM order estimation problems differ significantly. The
Markov order estimation problem is actually very close to the
i.i.d. case. For a given order , the ML estimate of the pafor Markov chains depends on a finite-dimensional
rameter
sufficient statistics: the empirical distribution of transitions of
order . In the HMM setting, no such sufficient statistics for
MLs are known (see, for example, [34, Sec. 4]).
When analyzing the behavior of code-based order estimators
for HMM, we face a difficulty that plagues many traditional statistical problems (see, for example, [13], [42], [43] and references therein). We aim to analyze the behavior of the minimizer
of some empirical criterion (M-estimator) though we have neither a closed-form expression for this minimizer nor any efficient algorithm for computing it. This situation usually prompts
the use of empirical process methods (see again [13], [42], [43],
and references therein). As error exponents are usually analyzed
using large-deviations techniques [12], [46], the determination
of the optimal underestimation exponents for HMM order estimation will rely heavily on large-deviations theorems for likelihood processes.
The paper is organized as follows. In the next section, we formally define the aforementioned order estimators and introduce
some important notions including the extended Markov chains
device associated with HMMs and Krichevsky–Trofimov mixtures. In Section III, we first derive the consistency of the penalized code-based order estimators (Theorem 2). Then we turn
to the main topic of the paper: error exponents. An appropriate
version of Stein’s lemma (Theorem 3) provides an upper bound
on the possible underestimation exponents. Then, the optimality
of the penalized code-based order estimators with respect to the
underestimation exponent is assessed (Theorem 4). The section
is concluded by another version of Stein’s lemma concerning
overestimation exponents (Theorem 5). This last theorem asserts that for each consistent order estimator, for each order ,
the set of parameters of HMMs of true order which have a
positive overestimation exponent has Lebesgue measure in
. In Section IV, we provide the detailed proof of the
optimality of the penalized code-based order estimators with respect to the underestimation exponent.
II. DEFINITIONS AND BACKGROUND
A. General Conventions
Each HMM of order is defined by a transition kernel
where
denotes the probability of transiting from state
to state
and an emission distribution
where
denotes the probability of observing
when in state
.
denotes the set of HMM of order having an erThe set
. The associated stationary probgodic transition kernel
or if there is no
ability on hidden state space is denoted by
,
denotes the set of HMM of order
ambiguity. And for
such that for all
and
,
and
are lower-bounded by . (Notice that
is not
for
.)
An HMM together with an initial distribution defines a
on
probability
In the whole text, we assume that
defines an ergodic
) with unique stationary
Markov chain on (and also on
on (and also a unique stationary law on ).
law
of a sequence
In the sequel, the log likelihood
under
will play a distinguished role. It
of observations
can be immediately written as
(2)
In the paper, we will consider vectors of log-likelihoods
, where
, are elements
of
(3)
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
Remark 1: When is omitted in the denotation of the probor in the log likelihood
for a
,
ability
the initial is assumed to be the (unique) stationary distribution
of
. That is,
(4)
and similarly
(5)
B. The Likelihood Process
Let
and
belong to
, then let
be defined by
(6)
,
, and
where we assume
Then, according to [33, Lemma 2.6], for all
, for all
967
convergence in distribution or a large-deviations principle for
the sequence of log-likelihood processes indexed by
, it is
sufficient to check convergence in distribution or large-deviations principles for finite-dimensional distributions.
C. The Extended Chain
As pointed out in the Introduction, we will need to establish limit theorems (large-deviation principles) for log-likelihood processes. The large-deviations literature reports many
large-deviations principles for functionals of the empirical measure of Markov chains. In order to derive large-deviations principles for log-likelihood vectors and processes, it will be conor
as an addivenient to represent
tive functional of an extended Markov chain. This approach has
proved successful when establishing the asymptotic normality
parameof ML estimators [31], [14]. The prediction filter
terized by is defined as the conditional probability under
of the hidden state
given the initial distribution and
.
the sequence of past observations
has state space
The extended chain
. Note that the law of the extended chain depends on
and . Namely,
evolves according to ,
both
is a deterministic function of
and
which is
but
parameterized by
and for all
(8)
(7)
this is summarized by the notation
If
denotes an initial distribution, then
The transition kernel of the extended Markov chain defined by
and can thus be written as
An immediate consequence of this equicontinuity property and
of the Arzela–Ascoli theorem [12, Theorem C.8] is the following proposition.
Proposition 1: Let
. For any
indexed by
(
denote some initial distribution on
, the set of real functions from
to
and defined by
is totally bounded with respect to the topology of uniform convergence (that is the sup-norm topology).
This implies that the sequence of log-likelihood processes
indexed by
is uniformly tight (that is, for
, there exists a compact set
that has probability
each
with respect to all the distributions of the
larger than
, see [17] for more informations about
processes
,
this notion) and exponentially tight (that is, for each
that has probability larger than
there exists a compact set
with respect to all the distributions of the
, see [12]) with respect to the topology
processes
of uniform convergence. Thus, in order to establish either the
And as announced, the log likelihood with respect to
and
can now be written as an
initial distribution ,
additive functional of the extended Markov chain [31]
(9)
stands for the scalar product in
where is the
Here,
number of Hidden states of .
and are irreducible, the extended Markov chain can
If
be shown to have a unique stationary distribution and to be geometrically mixing [31].
Note that it is possible to consider tuples of parameters
instead of a single parameter . The multiply extended chain is defined in a similar fashion, letting denote the
th value of the prediction filter associated with parameter
968
Overloading notations, we let
induced by
and
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
denote the Markov chain on
Lemma 2: If the distribution
on
defines a
, there exists a unique finite
stationary Markov chain on
into stationary ergodic Markov chains
decomposition of
with
D. Information Rates
Information divergence rate distances show up in the characterization of error exponents. Their definition is based on the
notion of relative entropy or Kullback–Leibler divergence between two probability distributions and which is defined as
if is absolutely continuous with respect to , and infinite
denote two possible distributions
otherwise. If now and
(resp.,
) denote the distribution of
for a process , let
under (resp., ). Then the asymptotic relative entropy rate or divergence rate between those two processes
and defined as
is denoted by
with
and
marginal distributions
on
for all . The support sets of the
are disjoint.
If we consider now a stationary process formed by the ob, by the preservations of a stationary nonergodic HMM
can be represented as a
ceding remark, the distribution of
mixture of finitely many stationary ergodic HMMs. If two distinct stationary ergodic Markov chains in the decomposition ac, they are identified
tually generate the same distribution in
in the decomposition of the HMM.
is a stationary HMM with pairwise disLemma 3: If
with
tinct stationary ergodic components
having disjoint supports on
(10)
if the latter limit exists.
The Shannon–Breiman–McMillan theorem [20], [32], [7]
define irreducible HMMs, the
warrants that when and
information divergence rate distance is well defined.
Theorem 1: Let denote a discrete-time process on the finite alphabet . Let us assume that under law , is stationary
ergodic and that under law , is a stationary ergodic HMM.
Moreover, let us assume that for all , the th marginal of
is absolutely continuous with respect to the th marginal of .
is well defined and -almost surely
Then
We collect here further results that will allow us to deal with
possibly nonirreducible HMMs and that will prove crucial in the
analysis of error exponents.
Let us first mention an easy consequence of the exponential forgetting and geometric ergodicity of the extended Markov
chain [31], [14].
where
and
, and if
is an irreducible HMM
and an initial distribution on the hidden state space of then
Proof: The definition of
According to the representation of
Now, relative entropy may be rewritten as
Lemma 1: Let and denote two irreducible HMMs, then
for all initial distributions and on the hidden state spaces of
and ,
-almost surely
In the sequel, we will need bounds of
,
is a stationary but possibly nonergodic HMM,
where
defines an irreducible HMM. Those bounds can be
while
derived thanks to the following folklore lemma. For the sake of
self-reference, a proof of the lemma is included in Appendix II.
states that
Taking to infinity, one obtains
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
969
On the other hand, the reverse inequality comes from the convexity of the relative entropy with respect to its arguments; for
each
comes from the fact that for every initial distribution on the
is another distribuhidden state space of
tion on the hidden state space of and on the fact that
It is enough to take to infinity, to complete the proof of the
lemma.
can be exFollowing Kieffer [28], the definition of
denote a stationary
tended to nonirreducible HMM. Let
HMM. Let denote a possibly nonirreducible HMM
This representation of the information divergence rate as a
supremum provides an easy proof of the following lemma.
and
denote two seLemma 5: Let
quences of irreducible HMMs, converging toward and . Assume is irreducible, then
(11)
We need to check that this definition is consistent with the prewhen is irreducible. When is
vious definition of
irreducible, this follows from the exponential forgetting property of the prediction filter associated with (see [15]) which
asserts that there exists a constant such that for all , for all
initial distributions
Proof: Let denote a fixed integer. First note that for each
,
, is attained for some . Indeed,
belongs to a compact set, and
is convex with
respect to .
with limit , then
Consider a converging subsequence
for that subsequence, we have
An immediate consequence of the exponential forgetting property is
As Kullback–Leibler information is lower-semicontinuous with
respect to its arguments, we get that over the subsequence
When is not irreducible, applying the same line of reasoning
to each ergodic component leads to the desired result.
The following lemma will be helpful in the determination of
error exponents.
From this, the lemma follows.
the staLemma 4: Let denote an irreducible HMM and
tionary distribution on the hidden state space under . Let
denote an irreducible HMM
Proof: By Fekete’s subadditivity lemma (see, for
instance, [12]), the proof boils down to checking that
is super-additive. We have
The following lemma is a consequence of [28, Propositions
1 and 2].
Lemma 6: Assume that
with true order , then for any
defines a stationary HMM
smaller than
where the infimum is taken over transition and emission kernel.
The following lemma is a consequence of Lemmas 3–5.
Lemma 7: Let denote a stationary ergodic HMM of order
then for any smaller than
(12)
E. Code-Based Order Estimators
Two universal coding probabilities have played and still play
a distinguished role in HMM order estimation: the NML distribution and the Krichevsky–Trofimov mixture (see, for example,
[5], [2] for general overviews of those two coding distributions).
with respect to model
The NML probability of string
is defined by
where
comes from the chain rule for relative entropies,
comes the convexity of relative entropy with respect to its arguments (see [6] for finite alphabets or [16] in general), and
(13)
970
where
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
The penalized code-based order estimators are defined by
plugging either
or
instead of
in (1)
is defined by
The stochastic complexity of
with respect to
is defined
[2].
as
Given the known hardness of computing ML estimates in
HMMs, Liu and Narayan [33] have explored the potential of
other universal coding techniques. Krichevsky–Trofimov mixtures form the natural contenders to NML codes. Those mix. Assume for a moment that
tures rely on Dirichlet priors on
transition matrix
a HMM is defined by the
and the
emission matrix
, the Dirichlet density
is defined as
on
for stochastic matrices and , and elsewhere. Here,
denotes the function, that is, the function that interpolates factorial numbers
Remark 3: Bayesian information criterion (BIC) estimators
form a special instance of penalized NML order estimators. BIC
estimators choose the order that minimizes
The criterion slightly but significantly exceeds the NML code
length. A BIC estimator may thus be regarded as penalized
depends only on
NML estimator where the penalty
(see [2] and references therein for more discussion on the
).
asymptotic behavior of
Remark 4: Restricted versions of penalized NML estimators
, where
is as
perform minimization in the range
before an a priori upper bound on the true order.
Remark 5: The order estimator defined in [33] is not strictly
speaking a penalized estimator. It is formally defined as
The Dirichlet prior is actually a product of
Dirichlet
-densities.
The Dirichlet prior defines the Krichevsky–Trofimov mixture
on
distribution
for measurable sets , where denotes the uniform probability
over .
The NML and Krichevsky–Trofimov mixture distributions
enjoy remarkable properties (see [2]).
Lemma 8: For all , for all
larger than
(14)
where for
,
III. MAIN RESULTS
A. Consistency
We will first establish that NML and mixture-based order estimators do not overestimate the order eventually almost surely
on
even if we do not assume any a priori upper bound
the true order . Strong consistency will be a consequence of
Theorem 2 and Theorem 4, which deals with underestimation
probability. This is stated in Corollary 1.
Theorem 2: If for some positive
may be chosen as
The proof of the first inequality goes back to Shtarkov [41].
The proof of the second inequality for HMMs goes back to [7].
Variants of the result have been used in [20], [33]. A justification
is given in the Appendix.
of the choice for
Remark 2: Notice that results based on upper bounds on
pointwise redundancy of universal coding are specific to finite
is infinite, the ML
alphabets. When the emission alphabet
ratio is unbounded, and its growing rate is still an open question, see [22]. In [23], results on penalized likelihood are given
to obtain a weakly consistent estimate for the order in the infinite case.
(15)
increasing in for all , then for any
for a function
of true order , both
stationary ergodic HMM
and
eventually
-almost surely do not overestimate
the order of .
The proof of the Theorem is given in the Appendix.
Remark 6: This theorem should be compared with earlier
consistency claims [28], [33]. In [28, Theorem 2], the consistency of a family of penalized ML order estimators for the HMM
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
order estimation problem is established, but penalties have to
grow exponentially fast with respect to model order, that is,
971
that are based on code-length minimization (according to the
MDL principle proposed by Rissanen [38], [39]).
B. A Version of Stein’s Lemma
This means that Theorem 2 represents a step forward on the way
toward the identification of the smallest penalty that warrants
consistency.
Motivations for tuning penalties may be summarized in the
following way: the larger the penalties, the smaller the overestimation error; the smaller the penalties, the smaller the underestimation error. Guidelines for the choice of penalties are provided
by the following observation: in order to avoid overestimation,
should be larger than the typical (average) value of
. Determining the smallest
penalties that would warrant strong consistency would be equivalent to getting a thorough understanding of the behavior of the
. Such a presequence
cise description of the likelihood processes has still to be obtained as far as HMMs are concerned.
In a model selection setting (see Bartlett, Boucheron, and Lugosi [3] and references therein), using small penalties tends to
limit the approximation bias while large penalties tend to limit
the estimation bias. As the latter is often stochastically smaller
than the former, small penalties provide an interesting tradeoff.
In a data-compression setting, a strongly consistent order estimator for HMM, when coupled with an arithmetic coder fed by a
universal code for the estimated order defines a universal coder
that achieves asymptotically minimax redundancy per symbol
in the class of all HMM (that is, the redundancy achieved by
such a universal coder is of the same order of magnitude as if
the encoder had been told in advance about the true order of the
source). This is especially relevant in the case of HMM, since, in
that setting we are not aware of the existence of an aggregation
mechanism comparable to the context-tree-weighting algorithm
[44] that allows to implement efficiently a double-mixture coder
for Markov sources of arbitrary order.
The next theorem is an extension of Stein’s lemma to the
HMM order estimation problem. It is motivated in the same
way as [21, Theorem 2], where a similar statement is proved for
the Markov order estimation problem. Theorem 3 aims at determining the best underestimation exponent for a class of order
estimators that ultimately overestimate the order with a probability bounded away from .
Theorem 3: Let
such that for some
denote a sequence of order estimators
, for all , for all
for all
. Then, for all
Proof: Fix
. Let
, for all
with
, define
For
and according to [33, Lemma 2.3] (Csiszár, Shields, Finesso,
and Leroux) for
(16)
If
with
Remark 7: In [10], Csiszár and Shields answer a question
raised in [28] and prove the consistency of the BIC criterion
for the Markov order estimation problem and the nonconsistency of the MDL order estimators based on NML and
Krichevsky–Trofimov (KT) codes for this problem. This seems
to identify the minimal penalty that can be used in the Markov
order estimation problem. The analogous questions for HMMs
remain open.
The Markov order estimation problem also provides an
opportunity to discuss the following important issue: does
on the true
the availability of an a priori upper bound
influence the range of possible penalties that enable
order
consistent order estimation.
According to a recent result [9], in the Markov order estimation problem, the availability of an a priori upper bound on
the true order allows to dramatically decrease penalties while
preserving consistent order estimation: availability of an a priori
upper bound allows to prove the consistency of order estimators
where
boils down to equality if
and
have the same
,
follows from
,
follows from
support set in
,
is just rewriting and
follows
the definition of
, and (16).
from the union bound, the assumptions on
and , using (12) and
Now optimizing with respect to
taking to zero, the theorem follows. (We only get a logarithmic equivalent, since we have no uniform upper bound on
.)
972
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
C. The ML and Mixture Estimators Achieve Optimal
Underestimation Exponent
We finally get that
Theorem 3 provides a strong incentive to investigate the error
exponent of the order estimators that have been proposed in the
literature, that is, to settle the question raised by Lemma 4.3
from [33].
The change of measure argument that proved effective in the
,
proof of Theorem 3 can now be performed for each HMM
and this terminates the proof of Theorem 5.
is nonnegative and for each ,
as
, the penalized NML order
and the penalized mixture order estimator
estimator
achieve the optimal underestimation exponent
Remark 8: A similar result concerning Markov order estimation is presented in [21, Proposition 1]. In both cases, in the
may be conparameter space, the set of processes of order
with
sidered as a subset of the set of processes of order
empty interior.
Theorem 4: If
The proof of this theorem which is rather technical is postponed to the next section. Combining Theorem 4 and Lemmas 11, 12, 13, and 6, we get the following consistency result.
Corollary 1: If
for a function
increasing in
for all , and
then the penalized NML and mixture order estimators are almost-surely eventually consistent.
The last theorem shows that no consistent estimator may have
nontrivial overestimation exponent.
Theorem 5: Assume
, if
Then for all
IV. A LARGE-DEVIATIONS PRINCIPLE (LDP)
LIKELIHOOD PROCESS IN HMMS
FOR THE
Since in the HMM context, the ML does not have a simple expression, in this section, we characterize the large deviations of
where
.
the log-likelihood process indexed by
For the sake of self-reference, let us recall here the definition of
a large-deviations principle for probability measures on a topological space .
Definition 1: A rate function on is a function
which is lower semicontinuous. It is said to be a good
are compact.
rate function if, in addition, its level sets
of random elements in is said to satisfy
A sequence
the large deviations principle (LDP) with rate function if the
of the corresponding laws on satisfies the
sequence
following criteria.
1) Upper bound: For any measurable closed subset of
is a consistent HMM order estimator.
is of true order , then
2) Lower bound: For any measurable open subset
Proof: In order to adapt the argument that led to the proof
is of true
of Theorem 3, we first need to establish that if
and belongs to
for some
, then there exists
order
of HMMs of true order
, such that
a sequence
.
, then an equivalent representation
First note that if
of the HMM can be found in
(see [33]).
Classical results on HMMs going back to [37] (see [20, Ch. 1]
formed
for the relevant material) assert that the subset of
is dense in
with respect
by HMMs of true order
. Now viewing
to the Euclidean topology in
as an element of
, it follows that there exists a sequence
of true order
that tends toward
of elements
with respect to the Euclidean topology in
and
where is defined by (6).
such that
Thus using (7), for all
and taking expectation with respect to
spect to
and limit with re-
of
Note that in this text, we will consider many different
spaces .
The following Lemma is the cornerstone of our argument. Relying on classical theorems from large-deviations theory [12],
[16], it asserts that the empirical measure
defined by the extended chain induced by a tuple of paramesatisfies an LDP. Here, the topological space
ters
is the set of probability measures on
provided with the topology of weak convergence. Recall that the
set of probability distributions on a compact space is also compact when provided with the topology of weak convergence. As
provided with the product of the discrete
topologies and of the topology of weak convergence is compact,
our space is also compact.
of true order
Lemma 9: For any
, for any vector of parameters
for any
we have the following.
, for any
,
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
a) Under
the sequence of empirical measures associated
with the extended chain satisfies an LDP with the good rate
function
where
(17)
The LDP is satisfied uniformly with respect to initial conditions
(initial state , initial values of prediction filters).
, the sequence of likelihood vectors
b) Under
973
nator in the definition of
is uniformly bounded away from
, hence, for all
and ,
is continuous with respect to
. This is enough to prove
.
the Feller continuity of the extended chain if
Uniformity with respect to initial conditions in the large deviations upper bound holds because of the compactness of the
state space and of [16, Theorem 8.4.3, part b)].
Let us now prove part b) of Lemma 9. The distribution of
under
and the extended chain are idenis a vector-valued additive
tical. Moreover,
functional of the extended chain.
for some
, the
As ’s are assumed to belong to
function that maps the empirical measure of the extended chain
toward the likelihood vector
satisfies an LDP with the good rate function
for all
(18)
is attained by some
c) The infimum in the definition of
, . Moreover defines a hidden
transition kernel over
, that is, there exists a tranMarkov chain on
and an emission matrix on
sition matrix on
such that
Remark 9: Note that the notation
is overloaded. This
is intentional to stress the fact that the rate function for the likelihood vector is obtained by contraction from the rate function
for the empirical measure.
Remark 10: In (18), the prediction filters are meant to be
distributed according to the stationary distribution of the extended chain defined by and .
In order to prove Lemma 9, we will need to check that the
and
satisfies the Feller
extended chain defined by
continuity condition (see [16, Condition 8.3.1]).
is continuous with respect to the empirical measure of the extended chain.
As the state space of the extended chain is compact, by the
contraction principle and by [16, Theorem 8.4.3], the sequence
satisfies the LDP with the good convex
rate function described in the statement of the lemma.
be such that
Let us now prove part c). Let
. Let
denote a sequence of probability
such that
distributions on
(19)
for all
, and such that
Thanks to the compactness of the state space and hence of the set
of probability distributions on the state space, such a sequence
has an accumulation point that satisfies conditions (19) and
. Then by [16, Lemma 8.5.1, clause
such that is the
(c)], there exists some transition kernel
invariant measure associated with , and
Criterion 1 [Feller Continuity]: The function mapping
toward
is continuous in the topology of weak convergence.
Proof: We first prove part a) of Lemma 9.
The extended chain satisfies the Feller continuity condition.
As and are finite sets, the only thing that has to be checked
is the continuity with respect to the prediction filters. Fix and
in
. Let
denote a sequence of vectors of
. Let
prediction filters converging toward
denote a bounded continuous function from
to . Then
where
some
is defined component-wise by (8). As
for
, and as
are probabilities over , the denomi-
(20)
actually deIt remains to prove that the transition kernel
. Note that -almost everywhere
fines an HMM on
has to be
-almost everywhere, we have
finite. Hence, for all
if
for some
. Hence, not too surprisingly,
under , the next value of the prediction filters is determined
by the current value of the prediction filters and the observation.
Therefore, we may write
974
It remains to check that under , conditionally on , the distri. Let us define an HMM by
bution of is independent of
choosing
Let
denote the transition kernel on
defined by this
HMM. This HMM has stationary distribution . Moreover
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
where
follows from rewriting and from the fact that is the
follows from the definition of
stationary distribution of ;
and from the nonnegativity of the conditional entropy of
relative to ;
just consists in rearranging summations
follows from the convexity of relain the first summand;
tive entropy with respect to its first argument applied to the first
.
summand and from the definition of
to check the last part of the lemma.
Hence, we may choose
Lemma 9 constitutes the basic step in the proof of the following functional LDP.
let
This is a consequence of the convexity of the relative entropy
with respect to its arguments
be of true order
Theorem 6: Let
denote a positive number, and let
, let
,
denote the th normalized likelihood processes indexed by
.
Then we have the following.
, the sequence
satisfies a functional
a) Under
LDP with the good rate function
where
is defined by (18).
, then there exists an HMM
b) If is such that
with transition kernel
, emission distribution
, and
stationary probability for the extended chain, such that for all
where the distribution of the prediction filter
associated with
is the stationary distribution of
under the extended chain
and . Moreover, for some in
defined by
and a stationary distribution for the extended chain
(21)
is well defined even if is not
Notice that
ergodic, thanks to Lemma 3.
The proof of Theorem 6 is obtained from Lemma 9 in two
steps through Lemma 10. Inequality (21) comes from the decreasing of the entropy rate when taking marginal distributions
of processes.
of true order , for any
Lemma 10: For any
, for any
, the sequence of likelihood processes
indexed by
satisfies the LDP with good convex
rate function
where the space of likelihood processes is endowed with the
topology of pointwise convergence.
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
Moreover, if is such that
an HMM with transition kernel
, and stationary probability
where
, then there exists
, emission distribution
, such that for all
is the prediction filter associated with
and
(23)
Assertion (23) holds because of the lower semicontinuity of the
relative entropy.
Finally, note that thanks to the equicontinuity (see (7)) properties of the sample paths of the likelihood processes indexed
, any sample path satisfying
is continuous
by
[12, Lemma 4.1.5]. The assertions are extended from to
by continuity.
and
Proof: The space of likelihood processes provided with
the topology of pointwise convergence is the projective limit of
the set of likelihood vectors. The first part of the lemma follows
immediately from Lemma 9 and the Dawson–Gärtner theorem
[12].
Let us now establish the second part of the lemma. Let desuch that
. Consider a
note a real function on
dense countable subset of
Let
on
HMM
i) for each
975
. By Lemma 9, for each there exists an
with stationary probability
such that
We need to prove Theorem 6 since the supremum is not a
continuous function of the sample paths under the topology of
pointwise convergence.
Proof of Theorem 6: In order to prove part a) of the theorem, it is sufficient to observe that under the topology of uniform convergence, the sequence of distributions of likelihood
processes is exponentially tight. By Proposition 1, those distributions have actually compact support. Then part a) of the theorem follows immediately from Lemma 10 by the inverse contraction principle (see [12]).
The proof of part b) of Theorem 6 follows from part b) of
Lemma 10.
The next corollary now follows from Theorem 6 by the contraction principle [12] since the supremum is a continuous functional with respect to the uniform topology.
Corollary 2: Under the assumptions of Theorem 6, the sequence of normalized ML ratios
where the distribution of is the stationary distribution of the
prediction filter for the extended chain defined by and ;
ii)
satisfies an LDP with the good convex rate function
Now note that
product space
and
define a probability
on the (infinite)
The first marginal on
and the second
and the distribution of
marginal are identical and defined by
the second component conditionally on the first one is defined
and the Baum equations associated with the filters. Conby
and .
versely, this probability completely defines
By Tychonov’s theorem, the infinite product space is compact under the product topology. The set of probabilities under
that space is also compact. Hence, one may extract a converging
subsequence from the sequence of the joint laws . Let denote the limiting distribution over . Then, the two marginals of
have the same distribution which is an accumulation point
is determined by
of the sequence . Under , for each ,
and
through the Baum equation. Moreover, defines an
; to check this, it is enough to check that under
HMM on
,
is independent from
conditionally on .
And for all
(22)
Moreover, if
, the infimum in the above definition
, with a stationary
is achieved and there exists some
for the extended chain such that for all
distribution
and
We are now equipped to prove Theorem 4.
Proof of Theorem 4: First of all, recall that the LDP we
use are satisfied uniformly with respect to initial conditions, and
limit values in the law of large numbers for the likelihoods do
not depend on the initial distributions, so that we will assume,
without loss of generality, that initial distributions on hidden
states are uniform.
976
Let
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
denote either
or
or
. Let us first fix
, and let
denote either
. One has
Moreover, there exists some sample path
such that
processes indexed by
of the likelihood
and there exists some (possibly nonergodic)
(which
might be of true order smaller than ) with a stationary distrifor the extended chain such that
bution
and for all
(26)
Let now
while
be small enough so that
(27)
and
Let denote a (small) positive real. Let us consider
enough so that
large
Now, if we could assert that
position to assert
, then we would be in a
(28)
Unfortunately, we are not able to take such a direct path toward
the result, since we are not able to prove that does belong to
. Nevertheless, we may assume that is irreducible, since
otherwise, by Lemma 3, we would be able to assert that there
such that
exists some
We will prove that
sense.
is close to
in the Kullback–Leibler
such that
Claim 1: There exists
(24)
The inequality follows from
(29)
Let
be such that
(30)
and
Recall that
(see [33, inequality (3.1)]) and since
.
We may take the limsup of the two expressions as
toward infinity. Then by Corollary 2
In order to achieve our goals, we will relate the limiting value of
is nothing but
tends
and the supremum of
over
.
Following a line of reasoning described in [33, proof of Inof
that beequality (3.1)], we may define a modification
longs to
, by suitably using the maximum entry in each row
and
to compensate
of the stochastic matrices
for those entries less than . For all , we have
(25)
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
Now recall that for any in
, by the exponential forgetting property of the prediction filter (see Lemma 1), as is
stationary ergodic
assumed to be irreducible and hence
(31)
Hence
(32)
Now, by (30)
977
deviations argument allows to mimic that line of reasoning:
and provide together an ersatz of ML estimates, and the proof
reduces to checking that the ersatz belongs (asymptotically) to
.
the small set
APPENDIX I
PROOF OF LEMMA 8
The proof follows the proof given in [33, the Appendix], except that we use the following bounds on the function that
may be found in [45, p. 253]. Those bounds extend the Robbins–Stirling approximation for the factorial. For all
(34)
and by (26) and (31)
we get from (32) and (27)
Now choose
, take toward , and extract a subseconverges to some
quence such that
in
. Assume first that
is irreducible. One has on
the subsequence
and by Lemma 5
thus,
By Lemma 6, this entails
.
Hence, either is reducible or it belongs to
Note that by Lemma 5, we also have
The first summand does not exceed [33, citing Davisson et al.,
1981]
.
The last quantity can be upper-bounded using (34)
Hence,
(33)
, and taking the union
Repeating this reasoning for all
bound terminates the proof of Theorem 4.
Remark 11: It is instructive to compare the proof of Theorem
4 and the proof of analogous statements in the context of multiple hypothesis testing for i.i.d. samples (see, for example, [8,
Theorem III.2]). There, the hypothesis to be tested is whether
over some finite alphabet
belongs to
some distribution
. There, the analog of an underestimasome subset of
.
tion error consists in asserting that belongs to while
.
The associated error exponent has the form
The proof amounts to noticing that the type of the sample which
actually realizes the ML has to be in . Here resorting to large
978
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
For each , the term in the second summand can be upperbounded by
all
Proposition 2: Let
denote either
, all
, and all
or
. Then for
(36)
which can again be upper-bounded using (34)
Proof: Let
by Lemma 8
denote either
or
. Recall that
Then
APPENDIX II
PROOF OF LEMMA 2
A basic result in the theory of finite-state Markov chains
asserts that the set of states may be partitioned into a set
of transient states and irreducible classes of states such that
two states from different classes do not communicate, and
any pair of states from the same class do communicate. The
number of irreducible classes correspond to the multiplicity of
the Perron–Froebenius eigenvalue of the Markov chain. For
each irreducible class (say of index ), there exists a unique
on the states from that class that defines an
probability
eigenvector of the Markov chain with associated eigenvalue
. Note that the restriction of the Markov chain to the th
irreducible class defines an ergodic Markov chain on the th
class of states. Moreover, each invariant probability distribution
on the whole state space can be represented in a unique way
as a convex combination of the ’s.
on
is comThus, as a stationary distribution
pletely defined by ; it is represented in a unique way as a finite
into stationary ergodic Markov chains
combination of
with
with
and
where
comes from the definition of code-based order
comes from the application of Lemma 8 to
estimator,
, and from the fact that
for all .
APPENDIX III
PROOF OF THEOREM 2
The proof of Theorem 2 is immediate from Lemmas 11–13.
The proof of the first two lemmas parallels the second part of
the proof of the consistency of the BIC Markov order estimator
[10], or the proof of the conditional consistency of the mixture
estimator described in [33].
Lemma 11: If for each , the sequence of penalties satisfies
(35)
then, for any stationary ergodic HMM of true order
-almost surely, we have both
tually
and
Let us first establish a technical proposition.
, even-
is just rewriting, and
follows from the fact that
is a
probability distribution.
In [10], the proof of the analog of Lemma 11 uses the intimate relationship between empirical types of order and ML in
to reduce the proof of consistency to a strong ratiomodel
typicality theorem. Here, we will take advantage of the properties of the Krichevsky–Trofimov mixture to avoid resorting
to type-theoretical arguments. Our proof actually revisits [33,
proof of Lemma 4.2] and [20, Ch. 4]. The proof of [33, Lemma
4.2] uses a slightly weaker version of Lemma 8 than the version
presented here.
Proof [Lemma 11]: In order to establish the lemma, by the
first Borel–Cantelli lemma, it is enough to show that
GASSIAT AND BOUCHERON: OPTIMAL ERROR EXPONENTS IN HIDDEN MARKOV MODELS ORDER ESTIMATION
979
or by the union bound
Let denote again either
or
Proposition 2, we just need to check that
. Using (36) from
where the first inequality follows from Lemma 8 and the second
from (15) for big enough .
REFERENCES
As
is upper-bounded, it is enough to have
(37)
which holds under the condition of the lemma.
Note that the condition in the lemma is satisfied if
does not depend on and depends quadratically in .
Lemma 12: If the sequence of penalties satisfies (15), then
for any stationary ergodic HMM of true order , eventually
-almost surely, we have both
and
Proof of Lemma 12: We use again Proposition 2. It is
enough to prove that
(38)
denote an upper bound on
Let
or
union bound, for
that
. Again, by the
, it is enough to show
(39)
which follows from (15).
Lemma 13: If the sequence of penalties satisfies (15), then
for any stationary ergodic HMM of true order , eventually
-almost surely, we have
and
Proof: To obtain the desired result, note that if
denotes
or
, for any
either
[1] L. E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov chains,” Ann. Math. Statist., vol. 30, pp.
1554–1563, 1966.
[2] A. Barron, J. Rissanen, and B. Yu., “The minimum description length
principle in coding and modeling,” IEEE Trans. Inform. Theory, vol.
44, pp. 2743–2760, Oct. 1998.
[3] P. Bartlett, S. Boucheron, and G. Lugosi, “Model selection and error
estimation,” Machine Learning, vol. 48, pp. 85–113, 2002.
[4] P. Bickel, Y. Ritov, and T. Ryden, “Asymptotic normality of the maximum likelihood estimator for general hidden Markov models,” Ann.
Statist., vol. 26, pp. 1614–1635, 1998.
[5] O. Catoni, “Statistical learning and stochastic optimization,” in
Ecole d’été de Probabilities de Saint-Flour (Lecture Notes in Mathematics). Berlin, Germany: Springer-Verlag, 2001. [Online]. Available:
http://felix.proba.jussieu.fr/users/catoni, to be published.
[6] T. Cover and J. Thomas, Elements of Information Theory. New York:
Wiley, 1991.
[7] I. Csiszár, “Information theoretical methods in statistics,” lecture notes,
University of Maryland, College Park, 1990.
, “The method of types,” IEEE Trans. Inform. Theory, vol. 44, pp.
[8]
2505–2523, Oct. 1998.
[9]
, “Large scale typicality of Markov sample paths and consistency
of MDL order estimators,” IEEE Trans. Inform. Theory (Special Issue
on Shannon Theory: Perspective, Trends, and Applications), vol. 48, pp.
1616–1628, June 2002.
[10] I. Csiszár and P. Shields, “The consistency of the BIC Markov order
estimator,” Ann. Statist., vol. 28, pp. 1601–1619, 2000.
[11] L. A. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace, “Efficient universal noiseless source codes,” IEEE Trans. Inform. Theory,
vol. IT-27, pp. 269–279, May 1981.
[12] A. Dembo and O. Zeitouni, Large Deviations Techniques and Applications. New York: Springer-Verlag, 1999.
[13] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern
Recognition. Berlin, Germany: Springer-Verlag, 1996.
[14] R. Douc and C. Matias, “Asymptotics of the maximum likelihood
estimator for general hidden Markov models,” Bernouilli, vol. 3, pp.
381–420, 2001.
[15] R. Douc, E. Moulines, and T. Ryden, “Asymptotic properties of the
maximum likelihood estimator in autoregressive models with Markov
regime for general hidden Markov models,” preprint, 2001.
[16] P. Dupuis and R. S. Ellis, A Weak Convergence Approach to the Theory
of Large Deviations. New York: Wiley, 1997.
[17] R. Dudley, Real Analysis and Probability. Belmont, CA: Wadsworth,
1989.
[18] Y. Ephraim and N. Merhav, “Hidden Markov processes,” IEEE Trans.
Inform. Theory (Special Issue on Shannon Theory: Perspective, Trends,
and Applications), vol. 48, pp. 1518–1569, June 2002.
[19] M. Feder and N. Merhav, “Universal composite hypothesis testing: A
competitive minimax approach. Special issue on Shannon theory: Perspective, trends, and applications,” IEEE Trans. Inform. Theory (Special
Issue on Shannon Theory: Perspective, Trends, and Applications), pp.
1504–1517, June 2002.
[20] L. Finesso, “Estimation of the order of a finite Markov chain,” Ph.D.
dissertation, Univ. Maryland, College Park, 1990.
[21] L. Finesso, C. Liu, and P. Narayan, “The optimal error exponent for
Markov order estimation,” IEEE Trans. Inform. Theory, vol. 42, pp.
1488–1497, Sept. 1996.
[22] E. Gassiat and C. Keribin, “The likelihood ratio test for the number of
components in a mixture with Markov regime,” ESAIM Probab. Statist.,
vol. 4, pp. 25–52, 2000.
[23] E. Gassiat, “Likelihood ratio inequalities with applications to various
mixtures,” Ann. Inst. Henri Poincaré, vol. 38, pp. 897–906, 2002.
980
[24] W. Hoeffding, “Asymptotically optimal tests for multinomial distributions,” Ann. Math. Statist., vol. 36, pp. 369–400, 1965.
[25] H. Ito, S. I. Amari, and K. Kobayashi, “Identifiability of hidden Markov
information sources and their minimum degrees of freedom,” IEEE
Trans. Inform. Theory, vol. 38, pp. 324–333, Mar. 1992.
[26] J. L. Jensen and N. V. Petersen, “Asymptotic normality of the maximum
likelihood estimator in state space models,” Ann. Statist., vol. 27, pp.
514–535, 1999.
[27] B. H. Juang and L. R. Rabiner, “Hidden Markov models for speech
recognition,” Technometrics, vol. 33, pp. 251–272, 1991.
[28] J. C. Kieffer, “Strongly consistent code-based identification and order
estimation for constrained finite state model classes,” IEEE Trans. Inform. Theory, vol. 39, pp. 893–902, May 1993.
[29] R. E. Krichevski and V. N. Trofimov, “The performance of universal
encoding,” IEEE Trans. Inform. Theory, vol. IT-27, pp. 199–207, Mar.
1981.
[30] S. Khudanpur and P. Narayan, “Order estimation for a special class of
hidden Markov sources and binary renewal processes,” IEEE Trans. Inform. Theory (Special Issue on Shannon Theory: Perspective, Trends,
and Applications), vol. 48, pp. 1704–1713, June 2002.
[31] F. Le Gland and L. Mevel, “Exponential forgetting and geometric ergodicity in hidden Markov models,” Math. Contr. Signals Syst., vol. 13, pp.
63–93, 2000.
[32] B. Leroux, “Maximum-likelihood estimation for hidden Markov
models,” Stoch. Processing and Appl., vol. 40, pp. 127–143, 1992.
[33] C. Liu and P. Narayan, “Order estimation and sequential universal data
compression of a hidden Markov source by the method of mixtures,”
IEEE Trans. Inform. Theory, vol. 40, pp. 1167–1180, July 1994.
[34] E. Levitan and N. Merhav, “A competitive Neyman–Pearson approach
to universal hypothesis testing with applications,” IEEE Trans. Inform.
Theory, vol. 48, pp. 2215–2229, Aug. 2002.
[35] I. L. MacDonald and W. Zucchini, Hidden Markov and Other Models
for Discrete-Valued Time Series (Monographs on Statistics and Applied
Probability), 70. London, U.K.: Chapman & Hall, 1997.
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 49, NO. 4, APRIL 2003
[36] N. Merhav, M. Gutman, and J. Ziv, “On the estimation of the order of
a Markov Chain and universal data compression,” IEEE Trans. Inform.
Theory, vol. 35, pp. 1014–1019, Sept. 1989.
[37] T. Petrie, “Probabilistic functions of finite state Markov chains,” Ann.
Math. Statist., vol. 40, pp. 97–115, 1969.
[38] J. Rissanen, “Universal coding, information, prediction and estimation,”
IEEE Trans. Inform. Theory, vol. IT-30, pp. 629–636, July 1984.
[39]
, Stochastic Complexity in Statistical Enquiry. Singapore: World
Scientific, 1989.
[40] T. Rydén, “Estimating the order of a hidden Markov models,” Statistics,
vol. 26, pp. 345–354, 1995.
[41] Y. M. Shtarkov, “Universal sequential coding of single messages,” Probl.
Inform. Transm., vol. 23, pp. 3–17, 1987.
[42] A. van der Vaart, Asymptotic Statistics. Cambridge, U.K.: Cambridge
Univ. Press, 1998.
[43] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[44] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree
weighting method: Basic properties,” IEEE Trans. Inform. Theory, vol.
41, pp. 653–664, May 1995.
[45] E. T. Whittaker and G. N. Watson, A Course of Modern Analysis. Cambridge, U.K.: Cambridge Math. Lib., 1924.
[46] O. Zeitouni and M. Gutman, “On universal hypothesis testing via large
deviations,” IEEE Trans. Inform. Theory, vol. 37, pp. 285–290, Mar.
1991.
[47] O. Zeitouni, J. Ziv, and N. Merhav, “When is the generalized likelihood ratio testing optimal?,” IEEE Trans. Inform. Theory, vol. 38, pp.
1597–1602, Sept. 1992.
[48] J. Ziv and N. Merhav, “Estimating the number of states of a finite-state
source,” IEEE Trans. Inform. Theory, vol. 38, pp. 61–65, Jan. 1992.