Download Processes LIDS- February, 1998 Research Supported By:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
February, 1998
LIDS- P 2411
Research Supported By:
Siemens AG, Germany, contract
NSF contract DMI-9625489
Simulation-Based Optimization of Markov Reward Processes
Marbach, P.
Tsitsiklis, J.N.
Simulation-Based Optimization of Markov Reward
Processes 1
Peter Marbach and John N. Tsitsiklis
Laboratory for Information and Decision Systems
Massachusetts Institute of Technology
Cambridge, MA 02139
e-mail: [email protected], [email protected]
Abstract: We propose a simulation-based algorithm for optimizing the average reward in a
Markov Reward Process that depends on a set of parameters. As a special case, the method
applies to Markov Decision Processes where optimization takes place within a parametrized
set of policies. The algorithm involves the simulation of a single sample path, and can be
implemented on-line. A convergence result (with probability 1) is provided.
1
This research was supported by a contract with Siemens AG, Munich, Germany, and by contract
DMI-9625489 with the National Science Foundation.
1
Introduction
Markov Decision Processes and the associated dynamic programming (DP) methodology
[Ber95a, Put94] provide a general framework for posing and analyzing problems of sequential
decision making under uncertainty. DP methods rely on a suitably defined value function
that has to be computed for every state in the state space. However, many interesting
problems involve very large state spaces ("curse of dimensionality"), which prohibits the
application of DP. In addition, DP assumes the availability of an exact model, in the form
of transition probabilities. In many practical situations, such a model is not available and
one must resort to simulation or experimentation with an actual system. For all of these
reasons, dynamic programming in its pure form, may be inapplicable.
The efforts to overcome the aforementioned difficulties involve two main ideas:
1. The use of simulation to estimate quantities of interest, thus avoiding model-based
computations.
2. The use of parametric representations to overcome the curse of dimensionality.
Parametric representations, and the associated algorithms, can be broadly classified into
three main categories.
(a) One can use a parametric representation of the value function. For example, instead
of associating a value V(i) with every state i, one uses a parametric form V(i, r),
where r is a vector of tunable parameters (weights), and V is a so-called approximation architecture. For example, V(i, r) could be the output of a multilayer perceptron
with weights r, when the input is i. Other representations are possible, e.g., involving polynomials, linear combinations of feature vectors, state aggregation, etc.
When the main ideas from DP are combined with such parametric representations,
one obtains methods that go under the names of "reinforcement learning" or "neurodynamic programming"; see [SB98, BT96] for textbook expositions, as well as the
references therein. Some of the main methods of this type are Sutton's temporal
difference (TD) methods [Sut88], Watkins' Q-learning algorithm [Wat89], and approximate policy iteration [BT96]. The distinguishing characteristic of such methods
is that policy optimization is carried out in an indirect fashion: one tries to obtain a
good approximation of the optimal value function of dynamic programming, and uses
it to construct policies that are close to optimal. The understanding of such methods is still somewhat incomplete: convergence results or performance guarantees are
available only for a few special cases such as state-space aggregation [TV96], optimal
stopping problems [TV97], and an idealized form of policy iteration [BT96]. However,
there have been some notable practical successes (see [SB98, BT96] for an overview),
including the world-class backgammon player by Tesauro [Tes92].
(b) In an alternative approach, which is the one considered in this paper, the tuning of
a parametrized value function is bypassed. Instead, one considers a class of policies
described in terms of a parameter vector S. Simulation is employed to estimate the
gradient of the performance metric with respect to 0, and the policy is improved
by updating 9 in a gradient direction. Methods of this type have been extensively
explored in the IPA (infinitesimal perturbation analysis) literature [CR94, HC91].
2
Many of these methods are focused on special problem types and do not readily
extend to general Markov Decision Processes.
(c) A third approach, which is a combination of the above two, involves a so-called "actorcritic" architecture, that includes parametrizations of the policy (actor) and of the
value function (critic) [BSA83]. At present, little is known about the theoretical
properties of such methods (see, however, [Rao96]).
This paper concentrates on methods based on policy parametrization and (approximate)
gradient improvement, in the spirit of item (b) above. It is actually not restricted to Markov
Decision Processes, but it also applies to general Markov Reward Processes that depend on
a parameter vector 0. Our first step is to obtain a method for estimating the gradient of the
performance metric. In this connection, we note the "likelihood ratio" method of [Gly86],
which has this flavor, but does not easily lend itself to on-line updating of the parameter
vector. We find an alternative approach, based on a suitably defined "differential reward
function," to be more convenient. It relies on a gradient formula that has been presented
in different forms in [CC97, CW98, FH94, JSJ95]. We exploit a variant of this formula
and develop a method that updates the parameter vector 0 at every renewal time, in an
approximate gradient direction. Furthermore, we show how to construct a truly on-line
method that updates the parameter vector at each time step. In this respect, our work is
closer to the methods described in [CR94] (that reference assumes, however, the availability
of an IPA estimator, with certain guaranteed properties that are absent in our context) and
in [JSJ95] (which, however, does not contain convergence results).
The method that we propose only keeps in memory and updates 2KI + 1 numbers,
where KI is the dimension of 0. Other than 0 itself, this includes a vector similar to the
"eligibility trace" in Sutton's temporal difference methods, and (as in [JSJ95]) an estimate
A of the average reward under the current value of the parameter vector. If that estimate
was accurate, our method would be a standard stochastic gradient algorithm. However, as
the policy keeps changing, A is generally a biased estimate of the true average reward, and
the mathematical structure of our method is more complex than that of stochastic gradient
algorithms. For reasons that will become clearer later, convergence cannot be established
using standard approaches (e.g., martingale arguments or the ODE approach), and a more
elaborate proof is necessary.
In summary, the main contributions of this paper are as follows.
1. We introduce a new algorithm for updating the parameters of a Markov Reward Process. The algorithm involves only a single sample path of the system. The parameter
updates can take place either at visits to a certain recurrent state, or at every time
step. We also specialize the method to the case of Markov Decision Processes with
parametrically represented policies.
2. We establish that the gradient (with respect to the parameter vector) of the performance metric converges to zero, with probability 1, which is the strongest possible
result for gradient-related stochastic approximation algorithms.
The remainder of this paper is organized as follows. In Section 2, we introduce our
framework and assumptions, and state some background results, including a formula for the
gradient of the performance metric. In Section 3, we present an algorithm that performs
3
updates at visits to a certain recurrent state, present our main convergence result, and
provide a heuristic argument. Section 4 deals with variants of the algorithm that perform
updates at every time step. In Section 5, we specialize our methods to the case of Markov
Decision Processes that are optimized within a possibly restricted set of parametrically
represented randomized policies. The lengthy proof of our main results is developed in the
appendices.
2
Markov Reward Processes Depending on a Parameter
In this section, we present our general framework, make a few assumptions, and state some
basic results that will be needed later.
We consider a discrete-time, finite-state Markov chain {in with state space S = { 1, . . ., N,
whose transition probabilities depend on a parameter vector 0 E ~RK, and are denoted by
Pij(0) = P(in = j I in,_ = i, 0).
Whenever the state is equal to i, we receive a one-stage reward, that also depends on 0, and
is denoted by gi(O).
For every 0 E hK, let P(O) be the stochastic matrix with entries pij (0). Let P = {P(O) I
0 CEK) be the set of all such matrices, and let P be its closure. Note that every element
of 7P is also a stochastic matrix and, therefore, defines a Markov chain on the same state
space. We make the following assumptions.
Assumption 1 The Markov chain corresponding to every P E P is aperiodic. Furthermore, there exists a state i* which is recurrentfor every such Markov chain.
We will often refer to the times that the state i* is visited as renewal times.
Assumption 2 For every i, j E S, Pij (0) and gi(0) are bounded, twice continuously differentiable, and have bounded first and second derivatives.
The performance metric that we use to compare different policies is the average reward
criterion A(6), defined by
A(6) = lim
1
_ [f
ik ()]
Here, ik is the state at time k, and the notation Eo[.] indicates that the expectation is taken
with respect to the distribution of the Markov chain with transition probabilities Pij(O).
Under Assumption 1, the average reward A(0) is well defined for every 0, and does not
depend on the initial state. Furthermore, the balance equations
N
7ri(O)piij()
=
rj(O),
=
,
j = 1,...,N
-1,
(1)
i=l
N
Eiri()
(2)
i=l
4
have a unique solution 7r(0) = (7rl (0),..., 7rN(0)), with 7ri(0) being the steady state probability of state i under that particular value of 0, and the average reward is equal to
N
A(0) =
(0)gi(0).
Ei
i=l
(3)
We observe that the balance equations (1)-(2) are of the form
A(0)w'(0) = a,
where a is a fixed vector and A(O) is an N x N matrix. Using the fact that A(O) depends
smoothly on 0, we have the following result.
Lemma 1 Let Assumptions i and 2 hold. Then, 1r(O) and A(O) are twice differentiable,
and have bounded first and second derivatives.
Proof: The balance equations can be written in the form A(0)wr(O) = a, where the entries
of A(O) have bounded second derivatives (Assumption 2). Since the balance equations have
a unique solution, the matrix A(9) is always invertible and Cramer's rule yields
r(0) = det(A())'
(4)
where C(9) is a vector whose entries are polynomial functions of the entries of A(O). Using
Assumption 2, C(0) and det(A(0)) are twice differentiable and have bounded first and
second derivatives.
More generally, suppose that P E 7, i.e., P is the limit of the stochastic matrices
P(Ok) along some sequence Sk. The corresponding balance equations are again of the form
A(P)wr = a, where A(P) is a matrix depending on P. Under Assumption 1, these balance
equations have again a unique solution, which implies that Idet(A(P))l is positive. Note
that Idet(A(P))]is a continuous function of P, and P lies in the set P, which is closed
and bounded. It follows that Idet(A(P))I is bounded below by a positive constant c. Since
every P(O) belongs to P, it follows that Idet(A(9))l > c > 0, for every 0. This fact, together
with Eq. (4) implies that 7r(0) is twice differentiable and has bounded first and second
derivatives. The same property holds true for A(0), as can be seen by differentiating twice
the formula (3).
[]
2.1
The Gradient of A(0)
For any 0 E RK and i E S, we define the differential reward vi(O) of state i by
vi(0) = EO [=
(gk(0)-
A(0)) I io = i]
(5)
where T = min{k > 0 l ik = i*} is the first future time that state i* is visited. With this
definition, it is well known that vi. (0) = 0.
The following proposition gives an expression for the gradient of the average reward
A(0), with respect to 0. A related expression (in a somewhat different context) was given
in [JSJ95], and a proof can be found in [CC97]. The latter reference does not consider the
case where gi(O) depends on 0, but the extension is immediate.
5
Proposition 1 Let Assumptions I and 2 hold. Then,
VA(0) = Z 7ri()
(Vgi(0) ±+
VPij(0)vj(0))·
jES
iES
Equation (3) for A(0) suggests that VA(S) could involve terms of the form V7ri(0), but
the expression given by Proposition 1 involves no such terms. This property is very helpful
in producing simulation-based estimates of VA(0).
2.2
An idealized gradient algorithm
Given that our goal is to maximize the average reward A(0), it is natural to consider gradienttype methods. If the gradient of A(0) could be computed exactly, we would consider a
gradient algorithm of the form
Ok+l = Ok + YkVAX(k).
Based on the fact that A(O) has bounded second derivatives, and under suitable conditions
on the stepsizes Yk, it would follow that limk-+,, VA(Ok) = 0 and that A(Ok) converges
[Ber95b].
Alternatively, if we could use simulation to produce an unbiased estimate hk of VA(Ok),
we could then employ the approximate gradient iteration
Ok+1 = Ok + Ykhk.
The convergence of such a method can be established if we use a diminishing stepsize
sequence and make suitable assumptions on the estimation errors. Unfortunately, it does
not appear possible to produce unbiased estimates of VA(0) in a manner that is consistent
with on-line implementation based on a single sample path. This difficulty is bypassed by
the method developed in the next section.
3
The Simulation-Based Method
In the previous section, we described an idealized gradient algorithm for tuning the parameter vector 0. In this section, we develop a simulation-based algorithm that replaces the
gradient VA(0) by an estimate obtained by simulating a single sample path. We will show
that this algorithm retains the convergence properties of the idealized gradient method.
For technical reasons, we make the following assumption on the transition probabilities
pij (0). In Section 5, this assumption is revisited and we argue that it need not be restrictive.
Assumption 3 There exists a positive scalar e, such that for every i, j C S, we have
either pij(0) = 0,
V 0,
or pij(0) > E, V 9.
6
3.1
Estimation of VA(0)
Throughout this subsection, we assume that the parameter vector 0 is fixed to some value.
Let {in be a sample path of the corresponding Markov chain, possibly obtained through
simulation. Let t, be the time of the mth visit at the recurrent state i*. We refer to the
sequence itmitm+l, . .. I itm+l as the mth renewal cycle, and we define its length Tm by
Tm = tm+i - tin.
For a fixed 0, the random variables Tm are independent identically distributed, and have a
(common) finite mean, denoted by Eo[T].
Our first step is to rewrite the formula for VA(O) in the form
V7(0) ='
ri (0)
iES
Vgi(0) + 1£ Pij(0) (Pij(O)
)jESi
j
))
(0
=+ij
where Si = {j I pij(O) > 0}. Estimating the term irr(0)VgI(9) through simulation is
straightforward, assuming that we are able to compute Vgi(O) for any given i and 0. The
other term can be viewed as the expectation of vj(0)Vpii(0)/pij(0),
with respect to the
steady-state probability 7ri(O)pij(O) of transitions from i to j. Furthermore, the definition
(5) of vj(0), suggests that if tm < n < tm+l - 1, and i,= j, we can use
tm+l -1
<jn (0,A)=
E
(gi (0)
)(6)
k=n
to estimate vj (0), where Ais some estimate of A(0). Note that vi. (0) = 0 and does not need
to be estimated. For this reason, we let
0in (0, ) = 0,
if n = tm.
By accumulating the above described estimates over a renewal cycle, we are finally led
to an estimate of VA(0) given by
Fro(0,A)
X) =
t§1Zgi.
n~
00 ( ,
(
(
Pin- in (0)
(7)
n(V))
(0)
Note that the denominator Pi,_li, (0) is always positive, since only transitions that have
positive probability will be observed. Also, the random variables Fm(0, A) are independent
and identically distributed for different values of m, because the transitions during distinct
renewal cycles are independent.
We define f(0, A) to be the expected value of Fm(0, A), namely,
f(9, A) = Eo[Fm(0,A)].
(8)
The following proposition confirms that the expectation of F,(0, A) is aligned with VA(0),
to the extent that A is close to A(0).
7
Proposition 2 We have
f(0, A) = Eo[T]VA(O) + G(O)(A(O) - A),
where
G(O) = Eo
(
(tm+l-n
t+l
(~tm+l
-
n) liPinii
(9)
(0)
Proof: Note that for n = tm + 1, ... , tm+l - 1, we have
tm+1-- 1
in,(0, A) =
(gik(0)
A()) + (tm+l- n)(A()
-
-
A).
k=n
Therefore,
tn+l-1
Fm (,A)
=
E
n=tm+l
a
ti+l-1
tiinl-1
pi,,_lin (O)
7
(tm+l-n)(AX()-A)VPin-1in()
in- i (0+
Pnl
=t"'+l
Pnl
f+
n=tm
Vgi (0)
where
tm+l-1
an=
E
(gik ()-
(10)
A()).
k=n
We consider separately the expectations of the three sums above. Using the definition of
G(O), the expectation of the second sum is equal to G(O)(A(O) - A). We then consider the
third sum. As is well known, the expected sum of rewards over a renewal cycle is equal to
the steady-state expected reward times the expected length of the renewal cycle. Therefore,
the expectation of the third sum is
tm+i -1
Es
5
Vgin(0) = Eo[T]jJ ri (O)Vgi(O).
Ln=tm
(11)
iES
We now focus on the expectation of the first sum. For n = tm,+l, ... , tm+l - 1, let
An =
(an - vi (0)) VPiin
(0)
Let Fn = io, -.., in} stand for the history of the process up to time n. By comparing the
definition (10) of an with the definition (5) of vin (0), we obtain
Eo [an |I
F]
= vin(9).
(12)
It follows that Eo[An I Fn] = 0.
Let Xn = 1 if n < tm+l, and Xn = 0, otherwise. For any n > tin, we have
I
Ee[xnAn 'tm]
= Eo[Eo[xan In3 I
8
tml
= EO[X.Eo[lt
I
]I Ftl
= 0.
We then have
Es
[
tm+l --1
_00 o1
An
Ft,,,|
EoIE
=
n=tm+1
XnAnIYXm
n=tm+ l
n=tm+
=0O.
(The interchange of the summation and the expectation can be justified by appealing to the
dominated convergence theorem.)
We therefore have
n=tm+l
Pn-
n=tm+ 1
n
Pi_
1 n ()
The right-hand side can be viewed as the total reward over a renewal cycle of a Markov reward process, where the reward associated with a transition from i to j is vj () V7pij (9)/Pij (0).
Recalling that any particular transition has steady-state probability 7ri(O)Pij(0) of being
from i to j, we obtain
Eo
E
fn=tm+l
an
=E[T]
(0)p(0)
(13)
iES jSi
Pi.-I1n
By combining Eqs. (11) and (13), and comparing with the formula for VA(0), we see that
E
the desired result has been proved.
3.2
An Algorithm that Updates at Visits to the Recurrent State
We now use the approximate gradient direction provided by Proposition 2, and propose a
simulation-based algorithm that performs updates at visits to the recurrent state i*. We use
the variable m to index the times when the recurrent state i* is visited and the corresponding
updates. The form of the algorithm is the following. At the time tm that state i* is visited
for the mth time, we have available a current vector 0m and an average reward estimate
Am. We then simulate the process according to the transition probabilities pij (Om) until the
next time tm+l that i* is visited and update according to
Om+l
=
Am),
On +7ymFm(O,
(14)
tm+l-1
Am+1
=
Am +m Ym
(gin,(O)
- Am),
(15)
n=tm
where y, is a positive stepsize sequence (cf. Assumption 4 below). To see the rationale
behind Eq. (15) note that the expected update direction for A is
E[ (gin()-
E
A)
n=-tm
which moves A closer to A(0).
9
=
Eo[T](A()- A),
(16)
Assumption 4 The stepsizes ym are nonnegative and satisfy
E
m
=
°°
Eo
m=l
m=1
m2
< C)O
Assumption 4 is satisfied, for example, if we let ym
1/rm. It can be shown that if 0
is held fixed, but A keeps being updated according to Eq. (15), then A converges to A(0).
However, if 0 is also updated according to Eq. (14), then the estimate Am, can "lag behind"
A(Om). As a consequence, the expected update direction for 0 will not be aligned with the
gradient VA(0).
An alternative approach that we do not pursue is to use different stepsizes for updating
A and 0. If the stepsize used to update 0 is, in the limit, much smaller than the stepsize
used to update A, the algorithm exhibits a two-time scale behavior of the form studied in
[Bor97]. In the limit, Am is an increasingly accurate estimate of A(9m), and the algorithm
is effectively a stochastic gradient algorithm. However, such a method would make slower
progress, as far as 0 is concerned. Our convergence results indicate that this alternative
approach is not necessary.
We can now state our main result.
Proposition 3 Let Assumptions 1-4 hold, and let {Om} be the sequence of parameter vectors generated by the above described algorithm. Then, A(0m) converges and
lim VA(6m) = 0,
m--)oo
with probability 1.
3.3
A Heuristic Argument
In this subsection, we approximate the algorithm by a suitable ODE (as in [Lju77]), and
establish the convergence properties of the ODE. While this argument does not constitute
a proof, it illustrates the rationale behind our convergence result.
We replace the update directions by their expectations under the current value of 0. The
update equations for 0 and A take the form
Im+1
=
Am+i
=
-Om
+ ymf(mAm, ),
Am + ymEm[T](A(0m) - Am),
where f(0, A) is given by Proposition 2. With an asymptotically vanishing stepsize, and
after rescaling time, this deterministic iteration behaves similar to the following system of
differential equations:
ot
=
A,
=
VA(0t) +
G(9t)
Eo [T] (A(0t)
(0t) - A,.
- At),
(17)
(18)
Note that At and A(0t) are bounded functions since the one-stage reward gi(0) is finite-valued
and, therefore, bounded. We will now argue that At converges.
10
We first consider the case where the initial conditions satisfy Ao < A(0o). We then claim
that
At < AA(t),
V t > 0.
(19)
Indeed, suppose that at some tiine to we have Ato = A(kt0 ). If VA(0t0 ) = 0, then we are at
an equilibrium point of the differential equations, and we have At = X(0t) for all subsequent
times. Otherwise, if VA( 0 to) =A 0, then Oto = VA( 0 to), and A(Oto) > 0. At the same time, we
have Ato = 0, and this implies that At < A(9t) for t slightly larger than to. The validity of
the claim (19) follows. As long as At < A(6t), At is nondecreasing and since it is bounded,
it must converge.
Suppose now that the initial conditions satisfy A0 > A( 0o). As long as this condition
remains true, At is nonincreasing. There are two possibilities. If this condition remains true
for all times, then A(t) converges. If not, then there exists a time to such that Ato = A(0to),
which takes us back to the previously considered case.
Having concluded that At converges, we can use Eq. (18) to argue that A(Ot) must also
converge to the same limit. Then, in the limit, St evolves according to St = VA(Ot), from
which it follows that VA(0t) must go to zero.
We now comment on the nature of a rigorous proof. There are two common approaches
for proving the convergence of stochastic approximation methods. One method relies on the
existence of a suitable Lyapunov function and a martingale argument. In our context, A(O)
could play such a role. However, as long as Am
A(Om,), our method cannot be expressed
as a stochastic gradient algorithm and this approach does not go through. (Furthermore,
it is unclear whether another Lyapunov function would do.) The second proof method, the
so-called ODE approach, shows that the trajectories followed by the algorithm converge to
the trajectories of a corresponding deterministic ODE, e.g., the ODE given by Eqs. (17)(18). This line of analysis generally requires the iterates to be bounded functions of time.
In our case, such a boundedness property is not guaranteed to hold. For example, if 9
stands for the weights of a neural network, it is certainly possible that certain "neurons"
asymptotically saturate, and the corresponding weights drift to infinity. We conclude that
we need a line of argument specially tailored to our particular algorithm. In rough terms,
it proceeds along the same lines as the above provided deterministic analysis, except that
we need to ensure that the stochastic terms are not significant.
3.4
Implementation Issues
In this subsection, we indicate an economical way of computing the direction Fm(0, A) of
update of the vector 6.
Taking into account that vitm (9, A) = 0, Eq. (7) becomes
t,+J-l
F, (0 A)
n=tm+1
-E
n=tm+l
k1
k=t,+l
vpiiz
(0)
" tm+l -1
(1n
n=tm
(a, A) VPi , (6)
1
Pi",n
Vgin (0) +
Pi,/_ln ()
Vgi, (0) + (gi ()
-
kE
V
E
g(0)
(gi() - A)
+ Vgi. (0)
k=n
A) n=t_+l
P
()
+ Vg,*
Vg
()
(On
tm+l -1
-
S
Vgi*(9) +
(Vgik(o) + (gik(O)
-
A)zk)
k=tm+l
where
Ad
Vpin_ i. (0)
Vpi 1 in
n=tm++l
Pin-lin (0)
Zk =
k = tm + 1,... tm+l - 1,
is a vector (of the same dimension as 0) that becomes available at time k. It can be updated
recursively, with
Ztm = 0,
(20)
and
=Zk +
Zk+
Vpikik+l (9)
()
Pikik+l(0)
'
k = tn, .. , tm+l - 2.
= ,
(21)
In order to implement the algorithm, on the basis of the above equations, we only need
to maintain in memory 2K + 1 scalars, namely A, and the vectors 0, z. In the next section,
we suggest a variant of the algorithm that updates the parameter vector 0 at every time
step, rather than at visits to the recurrent state i*.
4
Algorithms that Update at Every Time Step
We now propose a fully on-line algorithm that updates the parameter vector 0 at each time
step. Recall that in the preceding subsection, Fm (0, A) was expressed as a sum of terms,
with one additional term becoming available subsequent to each transition. Accordingly,
we break down the total update into a sum of incremental updates carried out at each time
step.
At a typical time k, the state is ik, and the values of Ok, zk, and Ak are available from
the previous iteration. We update 0 and A according to
0k+1
=
Ok +
Ak+1
=
Ak +Yk(gik(Gk)-
2'k(Vgik(Ok)
+ (gik(Ok) -
)k)zk),
Ak).
We then simulate a transition to the next state ik+l according to the transition probabilities
Pij(Ok+l), and finally update z by letting
{
if ik+1 = i*,
0,
Zk+1 =
Zk +
Pikik+ (0k)
otherwise.
In order to prove convergence of this version of the algorithm, we will use an additional
condition on the stepsizes.
Assumption 5 The stepsizes 7k are nonincreasing. Furthermore, there exists a positive
integer p and a positive scalar A such that
n+t
I(yn-
7k) < AtPY 2,
k=n
12
V n, t > 0.
Assumption 5 is satisfied, for example, if we let yk = 1/k. With this choice, and if we
initialize A to zero, it is easily verified that Ak is equal to the average reward obtained in
the first k transitions.
We have the following convergence result.
Proposition 4 Let Assumptions 1-5 hold, and let { Ok} be the sequence of parametervectors
generated by the above described algorithm. Then, A(Ok) converges and
lim VA(0k) = 0,
k-+oo
with probability 1.
The algorithm of this section is similar to the algorithm of the preceding one, except
that 0 is continually updated in the course of a renewal cycle. Because of the diminishing
stepsize, these incremental updates are asymptotically negligible and the difference between
the two algorithms is inconsequential. Given that the algorithm of the preceding section
converges, Proposition 4 is hardly surprising.
4.1
A Modified On-Line Algorithm
If the length of a typical interval between visits to the recurrent state i* is large, as is the
case in many applications, then the vector Zk may become quite large before it is reset to
zero, resulting in high variance for the updates. For this reason, it may be preferable to
introduce a forgetting factor acE (0, 1) and update zk according to
Zk+l = aoZk + VPikik+l (0)
Pikik+1 ()
without resetting it when i* is visited. This modification, which resembles the algorithm
introduced in [JSJ95], can reduce the variance of a typical update, but introduces an additional bias in the update direction. Given that gradient-type methods are fairly robust with
respect to small biases, this modification may result in improved practical performance.
As in [JSJ95], this modified algorithm can be justified if we define the differential reward
by
vi
= E
(gijO) -
(k( )
= i
instead of Eq. (5), approximate it with
vi(0)
Es
[
ak!(gi()
- A(0)) I i0 = i
(which is increasingly accurate as act 1), use the estimate
oo
vin(S, A)
=
Ek °k (gj(A)
-
k=n
instead of Eq. (6), and then argue similar to Section 3. The analysis of this algorithm will
be reported elsewhere.
13
5
Markov Decision Processes
In this section, we indicate how to apply our methodology to Markov Decision Processes.
We consider a Markov Decision Processes [Ber95a, Put94] with finite state space S =
{1, . . ., N} and finite action space U = {1, ... , L}. At any state i, the choice of a control
action u E U determines the probability pij(u) that the next state is j. The immediate
reward at each time step is of the form gi(u), where i and u is the current state and action,
respectively.
A (randomized) policy is defined as a mapping
: S - [0, 1]L,
with components /zu(i) such that
V iES.
E Yu(i)=I,
uEU
Under a policy ,u, and whenever the state is equal to i, action u is chosen with probability
/ut(i), independent of everything else. If for every state i there exists a single u for which
/,u(i) is positive (and, therefore, unity), we say that we have a pure policy.
For problems involving very large state spaces, it is impossible to even describe an
arbitrary pure policy /z, since this requires a listing of the actions corresponding to every
state. This leads us to consider policies described in terms of a parameter vector 0 =
(01, ... , 0K), whose dimension K is tractably small. We are interested in a method that
performs small incremental updates of the parameter 0. A method of this type can work
only if the policy has a smooth dependence on 0, and this is the main reason why we choose
to work with randomized policies.
We allow 0 to be an arbitrary element of iK. With every 0 E IK, we associate a
randomized policy p(0), which at any given state i chooses action u with probability u,(i, 0).
~/u(i, 0) = 1. Note
Naturally, we require that every Yuz(i, 0) be nonnegative and that euu
that the resulting transition probabilities are given by
Yu (i, 0)PiJ(u),
PZi(0) =
(22)
uEU
and the expected reward per stage is given by
gi()
5)
= Z
E
u (i,
(u).
uEU
The objective is to maximize the average reward under policy ,u(0), which is denoted by
A(0). This is a special case of the framework of Section 2. We now discuss the various
assumptions introduced in earlier sections.
In order to satisfy Assumption 1, it suffices to assume that there exists a state i* which
is recurrent under every pure policy, a property which is satisfied in many interesting
problems. In order to satisfy Assumption 2, it suffices to assume that the policy has a
smooth dependence on 0; in particular, that /uz(i, 0) is twice differentiable (in 0) and has
bounded first and second derivatives. Finally, Assumption 3 is implied by the following
condition.
14
Assumption 6 There exists some e > 0 such that /u(i, 9) > e for every i, u, and 0.
Assumption 6 can be often imposed without any loss of generality (or loss of performance). Even when it is not automatically true, it may be profitable to enforce it artificially,
because it introduces a minimal amount of "exploration," and ensures that every action will
be tried infinitely often. Indeed, the available experience with simulation-based methods
for Markov decision processes indicates that performance can substantially degrade in the
absence of exploration: a method may converge to a poor set of policies for the simple
reason that the actions corresponding to better policies have not been sufficiently explored.
Since EueU Mu(i, 0) = 1, for every 0, we have Zueu Vi/t(i, 0) = 0, and
Vgi (9) =
VuL (i, 0) (gi(u)-
(0)).
uEU
Furthermore,
Vpij
(0)vj ()
=
jES
V 1 .t(i, )pij (u)vj (9).
jESuEU
Using these relations in the formula for VA(O) provided by Proposition 1, and after some
rearranging, we obtain
VA(0) = E E
ri(0) M.(i 9),
)qi,u(0)
iES uEU
mu (i,
, 0)
9)
where
qi,u ()
=
(gi (u) -
(0)) +
Pij(u)vj (0)
jEs
T-1
=
Eso5
(gik (uk)- A(0)) I io = i, uo = u
and where ik and uk is the state and control at time k. Thus, qi,u(O) is the differential
reward if control action u is first applied in state i, and policy /U(0) is followed thereafter.
It is the same as Watkins' Q-factor [Wat89], suitably modified for the average reward case.
From here on, we can proceed as in Section 3 and obtain an algorithm that updates 0
at the times tm that state i* is visited. The form of the algorithm is
rm+l
-= m+y mFm(,rAm),
tm+l 1-1
=
Am+l
Am +
(gin(z,)
Ym
- Am),
n=tm
where
V/-7LVn (i--, e)
tm+l1-
Fm(Om,
q5
iELn
m)
un(iOm)
and
tm+l-1
in,,,=
5
(gik(uk) -m).
k=n
Similar to Section 4, an on-line version of the algorithm is also possible. The convergence
results of Sections 3 and 4 remain valid, with only notation changes in the proof.
15
6
Conclusions
We have presented simulation-based method for optimizing a Markov Reward Process whose
transition probabilities depend on a parameter vector 0, or a Markov Decision Process
in which we restrict to a parametric set of randomized policies. The method involves
simulation of a single sample path. Updates can be carried out either whenever the recurrent
state i* is visited, or at every time step.
Regarding further research, there is a need for computational experiments in order to
delineate the class of practical problems for which this methodology is useful. In addition, further analysis and experimentation is needed for the modified on-line algorithm of
Section 4.1, in order to determine whether it has practical advantages. Finally, it may be
possible to extend the results to the case of infinite state spaces, or to weaken some of our
assumptions.
References
[Ber95a] D. P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific,
Belmont, MA, 1995.
[Ber95b] D. P. Bertsekas, Nonlinear Programming, Athena Scientific, Belmont, MA, 1995.
[Bor97]
V.S. Borkar, "Stochastic Approximation with Two Time Scales," Systems and
Control Letters, Vol. 29, pp. 291-294, 1997.
[BSA83] A. Barto, R. Sutton, and C. Anderson, "Neuron-Like Elements that Can Solve
Difficult Learning Control Problems," IEEE Trans. on Systems, Man and Cybernetics, Vol. 13, pp. 835-846, 1983.
[BT96]
D. P. Bertsekas and J. N. Tsitsiklis,
Scientific, Belmont, MA, 1996.
[BT97]
D. P. Bertsekas and J. N. Tsitsiklis, "Gradient Convergence in Gradient Methods," Lab. for Info. and Decision Systems Report LIDS-P-2301, Massachusetts
Institute of Technology, Cambridge, MA, 1997.
[CC97]
X. R. Cao and H. F. Chen, "Perturbation Realization, Potentials, and Sensitivity
Analysis of Markov Processes," IEEE Transactions on Automatic Control, Vol.
42, pp. 1382-1393, 1997.
[CR94]
E. K. P. Chong and P. J. Ramadage, "Stochastic Optimization of Regenerative
Systems Using Infinitesimal Perturbation Analysis," IEEE Trans. on Automatic
Control, Vol. 39, pp. 1400-1410, 1994.
[CW98]
X. R. Cao and Y. W. Wan, "Algorithms for Sensitivity Analysis of Markov
Decision Systems through Potentials and Perturbation Realization," to appear in
IEEE Trans. on Control Systems Technology, 1998.
[Del96]
B. Delyon, "General Results on the Convergence of Stochastic Algorithms," IEEE
Trans. on Automatic Control, Vol. 41, pp. 1245-1255, 1996.
16
Neuro-Dynamic Programming,
Athena
[FH94]
M. C. Fu and J. Hu, "Smoothed Perturbation Analysis Derivative Estimation for
Markov Chains," Operations Research Letters, Vol. 15, pp. 241-251, 1994.
[Gly86]
P. W. Glynn, "Stochastic Approximation for Monte Carlo Optimization," Proceedings of the 1986 Winter Simulation Conference, pp. 285-289, 1986.
[HC91]
Y. C. Ho and X. R. Cao, Perturbation Analysis of Discrete Event Systems, Kluwer
Academic Publisher, Boston, MA, 1991.
[JSJ95]
T. Jaakkola, S. P. Singh, and M. I. Jordan, "Reinforcement Learning Algorithm
for Partially Observable Markov Decision Problems," Advances in Neural Information Processing Systems, Vol. 7, pp. 345-352, Morgan Kaufman, San Francisco,
CA, 1995.
[Lju77]
L. Ljung, "Analysis of Recursive Stochastic Algorithms," IEEE Trans. on Automatic Control, Vol. 22, pp. 551-575, 1977.
[Put94]
M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, Wiley, New York, NY, 1994.
[Rao96] K.V. Rao, "Learning Algorithms for Markov Decision Processes," Master Thesis,
Dept. of Computer Science and Automation, Indian Institute of Science, Banglore,
1996.
[SB98]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, MIT
Press, Cambridge, MA, 1998.
[Sut88]
R. Sutton. "Learning to Predict by the Methods of Temporal Differences," Machine Learning, Vol. 3, pp. 835-846, 1988.
[Tes92]
G. J. Tesauro, "Practical Issues in Temporal Difference Learning,"
Learning, Vol. 8, pp. 257-277, 1992.
[TV96]
J. N. Tsitsiklis and B. Van Roy, "Feature-Based Methods for Large-Scale Dynamic
Programming," Machine Learning, Vol. 22, pp. 59-94, 1994.
[TV97]
J. N. Tsitsiklis and B. Van Roy, "Optimal Stopping of Markov Processes: Hilbert
Space Theory, Approximation Algorithms, and an Application to Pricing HighDimensional Financial Derivatives," Lab. for Info. and Decision Systems Report
LIDS-P-2389, Massachusetts Institute of Technology, Cambridge, MA, 1997.
[Wat89]
C. Watkins, "Learning from Delayed Rewards," Ph.D. Thesis, Cambridge University, UK, 1989.
17
Machine
A
Proof of Proposition 3
In this appendix, we prove convergence of the algorithm
9
m+l
=
m+ymFm(Om,Am),
tm+l-1
>E
Am +ym
=
Am+,
Am)
(gin (m)-
n=tm
where
Fm(Om, m,) = tm
--
(m, Am)
(0
+Vmsin
VVPi_7
(
))
n=t"I
t'+l-1
in(O,A) =
(gik(0) -)
n = tm
,
+
1,.
.,tm+l -
1,
kI=n
and
i,, (0,a)
=0.
For notational convenience, we define the augmented parameter vector rm = (m0, Am), and
write the update equations in the form
rm+l = rm +
m Hm(rm),
where
Fm(0,, Am)
Hm (rm)
(23)
tm+ -1
n=tm
Let
wFm =
{0o,
o,
0 io, ii, ... , itm}
stand for the history of the algorithm up to and including time tm. Using Proposition 2 and
Eq. (16), we have
E[Hm(rm) I Fm] = h(rm),
where
h(r) =[
E[T]VA)( O )+ G(
O
)(A ( O) -)
Eso[T](A(e)- -)
We then rewrite the algorithm in the form
rm+l = rm +'ymh(rm)
+
Em,
(24)
where
Em = -m(Hm(rm) -h(r))
and note that
E[im I Fm] = 0.
The proof rests on the fact that Em is "small," in a sense to be made precise, which will
then allow us to mimic the heuristic argument of Section 3.3.
18
A.1
Preliminaries
In this subsection, we establish a few useful bounds and characterize the behavior of e,.
Lemma 2
(a) For any i,j such that pij(0) is nonzero, the function Vpij(0)/pij(0) is bounded and
has bounded first derivatives.
(b) There exist constants C and p < 1 such that
Po(T = k) < Cpk,
V k, 0,
where the subscript 0 indicates that we are considering the distribution of the interrenewal time Tm = tm+l - t, under a particularchoice of 9. In particular,Eo[T] and
Eo[T 2] are bounded functions of 0.
(c) The function G(O) is well defined and bounded.
(d) The sequence Am is bounded, with probability 1.
(e) The sequence h(rm) is bounded, with probability 1.
Proof:
(a) This is true because pij(() has bounded first and second derivatives (Assumption 2),
and Pij(O) is bounded below by e > 0 (Assumption 3).
(b) Under Assumptions 1 and 3, and for any state, the probability that i* is reached in the
next N steps is at least EN (where N is the number of states), and the result follows.
(c) Note that
E
() 11 < CE[T2 ],
(tm+_
Pin)
[n=tm+i
-
(9)
where C is a bound on IIVpj(0)/lpij(0) II (cf. part (a)). The right-hand side is bounded
by the result of part (b). It follows that the expectation defining G(O), exists and is
a bounded function of 0.
(d) Using Assumption 4 and part (b) of this lemma, we obtain
E
which implies that
0[
E:
m=l
m2(tm+ltm)2
<
0o,
Em(t
tm+lconverges
- m)
to zero, with probability 1. Note that
Am+l < (1 - m(tm+i -tm))Am
+ ym(tm+i - tm)C,
where C is a bound on gi(O). For large enough m, we have
,ym(tm+i - tm) < 1, and
< max{Am, C}, from which it follows that the sequence Am is bounded above.
By a similar argument, the sequence Am is also bounded below.
Am+l
19
(e) Consider the formula that defines h(r). Parts (b) and (c) show that Eom [T] and G(Om)
are bounded. Also, A(0m) is bounded since the gi(O) are bounded (Assumption 2).
Furthermore, VA(Om) is bounded, by Lemma 1. Using also part (d) of this lemma,
the result follows.
o
Lemma 3 There exists a constant C (which is random but finite with probability 1) such
that
E[llmI 2 I m] < Cy,
V m,
and the series Em,E converges with probability 1.
Proof: Recall that gim(0m) and Am, are bounded with probability 1 (Assumption 2 and
Lemma 2(d)). Thus, for n = tm,..., tm+l - 1, we have Iji,(, _A) < C(t+l - tin), for some
constant C. Using this bound in the definition of Fm(0m, Am), we see that for almost all
sample paths, we have
IIFm (Om,Am)II < C(tm+l-tm)
2
,
for some new constant C. Using Lemma 2(b), the conditional variance of Fm (Om, Am), given
YF, is bounded. Similar arguments also apply to the last component of Hm(rm). Since
Em =
7Ym(Hm(rm) - E[Hm(rm) I Tm]), the first statement follows.
Fix a positive integer c and consider the sequence
min{M(c),n}
WC
Wn ==--
C
m=l
~m,
Em,
where M(c) is the first time m such that E[IlEmjo2 I Fm] > C^/m. The sequence wC is a
martingale with bounded second moment, and therefore converges with probability 1. This
is true for every positive integer c. For (almost) every sample path, there exists some c such
that M(c) = oc. After discarding a countable union of sets of measure zero (for each c,
the set of sample paths for which wc does not converge), it follows that for (almost) every
sample path, Em Em converges.
M
We observe the following consequences of Lemma 3. First, Em converges to zero with
probability 1. Since ym also converges to zero and the sequence h(rm) is bounded, we
conclude that
lim (0m+,-,m) = 0,
r-+0oo
lim (A(0m+i)-A(0m)) = 0,
--
+oo
lim (Am+i-Am) = 0,
-+oo
with probability 1.
A.2
Convergence of
Am
and A(Om)
In this subsection, we prove that Am and A(0m) converge to a common limit. The flow of
the proof is similar to the heuristic argument of Section 3.3.
We will be using a few different Lyapunov functions to analyze the behavior of the
algorithm in different "regions." The lemma below involves a generic Lyapunov function q
and characterizes the changes in q(r) caused by the updates
rm+l = rm +
mnh(rm) + -m.
20
Let Dc = {(0, A) E RK+1 itA < c}. We are interested in Lyapunov functions 6 that are
twice differentiable and such that ¢, V+, and V 26 are bounded on Do for every c. Let 4
be the set of all such Lyapunov functions. For any 6 E r9, we define
Em(6b) = 0(rm+l) -6(rm) -
-mVq!(rm)
h(rm),
where for any two vectors a, b, we use a · b to denote their inner product.
Lemma 4 If X E X, then the series Em Em (0) converges with probability 1.
Proof: Consider a sample path of the random sequence {rm}. Using part (d) of Lemma 2,
and after discarding a set of zero probability, there exists some c such that rm E ZDc for all
m. We use the Taylor expansion of +(r) at rm, and obtain
Em(q=)
=
<
--(rm+i)
6(rm) - 7mVq6(rm) ' h(rm)
V$(rm). (rm+l - rm) + Mllrm+i - rmll 2 - mV3(rm)
h(rm)
2
= Vb(rm) Em + Mr'm+l- rm 11
,
where M is a constant related to the bound on the second derivatives of 6(.) on the set D,.
A symmetric argument also yields
Vq3(rm) · m - Mllrm+ - rm 112 < Cm(b).
Using the boundedness of V6 on the set DZ, the same martingale argument as in the
proof of Lemma 3 shows that the series Zm V63(rm) ' em converges with probability 1. Note
that llrm+l - rmll = llmh(rm) +Emil, which yields
llrm+l - r m 112 < 27y[]h(rm)
112 + 211m 112.
2
The sequence h(rm) is bounded (Lemma 2) and y2 is summable (Assumption 4). Furthermore, it is an easy consequence of Lemma 3 that Em is also square summable. We conclude
that ljrm+l - rmll is square summable, and the result follows.
O
From now on, we will concentrate on a single sample path for which the sequences Em
and Cmr(q) (for the Lyapunov functions to be considered) are summable. Accordingly, we
will be omitting the "with probability 1" qualification.
The next lemma shows that if the error Am - A(O,) in estimating the average reward
is positive but small, then it tends to decrease. The proof uses A - A() as a Lyapunov
function.
Lemma 5 Let L be such that IIG(0)ll < L for all 6, and let
6(r) = 6(0, A) = X - X(0).
We have
E (D. Furthermore, if 0 _< A- A(O) < 1/L 2 , then
Vq(r) . h(r) < O.
21
Proof: The fact that q E
C
is a consequence of Lemma 1. We now have
V7(r) - h(r) = -(A - A(9))Eo[T] - IIVA(9)l12EO[T] + (A - A(9))VA()
- G()).
Using the inequality la. bl < Ilall 2 + Ilb112, to bound the last term, and the fact EO[T] > 1,
we obtain
V+(r)
.
h(r) < -(A - A(0)) + L 2 (
which is nonpositive as long as 0 < A - A(0) <
-_ A())2,
1/L 2 .
0
In the next two lemmas, we establish that if lAm - A(0m)I remains small during a certain
time interval, then Am cannot decrease by much. We first introduce a Lyapunov function
that captures the behavior of the algorithm when A z A(O).
Lemma 6 As in Lemma 5, let L be such that IIG(0)II < L. Let also
O(r) = 0(S, A) = A() - L 2(A(0)
-
)2 .
We have d E (. Furthermore, if 1A(O) - Al < 1/4L 2 , then
V+(r) *h(r)
Proof: The fact that
X
C
E
O0.
is a consequence of Lemma 1. We have
2L 2 (A(0) -
A) = (1-
V$(s,
V))VA(0),
and
V>0(O, A) = 2L(A()
- A).
Therefore, assuming that 1A(9) - Al < 1/4L 2 , and using the Schwartz inequality, we obtain
V>(r) h(r)= ( 1-2L2 (A(6)-
>
>
+2L 2 (A(0)1
(112
2
1IVA(O)lI
)) (lIVA(o)
12E 9 [T] + (A(6) - A)G(O) *VA())
A)2 Eo[T]
(A(0)
22 (A()
- 33A(() - XILIIVA()11l + 2L
-
2
)2
0.
C
Lemma 7 Consider the same function 0 as in Lemma 6, and the same constant L. Let &a
be some positive scalar smaller than 1/4L2 . Suppose that for some integers n and n', with
n' > n, we have
IA((0) - AnI <
•
IA(9n')
,
-
A~n
•
<
&
and
A(0m)
-
ml
m = n+1, ... ,n' - 1.
< 4L2
Then,
n'-1
An, > An - 2o(L2
+ 1) +
Z
m=n
22
Em().-
Proof: Using Lemma 6, we have
Vq(rm)- h(rm) > 0,
m = n,..., n' - 1.
Therefore, for m = n, ... , n' - 1, we have
0(rm+l)
=
>
•(rm) +..-mV$(rm) h(rm) +- m (0)
>(rm) +Em(q$),
and
n'-1
(r~n) Ž> ¢(rn) + 1
Em(q).
(25)
m=n
2 + a. Using these inequalitites
Note that I0(rn) - Anl < L 2ca2 + oa, and [9(rn,) - An'I < L 2 a0
in Eq. (25), we obtain the desired result.
o
Lemma 8 We have liminfm-o IA(0m) - A
= 0.
Proof: Suppose that the result is not true, and we will derive a contradiction. Since
A(0m+1) - A(0m) and Am+1 - Am converge to zero, there exists a scalar e > 0 and an integer
n, such that either A(Om) - Am > e, or A(Om) - Am < -E, for all m > n. Without loss of
generality, let us consider the first possibility.
Recall that the update equation for A is of the form
Am+1 = Am + YmEqm[T](A(0m) - Am) + Sm,
where 3 m is the last component of the vector Em, which is summable by Lemma 3. Given
that A(6m) - Am stays above e, the sequence 7m(A(0m) - Am) sums to infinity. As Jm is
summable, we conclude that Am converges to infinity, which contradicts the fact that it is
bounded.
o
The next lemma shows that the condition A(6m) > Am is satisfied, in the limit.
Lemma 9 We have liminfm,,o(A(0m) - Am) > 0.
Proof: Suppose the contrary. Then, there exists some E> 0 such that the inequality
Am- A(Om) > E
holds infinitely often. Let P = min{E, 1/L 2 }, where L is the constant of Lemma 5. Using
Lemma 8, we conclude that Am - A(O,) crosses infinitely often from a value smaller than
p/3 to a value larger than 2/3/3. In particular, there exist infinitely many pairs n, n', with
n' > n, such that
0 < An-A(9n) <
and
1
3
2
An' - A(6n') > -3,
3
-/3,
1
2
-f < Am - X(0m) < 3P
m = n + ,..., n'-.
We use the Lyapunov function
+(r) = (, A) =
23
- XA(),
and note that
(Prn_) > b(rn) + 3
(26)
For m = n,..., n' - 1, we have 0 < A - A(O) < P < 1/L 2 . Lemma 5 applies and shows
that VO(rm) h(rm) < 0. Therefore,
n'-1
n'-l
q(rn') = q(rn) + r
(ymVq(rm) h(rm) +Em(,())
< 0(rn) +
m=n
E
Em(+).
m=n
By Lemma 4, m, Cmem(q) converges, which implies that Emn_em (O) becomes arbitrarily
small. This contradicts Eq. (26) and completes the proof.
[
We now continue with the central step in the proof, which is the proof that limm-> (A(Om) Am) = 0. Using Lemma 9, it suffices to show that we cannot have limsupmoo(A(Om) Am) > 0. The main idea is the following. Whenever A(Om) becomes significantly larger
than Am, then Am is bound to increase significantly. On the other hand, by Lemma 7,
whenever A(Om) is approximately equal to Am, then Am cannot decrease by much. Since
Am is bounded, this will imply that A(Om) can become significantly larger than Am only a
finite number of times.
Lemma 10 We have limmo(A(0m) -
Am)
= 0.
Proof:
We will assume the contrary and derive a contradiction. By Lemma 9, we
have liminfm+o,(A(9m) - Am) > 0. So if the desired result is not true, we must have
lim supmc, (A(Om) - Am) > 0, which we will assume to be the case. In particular, there is
some A > 0 such that A(Om) - Am > A, infinitely often. Without loss of generality, we assume that A < 1/4L2 , where L is the constant of Lemmas 5 and 6. Let a > 0 be some small
constant (with ac < A/2), to be specified later. Using Lemma 9, we have A(0m) - Am > -a
for all large enough m. In addition, by Lemma 8, the condition IA(Om) - Aml _< a holds
infinitely often. Thus, the algorithm can be broken down into a sequence of cycles, where
in the beginning and at the end of each cycle we have IA(0m) - ml _<a, while the condition
A(Om) - Am > A holds at some intermediate time in the cycle.
We describe the stages of such a cycle more precisely. A typical cycle starts at some
time N with IA(ON) - ANI < ao. Let n" be the first time after time N that A(n,,,) - An,, > A.
Let n' be the last time before n" such that A(,,n) - ,n' < A/2. Let also n be the last time
before n' such that A(6n) - An < aC. Finally, let n"' be the first time after n" such that
IA(On"',), - An'"I < a. The time n"' is the end of the cycle and marks the beginning of a new
cycle.
Recall that the changes in Om and Am converge to zero. For this reason, by taking N to
be large enough, we can assume that A(9n) - An >- 0. To summarize our construction, we
have N < n < n' < n" < n"', and
IA(ON) - ANI < a,
0 < AX(n, )- A, < a,
IA(Om)-Aml < A,
A(0
,
,
m = N,..., n"- 1,
,) A<
24
A(Oa/) - An,, > A
< A,
- .m <(Bm)
a <
-
<
m = n + 1,..., n"m =n' + 1,..., n-
Am < A,
(Om)
,
m = n",... n"'-1.
ca < X(Gm)-Am ,
Our argument will use the Lyapunov functions
+(r) =
(0s,A) = A(0) - L 2 (()
-
)2
where L is as in Lemma 5 and 6, and
+(r) ==(G,A)= A-_(O).
We have
cm(S) =
0Q(rm+i)
(rm) -
-
(m )
mV
h(rm),
and we define Em (4b) by a similar formula. By Lemma 4, the series Zm Em (0) and Aim Em (i))
converge. Also, let
am
Am -YmEm [T](A(Om) - Am).
= Am+-
3
We observe that 5 , is the last component of Em and therefore, the series Zm m converges
and limm-o, am = 0. Finally, let C be a constant such that IVI(rm) · h(rm)l < C, for all
and because the sequences h(rm) and Am are bounded.
m, which exists because 0t Ef
Using the above observations, we see that if the beginning time N of a cycle is chosen
large enough, then for any k, k' such that N < k < k', we have
YkC
<
-
A
32'
A2
k'
•
ZE Em ()
m=k
-96C'
m=k
Finally, we assume that ac has been chosen small enough so that
A2
C
2( + L2a ) <
96C
2
Using the fact that A(On,+ l ) - An'+l > A/2, we have
,(On')-Anl
= '(9n'+l) - -n'+
+Y'n'V4(rn')
25
h(rn') +--n'(4) >
-2
A
-
-
A
16
.
Furthermore, we have
2
=
-l(rn,,)
n
+
"-
n"-1
Z
=-
/(rn,)
1
mV (r)m)
h(rm)-
E
m=n'
A
n-i
<
m(C )
m=n'
E ymC-C+
32
m=n
which implies that
A
Z
m=n'
A
n> 2C
32C'
Then,
n"l_'-1n'-1
nd,,
omEm0[T](A(Om) -Am)
+
A
=
+
m=n
E
Jm
m=n
n" -1
n"--1
m=n
nm=nI
A
2C
A
32C)
A
2
AX
16
A2
8C
A2
> An+ 24C'
We have shown so far that Am has a substantial increase between time n and n"'. We
now show that Am can only have a small decrease in the time between N and n. Indeed,
by Lemma 7, we have
n-1
An
>
N - 2(0 + L
o
)
+
E
Em(+).
m=N
By combining these two properties, we obtain
A~'
>
>
AN-2(+L 2)
,NI,(
La
A2
960
A2
-4C
2
\N+ 48C
We have shown that Am increases by a positive amount during each cycle. Since Am
is bounded above, this proves that there can only be a finite number of cycles, and a
contradiction has been obtained.
o
Lemma 11 The sequences
Am
and A(Om) converge.
26
Proof: Consider the function +(r) = A(O) - L2 (A(O) - A) 2, and the same constant L as in
Lemma 6. Let ac be a scalar such that 0 < ao < 1/4L2 . By the preceding lemma and by
Lemma 4, there exists some N such that if N < n < n', we have
IA(fn)-
<
AnI
a,
and
no-1
& Em (X)
< c.
m=n
Using Lemma 6,
n2'-1
erm(C5) Ž>
q(on') >_q(on) + 1:
N < n < n',
)(On)- a,
mn=n
or
A(OW) - L 2 (A(O,)- A,1)2 > A(0n)- L2 (A(0n) - iA)2 -,
which implies
A(0,n') > A((On)-
L2o-2 _ a,
N <
n < n'.
Therefore,
lim inf A(0,)
n'--)oo
> A(bin) - L 2Co2
-
a,
N < n,
and this implies that
lim inf X(Om) > lim sup AX(0)
m--)oo
m--)oo
- L2a
2
aC.
Since accan be chosen arbitrarily small, we have lim infmoo A(0a) > lim supmooo X(0m),
and since the sequence AX(Om) is bounded, we conclude that it converges. Using also Lemma
10, it follows that Am converges as well.
F]
A.3
Convergence of VA(Om)
In the preceding subsection, we have shown that AX(Om) and Am converge to a common limit.
It now remains to show that VA(Om) converges to zero.
Since A(Otm) - At, converges to zero, the algorithm is of the form
Om+l =
Om + "YmEom[T](VA(Om) + em) + Em,
where em converges to zero and Em is a summable sequence. This is a gradient method with
errors, similar to the methods considered in [Del96] and [BT97]. However, [Del96] assumes
the boundedness of the sequence of iterates, and the results of [BT97] do not include the
term em. Thus, while the situation is very similar to that considered in these references, a
separate proof is needed.
We will first show that lim infm-oo llVA(0m)Il = 0. Suppose the contrary. Then, there
exists some e > 0 and some N such that IIVA(0m)II > E for all m > N. In addition, by
taking N large enough, we can also assume that Ilemll < E/2. Then, it is easily checked that
VA(Om)
(VA(0m) + em) > 2
27
Let 0(r) = A(9). Note that ¢ E I4. We have
A(0m+i)
+ YmEom [T]VA(O9m)
=
(0m)
>
(0m) + m
·
(VA(Om) + em) +Em
+Em(().
(i)
(27)
Since Em(X) is summable (Lemma 4), but am m, = oo, we conclude that A(Om) converges
to infinity, which is a contradiction.
Next we show that limsupmo ,,,o VA(0m)ll = 0. Suppose the contrary. Then, there
exists some e > 0 such that [[V'A(0n)ll > e for infinitely many indices n. For any such n, let
n' be the first subsequent time that IJVA(0.,)ll < E/2. Then,
< IjVA(On)li - IIVA(0n,)11
VA(0,1)11
<
fVA(ln) -
<
Cllrn-rnll
n'-1
C
E
n'-1
nYmh(rm) +
m=n
Em
m=n
n'--1
<
n'-1
C Ey mlh(rm) 11
+C
m=n
Em
m=n
for some constant C, as V2 A(0) is bounded (Lemma 1). Recall that Ilh(rm)ll is bounded by
some constant B. Furthermore, when n is large enough, the summability of the sequence
Em yields CfI Em- e l < d/4. This implies that E_-n' yn > E/4CB. By an argument
very similar to the one that led to Eq. (27), it is easily shown that there exists some / > 0
such that
A(09') > A(9n) + 3,
which contradicts the convergence of the sequence A\(0).
B
0
Proof of Proposition 4
In this section, we prove the convergence of the on-line method introduced in Section 4,
which is described by
k+1l
=
Ok + Yk
Ak+1
=
Ak + Yk(gik(Ok) -
(V9i (Ok)
+ (9ik (Ok) - Ak)zk),
k),
f 0,
Zk+1
=-
if ik+1 = i*
-
7pikik+1 (0k)
Zk + Pikik+ (Ok)
otherwise.
The proof has many common elements with the proof of Proposition 3. For this reason, we
will only discuss the differences in the two proofs. In addition, whenever routine arguments
are used, we will only provide an outline.
28
As in Appendix A, we let rk = (Ok, Ak). Note, however, the different meaning of the
index k which is now advanced at each time step, whereas in Appendix A it was advanced
whenever the state i* was visited. We also define an augmented state xk = (ik, zk).
We rewrite the update equations as
rk+l = rk + YkR(Xk, rk),
where
V g i k (0k) +
R(xk,)
=R(Xk,
L
ik
(g
(0k) -
k)zk]
gik (0k) - Ak
(28)
Consider the sequence of states (io, i,.. .) visited during the execution of the algorithm.
As in Section 3, we let tm be the mth time that the recurrent state i* is visited. Also, as in
Appendix A, we let
.Fm = {00, Ao,io, ... , itm}
stand for the history of the algorithm up to and including time tn.
The parameter Ok keeps changing between visits to state i*, which is a situation somewhat different than that considered in Lemma 2(b). Nevertheless, the same argument applies
and shows that for any positive integer s, there exists a constant Ds such that
E [(t+l
- t)S
I :Fm] < D,.
(29)
We have
tm+l-1
rtm+l
=
rtm +
E
YkR(Xk, rk)
k=tm
rtm + %mh(rtm)+ em,
=
(30)
where Yi and Em are given by
tm+l -1
Ymin=
ET
(31)
ok,
k=tm
tm+l--
Em =
E
k=tm
Yk (R(sk, rk)
-h(rtm))
and h is a scaled version of the function h in Appendix A, namely,
h(r)
_)
[ VA(0) +
Eo[T]-
(A(0) - A)
)A() -
(
We note the following property of the various stepsize parameters.
Lemma 12
(a) For any positive integer s, we have
EE
m=l
tYm(tm+l - tin)
·29
29
(32)
< OO.
(b) We have
E0
=°E0
m
< 00,
m=l
m=l
with probability 1.
Proof: (a) From Eq. (29), and because 7ytm is Yim-measurable, we have
E[y2 (tm+l- tm)S ] = E [2'E[(tm+l - tm)S
m]] < E[7 ]Ds.
Hence,
c
0oo
d E[r
2m(tm+l - tm)S
]
£
< D3s
m=l
k
<
°O,
k=l
and the result follows.
(b) By Assumption 4, we have
00
00
oYm =
m=l
5
Yk =
°
k=l
Furthermore, since the sequence Yk is nonincreasing (Assumption 5), we have
am <
2
Yt m(tm+l -
Using part (a) of the lemma, we obtain that
finite with probability 1.
tm)
2
~ / has finite expectation and is therefore
m=l
[
Without loss of generality, we assume that 7k < 1 for all k. Then, the update equation
for Ak implies that JAkl < max{l0ol,C}, where C is a bound on lgi(0)1. Thus, [Akl is
bounded by a deterministic constant, which implies that the magnitude of h(rk) is also
bounded by a deterministic constant.
We now observe that Eq. (30) is of the same form as Eq. (24) that was studied in the
preceding appendix, except that we now have rtm in place of rm, Ym in place of 7m, and
h(rt,,) in place of h(rm). By Lemma 12(b), the new stepsizes satisfy the same conditions
as those imposed by Assumption 4 on the stepsizes ym of Appendix A. Also, in the next
subsection, we show that the series m e,m converges. Once these properties are established, the arguments in Appendix A remain valid and show that A(Otm) converges, and that
VA(Otm) converges to zero. Furthermore, we will see in the next subsection that the total
change of Ok between visits to i* converges to zero. This implies that A(0k) converges and
that VA(Gk) converges to zero, and Proposition 4 is established.
B.1
Summability of Ek and Convergence of the Changes in
Ok
This subsection is devoted to the proof that the series Zmem converges, and that the
changes of Sk between visits to i* converge to zero.
We introduce some more notation. The evolution of the augmented state xk = (ik, zk)
is affected by the fact that Sk changes at each time step. Given a time t, at which i* is
visited, we define a "frozen" augmented state xF = (iF,ZF) which evolves the same way
as Xk except that Sk is held fixed at Ot, until the next visit at i*. More precisely, we let
30
= xtM. Then, for k > tm + 1, we let iF evolve as a time-homogeneous Markov chain
with transition probabilities Pij(tm). We also let t
= min{k > t
i = i*} be the
first time after tm that ik is equal to i*, and
XF
VPiFiF
ZkF+l = Zk +
k
We start by breaking down
Em
(Otm)
k+
k+1
as follows:
em
- h(rtm))
Yk (R(xk, rk)
=
k=tm
=
(1)
+ Em2)
+ E(3) + E(4) + E(5)
where
tm+l-1
2m)
=
tt
s
k=tm
-Rk)
h,(rtm)
r
F
-1
m+l1
'Ytm
E()=
[R(Xk , rtm) - h(rtm)]
E
k=tm
tm+1t-1
m3)
S
=atm
[R(Xk, rtm)-h(rtm)]
k=tm
tF
--1
k=tm
tm+ -1
(4)
=
Yt
s
) -- R(Xk, rtm)]
[R(Xk,
k=tm
tm+ l-1
£m)
=
(l- k-t)
R(Xk, rk)
k=tm
We will show that each one of the series Em em)
n = 1,
...
,5, converges with probability
1.
We make the following observations.
The ratio Vpikik+l (k)/Pikik+
(Ok)
is bounded
because of Assumptions 2 and 3. This implies that between the times tm and tm+l that i*
is visited, the magnitude of zk is bounded by C(tm+l - tm) for some constant C. Similarly,
the magnitude of ZF is bounded by C(tF+ - tm). Using the boundedness of Ak and
h(rk), together with the update equations for 0 k and Ak, we conclude that there exists a
(deterministic) constant C, such that for every m, we have
IIR(xk, rtm)-
IIR(k, k)II
|R(XF,
rk)I!
<
<
C(tm+ - tm),
C(tF+1 - tm)
IIrk - rtm
<
Cytm(tm+ - tm)
2
,
k E {tm, .. .,tm+1 - 1},
(35)
Cyt,(tnm+ - t)
3
,
k E {tm, ...
(36)
R(xk, rk)I
<
31
k E {tm, ..., tm+ - 1},
k E {tm, ... ,t +1 - 1}
,
tm+l - 1}.
(33)
(34)
Lemma 13 The series Em Em) converges with probability 1.
Proof: Let B be a bound on IIh(rk)ll.
Then, using Assumption 5, we have
tm+l-1
1e)l[
S]
<_B
(Yt,
< BAt2m(tm+l-
- Yk)
tm)P
k=tm
Then, Lemma 12(a), implies that Em lEm)ll has finite expectation, and is therefore finite
with probability 1.
0
Lemma 14 The series Em Em2) converges with probability 1.
Proof: When the parameters 0 and A are frozen to their values at time tin, the total
update km2=t
R(ax, rtm) coincides with the update Hm(rm) of the algorithm studied in
Appendix A. Using the discussion in the beginning of that appendix, we have E[Hm(rn,,)
.fm] = h(rt,). Furthermore, observe that
E [Z
h(rtm)
|.
m
=
h(rtm)Eotm[T] = h(rtm).
k=tm
Thus, E[E$) I )im] = 0. Furthermore, using Eq. (33), we have
E[1_E2)112
I .m] < Ct2)(tm+l- t,) 4
Using Lemma 12(a), we obtain
) is martingale with bounded variance and, therefore, converges.
Thus, Em Em
o
Lemma 15 The series Em E3) converges with probability 1.
Proof:
The proof is based on a coupling argument. For k = tm,..., tm+l - 1, the two
processes xk and x kF can be defined on the same probability space so that
P(ik+l
j
ik+:l |ii
= k) < B10kk -
<<FI
BlIrk - rtl11 < BC)tm(tm+l - tm) 2 ,
(37)
for some constants B and C. We have used here the assumption that the transition probabilities depend smoothly on 0, as well as Eq. (35).
We define Em to be the event
£m = {xk : Xk for some
k
= tm,..., tm+l}.
Using Eq. (37), we obtain
tm+1-1
P(£m
I tm,tm+l)
< BC
E
Y7tm(tm+l -
k=tm
32
tm) 2
= BCt.tm (tm+l - tm) 3 .
Note that if the event £m does not occur, then Em) = 0. Thus,
E[Ie3)II I tm, tm+l] = P(Em I tm, tm+l)E[I1_3)11 I tm, tm+l, m].
Since h(rk) is bounded, and using also the bounds (33)-(34), we have
Il(3)
<I 7ttmC((tm+l - tm) 2 + (tF+1 - t)
2
),
for some new constant C. We conclude that
E[lII3)I
I tm,tm+l, m]_< ytmC((tm+-
t)
2
I tm,tm+l,
+ E[(tF+l- tm) 2
ml])
Now, it is easily verified that
E[(t+l - tm)2 I tm, tm+l, m])
<
2E[(t+l - tm+l)2
<
C(tm+l - tm) 2,
I tm+l,m]
+ 2(tm+l - tm) 2
for some new constant C. By combining these inequalities, we obtain
E[1(3)|
i tm,
tm+l, Em] < C7t, (tm+l - tm)2 ,
and
(tm+l - tm) 5 ,
E[lIIe3)11 I tm tm+l] < BCy_2
for some different constant C. Using Lemma 12(a), Em IlE
1)1
has
finite expectation, and
is therefore finite with probability 1.
Lemma 16 The series Em £m)
El
converges with probability 1.
Proof: Using Eq. (36), we have
tm+l 1
$
_4)11 <
"Ytm
Z
C)tm(tm+l
-tm)
3
= Cy
(tm+l - tm) 4
k=tm
Using Lemma 12(a),
1.
Em
IIE)II has finite expectation, and is therefore finite with probability
a
Lemma 17 The series Em e¢)
converges with probability 1.
Proof: Using Assumption 5 and the bound (33) on IIR(xk, rk)ll, we have
tm+l--1
I]m5)II < C(tm+l -tm)
s
E
(Ytm -- 7k) < ACyt(tm+
p
- tm)
+
1
k=tm
Using Lemma 12(a), Em
1.
iiEm ) 11has finite expectation, and is therefore finite with probability
0]
We close by establishing the statement mentioned at the end of the preceding subsection,
namely, that the changes in rk between visits to the recurrent state i* tend to zero as time
goes to infinity. Indeed, Eq. (33) establishes a bound on Irk - rtm
which converges to zero because of Lemma 12(a).
33
for k = tm,
tm+l - 1,