Download Alleviating tuning sensitivity in Approximate Dynamic Programming

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Knapsack problem wikipedia , lookup

Renormalization group wikipedia , lookup

Inverse problem wikipedia , lookup

Corecursion wikipedia , lookup

Dirac delta function wikipedia , lookup

Multi-objective optimization wikipedia , lookup

Reinforcement learning wikipedia , lookup

Least squares wikipedia , lookup

Dynamic programming wikipedia , lookup

Generalized linear model wikipedia , lookup

Drift plus penalty wikipedia , lookup

Simplex algorithm wikipedia , lookup

Multiple-criteria decision analysis wikipedia , lookup

Mathematical optimization wikipedia , lookup

Transcript
Alleviating tuning sensitivity in
Approximate Dynamic Programming
Paul Beuchat, Angelos Georghiou and John Lygeros1
Abstract— Approximate Dynamic Programming offers benefits for large-scale systems compared to other synthesis and
control methodologies. A common technique to approximate the
Dynamic Program, is through the solution of the corresponding
Linear Program. The major drawback of this approach is
that the online performance is very sensitive to the choice of
tuning parameters, in particular the state relevance weighting
parameter. Our work aims at alleviating this sensitivity. To
achieve this, we propose a point-wise maximum of multiple Qfunctions for the online policy, and show that this immunizes
against tuning errors in the parameter selection process. We
formulate the resulting problem as a convex optimization
problem and demonstrate the effectiveness of the approach
using a stylized portfolio optimization problem. The approach
offers a benefit for large scale systems where the cost of a
parameter tuning process is prohibitively high.
I. I NTRODUCTION
Stochastic optimal control provides a framework to describe many challenges across the field of engineering.
The objective is to find a policy for decision making that
optimizes the performance of the dynamical system under consideration. Dynamic Programming (DP) provides a
method to solve stochastic optimal control problems, for
which the key is to solve the Bellman equation [1]. Although
a powerful result, computing an exact solution to the Bellman
equation, called the optimal cost-to-go function, is in general
intractable and inevitably leads to the curse of dimensionality
[2]. Approximate Dynamic Programming (ADP) is a term
covering methods that attempt to approximate the solution
of the Bellman equation, [3], [4]. In particular, the Linear
Programming (LP) approach to ADP, first suggested in
1985 [5], introduces a set of parameters that need to be
selected by the practitioner, and strongly affect the quality
of the solution. In this paper, we address this sensitivity
by proposing a systematic way to partially immunize the
solution quality against bad choices of the tuning parameters.
The LP approach to ADP is stated as follows: given a set
of basis functions, find a linear combination of them that
“best” approximates the optimal cost-to-go function, called
the approximate value function. Based on this, the online
policy is the one-step minimization of the approximate value
function, called the approximate greedy policy. In the case
that a set of basis functions and the coefficients of the linear
combination can be found to closely approximate the optimal
cost-to-go function, then the method can be a very powerful
This research was partially funded by the European Commission under
the project Local4Global.
1 All Authors are with the Automatic Control Laboratory, ETH Zürich,
Switzerland [email protected]
tool. For example, the LP approach enjoyed some notable
success for the applications of playing backgammon [6],
elevator scheduling [7], and stochastic reachability problems
[8]. However, these examples required significant trial and
error tuning in order to find a suitable choice of basis functions and the best linear combination. For other applications
the trial and error work involved in tuning prohibits the use of
this method. Hence, alleviating the tuning effort will expand
the scope of applications for the LP approach.
Despite the rich choice of basis functions, see [9] and [10],
choosing the optimal coefficients of the linear combination
remains a difficult problem. The key parameter used throughout the literature to tune the coefficients is called the state
relevance weighting. This tuning parameter specifies which
regions of the state space are important for approximation.
However, the regions of importance depend on the behaviour
of the system when the approximate greedy policy is played
online, and the policy in turn depends on the choice of the
state relevance weighting. This circular dependence of the
tuning parameter leads to the difficulties experienced.
Different approaches have been suggested for tuning the
state relevance weighting. In [11] the authors use the initial
distribution as the state relevance weighting. Although a
natural choice, it leads to poor online performance if the
system evolves to regions of the state space different from
the initial distribution. The authors of [12] eliminate the state
relevance weighting from the formulation at the expense of
increased complexity to evaluate the online policy. Their
approach is a variant of Model Predictive Control and will
hence face similar difficulties as researched in that field [13].
The contributions is this paper are twofold. First, we
propose a policy that allows the practitioner to choose multiple state relevance weightings and the policy automatically
leverages the best performance of each without requiring trial
and error tuning. Second, we provide bounds to guarantee
that our proposed approach will perform at least as well
as any individual choice of the tuning parameter. Finally,
we show through numerical examples that the proposed
policy immunizes against poor choice of the state relevance
weighting. Our proposed approach extends from the pointwise maximum of approximate value functions suggested in
[11], and uses the Q-function formulation, see [14], to reduce
the computational burden of the online policy.
The structure of this paper is as follows. In Section II,
we present the DP formulation considered and in Section
III, we present our proposed policy using the Value function
formulation of the LP approach to DP. This motivates Section
IV where we use the Q-function formulation to propose
a tractable, point-wise maximum, greedy policy. Section
IV also provides performance guarantees for the computed
solution. In Section V, we demonstrate the performance of
the proposed approach and conclude in Section VI.
Notation: R+ is the space of non-negative scalars; Z+ is
the space of positive integers; Sn is the space of n⇥n real
symmetric matrices; In is the n⇥n identity matrix; 1n the
vector of ones of size n; (.)> is the matrix transpose; given
f : X ! R, the infinity norm is kfRk1 = supx2X |f (x)|, and
the weighted 1-norm is k f k1,c = X |f (x)|c(dx).
The term intractable is used throughout the paper. We
loosely define intractable to mean that the computational
burden of any existing solution method prohibits finding a
solution in reasonable time.
II. DYNAMIC P ROGRAMMING (DP) F ORMULATION
This section introduces the problem formulation and states
the DP as the solution to the Bellman equation. We consider
infinite horizon, discounted cost, stochastic optimal control
problems. The system is described by discrete dynamics over
continuous state and action spaces. The state of the system
at time t is xt 2 X ✓ Rnx . The system state is influenced
by the control decisions ut 2 U ✓ Rnu , and the stochastic
disturbance ⇠t 2 ⌅ ✓ Rn⇠ . In this setting, the state evolves
according to the function g : X ⇥ U ⇥ ⌅ ! X as,
xt+1 = g (xt , ut , ⇠t ) .
At time t, the system incurs the stage cost t l (xt , ut ), where
2 [0, 1) is the discount factor and the objective is to
minimize the infinite sum of the stage costs.
The optimal Value function, V ⇤ : X ! R, characterizes
the solution of this stochastic optimal control problem. It
represents the cost-to-go from any state of the system if the
optimal control policy is played. The optimal Value function
is the solution of the Bellman equation [1],
Q⇤ (x,u)
z
}|
{
V ⇤ (x) = min l (x, u) + E [V ⇤ (g (x, u, ⇠))] ,
u2U
|
{z
}
(1)
(T V ⇤ )(x)
for all x 2 X , where T is the Bellman operator, and
Q⇤ : (X ⇥ U ) ! R is the optimal Q-function. The Qfunction represents the cost of making decision u now and
then playing optimally from the next time step forward. The
optimal control actions are generated via the Greedy Policy:
⇡ ⇤ (x) = arg min l (x, u) +
u2U
E [V ⇤ (g (x, u, ⇠))] ,
⇤
= arg min Q (x, u) .
u2U
(2)
The Bellman equation, (1), can be equivalently written in
terms of Q⇤ as follows:

Q⇤ (x, u) = l(x, u) + E min Q⇤ (g (x, u, ⇠) , v) ,
v2U
(3)
{z
}
|
(F Q⇤ )(x,u)
for all x 2 X and u 2 U . Equation (3) defines the F operator, the equivalent of the T for Q-functions. 1
1 The
operators T and F are both monotone and -contractive, see [14].
Solving (1) exactly is only tractable under strong assumption on the problem structure, namely unconstrained
Linear Quadratic Gaussian problems [15]. In other cases, the
popular LP approach to ADP can be used to approximate the
solution of (1). This method is presented in the next sections.
III. VALUE FUNCTION APPROACH TO ADP
This sections presents a method to obtain an approximation of V ⇤ through the solution of a LP. This is done by
approximating the so-called exact LP, whose solution is V ⇤ .
We highlight the sensitivity of approximate solutions to the
tuning parameters introduced, and then propose a policy that
immunizes against this sensitivity.
A. Iterated Bellman Inequality and the Exact LP
Equation (1) is relaxed to the iterated Bellman Inequality,
V (x)  T M V (x) ,
8x 2 X ,
(4)
for some M 2 Z+ , where T M denotes the M applications
of the Bellman operator. As shown in [11], any V satisfying
(4) will be a point-wise under-estimator of V ⇤ over the set
X.
The exact LP associated with (1) is formulated as follows:
Z
max
V (x) c(dx)
V
s.t.
X
V 2 F(X ) ,
V (x)  T M V (x) ,
(5)
8x 2 X .
As shown in [16, Section 6.3], the solutions of (1) and (5)
coincide when F(X ) is the function space of real-valued
measurable functions on X with finite weighted 1-norm,
and c(·) is any finite measure on X that assigns a positive
mass to all open subsets of X . Taking M = 1 here and in the
subsequent analysis corresponds to the formulation originally
proposed in [17]. Although (1) and (5) are equivalent for all
M 2 Z+ , the benefit is apparent after the approximation is
made. As explained in [11], problem (5) with M > 1 has a
larger feasible region than for M = 1.
Solving (5) for V ⇤ , and implementing (2), is in general
intractable. The difficulties can be categorized as follows:
(D1)
(D2)
(D3)
(D4)
F(X ) is an infinite dimensional function space;
Problem (5) involves an infinite number of constraints;
The multidimensional integral in the objective of (5);
The multidimensional integral over the disturbances ⇠
in the bellman operator T , and the greedy policy (2);
(D5) For arbitrary V ⇤ 2 F(X ), the greedy policy (2) may
be intractable;
Thus, methods that exactly solve (5) and (2) will suffer from
the curse of dimensionality in at least one of these aspects,
see [18, Section 2]. To gain computational tractability, in the
following we restrict the function space F(X ) to simultaneously overcome (D1-D5).
B. The Approximate LP
As suggested in [5], we restrict the admissible value
functions to those that can be expressed as an linear combinations of basis functions. In particular, given basis functions
V̂ (i) (x) : Rnx ! R, we parameterize a restricted function
space as,
n
o
P
(i)
F̂(X ) = V̂ (·) V̂ (x) = K
✓ F(X ) .
↵
V̂
(x)
i=1 i
for some ↵i 2 R. Hence an element of the set is specified by
a set of ↵i ’s. An approximate solution to (5) can be obtained
through the solution of the following approximate LP:
Z
max
V̂ (x) c(dx)
V̂
s.t.
X
V̂ 2 F̂(X ) ,
V̂ (x)  T M V̂ (x) ,
(6)
8x 2 X ,
where the optimization variables are the ↵i ’s in the definition
of F̂(X ). The only change from (5) was to replace F(X )
by F̂(X ). The iterated Bellman inequality is not a convex
constraint on the optimization variables. As presented in [11,
Section 3.4], it can be replaced by a constraint that is convex
in the ↵i ’s and implies the iterated Bellman inequality.
Difficulty (D1) has been overcome in problem (6) as
F̂(X ) is parameterized by a finite dimensional decision
variable. However, difficulties (D2-D4) are still present and
are overcome by matching the choice of basis functions with
the problem instance. The details of choosing the basis functions are omitted and the reader is referred to the following
examples for guidance. The space of quadratic functions
overcomes (D2-D4) for constrained LQG problems, see [11]
for the details of the S-lemma procedure used to reformulate
(6). For problems with polynomial dynamics, costs, and
constraints, see [10] where sums-of-squares techniques and
polynomial basis functions are used. In [8], radial basis
functions are used to approximate stochastic reachability
problems. Piece-wise constant approximate Value functions
are used in [19] to address a perimeter surveillance control
problem. Sampling based alternatives for overcoming (D2)
are suggested in [20] and [21].
Let V̂ ⇤ denote the optimizer of problem (6). Then a natural
choice for the online policy is,
h
i
⇡
ˆ (x) = arg min l (x, u) + E V̂ ⇤ (g (x, u, ⇠)) ,
(7)
u2U
called an approximate greedy policy. Unless V̂ ⇤ is restricted
to be convex when solving (6), difficulty (D5) will still be
present. If convexity of V̂ ⇤ is not enforced, results from
global polynomial optimization, [22], may assist. The policy
we propose in Section III-D requires that the V̂ ⇤ are convex,
and hence that a convexity constraint is added to (6).
C. Choice of the weighting c(·)
As discussed in [16], the choice of c(·) does not affect
problem (5). Intuitively speaking, the reason is that the space
F(X ) is rich enough to satisfy V (x)  T M V (x) with
equality, point-wise for all x 2 X . In contrast, once the
restriction F̂(X ) ⇢ F(X ) is made this is no longer true. The
choice of c(·), referred to as the state relevance weighting,
provides a trade-off between elements of F̂(X ) over the set
of states. Thus, c(·) is a tuning parameter of the approximate
LP and influences the optimizer, V̂ ⇤ .
A good approximation of the value function should
achieve near optimal online performance when it replaces
V ⇤ in the greedy policy. Intuitively, we see from (7) that the
online policy depends on the gradient of the approximate
value function. Two value functions that differ by a constant
will make identical decisions. However, the approximate LP
finds the closest fit to V ⇤ , relative to the choice of c(·) and
does not attempts to match the gradient of V ⇤ .
We now provide the intuition behind the approach proposed in the next sub-section. Consider two choices of the
state relevance weighting, cA (·) and cB (·), that separately
place weight on narrow, disjoint regions of the state space,
denoted A and B, and zero weight elsewhere. For each
choice, the solution of (6), V̂A⇤ and V̂B⇤ , will be the closest
under-estimator to V ⇤ over the respective region. On region
A, V̂B⇤ will be lower than V̂A⇤ , otherwise it would not be
the solution of (6), and the reverse. Thus if we construct an
approximate Value function that is a point-wise maximum
of V̂A⇤ and V̂B⇤ , it is expected to give the best estimate of
V ⇤ . Finally, by fitting V ⇤ closely over a larger region, it is
expected that the gradient approximation will be improved.
In this way, our proposed point-wise maximum approach immunizes against the tuning errors that occur when choosing
a single c(·).
D. Point-wise maximum Value function and policy
We will solve problem (6) for several choices of c(·),
and denote V̂j⇤ as the solution for a corresponding cj (·).
Letting j 2 J denote an index set, we define the point-wise
maximum Value function as follows:
n
o
V̂pwm (x) := max V̂j⇤ (x) , 8 x 2 X .
j2J
Problem (6) ensures that each V̂j⇤ is a point-wise underestimator of V ⇤ . Hence V̂pwm is a better under-estimator of
V ⇤ in the following sense:
V⇤
V̂pwm
1

V⇤
V̂j⇤
1
,
8j 2 J .
The natural choice for the online policy now is to use V̂pwm
in the approximate greedy policy, i.e.,

n
o
⇡
ˆ (x) = arg min l (x, u) + E max V̂j⇤ (f (x, u, ⇠)) . (8)
j
u2U
However, the point-wise maximum value function reintroduces difficulty (D4): as V̂pwm 2
/ F̂(X ), evaluating (8)
may not be tractable. The difficulty in (8) is that evaluating the expectation over the disturbance ⇠ requires Monte
Carlo sampling, and this makes the optimization over u
prohibitively slow.
Exchanging the expectation and maximization in (8) circumvents this difficulty, and leads to,
n h
io
⇡
ˆ (x) = arg min l (x, u) + max E V̂j⇤ (f (x, u, ⇠)) . (9)
u2U
j2J
This approximation induces a tractable reformulation, and
by Jensen’s inequality (9) is still a lower bound. A similar
approach was proposed in [12] in the context of min-max
approximate dynamic programming. It is not clear how the
exchange will affect the performance of the approximate
greedy policy. In the next section we propose an alternative
formulation in terms of the Q functions that alleviates the
need to use Jensen’s inequality.
IV. Q- FUNCTION APPROACH TO ADP
In this section, we alternatively define the greedy policy
using Q functions instead of Value functions. We will show
that the resulting greedy policy does not suffer from difficulty
(D4). Additionally, the greedy policy can be efficiently computed when using a point-wise maximum of approximate Qfunctions. Finally in Section IV-D, we provide error bounds
for point-wise maximum Q-functions.
A. Iterated F -operator Inequality and the Approximate LP
The bellman equation for the Q-function formation, (3),
is relaxed to the iterated F -operator Inequality,
Q(x, u) F M Q(x, u) ,
8x 2 X , u 2 U ,
(10)
for some M 2 Z+ being the number of iterations, where
F M denotes the M applications of the F -operator. As shown
in [14], any Q satisfying (10) will be a point-wise underestimator of Q⇤ for all elements of the set (X ⇥ U ).
An exact LP reformulation of (3) is analogous to (5) and
also requires optimization over an infinite dimensional functional space. For brevity we omit this formulation, and move
directly to the functional approximation. Similar to Section
III-B, we restrict the admissible Q-functions to those that
can be expressed as a linear combination of basis functions.
In particular, given basis functions Q̂(i) (x) : Rnx ! R, we
parameterize a restricted function space as,
n
o
P
(i)
F̂(X ⇥ U ) = Q̂(·, ·) Q̂(x, u) = K
,
↵
Q̂
(x,
u)
i=1 i
for some ↵i 2 R. Using F̂(X ⇥ U ), an approximate solution
to (3) is obtained through the solution of the following
approximate LP:
Z
max
Q̂(x, u) c(d(x, u))
Q̂
s.t.
X ⇥U
Q̂ 2 F̂(X ⇥ U ) ,
Q̂(x, u)  F M Q̂(x, u) ,
(11)
8x 2 X , u 2 U ,
where the optimization variables are the ↵i ’s in the definition
of F̂(X ⇥U ), hence the LP is finite dimensional. The weighting parameter in the objective, c(·, ·) needs to be defined over
the (X ⇥ U ) space. Again, the iterated F -operator inequality
is not a convex constraint on the optimization variables.
However, it can be replaced by a constraint that is convex in
the ↵i ’s and implies that the iterated F -operator inequality
is satisfied. The reformulation is presented in Appendix II,
it combines the reformulations found in [11] and [14].
The difficulties (D1-D5), described for the Value function
formulation, apply equally for the Q-function formulation.
Similar to the discussion in Section III-B, difficulties (D2D4) remain present in (11). Overcoming (D2-D4) requires
the basis function of F̂(X ⇥ U ) to be chosen appropriately
for a particular problem instance. The quadratic, polynomial,
and radial basis functions described in [11], [10], and [8] for
the Value function formulation, can be used equivalently for
the Q-function formulation.
Let Q̂⇤ denote the optimizer of problem (11). Then a
natural choice for an approximate greedy policy is,
⇡
ˆ (x) = arg min Q̂⇤ (x, u) ,
u2U
(12)
Overcoming difficulty (D5) for Q-functions requires that Q̂⇤
is restricted to be convex in u when solving (11).
B. Choice of the weighting c(·, ·)
The weighting in the objective of (11), c(·, ·), is analogous
to the weighting in the Value function formulation: it influences the solution of the approximate LP and is difficult to
choose. In general, it is not possible to find a c(·, ·) so that the
associated Q̂⇤ approximates Q⇤ equally well over the whole
state-by-input space. It is not clear how one would choose
c(·, ·) such that Q̂⇤ provides: (i) a tight under-estimate of
Q⇤ , and (ii) achieves near optimal online performance. Next,
we introduce the point-wise maximum Q function, which
removes the need to choose a single c(·, ·) by combining the
best fit over multiple approximate Q-functions.
C. Point-wise maximum Q-functions and policy
We propose the point-wise maximum of Q-functions as a
method to alleviate the burden of tuning a single weighting
parameter. We will show that: (i) the proposed method
induces a computationally tractable policy; and (ii) a better
under estimator will result in a tighter lower bound on Q⇤ ,
and potentially better performance with the online policy.
We will now define the point-wise maximum Q-function.
To this end, we solve problem (6) for several choices of c(·, ·)
and denote Q̂⇤j as the solution for a corresponding cj (·, ·).
Letting J denote an index set the Q̂pwm is defined as,
n
o
Q̂pwm (x, u) := max Q̂⇤j (x, u) ,
(13)
j2J
point-wise for all x 2 X and u 2 U.
Replacing Q⇤ by Q̂pwm in equation (2) leads to the
approximate greedy policy,
n
o
⇡
ˆ (x) = arg min max Q̂⇤j (x, u) .
(14)
u2U j2J
The advantage of problem (14) compared to (8) is that it
does not involve the multidimensional integration over the
disturbance and hence avoids the re-introduction of difficulty
(D4). Thus, (14) is equivalently reformulated as,
⇡
ˆ (x) = arg min t
u2U ,t
s.t. Q̂⇤j (x, u)  t ,
8j 2 J .
(15)
This reformulation is a convex optimization program if Q̂⇤ is
restricted to be convex in u when solving (11). The numerical
example presented in Section V uses convex quadratics as
the basis function space. This means that Q̂pwm is a convex
piece-wise quadratic function and solving (15) is a convex
Quadratically Constrained Quadratic Program (QCQP).
D. Fitting bound for the approximate Q-function
We now provide a bound on how closely a solution of (11)
approximates Q⇤ . This result allows us to show that Q̂pwm
will provide a better estimate of Q⇤ than the Q̂⇤j from which
it is composed. Combining the ideas from [11] and [14], we
prove the following theorem:
Theorem 4.1: Given Q⇤ is the solution of (3) and Q̂⇤ is
the solution of (11) for a given choice F̂(X ⇥ U ) and c(·, ·),
then the following bound holds,
Q
⇤
Q̂
⇤
1,c(x,u)

2
1
M
min
Q̂2F̂ (X ⇥U )
kQ
⇤
Q̂k1
(16)
Proof: See Appendix I.
The theorem says that when Q⇤ is close to the span of
the restricted function space, then the under-estimator Q̂⇤
will also be close to Q⇤ . In fact, the bound indicates that if
Q⇤ 2 F̂(X ⇥ U ), the approximate LP will recover Q̂⇤ = Q⇤ .
Notice that Theorem 4.1 holds for any choice of F̂(X ⇥U )
and any choice of c(·, ·). We will now argue that Theorem
4.1 applies also to Q̂pwm as defined in (13). This will allow
us to conclude that Q̂pwm provides a better estimate of Q⇤ .
A valid choice of the restricted function space is the following point-wise maximum of N approximate Q-functions:
8
9
<
Q̂k (x, u) 2 F̂(X
⇥ U ), o=
n
N
F̂pwm
(X ⇥U ) = Q̂(·, ·)
.
Q̂(x, u) = max
Q̂k (x, u) ;
:
k=1,...,N
Difficulties (D1-D4) will still exist if one attempts to solve
(11) using F̂pwm (X ⇥ U ) as the restricted function space.
However, the bound given in Theorem 4.1 applies regardless.
Note that the right-hand side of the bound in Theorem 4.1
depends only on the restricted function space and not on the
solution of the approximate LP. Therefore, F̂pwm (X ⇥ U ) ◆
F̂(X ⇥ U ), implies that the minimization on the right-hand
side of the bound has a greater feasible region, and hence
will achieve a tighter bound, for the point-wise maximum
function space. For the theorem to apply to our choice of
Q̂pwm in (13), we must show that it is feasible for the
approximate LP with F̂pwm (X ⇥U ) as the restricted function
space. This is achieved by the following lemma.
Lemma 4.2: Let {Q̂j }j2J be Q-functions such that for
each j 2 J the following inequality holds:
⇣
⌘
Q̂j (x, u)  F M Q̂j (x, u) , 8 x 2 X , u 2 U .
Then the function Q̂pwm (x, u), defined in (13), also satisfies
the iterated F -operator inequality, i.e.,
⇣
⌘
Q̂pwm (x, u)  F M Q̂pwm (x, u) .
Proof: See Appendix I.
Although we cannot tractably solve (11) with F̂pwm (X ⇥
U ) as the restricted function space, Lemma 4.2 states that
Q̂pwm is a feasible point of that problem. Therefore, there
likely exists a choice of c(·, ·) such that Q̂pwm is the solution.
Theorem 4.1 states that under this choice of c(·, ·), Q̂pwm
approximates Q⇤ per the bound given in the theorem.
We close this section by re-iterating the benefits of
the Q̂pwm compared to V̂pwm . Both give improved lower
bounds, but Q̂pwm has the advantage that the policy can
be implemented without the need to introduce an additional
approximation as in the case of (8). This will be further
demonstrated in the following numerical example.
V. NUMERICAL RESULTS
In this section, we present a numerical case study to
highlight the benefits of the proposed point-wise maximum
policies, see Sections III-D and IV-C. We use a dynamic
portfolio optimization example taken directly from [12] to
compare the online performance of both the Value function
and Q-function formulations. The model is briefly described
here, using the same notation as [12, Section IV].
The task is to manage a portfolio of n assets with the
objective of maximizing revenue over a discounted infinite
horizon. The state of the system, xt 2 Rn , is the value
of assets in the portfolio, while the input, ut 2 Rn , is the
amount to buy or sell of each asset. By convention, negative
values for an element of ut means the respective asset is sold,
while positive means purchased. The stochastic disturbance
affecting the system is the return of the assets occurring over
a time period, denoted ⇠t 2 Rn . Under the influence of ut
and ⇠t , the value of the portfolio evolves over time as,
xt+1 = diag (⇠t ) (xt + ut ) .
The dynamics represent an example of a linear system
affected by multiplicative uncertainty.
The transaction fees are parameterized by  2 R+ and
R 2 Rn⇥n , and hence the stage cost incurred at each time
step is given by,
>
1>
n ut + |ut | + ut Rut .
The first term represents the gross cash from purchases and
sales, and the final two terms represent the transaction cost.
As revenue corresponds to a negative cost, the objective is
to minimize the discounted infinite sum of the stage costs.
The discount factor represents the time value of money. A
restriction is placed on the risk of the portfolio by enforcing
the following constraint on the return variance over a time
step:
> ˆ
(xt + ut ) ⌃
(xt + ut )  l ,
ˆ
where l 2 R+ is the maximum variance allowed, and ⌃
is the covariance of the uncertain return ⇠t . Table I lists
the parameter values that we use in the numerical instance
presented here.
We solved both the Value function and Q-function approximate LP using quadratic basis functions. The equations for
fitting the Value functions and Q-functions via (6) and (11)
TABLE I: Parameters used for portfolio example
n=
=
x0 =
log(⇠t ) =
=
R=
8
0.96
N (0n⇥1 , 10 In )
N (0n⇥1 , 0.15 In )
0.04 1n
diag(0.028, 0.034, 0.020, 0.026,
0.023, 0.022, 0.024, 0.027)
follow directly from [12, Equations (14)-(17)]. The quadratic
basis functions used are parameterized as follows:
V̂ (x) = x> P x + p> x + sV ,
 >


⇥ >
⇤ x
Pxx Pxu x
x
>
Q̂t (x, u) =
+ px pu
+ sQ ,
>
u
u
Pxu
Puu u
where P, Pxx , Puu 2 Sn , Pxu 2 Rn⇥n , p, px , pu 2 Rn ,
sV , sQ 2 R, are the coefficients of the linear combinations
in the basis functions sets F̂(X ) and F̂(X ⇥ U ). A positive
semi-definite constraint is enforced on P and Puu to make
the point-wise maximum policies convex and tractable.
This restricts the Value functions to be convex and the
Q-functions to be convex in u.
A family of 10 weighting parameters was used. One of
the weighting parameters, denoted c0 (·), was chosen to be
the same as the initial distribution over X , and represents
the method suggested in [11]. The remaining 9 weighting
parameters, denoted {cj (·)}9j=1 were chosen to be low
variance normal distributions centred at random locations
in the state space. For the Q-function formulation, the
same 10 weighting parameters were expanded with a fixed
uniform distribution over U . For each cj (·), problems (6) and
(11) were solved to obtain V̂j⇤ and Q̂⇤j , j = 1, . . . , 9. The
lower bound and online performance was computed for the
individual V̂j⇤ and Q̂⇤j and also for V̂pwm and Q̂pwm with the
point-wise maximum taken over the whole family. The lower
bound was computed as the average over 2000 samples while
the online performance was averaged over 2000 simulations
each of length 300 time steps. The results are shown in Table
II.
TABLE II: Lower Bound and online performance
Lower Bound / Online
Value
Online
Online
Online
Online
Online
Online
V̂0
Q̂0
Q̂⇤j j = 1, . . . , 9
V̂j⇤ j = 1, . . . , 9
V̂pwm
Q̂pwm
140.1
140.1
[ 13.0, 142.8]
[ 95.1, 142.8]
145.2
145.5
Lower
Lower
Lower
Lower
Lower
Lower
Bound
Bound
Bound
Bound
Bound
Bound
165.1
165.1
[ 568.4, 165.1]
165.8
165.8
[ 220.7, 166.0]
Q̂pwm
Q̂0
Q̂⇤j j = 1, . . . , 9
V̂pwm
V̂0
V̂j⇤ j = 1, . . . , 9
The results show that when using a single choice of
c(·), both the lower bound and online performance can be
arbitrarily bad. This is shown by the large range of values in
rows 3, 4, 9, and 12 of Table II. This large range indicates
that at least one of the weighting parameters was a poor
choice. As the point-wise maximum function uses all 10
choices of c(·), it includes also this poor choice of the
weighting parameter. We see from the results that the pointwise maximum function achieves the tightest lower bound
and best online performance. This highlights that the pointwise maximum function immunizes against the poor choice
of the weighting parameter. Using the initial distribution as
the weighting parameter gives reasonable results for this
example, see rows 1, 2, 8, and 11 of Table II, thus, the suggestion of [11] is reasonable. The benefit of the point-wise
maximum policy is that the practitioner can explore other
choices of c(·) without risk of degraded performance. Finally,
we note that in the best case the Q-functions perform slightly
better that the Value functions, indicating that exchanging
the expectation and maximization in (8) had little impact for
this example. However, in the worst case, the Q function
performs very badly for a poor choice of the weighting
parameter, highlighting further the importance of using our
proposed approach to immunize against such sensitivity. This
example demonstrates the features and trends indicated by
the theory.
VI. C ONCLUSIONS
In this paper, we addressed the difficulty of tuning the state
relevance weighting parameter in the Linear Programming
approach to Approximate Dynamic Programming. We proposed an approximate greedy policy that alleviates the tuning
sensitivity of previous methods by allowing for a family of
parameters to be used and automatically choosing the best
parameter at each time step. This is achieved by using a
point-wise maximum of functions that individually underestimate the optimal cost-to-go. We render the online policy
tractable by using Q-functions. We proved that the proposed
approach gives a satisfactory lower bound on the best achievable cost, and used a numerical example to demonstrate that
the online performance is indeed immunized against poor
choices of the weighting parameter. Future work will include
improved theoretical bounds on the online performance of
the policy and a deeper understanding of when the Value
function or Q-function approach is preferable.
A PPENDIX I
P ROOFS OF THEOREM 4.1 AND LEMMA 4.2
The proof requires two auxiliary lemmas that are presented
first, and then we present the proof of Theorem 4.1. Lemma
1.2 provides a point-wise bound on how much the M -iterated
F -operator inequality is violated for any given Q-function.
This is used in the proof of Lemma 1.3, which shows that
given a Q̂ 2 F̂(X ⇥ U ), it can be downshifted by a certain
constant amount to satisfy the iterated F -operator inequality.
The constant by which it is downshifted relates directly to
the constant on the RHS of Theorem 4.1.
We start by stating the monotone and contractive properties of the F -operator which are needed in the proofs.
Proposition 1.1: the F -operator is (i) monotone (ii)
contractive as for any given Q1 , Q2 : (X ⇥ U ) ! R,
-
Iterating the same argumentation M -times leads to
F M (Q + ) (x, u)
(i) Q1 (x, u)  Q2 (x, u) , 8x 2 X , u 2 U ,
) F Q1 (x, u)  F Q2 (x, u) , 8x 2 X , u 2 U ;
= FM
1
(F (Q + )) (x, u)
= F
M
1
((F Q) +
(ii) kF Q1
= F
M
2
2
F Q2 k1 
Q2 k1 ;
kQ1
Lemma 1.2: Let M 2 Z+ , and let Q : (X ⇥ U ) ! R be
any Q-function, then violations of the iterated F -operator
inequality can be bounded as,
Q(x, u)
1+
M
kQ⇤
Qk1 ,
for all x 2 X , u 2 U.
Proof: Starting from the terms not involving ,
kQ
Q(x, u)
⇤
 Q (x, u)


⇤
F
M
F M Q⇤
M
Qk1
F
Q (x, u) ,
FM Q
k Q⇤
M
Q (x, u)
M
M
= Q̃(x, u) ,
kQ⇤
{z
Q̂k1 ,
}
(17)
then Q̃(x, u) satisfies the iterated F -operator inequality, i.e.,
⇣
⌘
F M Q̃ (x, u) ,
8x 2 X , u 2 U ,
and if F̂(X ⇥ U ) allows for affine combinations of the basis
functions, then Q̃ is also an element of F̂(X ⇥ U ).
Proof: Let 2 R denote the constant downwards shift
term for notational convenience. Using the definition of the
F -operator we see that for any function Q(x, u),
( F (Q + ) ) (x, u)
= l(x, u) +
min E [ Q(f (x, u, ⇠), v) +
v2U
= (F Q) (x, u) +
,
✓
Q̂
M
1+
1
M
kQ
1
⇤
Q̂k1
◆
where the first equality comes from (18), the inequality is
a direct application of Lemma 1.2 to the term (F M Q̂) and
holds for all x 2 X , u 2 U , and the final equality follows
from (17).
Finally, if F̂(X ⇥ U ) allows for affine combinations of the
basis functions, then Q̂ 2 F̂(X ⇥ U ) implies Q̃ 2 F̂(X ⇥ U )
as the downward shift term is an additive constant.
Now, we have all the ingredients to prove Theorem 4.1.
Proof: of Theorem 4.1.
Given any approximate Q-function from the basis,
Q̂(x, u) 2 F̂(X ⇥ U ), Lemma 1.3 allows us to construct,
1+ M
Q⇤ Q̂
2 F̂(X ⇥ U ) ,
M
1
1
which is feasible for (11).
R
It can be shown that maximizing X ⇥U Q̂(x, u) c(d(x, u))
is equivalent to minimizing kQ⇤ Q̂k1,c(x,u) for constraint
that ensure Q̂ is an under-estimator of Q⇤ (11), see [17,
Lemma 1] for an example proof. Thus, we start from left
hand side of(16),
Q̃(x, u) = Q̂(x, u)
downwards shift term
Q̃(x, u) 
M
Q⇤
M
1+
M
Lemma 1.3: Let Q̂(x, u) 2 F̂(X ⇥ U ) be an arbitrary
element from the basis functions set, and let Q̃(x, u) be a
Q-function defined as,
1+
1
|
(18)
(x, u)
where the equivalences hold point-wise for all x 2 X , u 2
U . Now we show that Q̃ satisfies the iterated F -operator
inequality,
⇣
⌘
F M Q̃ (x, u)
✓
◆
⇣
⌘
1+ M
M
M
⇤
= F Q̂ (x, u)
kQ
Q̂k1
M
1
8x 2 X, u 2 U
The first inequality follows from the definition of the 1norm, and the second inequality comes from Q⇤ (x, u) =
(F Q⇤ )(x, u) and the 1-norm definition. Finally, the third
inequality is due to the -contractive property of the F operator. Re-arranging, the result follows.
Q̃(x, u) = Q̂(x, u)
= F M Q (x, u) +
Q̂(x, u)
1
Q k1 .
2
F Q +
= ...
These properties can be found in [14].
F M Q (x, u)
) (x, u)
]
.
where the equalities hold for all x 2 X , u 2 U. The first
equality comes from the definition of the F -operator, and
the second equality holds as is an additive constant in the
objective of the minimization.
Q⇤



=
=
Q̂⇤
Q
⇤
Q̃
Q
⇤
Q̃
Q
⇤
Q̂
Q
⇤
Q̂
2
1
M
1,c(x,u)
1,c(x,u)
1
1
1
Q
⇤
+
Q̂
1+
+
1
Q̂
Q̃
M
M
1
Q⇤
Q̂
1
1
where the first inequality holds because Q̃ is also feasible
for (11), the second inequality by assuming w.l.o.g. that
c(x, u) is a probability distribution, the third inequality is
an application of the triangle inequality, the first equality
stems directly from the definition of Q̃, and the final is an
algebraic manipulation.
As this argumentation holds for any Q̂ 2 F̂(X ⇥ U ), the
result follows.
Proof: of Lemma 4.2.
Starting from the definition of Q̂pwm we get,
Q̂j (x, u)  Q̂pwm (x, u) , 8 j 2 J
⇣
⌘
⇣
⌘
) F M Q̂j (x, u)  F M Q̂pwm (x, u) , 8 j 2 J
n⇣
⌘
o ⇣
⌘
, max
F M Q̂j (x, u)  F M Q̂pwm (x, u) ,
j2J
where the inequalities hold point-wise for all x 2 X , u 2 U.
The first implication follows from the monotonicity property
of the F operator, and the equivalence holds as j appears
only on the LHS of the inequality.
The assumption of the lemma that the iterated F -operator
inequality is satisfied for each j, implies the following
inequality
n
o
n⇣
⌘
o
max Q̂j (x, u)  max
F M Q̂k (x, u) ,
j2J
k2J
hold point-wise for all x 2 X , u 2 U . Noting the the lefthand side of (19) is the definition of Q̂pwm (x, u), the claim
follows.
A PPENDIX II
R EFORMULATION OF F - OPERATOR INEQUALITY
This convex alternative applies to (11). A sufficient condition for Q̂1 2 F̂(X ⇥ U ) to satisfy the iterated F -operator
inequality is the following:
Q̂1 (x, u) F M Q̂1 (x, u) ,
(19a)
*
9Q̂j 2 F̂(X ⇥U ) ,
j = 1, . . . , M ,
i
Q̂j (x, u) l(x, u) + E V̂j (f (x, u, ⇠)) ,
V̂j
1 (x)
Q̂j (x, u) ,
h
(19b)
j = 1, . . . , M ,
j = 2, . . . , M ,
V̂M (x) Q̂1 (x, u) ,
where all inequalities hold for all x 2 X and u 2 U. The
definitions of F̂(X ) and F̂(X ⇥ U ) are given in Section IIIB and IV-A respectively. The reformulation (19b) is linear
in the additional Value function and Q-function variables
introduced. Hence (19b) is a tractable set of constraints,
given that F̂(X ) and F̂(X ⇥ U ) were chosen to overcome
difficulties (D1-D4). Note that (19a) , (19b) only when infinite dimensional function spaces are used for the additional
variables, meaning that it is intractable to use an equivalent
reformulation of the F -operator inequality.
R EFERENCES
[1] R. E. Bellman, “On the theory of dynamic programming,” Proceedings
of the National Academy of Sciences of the United States of America,
vol. 38, no. 8, pp. 716–719, 1952.
[2] D. P. Bertsekas, Dynamic programming and optimal control. Athena
Scientific Belmont, MA, 2005, vol. 1, no. 3.
[3] W. B. Powell, Approximate Dynamic Programming: Solving the
Curses of Dimensionality, 2nd ed. Wiley, 2011.
[4] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.
Athena Scientific, 1996.
[5] P. J. Schweitzer and A. Seidmann, “Generalized polynomial approximations in markovian decision processes,” Journal of Mathematical
Analysis and Applications, vol. 110, pp. 568–582, 1985.
[6] G. Tesauro, “Practical issues in temporal difference learning,” Machine
Learning, vol. 8, pp. 257–277, 1992.
[7] A. G. Barto and R. H. Crites, “Improving elevator performance using
reinforcement learning,” Advances in neural information processing
systems, vol. 8, pp. 1017–1023, 1996.
[8] N. Kariotoglou, S. Summers, T. Summers, M. Kamgarpour, and
J. Lygeros, “Approximate dynamic programming for stochastic reachability,” in Control Conference (ECC), 2013 European. IEEE, 2013,
pp. 584–589.
[9] A. Keshavarz and S. Boyd, “Quadratic approximate dynamic programming for input-affine systems,” International Journal of Robust and
Nonlinear Control, vol. 24, no. 3, pp. 432–449, July 2012.
[10] T. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers,
and J. Lygeros, “Approximate dynamic programming via sum of
squares programming,” in Control Conference (ECC), 2013 European.
IEEE, 2013, pp. 191–197.
[11] Y. Wang, B. O’Donoghue, and S. Boyd, “Approximate dynamic
programming via iterated bellman inequalities,” International Journal
of Robust and Nonlinear Control, 2014.
[12] B. O’Donoghue, Y. Wang, and S. Boyd, “Min-max approximate
dynamic programming,” in Computer-Aided Control System Design
(CACSD), 2011 IEEE International Symposium on. IEEE, 2011, pp.
424–431.
[13] J. B. Rawlings and D. Q. Mayne, Model Predictive Control: Theory
and Design. Nob Hill Publishing, 2009.
[14] R. Cogill, M. Rotkowitz, B. Van Roy, and S. Lall, “An approximate
dynamic programming approach to decentralized control of stochastic
systems,” in Control of Uncertain Systems: Modelling, Approximation,
and Design. Springer, 2006, pp. 243–256.
[15] J. Casti, “The linear-quadratic control problem: some recent results
and outstanding problems,” SIAM Review, vol. 22, no. 4, pp. 459–
485, 1980.
[16] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control
processes: basic optimality criteria. Springer Science & Business
Media, 2012, vol. 30.
[17] D. P. De Farias and B. Van Roy, “The linear programming approach
to approximate dynamic programming,” Operations Research, vol. 51,
no. 6, pp. 850–865, November-December 2003.
[18] W. B. Powell, “What you should know about approximate dynamic
programming,” Naval Research Logistics (NRL), vol. 56, no. 3, pp.
239–249, February 2009.
[19] K. Krishnamoorthy, M. Pachter, S. Darbha, and P. Chandler, “Approximate dynamic programming with state aggregation applied to
uav perimeter patrol,” International Journal of Robust and Nonlinear
Control, vol. 21, no. 12, pp. 1396–1409, 2011.
[20] D. P. De Farias and B. Van Roy, “On constraint sampling in the
linear programming approach to approximate dynamic programming,”
INFORMS - Mathematics of Operations Research, vol. 29, no. 3, pp.
462–478, August 2004.
[21] T. Sutter, P. M. Esfahani, and J. Lygeros, “Approximation of constrained average cost markov control processes,” in Decision and
Control (CDC), 2014 IEEE 53rd Annual Conference on. IEEE, 2014,
pp. 6597–6602.
[22] J. B. Lasserre, “Global optimization with polynomials and the problem
of moments,” SIAM Journal on Optimization, vol. 11, no. 3, pp. 796–
817, 2001.