Download Solving a Dynamic Adverse Selection Model Through Finite Policy

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mechanism design wikipedia , lookup

Minimax wikipedia , lookup

Transcript
Solving a Dynamic Adverse Selection Model Through Finite
Policy Graphs
Hao Zhang
Marshall School of Business, University of Southern California, Los Angeles, CA 90089
Abstract
This paper studies an infinite-horizon adverse selection model with an underlying Markov
information process and a risk-neutral agent. It introduces a graphic representation of continuation contracts and continuation-payoff frontiers, namely finite policy graphs, and provides
an algorithm that generates a sequence of such graphs to approximate the optimal policy
graph. The algorithm performs an additional step after each value iteration—replacing dominated points on the previous continuation-payoff frontier by points on the new frontier and
reevaluating the new frontier. This dominance-free reevaluation step accelerates the convergence of the continuation-payoff frontiers. Numerical examples demonstrate the effectiveness
of this algorithm and properties of the optimal contracts.
1
Introduction
The principal-agent model with hidden information (or adverse selection) provides a powerful
tool for analyzing bilateral interactions tangled with asymmetric information. The literature
on single-period adverse selection problems is vast (see e.g., the textbooks by Fudenberg and
Tirole 1991, Laffont and Martimort 2002, and Bolton and Dewatripont 2005). There is
also a large literature on multi-period adverse selection problems, but the majority of it
focuses on settings in which the hidden information is either constant or independent across
time periods (see the examples and references in Salanie 1997 and Bolton and Dewatripont
2005). Because the intriguing dynamics are assumed away from the information structure,
the results and implications obtained from these models need not extend to more general
settings. In recent years, Markov information structures, under which the private information
follows a Markov process or Markov decision process, have attracted increasing attention,
such as the endowment process in Fernandes and Phelan (2000), general state process in Cole
and Kocherlakota (2001), consumer preference process in Battaglini (2005), income process
in Doepke and Townsend (2006), productivity process in Kapicka (2008), and inventory
process in Zhang, Nagarajan, and Sosic (2010), to name a few.
In this paper, we study an infinite-horizon adverse selection model with an underlying
Markov decision process and a risk-neutral agent, which is a counterpart of the finite-horizon
model studied in Zhang and Zenios (2008). We discuss two applications of the model next.
First, consider a simple supply chain with asymmetric inventory information. A monopolistic
supplier sells a product to a retailer in multiple periods, and the retailer stocks inventory to
hedge against demand uncertainty. In each period t, the retailer observes its initial inventory
level xt and orders quantity qt from the supplier; the (random) demand is realized at the end
of the period and excess inventory is carried over to the next period. The inventory process
is a Markov decision process with transition probabilities p(xt+1 |xt , qt ) determined by the
demand distribution. The supplier cannot observe the retailer’s inventory although it must
be taken into account when designing the contract. The short-term contracting version of
this problem (in which a one-period contract is offered by the supplier in every period) is
1
studied in Zhang, Nagarajan and Sosic (2010), while the long-term contracting version fits
into the scope of this paper. Second, consider a dynamic pricing problem with changing
customer types. A firm sells a non-durable product (or service) in multiple periods. Each
customer has a type θt that affects his or her utility in period t and evolves according to a
Markov decision process, with transition probabilities p(θt+1 |θt , qt ), where qt is the purchasing
quantity (or quality) in period t. The firm does not observe customer types yet strives to
maximize its expected profit through a dynamic pricing mechanism. A two-state case of
this problem is studied in Battaglini (2005), with emphasis on the structure of the optimal
contracts. In contrast, the focus of this paper is on the numerical solution of the general
model.
In spite of many potential applications, dynamic adverse selection problems with Markov
transitions remain relatively under-represented in the literature, partly due to the technical
complexity of such problems. The optimal contracts tend to be history dependent, and
finding such a contract is computationally costly—a phenomenon known as the “curse of
dimensionality.” In this paper, we present an algorithm to find optimal long-term contracts
(under full commitment of the principal). The algorithm differs from the existing methods
in the literature in a significant way, as summarized below.
The main methodology for tackling a dynamic adverse selection problem is originated
from Abreu, Pearce, and Stacchetti (1990) (APS, hereinafter) on repeated multi-agent games
with imperfect monitoring. Under the APS approach, the set of continuation payoffs in equilibrium is recursively defined through a functional operator, and the equilibrium payoff set
can be approximated by a sequence of continuation-payoff sets generated by the operator.
This approach is applicable to a wide range of dynamic games, but approximating the equlibrium payoff set is often computationally challenging. A common remedy is to discretize the
continuation-payoff sets, as in Doepke and Townsend (2006) on a hybrid adverse selection
and moral hazard problem between a social planner and a representative agent who has
private income information and can make private investment efforts.1 Another way of im1
Recursive expression of the continuation-payoff set (or function) is also applicable to dynamic moral
hazard problems. Discretizing the continuation-payoff set (or function) is common as well, as in Phelan and
2
plementing an APS recursion is proposed by Judd, Yeltekin, and Conklin (2003). They
bound the equilibrium payoff set by convex polytopes from inside and outside and provide
algorithms that generate inner and outer polytopes to approximate the equilibrium payoff
set. This approach is adopted by Sleet (2001) and Sleet and Yeltekin (2007) to dynamic
signaling games, in which a government designs a monetary policy for an economy resided
by a continuum of households and firms, and the government has private information on
the state of the economy (in the former paper) or on whether it will commit to the socially
optimal monetary policy (in the latter paper).
In this paper, we first introduce a graphic representation of (infinite-horizon) long-term
contracts and continuation-payoff frontiers, namely finite policy graphs, and then present an
algorithm that generates a sequence of such graphs to approach the optimal policy graph.
We take advantage of an important property of the infinite-horizon model that the optimal
(equilibrium) continuation-payoff set is a fixed point of a functional operator (of the APS
style). In contrast with the aforementioned methods of implementing APS-style recursions,
our algorithm is in the spirit of policy iteration for solving Markov decision processes (Howard
1960) and finite state controller for solving partially observable Markov decision processes
(Hansen 1998), as it strives to improve the structure of the policy graph in each iteration
through rerouting existing branches. More specifically, the algorithm performs an extra step,
namely dominance-free reevaluation, after each value iteration: identifying dominated points
on the previous continuation-payoff frontier, replacing them by those on the newly created
frontier, and recalculating the new frontier through a set of linear equations (determined
by the dominance relations). We show analytically and through numerical experiments that
this algorithm outperforms its value iteration counterpart. It has the additional advantage
of facilitating the exploration of optimal contract structures.
We assume in the model that the agent is risk neutral toward monetary uncertainties.
This assumption is common in the operations research and management science literature,
Townsend (1991) on a repeated moral hazard problem between a social planner and a continuum of agents
with private production efforts, which is generalized by Sleet and Yeltekin (2001) by allowing the planner
(or firm) to lay off workers.
3
as the agents are often firms or businesses which have the capacity to bear or transfer some
financial risks; it is not uncommon in the economics literature as well, in which the agents
under consideration are often individual consumers. The two examples presented earlier fit
well in the model studied in the paper. Other examples can be found in Battaglini and Coate
(2008) and Tchistyi (2006).
The remainder of the paper is organized as follows. Section 2 introduces the model and
some basic results. Section 3 presents the graphic representation of long-term contracts and
continuation-payoff frontiers and the algorithm that generates a converging sequence of such
graphs. Section 4 analyzes two numerical examples that demonstrate the main features of
the algorithm and properties of the optimal solution. The last section concludes with future
research suggestions.
2
A Dynamic Adverse Selection Model and Basic Results
In this section, we introduce the dynamic adverse selection model proposed in Zhang and
Zenios (2008) and discuss some basic results of the model. Vectors and matrices will be
denoted by bold letters throughout the paper, e.g., φ, px (a), and P(a).
A Dynamic Adverse Selection Model. At the beginning of the horizon, the principal
makes a take-it-or-leave-it offer to the agent in the form of a long-term contract that covers
T ≤ ∞ periods (henceforth, the principal will be referred to as “she” and the agent as
“he”). If the agent accepts the offer, the contract execution starts. Within each period t,
the following events take place. First, the agent privately observes the state of a Markov
decision process, denoted by xt . The state set is finite and denoted by X = {1, · · · , n}.
Next, the agent takes a public action at and incurs a cost cxt (at ). The action set is also
finite and is denoted by A = {1, · · · , m}. At the end of the period, the principal receives
a reward rxt (at ). She then pays the agent st as specified in the contract, contingent upon
publicly observable and verifiable information. Finally, the hidden state moves to xt+1 , with
transition probabilities Pr(xt+1 = y|xt = x, at = a), or simply pxy (a). A row vector px (a) =
4
(px1 (a), ..., pxn (a)) is defined for each x ∈ X and a ∈ A. The distribution of the initial state
!
x1 is publicly known and is described by probabilities βx1 such that x1 ∈X βx1 = 1. The
!
total discounted payoff for the principal is given by Tt=1 δ t−1 (rxt (at ) − st ), and that for the
!
agent is Tt=1 δ t−1 (st − cxt (at )). The state history (x1 , · · · , xt ) and action history (a1 , · · · , at )
are abbreviated as xt and at , respectively. The beginning of period t is referred to as time t.
The principal’s problem is to design a long-term contract to maximize her expected
total payoff subject to the agent’s incentive compatibility and participation (or individual
rationality) constraints. We assume that the principal can make full commitment not to
renegotiate with the agent during contract execution. As in the standard setting, we assume
that the state xt cannot be inferred (or verified) from the reward rxt (at ).
Revelation Contracts. Although the types of possible long-term contracts are numerous,
there is no loss of generality to focus on the following revelation contracts:2
A dynamic randomized revelation contract defines the following sequence of
events in any period t after any public history ("
xt−1 , zt−1 ): the agent reports a
state x
"t ; a public random variable zt is drawn from the set Z = {1, · · · , n} with
!
probabilities θt ("
xt , zt ) such that zt ∈Z θt ("
xt , zt−1 , zt ) = 1; the agent takes action
at ("
xt , zt ); and the principal pays the agent st ("
xt , zt ). The contract is denoted by
σ = {θ1 ("
x1 , z1 ), a1 ("
x1 , z1 ), s1 ("
x1 , z1 ); θ2 ("
x2 , z2 ), a2 ("
x2 , z2 ), s2 ("
x2 , z2 ); · · · }xbT ∈X T ,zT ∈Z T ,
or σt ("
xt−1 , zt−1 ) = {θt ("
xt , zt ), at ("
xt , zt ), st ("
xt , zt ), σt+1 ("
xt , zt )}xbt ∈X,zt ∈Z recursively.
The above contract generalizes the familiar static revelation contract to multiple periods and
introduces randomization at the beginning of every period, after the state is reported and
before the action is taken.3 In the deterministic special case, the contract can be simplified
2
A dynamic revelation principle is first shown by Myerson (1986) for multi-player multi-stage communication games in which the players can freely communicate through a central mediator at the beginning of
each stage.
3
Because any point in a facet of an n dimensional polytope can be expressed as a convex combination of
no more than n vertices, the random variable zt need not take more than n values, and hence it is sufficient
to define Z = {1, · · · , n}. The above randomized revelation contract differs from the one defined in Zhang
and Zenios (2008), in which the randomization is over the set of actions A. It can be shown that the two
definitions are equivalent, but the one given here will be more convenient for the algorithm presented in
5
to σ = {a1 ("
x1 ), s1 ("
x1 ); a2 ("
x2 ), s2 ("
x2 ); · · · }xbT ∈X T , or σt ("
xt−1 ) = {at ("
xt ), st ("
xt ), σt+1 ("
xt )}xbt ∈X
recursively. We call a dynamic revelation contract a truthful revelation contract if it induces
the agent to reveal the true state in every period. We call the part of a long-term contract
starting from time t (after any public information history) a time-t continuation contract.
The principal’s problem can be formulated within the class of truthful revelation contracts. The notation can be simplified as follows: suppressing the history ("
xt−1 , zt−1 ), moving
x
"t to the subscript (without the “hat”), moving zt to the superscript, and removing the index
t when it is clear from the context. Thus, a time-t randomized revelation contract can be
z
z
compactly written as: σt = {θxz , azx , szx , σt+1,x
}x∈X,z∈Z . The tuple (θxz , azx , szx , σt+1,x
)z∈Z , given
reported state x, is referred to as a submenu for state x.
If the agent reports truthfully under σt , the two parties’ and system’s expected future
payoffs in state x are given by:
ux (σt ) =
#
z∈Z
πx (σt ) =
#
z∈Z
φx (σt ) =
#
z∈Z
$
%
z
) ,
θxz szx − cx (azx ) + δpx (azx )u(σt+1,x
$
%
z
) ,
θxz rx (azx ) − szx + δpx (azx )π(σt+1,x
$
%
z
θxz rx (azx ) − cx (azx ) + δpx (azx )φ(σt+1,x
) ,
(1)
(2)
(3)
where u(σ) = (ux (σ))x∈X , π(σ) = (πx (σ))x∈X , and φ(σ) = (φx (σ))x∈X are column vectors
and pu is the matrix multiplication of a row vector and a column vector. Thus, every time-t
truthful revelation contract σt generates a triple of continuation-payoff vectors (u, π, φ)(σt ).
Because φ(σt ) = u(σt ) + π(σt ), it suffices to concentrate on the pair (u, φ)(σt ).
Policy Trees. A long-term revelation contract can be expressed as a policy tree. Figure
1(a) illustrates a deterministic contract. A branch of the tree corresponds to a reported
state xt and is associated with an action-payment pair (at , st ). Figure 1(b) illustrates a
randomized contract: a branch emanating from a node corresponds to a reported state xt ;
Subsection 3.4. In terms of the total number of variables in the contract, there is no clear winner between
these two definitions: if |A| < n, the format in Zhang and Zenios (2008) is more parsimonious than the one
given here; if |A| > n, the reverse is true.
6
•••
•••
•••
•••
•••
•••
(a)
(b)
Figure 1: Policy tree of a long-term revelation contract in (a) the deterministic case, and (b)
the randomized case.
it splits into a collection of sub-branches, each of which corresponds to a realization of the
random variable zt and is associated with a probability-action-payment triple (θt , at , st ).
Every node of the policy tree has two interpretations. Consider the deterministic case for
1
instance. A node at time t viewed from the top down corresponds
to a history of the reported
1
2
2
1
2
1
"t−1 , associated
states x
with a1 history
of action-payment
pairs.
Viewed
from the bottom up,
2
the node corresponds to a continuation contract σt , associated with a continuation-payoff
(a)
(b)
pair (u, φ). The bottom-up (or backward) perspective will be
Med-the emphasis of this paper.
ium
Bust
Boom
Continuation-Payoff Frontiers and the Principal’s Problem.
Given any continuationBoom
Low continuation-payoff
High
payoff vector of the agent ut , there exists a maximum
vector for the
Bust
Bust
system, φ∗t , attainable by a continuation contract
that yields ut for the Boom
agent. This gives
rise to the time-t continuation-payoff frontier (or function), φ∗t (ut ) = sup{φ(σt ) : σt ∈ ΣTRC
t
and u(σt ) = ut }, for ut ∈ Ut , where ΣTRC
is the set of time-t truthful revelation contracts
t
and Ut = {u(σt ) : σt ∈ ΣTRC
} is the time-t continuation-agent-payoff set.4,5
t
4
We can also define continuation-payoff frontiers in terms of the principal’s continuation payoffs (against
the agent’s), which is more common in the literature. The two types of frontiers are equivalent because they
have a one-to-one correspondence. However, analyzing continuation payoffs from the system’s perspective
is more convenient for us because the payment terms szx are cancelled out in the system payoff expression
(3) due to the risk neutrality of the principal and agent. This technical convenience is inessential to the
algorithm presented in Subsection 3.4.
5
Notice that the notation differs from the standard notaion in economics, in which φ∗t (ut ) is usually
expressed in components (i.e., φ∗t,x (ut )), denoted by Vt (ut , xt ) alike. The notation introduced here emphasizes
7
The continuation-payoff frontiers φ∗t : Ut → Rn can be obtained through backward
induction, which is facilitated by the following problem, called the auxiliary planning problem
in the economics literature. The goal of the problem is to find the xth component of the
time-t continuation-payoff frontier, φ∗t,x (·), from the time-(t + 1) frontier φ∗t+1 (·). We refer
to φ∗t,x (·) as a frontier component or component function. If the agent is offered (promised)
a continuation-payoff vector ut , φ∗t,x (ut ) is the maximum continuation payoff for the system
from state x onward, obtained through optimal choices of the randomization probabilities
θxz , actions azx , and the agent’s time-(t + 1) continuation-payoff vectors uzx :
φ∗t,x (ut ) =
maxz
z
{θxz ∈[0,1],ax ∈A,ux ∈Ut+1 }z∈Z
s.t. ut,x" − ut,x ≥
#
θxz
= 1.
#
z∈Z
θxz
#
z∈Z
{cx (azx )
$
%
θxz rx (azx ) − cx (azx ) + δpx (azx )φ∗t+1 (uzx )
− cx" (azx ) + δ[px" (azx ) − px (azx )]uzx } , x$ ('= x) ∈ X
(4)
(5)
(6)
z∈Z
The incentive compatibility (IC) constraints (5) reflect a change of variables, from period-t
payments szx to the agent’s time-t continuation payoffs ut,x (the former can be easily recovered
from the latter, with some redundancy). The constraints can be derived as follows. If the
agent reports a state x truthfully, his continuation payoff at time t would be given by
!
z
z
z
z
z
$
ut,x =
z∈Z θx {sx − cx (ax ) + δpx (ax )ux }. If the true state is x but the agent reports
!
x, his continuation payoff would be ut,x|x" = z∈Z θxz {szx − cx" (azx ) + δpx" (azx )uzx } = ut,x +
!
z
z
z
z
z
z
z
z
z
z∈Z θx {cx (ax ) − cx" (ax ) + δ[px" (ax ) − px (ax )]ux } (recall that the subscripts in θx , sx , and ux
refer to the reported state, and the ones in cx (a) and px (a) refer to the true state). Therefore,
constraints (5) are in fact ut,x|x" ≤ ut,x" , which prevent the agent from misreporting state x$ as
x. Intuitively, because the payment term szx and cost term cx (azx ) are additively separable in
!
the agent’s payoff function, the expected period-t payment z∈Z θxz szx (given that the agent
reports x) contributes to the agent’s continuation payoff uniformly, independent of the true
state. In other words, the difference between ut,x and ut,x|x" is independent of the payments,
the multi-dimensional nature of the continuation payoffs for the principal, agent, and system, which is
consistent with the emphasis of the vector-valued continuation-payoff frontiers (functions) throughout this
paper.
8
and thus the agent’s gain from misrepresenting x$ as x, i.e., ut,x|x" − ut,x" , can be conveniently
translated into ut,x − ut,x" .
z
The problem (4)-(6) only involves the submenu (θxz , azx , szx , σt+1,x
)z∈Z for state x (szx and
z
σt+1,x
are replaced by ut,x and uzx , respectively), and the constraint set (5) is only a subset
of the IC constraints that define a time-t truthful revelation contract. For any ut ∈ Ut ,
a continuation contract can be formed by combining the n submenus found through the
auxiliary planning problems given ut . The continuation-payoff vector ut promised to the
agent serves as a parameter of the problem (4)-(6). The resulting optimal objective function
φ∗t,x (·) possesses useful properties such as piece-wise linearity, concavity, and monotonicity.
The problem is only feasible for certain values of ut , denoted by set Ut,x , which is endogenously determined by the IC constraints (5). The time-t continuation-agent-payoff set Ut
defined earlier can be obtained by Ut = ∩x∈X Ut,x , because the collection of IC constraints
from all time-t auxiliary planning problems defines Ut exactly.
The problem formulation (4)-(6) implies that φ∗t (ut ) = φ∗t (ut + λ1) for any ut ∈ Ut and
λ ∈ R, where 1 is a column vector of ones. Thus, the set Ut has one degree of freedom, and
it is convenient to focus on the agent’s relative continuation payoffs ut,x" − ut,x . For any pair
of states x '= x$ , the IC constraints in the auxiliary planning problems for states x and x$
take the forms “ut,x" − ut,x ≥ · · · ” and “ut,x − ut,x" ≥ · · · ,” respectively, and hence the relative
continuation payoffs are bounded from both sides.6
After the continuation-payoff frontier φ∗1 : U1 → Rn is found, the principal solves a simple
problem at time 1: maxu1 ∈U1 {β (φ∗1 (u1 ) − u1 ) : u1 ≥ 0}. The participation constraint
u1 ≥ 0 is based on the assumption that the agent knows the initial state x1 when signing
the contract. Participation constraints are unnecessary when finding φ∗t : Ut → Rn because
they can be easily satisfied by transferring payments across periods.
6
The dimensional reduction of the agent’s continuation-payoff set significantly simplifies the illustration
of continuation-payoff frontiers in the two-state case, as evident in Figures 2, 7, and 8. However, this
simplification is inessential for the algorithm discussed in Subsection 3.4.
9
Convergence of Continuation-Payoff Frontiers Over the Infinite Horizon. The
problem (4)-(6) for state x in effect defines a functional operator Γ∗x that maps the vectorvalued function φ∗t+1 : Ut+1 → Rn to the scalar-valued function φ∗t,x : Ut,x → R. The problem
can be expressed succinctly as φ∗t,x (·) = Γ∗x φ∗t+1 (·). Each iteration in the backward induction,
from continuation-payoff frontier φ∗t+1 (·) to φ∗t (·), also defines a functional operator (of the
APS style), denoted by Γ∗ and referred to as the value-iteration operator. The iteration can
be conveniently written as φ∗t (·) = Γ∗ φ∗t+1 (·). Clearly, Γ∗ = (Γ∗x )x∈X , where the domain of
Γ∗ φ∗t+1 (·) equals the intersection of the domains of Γ∗x φ∗t+1 (·). It can be shown that:
Given a bounded continuous function φ∗0 : U0 → Rn with a convex domain
U0 ⊂ Rn , the sequence of continaution-payoff functions φ∗k (·) = Γ∗ φ∗k−1 (·) converges to a unique bounded continuous function φ∗∞ : U∞ → Rn . That is, φ∗∞ (·)
is the unique fixed point of the operator Γ∗ .
In other words, over the infinite horizon, the continuation-payoff frontiers are identical in
all periods, as given by φ∗∞ (·) and referred to as the optimal (continuation-payoff ) frontier.7
The convergence occurs at two levels: the sequence of domains Uk converges to U∞ (under
the Hausdorff metric), and the sequence of functions φ∗k (·) converges to φ∗∞ (·).8 In the
remainder of this paper, we will concentrate on φ∗∞ (·) and remove the index ∞ for simplicity
(all other time indices will be suppressed from the notation as well).
3
Finding the Optimal Continuation-Payoff Frontier Through
Finite Policy Graphs
In this section, we solve the infinite-horizon version of the model presented in Section 2. We
first focus on the two-state model, which will be useful for demonstrations throughout the
7
Because the frontier is unique in the infinite-horizon case, the modifier “optimal” is somewhat redundant.
It is adopted to distinguish the true continuation-payoff frontier from other approximated ones.
8
It is commonly seen in the economics literature that an APS-style operator like Γ∗ has a unique fixed
point if the domains of the continuation-payoff functions are fixed at U∞ (e.g., Fernandes and Phelan 2000,
and Doepke and Townsend 2006). The result stated here frees the domains and justifies the algorithm
presented in the next section. The proof of the result is available from the author.
10
paper and for numerical experiments in Section 4. We then introduce a graphic representation for long-term contracts and continuation-payoff frontiers, using finitely many nodes, and
present an algorithm that generates a sequence of such graphs to approximate the optimal
frontier. We also discuss a circular structure in the optimal frontier. Finally, we show that
optimal contracts can be easily constructed from the optimal frontier.
3.1
Optimal Continuation-Payoff Frontier in the Two-State Case
In the two-state case, X = {1, 2}. Due to the redundancy in the agent’s continuationpayoff set, the agent’s absolute continuation-payoff vector u can be replaced by a relative
continuation-payoff variable u = u1 − u2 , which reduces the unbounded continuation-payoff
set U to a compact interval U = [u, u] (with a slight abuse of notation). The auxiliary
planning problem (4)-(6) can be rewritten as follows, given u. For x = 1,
φ∗1 (u) =
{
max
θ1z ∈[0,1],az1 ∈A,v1z ∈U
s.t. u ≤
#
#
z∈Z
#
}z∈Z z∈Z
θ1z {r1 (az1 ) − c1 (az1 ) + δp1 (az1 )φ∗ (v1z )}
θ1z {c2 (az1 ) − c1 (az1 ) + δ (p11 (az1 ) − p21 (az1 )) v1z }
(7)
(8)
(9)
θ1z = 1;
z∈Z
and for x = 2,
φ∗2 (u) =
{
max
θ2z ∈[0,1],az2 ∈A,v2z ∈U
s.t. u ≥
#
#
z∈Z
#
}z∈Z z∈Z
θ2z {r2 (az2 ) − c2 (az2 ) + δp2 (az2 )φ∗ (v2z )}
θ2z {c2 (az2 ) − c1 (az2 ) + δ (p11 (az2 ) − p21 (az2 )) v2z }
(10)
(11)
(12)
θ2z = 1.
z∈Z
The incentive compatibility constraint (8) prevents the agent from misreporting state 2 as
state 1, and constraint (11) does the opposite. The function φ∗ (·) in the objective functions
is the optimal continuation-payoff frontier, and the optimal objective functions φ∗x (·), x = 1
and 2, give its two components; the domain of φ∗ (·), U , is endogenously determined by
the constraints. This formulation underscores the fact that φ∗ (·) is a fixed point of the
value-iteration operator Γ∗ .
11
The component functions φ∗1 (·) and φ∗2 (·) can be depicted in a two-dimensional space,
namely the (continuation-payoff ) component space, which is useful for constructing and illustrating continuation-payoff frontiers. Any continuation-payoff pair (u, φ) can be represented by two points (u, φ1 ) and (u, φ2 ) in the component space. As illustrated in Figure 2,
(a)
a component function φ∗x (·) can be constructed from certain intermediate functions φx (·),
a ∈ A, which are in turn obtained from φ∗ (·) through simple transformations.
(a)
To define these intermediate functions, we define an operator Γx : R3 → R2 for any
x ∈ X and a ∈ A that maps a continuation-payoff pair (v, ψ) ∈ R × R2 to a continuationpayoff component (u, φx ) ∈ R × R as follows:
(13)
φx = rx (a) − cx (a) + δpx (a)ψ,
u = c2 (a) − c1 (a) + δ (p11 (a) − p21 (a)) v.
(14)
(a)
Applying Γx to the optimal continuation-payoff function φ∗ (·) point by point—i.e., map(a)
ping each point (v, φ∗ (v)) to a component point Γx (v, φ∗ (v))—we obtain an intermedi(a)
(a)
ate function φx (·). Used this way, Γx becomes a functional operator, and we can write
(a)
(a)
φx (·) = Γx φ∗ (·). As will be useful to the algorithm in Subsection 3.4, we define another
operator Γ(a) : R3 → R3 , which maps a continuation-payoff pair (v, ψ) ∈ R × R2 to another
pair (u, φ) ∈ R × R2 according to equations (13)-(14).9
Now we show that the components of the optimal continuation-payoff frontier can be
(a)
constructed from the intermediate functions φx (·). In the problem formulation (7)-(9), if
the IC constraint (8) were instead an equality, the optimal objective function φ∗1 (·) would be
(a)
the convex hull of the intermediate functions φ1 (·), a ∈ A. Because of the inequality, any
feasible solution to the problem (7)-(9) under parameter u$ must be feasible under parameter
u$$ ≤ u$ as well. Thus, φ∗1 (u$$ ) ≥ φ∗1 (u$ ) for any u$$ ≤ u$ , and the inequality (8) effectuates a
projection along the −u direction, making the left tail of φ∗1 (·) flat. Similarly, the component
(a)
function φ∗2 (·) is the convex hull of the intermediate functions φ2 (·), a ∈ A, with a flat right
9
(a)
The relationship between Γ(a) and Γx is similar to that between Γ∗ and Γ∗x . However, when used as
(a)
functional operators, the domain of Γ(a) φ∗ (·) is identical to those of Γx φ∗ (·), while the domain of Γ∗ φ∗ (·)
∗ ∗
is the intersection of those of Γx φ (·).
12
Projection
direction
for
Projection
direction
for
Figure 2: Optimal continuation-payoff frontier and intermediate functions for Example 1.
tail.
Example 1 Consider a& model'with two states,
& two actions,
' and the &following parame'
0 1
−0.5 1
0.8 0.2
ters: δ = 0.95, C =
, R−C =
, P(1) =
, and
2 0
0.5 −1
0.3 0.7
&
'
0.2 0.8
P(2) =
, where each row of C and R − C corresponds to a state, each column
0.8 0.2
(a)
(b)
corresponds to an action, and each transition matrix P(a) consists of probabilities pxy (a).
(a)
The optimal frontier components φ∗x (·) and intermediate functions φx (·) are illustrated in
Figure 2, in the component space.
Since the optimal frontier φ∗ (·) is a fixed point of the value-iteration operator Γ∗ , the
(a)
construction of φ∗x (·) is circular in nature: the intermediate functions φx (·) can be obtained
from the component functions
(c) through linear transformations,(d)defined through (13)-(14),
while the component functions φ∗x (·) can be constructed from the intermediate functions
through convex hull operations and projections.
Remark 1 The agent’s relative continuation-payoff set U = [u, u] can be determined as
follows. Define the sets A+ = {a ∈ A : p11 (a) − p21 (a) > 0}, A− = {a ∈ A : p11 (a) − p21 (a) <
(a)
0}, and A0 = {a ∈ A : p11 (a) − p21 (a) = 0}, and the operator Γ[u] : R → R according to
13
(a)
equation (14), i.e., Γ[u] (v) = c2 (a) − c1 (a) + δ (p11 (a) − p21 (a)) v. Then, the bounds u and u
are determined by two equations simultaneously:
(
)
(a)
(a)
(a)
u = min {Γ[u] (u)}a∈A+ ∪ {Γ[u] (u)}a∈A− ∪ {Γ[u] (0)}a∈A0 ,
(
)
(a)
(a)
(a)
u = max {Γ[u] (u)}a∈A+ ∪ {Γ[u] (u)}a∈A− ∪ {Γ[u] (0)}a∈A0 .
(15)
(16)
(1)
(2)
In Example 1, because A+ = {1}, A− = {2}, and A0 = ∅, we have u = min{Γ[u] (u), Γ[u] (u)}
(1)
(2)
and u = max{Γ[u] (u), Γ[u] (u)}. Through straightforward calculations (or with the aid of
(2)
(1)
Figure 2), we further obtain u = Γ[u] (u) and u = Γ[u] (u), and therefore, u = (c2 (1) −
c1 (1))/(1−δ(p11 (1)−p21 (1))) ≈ 3.8095 and u = c2 (2)−c1 (2)+δ(p11 (2)−p21 (2))u ≈ −3.1714.
In general, u and u can be determined by comparing various action pairs or through the
algorithm presented in Subsection 3.4, because the continuation-payoff frontiers generated by
the algorithm, along with their domains, converge to the optimal one.
3.2
Finite Policy Graphs
An effective representation of a solution facilitates both the design and implementation of
an algorithm. The policy tree for an infinite-horizon contract consists of an infinite number
of nodes, as illustrated in Figure 1. If cycles are allowed however, an infinite policy tree
may be reduced to a finite graph, namely a finite policy graph. Due to the cycles, the time
index of a node is meaningless in a finite policy graph and can be replaced by an arbitrary
label. A finite policy graph is made up of the following nodes and branches: (1) each node
has a unique label i from a finite set I and corresponds to a continuation contract σ(i);
(2) n directed branches emanate from each node, each of which corresponds to a submenu
of the continuation contract; (3) in the deterministic case, each branch is attached with
an action-payment pair (ax (i), sx (i)) and points to a unique successor node τx (i); and (4)
in the randomized case, each branch splits into n sub-branches, each of which corresponds
to a realization of the random variable z, is attached with a probability-action-payment
triple (θxz (i), azx (i), szx (i)), and points to a unique successor node τxz (i). A simple two-state
example is illustrated in Figure 3. There are two continuation contracts, σ(1) and σ(2),
intertwined with each other, both consisting of a deterministic submenu and a randomized
14
•••
•••
•••
•••
(a)
(b)
Figure 3: A finite policy graph consisting of two nodes and deterministic and randomized
1
branches.
1
2
1
1
2
2
1
one. For clarity, the branches corresponding to state 1 2are marked by a short bar, and those
(a)
corresponding to state 2 are marked by two bars.
(b)
Medium
Each node i of a finite policy graph generates a continuation-payoff pair (u, φ)(i), deBust
Boom
termined by the structure of the graph in a recursive manner.
By expressions (1)-(3), the
Boom
continuation-payoff component (ux , φx )(i) associated with
branch of node i can be
Low the xth High
computed as follows. In the randomized case,
φx (i) =
#
z∈Z
ux (i) =
#
z∈Z
Bust
Bust
Boom
θxz (i) {rx (azx (i)) − cx (azx (i)) + δpx (azx (i))φ(τxz (i))} ,
(17)
θxz (i) {szx (i) − cx (azx (i)) + δpx (azx (i))u(τxz (i))} ,
(18)
and in the deterministic case,
φx (i) = rx (ax (i)) − cx (ax (i)) + δpx (ax (i))φ(τx (i)),
(19)
ux (i) = sx (i) − cx (ax (i)) + δpx (ax (i))u(τx (i)).
(20)
A finite policy graph can be used in two ways: to describe a long-term contract, and to
describe a continuation-payoff frontier. As demonstrated in Figure 2, each frontier component φ∗x (·) is piecewise linear and concave and can be described by a set of extreme points
where the slope of the function changes. An extreme point of a frontier component results
in an extreme point of the frontier φ∗ (·). The latter can be represented by a node in a policy
graph, and thus a whole frontier can be represented by a whole policy graph.
As an example, the policy graph for the optimal frontier in Example 1 (Figure 2) is
depicted in Figure 4. It is a deterministic policy graph and a simplified one, consisting
15
1.20
(0.97, 1.00)
2
R1
1
2
1
3.22
(0.10, 1.58)
3.53
(-0.13, 1.58)
3.68
(-0.30, 1.58)
1
1
R0
1
2.57
(0.42, 1.44)
2
R2
1
2
2
R3
1
2
1
R4
2
2
•••
•
•••
L0
L1
L2
L3
L4
-1.68
(1.95, -0.08)
-2.46
(2.18, -0.40)
-2.84
(2.22, -0.63)
-3.01
(2.22, -0.80)
-3.10 •
(2.22, -0.92)
Figure 4: The (simplified) optimal policy graph for Example 1.
of (1) the label i for each node; (2) the agent’s relative continuation payoff u(i) and the
system’s continuation-payoff vector (φ1 , φ2 )(i) at each node; (3) the corresponding state for
each branch, represented by one or two bars; and (4) the associated action at each branch.
Payment variables are omitted from the graph. As discussed after the problem formulation
(4)-(6), the agent’s relative continuation payoffs are more essential. In the algorithm to
be presented, the agent’s relative continuation-payoff vector associated with any node is
computed when the node is created and is kept the same afterward. Those vectors are
computed from equation (22) below (free of payment variables) instead of (18) or (20).
3.3
1.93
2.27
2.39
(2.00, 1.57)
(1.71, 1.57)
(1.54, in
1.57)the Multi-State Case
Agent’s Relative Continuation
Payoffs
2
1 the operator Γ(a) from the two-state case to the n-state case.
We generalize the definition
of
R1
R2
R3 • • •
1
Because of the 0.91
one-degree redundancy
in the agent’s absolute continuation-payoff sets, it is
2
0
(2.55, 1.57)
1
2
1
2
•
convenient to replace his absolute continuation-payoff vector u by a relative continuation2
2
L1
L2
L3
•••
* , defined by u
payoff vector u
*x = ux − un , x = 1, · · · , n − 1 (any state other than n may also
•
0.36 of exposition,
0.17
0.11 the “tilde” symbol from the
serve as the reference state). For ease
we drop
(2.55, 1.21)
(2.55, 1.01)
(2.55, 0.90)
notation for the agent’s relative continuation-payoff vectors and sets. Define the truncated
* x (a) = (px1 (a), px2 (a), · · · , px,n−1 (a)), for x ∈ X and a ∈ A.
transition-probability vectors p
Given a ∈ A, Γ(a) maps a continuation-payoff-vector pair (v, ψ) ∈ Rn−1 × Rn to another pair
16
(u, φ) ∈ Rn−1 × Rn as the following:
φx = rx (a) − cx (a) + δpx (a)ψ,
x ∈ X,
(21)
* n (a))v,
ux = cn (a) − cx (a) + δ(*
px (a) − p
x ∈ X\{n},
(22)
which are analogous to equations (13)-(14) in the two-state case.
3.4
Augmenting Finite Policy Graphs
According to the convergence result stated at the end of Section 2, the optimal continuationpayoff frontier can be approached arbitrarily closely by a sequence of continuation-payoff
frontiers. Thus, the corresponding optimal policy graph can be approximated by a sequence
of finite policy graphs. In general, an optimal policy graph contains infinitely many nodes, as
the one in Figure 4. However, there are some special cases in which the optimal policy graphs
are finite. It is straightforward to verify if a finite policy graph is optimal, as described in
the following proposition (which directly follows from the fact that the functional operator
Γ∗ has a unique fixed point).
Proposition 1 If the continuation-payoff frontier determined by a finite policy graph is
invariant under the value-iteration operator Γ∗ , the frontier and the policy graph are both
optimal.
The algorithm presented below is inspired by the policy-iteration approach for solving
Markov decision processes (MDPs) and the finite-state-controller approach for solving partially observable Markov decision processes (POMDPs). It is noted in the literature that
policy iterations converge faster than value iterations for many MDPs (Puterman 1994)
and that finite state controllers outperform value iterations for many POMDPs (Hansen
1998).10,11
10
Our algorithm is based on finite policy graphs, which are generally history dependent, while the policyiteration approach is based on stationary deterministic policies, which are sequences of repeated single-period
policies. For an infinite-horizon MDP with finite state and action sets, the policy-iteration method can find
an optimal policy in finite time; however, there are infinitely many history dependent deterministic policies
in an infinite-horizon adverse selection problem, and therefore finite convergence to an optimal long-term
contract is not guaranteed even if randomization is disregarded.
11
More discussions of finite state controllers and analyses of POMDPs can be found in Zhang (2010a).
17
In preparation for the algorithm, we define some notation. First, in any finite policy graph
generated by the algorithm, every node is associated with a unique relative continuationpayoff vector u for the agent and hence can be uniquely identified by a function i(u). Second,
for two relative continuation-payoff vectors u and u$ for the agent, we define u !x u$ if every
feasible solution of the problem (4)-(6) for state x given parameter u$ is also feasible to the
problem given parameter u. The algorithm is described next.
Algorithm: Finite Policy Graph Augmentation
1. (Initialization) Define an initial finite policy graph. Determine the extreme points of
the corresponding continuation-payoff frontier φ− (·), and record their corresponding
continuation-agent-payoff vectors in U − . Select a precision level ε > 0.
2. (Augmentation) Apply the Γ(a) operators to the extreme points of φ− (·). For each x ∈ X,
determine the extreme points of the new frontier component φ+
x (·), and record their
corresponding continuation-agent-payoff vectors in Ux+ . Let U + = ∪x∈X Ux+ . Each u ∈
U + is coupled with a continuation-system-payoff vector φ so that (u, φ) is generated
from an extreme point of φ− (·) by a certain Γ(a) operator (if multiple such φ exist,
take their component-wise maximum).
(2.a) For each u ∈ U + !U − , create a new node i(u) with the branch for each state x ∈ X
determined as follows. (i) If (u, φx ) is an extreme point of the frontier component
φ+
x (·), make a deterministic branch with the action and successor node used to generate
$
$
(u, φx ); (ii) If (u, φx ) is dominated by an extreme point (u$ , φ+
x (u )) in that u !x u and
$
$
φx < φ+
x (u ), copy the xth branch of node i(u ); (iii) If (u, φx ) is dominated by a set of
!
!
z
z
+
z
extreme points {(uz , φ+
x (u ))}z∈Z in that u =
z∈Z λz u and φx <
z∈Z λz φx (u ) for
!
some λz ≥ 0 and z∈Z λz = 1, make a randomized branch that copies the xth branch
of node i(uz ) with probability λz .
(2.b) For each u ∈ U + ∩ U − , modify the branches of the existing node i(u) as in (2.a).
3. (Truncation) For each u ∈ U − !U + , if the node i(u) is not used to create any extreme
−
point of any φ+
x (·), remove the node from the policy graph and delete u from U ;
otherwise, modify the branches of the node as in (2.a.iii).
18
4. (Reevaluation) If any pre-existed branch is modified above, recalculate the continuationsystem-payoff vectors for all nodes using equation (17) or (19), and update the extreme
points of the continuation-payoff frontier φ+ (·).
5. (Termination) If d(φ− (·), φ+ (·)) ≤ ε, exit. Otherwise, replace U − by U − ∪ U + , φ− (·) by
φ+ (·), and return to step 2.
We explain the algorithm in more detail below.
The initial finite policy graph can be as simple as a single node with circular branches.
The performance of the algorithm is influenced by the initial graph to some extent, as will
be discussed in the next subsection.
In the augmentation step, we conduct the value iteration once and modify the policy
graph. More specifically, we apply the Γ(a) operators to the extreme points of the current
continuation-payoff frontier φ− (·), identify the extreme points of the new frontier φ+ (·),
record the corresponding u vectors in U + , create a new node for every u ∈ U + !U − (which
does not exist in the current graph), and modify the existing node for every u ∈ U + ∩ U −
(to improve the associated system-payoff vector). The two situations can be treated in the
same way, and the xth branch of any node i(u), u ∈ U + , is determined from the new
+
frontier component φ+
is coupled with
x (·). Consider any given x. Notice that each u ∈ U
(a)
a continuation system payoff φx when generated by a certain Γx operator from an extreme
point of φ− (·) (corresponding to a node in the old policy graph). There are three cases
about the point (u, φx ). (1) It contributes to φ+
x (·). Then the new branch should identify
the action and node (in the old policy graph) that resulted in (u, φx ). (2) The point is
+
dominated by a point (u$ , φ$x ) on φ+
x (·), typically happening at the tail of φx (·), as in Figure
2. In such a case, we couple u with the (larger) continuation system payoff φ$x by copying
the xth branch of node i(u$ ). (3) The point (u, φx ) is dominated by a convex combination
of points {(uz , φzx )}z∈Z on φ+
x (·). Then, the new branch should split into n sub-branches,
copying the xth branch of nodes {i(uz )}z∈Z . The new branch is deterministic in the first
two cases and randomized in the third.
19
In the truncation step, we check all remaining nodes in the policy graph created in
previous iterations. If a node is not used directly or indirectly to create any extreme point
of the new continuation-payoff frontier φ+ (·), it will never be useful in the future and can
be safely removed from the graph. Even if the contribution of an existing node is indirect,
it cannot be removed without harming the completeness of the graph. But because such a
node must be dominated by the nodes on the new frontier, an improvement can be made by
rerouting its branches as in (2.a.iii).
In the reevaluation step, we evaluate the continuation-system-payoff vectors for the modified policy graph according to equation (17) or (19). This involves solving a system of
linear equations, which can be efficiently done. This step is only necessary when rerouting
of pre-existed branches has occurred in the previous two steps.
In the termination step, we can use the Hausdorff metric dH (·, ·) or other more convenient
criteria (one of which will be seen in Subsection 4.1) to measure the distance between two
consecutive continuation-payoff frontiers and terminate the algorithm when a given precision
is reached.
Note that if we omit the truncation and reevaluation steps and extend part (a) of the
augmentation step to all u ∈ U + (omitting step 2.b), we obtain a value-iteration algorithm,
which will be referred to as the value-iteration counterpart of the algorithm presented above.
Therefore, the above algorithm can be viewed as an improvement over a value-iteration
algorithm. One disadvantage of the value-iteration counterpart is that the size of the policy
graph may grow much faster than under our algorithm, because there are no replacements
or removals of existing nodes. The performance gap between the two algorithms will be
demonstrated through numerical experiments in Subsection 4.1.
Now, we show that the continuation-payoff frontiers generated by the finite-policy-graph
algorithm follow a certain order. Define the hypograph of a function φ† : U → Rn as
hypo(φ† (·)) = {(u, φ) ∈ U × Rn : φ ≤ φ† (u)}. We say that two functions φ† : U → Rn
and ψ † : V → Rn satisfy φ† (·) . ψ † (·), or ψ † (·) / φ† (·), if hypo(φ† (·)) ⊂ hypo(ψ † (·))—i.e.,
U ⊂ V and φ† (u) ≤ ψ † (u) for all u ∈ U . The “.” relation defines a weakly increasing
20
partial order. Equipped with this order, we show that:
Theorem 2 The finite-policy-graph algorithm generates a sequence of finite policy graphs
with weakly increasing continuation-payoff frontiers. Before the algorithm terminates, each
frontier in the sequence strictly improves upon the previous one at some node or has a strictly
larger domain. The algorithm terminates with an optimal policy graph or after a finite
number of iterations. In addition, the algorithm converges faster than its value-iteration
counterpart if rerouting ever occurs.
Proof. Let φ"k−1 (·) denote the continuation-payoff frontier at the end of iteration k−1. At the
beginning of iteration k, φ"k−1 (·) is relabeled as φ−
k (·). At the beginning of the augmentation
(a)
step, a new frontier φ+
operations. Because new extreme
k (·) is formed following the Γ
−
points are added to the existing ones, we must have φ+
k (·) / φk (·). The remainder of this
step ensures that every extreme point of φ+
k (·) has a corresponding node in the policy graph,
by either adding a new node or modifying an existing one. In the latter case, the branches
of an existing node are rerouted in a way that the associated continuation-system-payoff
vector will be improved in the subsequent reevaluation step. The truncation step has no
effect on the continuation-payoff frontier. Thus, the reevaluation step results in a weakly
improved frontier φ"k (·) ≥ φ+
k (·) if rerouting has occurred; otherwise, the step is skipped, and
"
"
"
−
φ"k (·) = φ+
k (·). Thus, φk (·) / φk (·) = φk−1 (·) in any case, and {φk (·)}k=1,2,··· is a sequence
of weakly increasing continuation-payoff frontiers.
Suppose the algorithm does not stop after iteration k. If the frontier φ"k (·) has the same
domain as φ"k−1 (·) and is not strictly better than φ"k−1 (·) at any point, then φ"k (·) / φ"k−1 (·)
implies φ"k (·) = φ"k−1 (·), and the algorithm should have stopped after iteration k, which is a
contradiction. Therefore, φ"k (·) must be strictly better than φ"k−1 (·) at some point or have a
strictly larger domain. The algorithm terminates when φ"k (·) = φ"k−1 (·), in which case φ"k (·)
is the optimal frontier by Proposition 1, or when the distance between φ"k (·) and φ"k−1 (·) is
smaller than ε after a finite number of iterations.
Finally, if rerouting never occurs during the entire procedure, we would have φ"k (·) =
"
φ+
k (·) for all k, and the frontier sequence {φk (·)} would be identical to the sequence obtained
21
by the value-iteration counterpart. The latter sequence converges to the optimal frontier from
below. If rerouting ever happens, we must have φ"k (·) ≥ φ+
k (·) (strictly better at some point)
for some k as discussed above, and thus {φ"k (·)} must converge faster than the sequence
obtained by the value-iteration counterpart.
The last part of the theorem suggests that the finite-policy-graph algorithm improves over
its value-iteration counterpart by seizing possible opportunities to reroute existing branches
in the policy graphs.
3.5
Cycles in Finite Policy Graphs and Initialization of the Algorithm
Some interesting features of the optimal continuation-payoff frontier and policy graph for
Example 1 are demonstrated in Figures 2 and 4. One of them is the cycle contained in the
optimal policy graph, referred to as an optimal cycle, which consists of two nodes, R0 and
L0. Such a cycle is an absorbing sink in the graph and can be reached from other nodes.
It helps pin down the optimal frontier: in Figure 2, the two extreme points of φ∗ (·) located
at u(R0) and u(L0) can be directly computed from equations (19) and (22); other extreme
points of the frontier can be obtained subsequently, as will be seen in Figure 6 in Section 4.
An optimal cycle, if exists, carries important structural information about the optimal
solution—given any initial state distribution, the resulting optimal contract will become a
cyclical policy after a finite number of periods (when the state and action sets are finite). The
possible existence of an optimal cycle reflects the circular structure of the optimal frontier,
as discussed after Example 1. More specifically, the central (or pivotal ) part of a component
function φ∗x (·) (the part between u(R0) and u(L0) in Figure 6) is formed from the convex
(a)
hull of the central parts of intermediate functions {φx (·)}a∈A , while the central part of an
(a)
intermediate function φx (·) is a weighted average of those of the component functions (with
a shrunk domain, by a factor of δ (p11 (a) − p21 (a)) in the two-state case). Thus, the central
part of φ∗ (·) must be generated from itself, which corresponds to a cycle in the policy graph.
A rigorous proof of the general existence of an optimal cycle is left for future research, but in
some special cases such as the two-state-two-action case of Example 1 and the ICFB scenario
22
discussed next, a unique optimal cycle can be identified analytically.
When the state is the agent’s private information, a first-best solution is normally suboptimal for the principal (due to the high information rent yielded to the agent) but it
may still be incentive compatible. We call such a scenario the incentive-compatible first-best
(ICFB) scenario, in which the first-best system efficiency can be achieved over a non-empty
set of the agent’s continuation payoffs. This scenario includes an important special case
called “private values” in which the private state (or agent type) does not affect the principal’s
one-period reward function—i.e., rx (a) can be simplified to r(a). Many interesting adverse
selection problems fall in this category. For a problem with ICFB, the cycle in the optimal
policy graph can be directly constructed, as shown in Zhang (2010b). Example 2 in Section
4, illustrated by Figures 8 and 9, belongs to this case (but not of private values).
Clearly, to describe an infinite-horizon contract by finitely many nodes, a finite policy
graph must contain a cycle so that every node in the graph corresponds to a legitimate
continuation contract (with an infinite future). Thus, the finite-policy-graph algorithm can
naturally be initialized with a cycle. The performance of the algorithm may be affected by
the choice of the initial graph. As demonstrated by the two examples in Section 4, if started
with an optimal cycle, the algorithm would simply add new nodes to the policy graph in
subsequent iterations, without modifying the existing ones. Nevertheless, an optimal cycle
may not be easy to identify in general. As a remedy, we may include multiple cycles in the
initial policy graph, such as singleton nodes and node pairs (in the two-state case), which
need not be all connected. These initial nodes may be removed by the algorithm in later
iterations, but the benefit of including a true optimal cycle early on can be substantial. This
initial treatment can be viewed as a heuristic in general, and whether an optimal cycle exists
or not does not affect the validity of the algorithm.
3.6
Constructing Optimal Contracts
After an optimal policy graph is obtained or approximated by the algorithm, an optimal
long-term contract with respect to a given initial state distribution can be constructed con-
23
0
0
(a)
(b)
Figure 5: The principal’s problem at time 1: (a) u < 0 ≤ u; (b) 0 ≤ u < u.
veniently. For simplicity, we discuss the two-state case, but the process can be generalized
to the multi-state case.
Given the continuation-payoff frontier φ∗ (·), the principal solves the following problem
at time 1: maxu∈U {β[φ∗ (u) − u] : u ≥ 0}. This formulation is based on the agent’s abso-
lute continuation-payoff vector u = (u1 , u2 ), while the continuation-payoff frontier obtained
through the algorithm is based on his relative continuation payoff u = u1 − u2 —i.e., in the
form of φ∗ (u1 − u2 ). Thus, the problem can be rewritten as maxu∈U {β[φ∗ (u1 − u2 ) − u] : u ≥
0}. At an optimal solution u∗ , one of its coordinates must be zero, and thus u∗ must take the
form (u, 0) for some u ≥ 0 or (0, −u) for some u ≤ 0. Then the problem can be transformed
* is the agent’s relative continuation-payoff set and
into max+u∈Ue {βφ∗ (u) − ρ(u)}, where U
β1 u,
u≥0
ρ(u) =
. The terms βφ∗ (·) and ρ(·) in the objective function represent
−β2 u, u ≤ 0
the expected system payoff and expected information rent, respectively, depending on the
agent’s relative continuation payoff u. Thus, the objective of this problem reflects the tradeoff between system-efficiency maximization and information-rent extraction. Various parts
of the objective function are demonstrated in Figure 5: panel (a) illustrates the case in which
the agent’s relative continuation-payoff set [u, u] contains zero, and panel (b) illustrates the
case in which the entire interval lies in the positive half of the u axis. The third case, in
which the interval lies in the negative half of the axis, is symmetric to case (b).
For any initial state distribution β, only a portion of the optimal frontier or optimal policy
24
graph is needed. Given any β, an optimal contract can be constructed by traversing the
optimal policy graph, starting from the node corresponding to u∗ which solves the principal’s
time-1 problem. If that node is created during the policy graph augmentation procedure, the
task is over. This fact may be incorporated into the termination criterion of the algorithm.
4
Two-State Examples
In this section, we demonstrate some key characteristics of the algorithm and the optimal
continuation-payoff frontier through two-state examples. A two-state version of the finitepolicy-graph algorithm is implemented in MATLAB. For the purpose of this paper, we do
not go into the details of the implementation, but it is understood that the performance of a
sophisticated algorithm like ours is heavily influenced by the details of the algorithm design.
A full-scale implementation of the algorithm and thorough numerical study are left for future
research.12 We show two examples below, starting with Example 1 introduced in Section
3. To minimize the overlap, the emphasis of the first example is on the performance of the
algorithm, and that of the second is on the structure of the optimal frontier and contracts.
More examples can be obtained from the author.
4.1
Example 1 Revisited
In this example, the optimal policy graph contains a unique optimal cycle, consisting of
nodes R0 and L0 in Figure 4. If this cycle is chosen as the initial policy graph, the finitepolicy-graph algorithm will add node R1 in the first iteration, nodes R2 and L1 in the second
iteration, nodes R3 and L2 in the third iteration, and so on. No rerouting or truncating
12
The implementation of the algorithm can be greatly facilitated by some existing algorithms for
solving standard computational geometry problems, such as the point-inquiry problem, the convexhull problem, and the projection problem. There are efficient algorithms for solving these problems in the two-state case.
For instance, we used the heap sort algorithm to sort a set
of the agent’s relative continuation payoffs (based on the pseudocode available at the website
http://en.wikipedia.org/wiki/Heapsort#cite_note-1) and Andrew’s monotone-chain convex-hull algorithm
to identify the extreme points of the continuation-payoff frontiers (based on the C++ code provided at
the website http://softsurfer.com/Archive/algorithm_0109/algorithm_0109.htm). For the multi-state case,
more sophisticated algorithms are needed.
25
(a)
(b)
(c)
(d)
Figure 6: Continuation-payoff frontiers for Example 1 at the end of the first four iterations.
is needed in the process, and the iterations reduce to mere value iterations. The resulting
continuation-payoff frontiers at the end of the first four iterations are illustrated in Figure 6.
If the optimal cycle is unknown at the beginning, we can initialize the algorithm with
an arbitrary small graph. For instance, if we start with a singleton node based on action
1 (both branches from the node are circular and are associated with action 1), then the
first 10 continuation-payoff frontiers generated by the algorithm are depicted in Figure 7(a).
Figure 7(b) shows the first 50 frontiers generated by the value-iteration counterpart of the
algorithm.
The figures show that the frontiers generated by value iterations converge at a steady
speed, while those generated by policy graph augmentations (policy iterations) may comprise
big leaps. This seems to be a general phenomenon not restricted to a particular example. We
also observe that the last frontier in Figure 7(a) dominates that in Figure 7(b). To compare
the outputs of the two algorithms more precisely, we compare them with the optimal frontier
illustrated in Figure 2.
26
More
iterations
(a)
(b)
Figure 7: Continuation-payoff frontiers for Example 1 from an initial node based on action
1, through: (a) finite policy graph augmentations, and (b) value iterations.
Because the continuation-payoff frontiers after the first few iterations are roughly parallel
with one another, the distance between two frontiers can be conveniently measured by the
distance between their “middle” points (in terms of the agent’s relative continuation payoffs).
In this example, the middle points are located at um ≈ 0.3191. The corresponding optimal
continuation-system-payoff vector is φ∗ (um ) ≈ (1.2655, 0.6750), as in Figure 2. Table 1
reports the results of several experiments started from a single node based on action 1, and
Table 2 records the results started from an initial policy graph consisting of two independent
nodes based on actions 1 and 2, respectively. The tables consist of the following (groups of)
columns: (1) the type of algorithm, “p”olicy iteration or “v”alue iteration; (2) the number
of iterations performed; (3) the total run time;13 (4) the continuation-system-payoff vector
φ(um ) at the middle of the last frontier in the experiment; (5) the absolute difference between
φ(um ) and the optimal continuation-system-payoff vector, i.e., ∆φ(um ) = φ∗ (um ) − φ(um );
and (6) the relative difference, defined as
∆φ1 (um )+∆φ2 (um )
.
φ∗1 (um )+φ∗2 (um )
The two sets of experiments generate the same qualitative results, confirming the prediction of Theorem 2. In the first set of experiments, it takes roughly 140 value iterations to
beat the output precision of 10 policy iterations, and the run time is at least 50 times longer.
The results of the second set of experiments are similar. Although the run-time numbers
13
Recorded on a computer with a 1.60 GHz Pentium M processor and 512 MB RAM.
27
Table 1: Experiments Started
Alg. Iter. Time (s) φ1 (um )
p
10
1.88
1.2621
v
60
29.16
1.1387
v
80
51.72
1.2201
v
100
68.79
1.2492
v
120
91.97
1.2597
v
140
119.44
1.2634
Table 2: Experiments Started from
Alg. Iter. Time (s) φ1 (um )
p
10
2.17
1.2646
v
60
32.97
1.2201
v
80
52.75
1.2492
v
100
74.30
1.2597
v
120
92.88
1.2634
from a Single Node Based on Action 1
φ2 (um ) ∆φ1 (um ) ∆φ2 (um ) Rel. Diff.
0.6716
0.0034
0.0034
0.0035
0.5482
0.1268
0.1268
0.1307
0.6295
0.0454
0.0454
0.0468
0.6587
0.0163
0.0163
0.0168
0.6691
0.0058
0.0058
0.0060
0.6729
0.0021
0.0021
0.0022
Two Separate Nodes
φ2 (um ) ∆φ1 (um )
0.6741
0.0009
0.6296
0.0454
0.6587
0.0163
0.6692
0.0058
0.6729
0.0021
Based on Actions 1 & 2
∆φ2 (um ) Rel. Diff.
0.0009
0.0009
0.0454
0.0467
0.0163
0.0168
0.0058
0.0060
0.0021
0.0022
from a particular implementation of the algorithm should not be taken literally, the qualitative implication is clear, i.e., the finite-policy-graph algorithm takes far fewer iterations and
much less time than its value-iteration counterpart to achieve the same level of precision.
We briefly comment on the optimal continuation-payoff frontier of this example. Recall that the agent’s one-period
&
' costs and the system’s one-period payoffs
& are given by
'
0 1
−0.5 1
C = (cx (a))x∈X,a∈A =
and R − C = (rx (a) − cx (a))x∈X,a∈A =
,
2 0
0.5 −1
respectively. The action plan that maximizes the expected system payoff is given by a∗1 = 2
and a∗2 = 1, and the one that minimizes the expected agent cost is given by a1 = 1 and
a2 = 2 (the claim still holds if long-run payoffs are considered, taking state transitions into
account). There is clearly a severe conflict of interests between the agent and the system.
As a result, in the second-best scenario when the agent’s incentive issue must be addressed,
the optimal continuation-payoff frontier bears a severe loss of efficiency, compared with the
first-best continuation-system-payoff vector φF B ≈ (13.0594, 12.6027).
28
1
1
2
2
1
1
2
1
2
2
1
2
2
2
•
•••
L0
L1
L2
L3
L4
-1.68
(1.95, -0.08)
-2.46
(2.18, -0.40)
-2.84
(2.22, -0.63)
-3.01
(2.22, -0.80)
-3.10 •
(2.22, -0.92)
Figure 8: The optimal continuation-payoff frontier for Example 2.
1.93
(2.00, 1.57)
4.2
Example 2
2
1
2.27
(1.71, 1.57)
R1
2.39
(1.54, 1.57)
R2
R3
•••
1
The next example possesses
the ICFB
property discussed in Subsection 3.5. Its optimal
0.91
2
0
(2.55, 1.57)
1
2
1
2
•
solution exhibits markedly different characteristics from the example above.
2
2
L1
L2
L3
•••
•
Example 2 Consider& a model with
two
following parame0.36 states,
0.11' and the &
'
&0.17two actions,
'
(2.55,
1.21)
(2.55,
1.01)
(2.55,
0.90)
0
0
0.5
1
0.1 0.9
ters: δ = 0.9, C =
, R−C =
, P(1) =
, and
−1 −0.5
0.8 0.2
&
' 2.5 1.4
0.15 0.85
P(2) =
, where each row of C and R − C corresponds to a state, each column
0.75 0.25
corresponds to an action, and each transition matrix P(a) consists of probabilities pxy (a).
(a)
The optimal frontier components φ∗x (·) and intermediate functions φx (·) are illustrated in
Figure 8, and the optimal policy graph is shown in Figure 9.
In this example, the cost matrix C implies that the agent has the incentive to take action
2 in both states to minimize cost, and the system payoff matrix R−C implies that the system
also prefers action 2 in both states to maximize system payoff (which is still true when longrun payoffs are considered). Thus, the agent’s incentive is aligned with the system objective,
and the first-best action plan a∗1 = a∗2 = 2 is incentive compatible. Because the first-best
action plan yields the highest possible continuation-system-payoff vector and it solely involves
action 2, the self-circular node based on action 2 (the node 0 in Figure 9) must be contained
in the optimal policy graph and hence constitutes an optimal cycle. This node corresponds
29
1.93
(2.00, 1.57)
2
2
R1
1
0.91
(2.55, 1.57)
0
2
2
2.27
(1.71, 1.57)
2
R2
1
2
L1
0.36
(2.55, 1.21)
2.39
(1.54, 1.57)
L2
0.17
(2.55, 1.01)
R3
1
2
L3
•••
•
•••
•
0.11
(2.55, 0.90)
Figure 9: The optimal policy graph for Example 2.
to the agent’s relative continuation payoff u(0) ≈ 0.9091 and the aforementioned “central
part” of the optimal frontier φ∗ (·) in Subsection 3.5. The rest of φ∗ (·) can be determined by
expanding from this center. First, by the definition of intermediate functions, the peak of
(1)
φ1 (·), located at u(R1), is mapped from the peak of φ∗ (·) and becomes an extreme point of
(1)
the frontier component φ∗1 (·), by the properties of φ∗1 (·); the peak of φ2 (·), also located at
u(R1), is dominated by φ∗2 (u(0)). Thus, the node R1 in the optimal policy graph and the part
of φ∗ (·) between u(0) and u(R1) are determined as in Figures 8 and 9. Then, we can show
(2)
that the second-highest extreme point of φ2 (·) is mapped from the second-highest extreme
point of φ∗ (·), located at u(R1).14 Thus, the node L1 in the optimal policy graph and the
part of φ∗ (·) between u(0) and u(L1) are determined. Repeating this process, we obtain the
optimal continuation-payoff frontier and policy graph. If the algorithm is initialized with the
node 0, it will add nodes R1, L1, R2, L2, R3, L3, ..., one in each iteration.
Now, we derive the optimal contracts. The agent’s relative continuation-payoff set can
be found to be U ≈ [0.0758, 2.4523].15 Following the discussion in Subsection 3.6, the
(2)
14
Because p11 (2) − p21 (2) < 0, by equations (13)-(14), φ2 (·) is a weighted average of φ∗1 (·) and φ∗2 (·),
(2)
flipping at u(0). Thus, a point of φ2 (·) on the left of u(0) is mapped from a point of φ∗ (·) on the right of
u(0).
15
The bounds can be determined from equations (15)-(16). Because A− = {1, 2} and A+ = A0 = ∅,
(1)
(2)
(1)
(2)
(2)
we have u = min{Γ[u] (u), Γ[u] (u)} and u = max{Γ[u] (u), Γ[u] (u)}. Figure 8 suggests that u = Γ[u] (u) and
(1)
u = Γ[u] (u), i.e., u = c2 (2) − c1 (2) + δ(p11 (2) − p21 (2))u and u = c2 (1) − c1 (1) + δ(p11 (1) − p21 (1))u. Solving
the two equations simultaneously, we obtain u ≈ 0.0758 and u ≈ 2.4523.
30
(c)
(d)
1.0
0.8
0.6
0.4
0.2
0.0
Figure 10: The principal’s expected total-profit function π ∗ (u) for various β1 .
principal maximizes her expected continuation payoff at time 1 (or, expected total profit),
i.e., π ∗ (u) = βφ∗ (u)−β1 u = β1 φ∗1 (u)+(1−β1 )φ∗2 (u)−β1 u. The optimal u∗ must be associated
with an extreme point of φ∗ (·) and lie in the interval [u, u(0)] = [0.0758, 0.9091], as can be
seen from Figures 8 and 5(b). It also depends upon the initial state probability β1 . Figure 10
illustrates the function π ∗ (·) in the relevant domain, for different values of β1 . The extreme
point of each π ∗ (·) associated with u∗ is pointed out by an arrow. An optimal contract can
be constructed by traversing the policy graph from the node corresponding to u∗ . The path
always leads to the optimal cycle in finite time, which implies that the contract converges to
a first-best contract in finite time, a general structural result in the ICFB scenario (Zhang
2010b).
An optimal contract reflects the trade-off between system-efficiency maximization and
information-rent extraction. A first-best policy achieves the highest possible system efficiency, βφ∗ (u), but can only be sustained at u(0) ≈ 0.9091, which represents a substantial
information rent in state 1 (from Subsection 3.6, a relative continuation payoff u ≥ 0 corre-
sponds to an absolute continuation-payoff vector u = (u, 0)). Thus, the principal’s expected
total profit βφ∗ (u) − β1 u is not maximized at u(0). According to Figures 8 and 9, the
extreme point of φ∗ (·) on the left of u(0) (i.e., at u(L1)) corresponds to a contract that
attaches two periods of inefficient action plans in front of the first-best policy. Compared
31
with the first best, this new contract reduces the agent’s information rent at the expense of
system efficiency—both β1 u and βφ∗ (u) are reduced. The trade-off can be translated into
the comparison between the slopes of β1 u and βφ∗ (u). Viewed along the −u direction, the
former slope measures the marginal rent reduction, which is constant at β1 , and the latter
measures the marginal efficiency loss, which increases as more inefficient periods are attached
to the front of the contract. An optimal contract equalizes these two marginal values.
The above interpretation also helps examine the impact of the initial state distribution on
the optimal contracts. In this example, because the slope of φ∗1 (·) is always 0 in the relevant
domain, the slope of βφ∗ (·) is given by the slope of β2 φ∗2 (·). Thus, we are in fact comparing
the latter slope to β1 , or the slope of φ∗2 (·) to β1 /β2 . If we increase β1 (and hence β1 /β2 ),
the optimal u∗ will move toward the left, away from u(0), as shown in Figure 10. Intuitively,
when β1 increases, the information rent in state 1 becomes more salient to the principal,
and a less efficient contract that gives up less information rent in state 1 may become more
desirable.
5
Conclusion
In this paper, we presented an algorithm to solve an infinite-horizon adverse selection problem, based on the finite-policy-graph representation of long-term contracts and continuationpayoff frontiers. The algorithm augments the policy graph through value iterations and
exploits possible improvements in the structure of the policy graph, which may lead to substantial gains. The finite-policy-graph representation and the algorithm not only offer a
numerical solution to the problem but also help the exploration of the structures of the
optimal contracts, as demonstrated by the two examples.
In this paper, we considered pure adverse selection (with unobservable state and observable action) and assumed a risk-neutral agent. The algorithm is based on an APS-style recursion. Interestingly, continuation-value recursion is also applicable to moral hazard problems,
often with risk-averse agents (e.g., Spear and Srivastava 1987, and Phelan and Townsend
1991). Thus, from the perspective of the general solution approach, adverse selection and
32
moral hazard problems are closely related. For instance, Fernandes and Phelan (2000) study
both types of problems and Doepke and Townsend (2006) analyze a hybrid adverse selection and moral hazard problem. The finite-policy-graph algorithm presented in the paper is
not fundamentally restricted to the pure adverse selection setting with a risk-neutral agent.
In theory, the key step of the algorithm—i.e., replacing dominated points on the previous continuation-value frontier with the points on the new frontier and recalculating the
new frontier—can be applied to any finite-state finite-action principal-agent problem with a
continuation-value recursion. Extending the algorithm to other principal-agent problems is
a promising direction for future research.
As mentioned in the introduction, there exist different approaches for solving dynamic
principal-agent problems in the literature. It is an important future reserach topic to compare
the performances of the finite-policy-graph algorithm and the existing algorithms.
Acknowledgements
The author thanks Kenneth L. Judd and two anonymous referees for their thorough and
constructive reviews, which helped improve the paper significantly, and Garrett J. van Ryzin
for his valuable inputs on this paper.
References
[1] Abreu, D., D. Pearce, and E. Stacchetti. 1990. Towards a theory of discounted repeated
games with imperfect monitoring. Econometrica 58 1041–1064.
[2] Battaglini, M. 2005. Long-term contracting with Markovian consumers. American Economic Review 95 637–658.
[3] Battaglini, M., and S. Coate. 2008. Pareto efficient income taxation with stochastic
abilities. Journal of Public Economics 92 844–868.
[4] Bolton, P., and M. Dewatripont. 2005. Contract Theory. MIT Press, Cambridge, MA.
[5] Cole, H., and N. Kocherlakota. 2001. Dynamic games with hidden actions and hidden
states. Journal of Economic Theory 98 114–126.
[6] Doepke, M., and R. M. Townsend. 2006. Dynamic mechanism design with hidden income
and hidden actions. Journal of Economic Theory 126 235–285.
33
[7] Fernandes, A., and C. Phelan. 2000. A recursive formulation for repeated agency with
history dependence. Journal of Economic Theory 91 223–247.
[8] Fudenberg, D., and J. Tirole. 1991. Game Theory. MIT Press, Cambridge, MA.
[9] Hansen, E. A. 1998. An improved policy iteration algorithm for partially observable
MDPs. Advances in Neural Inform. Processing Systems 10 (NIPS-97), MIT Press, Cambridge, MA, 1015–1021.
[10] Howard, R. 1960. Dynamic Programming and Markov Processes. Technology PressWiley, Cambridge, MA.
[11] Judd, K. L., S. Yeltekin, and J. Conklin. 2003. Computing supergame equilibria. Econometrica 71 1239–1254.
[12] Kapicka, M. 2008. Ecient allocations in dynamic private information economies with
persistent shocks: A first-order approach. Working paper, University of California, Santa
Barbara.
[13] Laffont, J.-J., and D. Martimort. 2002. The Theory of Incentives: The Principal-Agent
Model. Princeton University Press, Princeton, NJ.
[14] Myerson, R. B. 1986. Multistage games with communication. Econometrica 54(2) 323–
358.
[15] Phelan, C., and R. M. Townsend. 1991. Computing multi-period, informationconstrained optima. Review of Economic Studies 58 853–881.
[16] Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York.
[17] Salanie, B. 1997. The Economics of Contracts. MIT Press, Cambridge, MA.
[18] Sleet, C. 2001. On credible monetary policy and private government information. Journal of Economic Theory 99 338–376.
[19] Sleet, C., and S. Yeltekin. 2001. Dynamic labor contracts with temporary layoffs and
permanent separations. Economic Theory 18 207–235.
[20] Sleet, C., and S. Yeltekin. 2007. Recursive monetary policy games with incomplete
information. Journal of Economic Dynamics & Control 31 1557–1583.
[21] Spear, S. E., and S. Srivastava. 1987. On repeated moral hazard with discounting.
Review of Economic Studies 54 599–617.
[22] Tchistyi, A. 2006. Security design with correlated hidden cash flows: The optimality of
performance pricing. Working paper, New York University.
34
[23] Zhang, H. 2010a. Partially observable Markov decision processes: A geometric technique
and analysis. Operations Research 58(1) 214–228.
[24] Zhang, H. 2010b. Structural analysis of a dynamic adverse-selection model. Working
paper, University of Southern California, Los Angeles, CA.
[25] Zhang, H., M. Nagarajan, and G. Sosic. 2010. Dynamic supplier contracts under asymmetric inventory information. Operations Research forthcoming.
[26] Zhang, H., and S. Zenios. 2008. A dynamic principal-agent model with hidden information: Sequential optimality through truthful state revelation. Operations Research
56(3) 681–696.
35