Download Solving a Dynamic Adverse Selection Model Through Finite Policy

Solving a Dynamic Adverse Selection Model Through Finite Policy Graphs Hao Zhang Marshall School of Business, University of Southern California, Los Angeles, CA 90089 Abstract This paper studies an infinite-horizon adverse selection model with an underlying Markov information process and a risk-neutral agent. It introduces a graphic representation of continuation contracts and continuation-payoff frontiers, namely finite policy graphs, and provides an algorithm that generates a sequence of such graphs to approximate the optimal policy graph. The algorithm performs an additional step after each value iteration—replacing dominated points on the previous continuation-payoff frontier by points on the new frontier and reevaluating the new frontier. This dominance-free reevaluation step accelerates the convergence of the continuation-payoff frontiers. Numerical examples demonstrate the effectiveness of this algorithm and properties of the optimal contracts. 1 Introduction The principal-agent model with hidden information (or adverse selection) provides a powerful tool for analyzing bilateral interactions tangled with asymmetric information. The literature on single-period adverse selection problems is vast (see e.g., the textbooks by Fudenberg and Tirole 1991, Laffont and Martimort 2002, and Bolton and Dewatripont 2005). There is also a large literature on multi-period adverse selection problems, but the majority of it focuses on settings in which the hidden information is either constant or independent across time periods (see the examples and references in Salanie 1997 and Bolton and Dewatripont 2005). Because the intriguing dynamics are assumed away from the information structure, the results and implications obtained from these models need not extend to more general settings. In recent years, Markov information structures, under which the private information follows a Markov process or Markov decision process, have attracted increasing attention, such as the endowment process in Fernandes and Phelan (2000), general state process in Cole and Kocherlakota (2001), consumer preference process in Battaglini (2005), income process in Doepke and Townsend (2006), productivity process in Kapicka (2008), and inventory process in Zhang, Nagarajan, and Sosic (2010), to name a few. In this paper, we study an infinite-horizon adverse selection model with an underlying Markov decision process and a risk-neutral agent, which is a counterpart of the finite-horizon model studied in Zhang and Zenios (2008). We discuss two applications of the model next. First, consider a simple supply chain with asymmetric inventory information. A monopolistic supplier sells a product to a retailer in multiple periods, and the retailer stocks inventory to hedge against demand uncertainty. In each period t, the retailer observes its initial inventory level xt and orders quantity qt from the supplier; the (random) demand is realized at the end of the period and excess inventory is carried over to the next period. The inventory process is a Markov decision process with transition probabilities p(xt+1 |xt , qt ) determined by the demand distribution. The supplier cannot observe the retailer’s inventory although it must be taken into account when designing the contract. The short-term contracting version of this problem (in which a one-period contract is offered by the supplier in every period) is 1 studied in Zhang, Nagarajan and Sosic (2010), while the long-term contracting version fits into the scope of this paper. Second, consider a dynamic pricing problem with changing customer types. A firm sells a non-durable product (or service) in multiple periods. Each customer has a type θt that affects his or her utility in period t and evolves according to a Markov decision process, with transition probabilities p(θt+1 |θt , qt ), where qt is the purchasing quantity (or quality) in period t. The firm does not observe customer types yet strives to maximize its expected profit through a dynamic pricing mechanism. A two-state case of this problem is studied in Battaglini (2005), with emphasis on the structure of the optimal contracts. In contrast, the focus of this paper is on the numerical solution of the general model. In spite of many potential applications, dynamic adverse selection problems with Markov transitions remain relatively under-represented in the literature, partly due to the technical complexity of such problems. The optimal contracts tend to be history dependent, and finding such a contract is computationally costly—a phenomenon known as the “curse of dimensionality.” In this paper, we present an algorithm to find optimal long-term contracts (under full commitment of the principal). The algorithm differs from the existing methods in the literature in a significant way, as summarized below. The main methodology for tackling a dynamic adverse selection problem is originated from Abreu, Pearce, and Stacchetti (1990) (APS, hereinafter) on repeated multi-agent games with imperfect monitoring. Under the APS approach, the set of continuation payoffs in equilibrium is recursively defined through a functional operator, and the equilibrium payoff set can be approximated by a sequence of continuation-payoff sets generated by the operator. This approach is applicable to a wide range of dynamic games, but approximating the equlibrium payoff set is often computationally challenging. A common remedy is to discretize the continuation-payoff sets, as in Doepke and Townsend (2006) on a hybrid adverse selection and moral hazard problem between a social planner and a representative agent who has private income information and can make private investment efforts.1 Another way of im1 Recursive expression of the continuation-payoff set (or function) is also applicable to dynamic moral hazard problems. Discretizing the continuation-payoff set (or function) is common as well, as in Phelan and 2 plementing an APS recursion is proposed by Judd, Yeltekin, and Conklin (2003). They bound the equilibrium payoff set by convex polytopes from inside and outside and provide algorithms that generate inner and outer polytopes to approximate the equilibrium payoff set. This approach is adopted by Sleet (2001) and Sleet and Yeltekin (2007) to dynamic signaling games, in which a government designs a monetary policy for an economy resided by a continuum of households and firms, and the government has private information on the state of the economy (in the former paper) or on whether it will commit to the socially optimal monetary policy (in the latter paper). In this paper, we first introduce a graphic representation of (infinite-horizon) long-term contracts and continuation-payoff frontiers, namely finite policy graphs, and then present an algorithm that generates a sequence of such graphs to approach the optimal policy graph. We take advantage of an important property of the infinite-horizon model that the optimal (equilibrium) continuation-payoff set is a fixed point of a functional operator (of the APS style). In contrast with the aforementioned methods of implementing APS-style recursions, our algorithm is in the spirit of policy iteration for solving Markov decision processes (Howard 1960) and finite state controller for solving partially observable Markov decision processes (Hansen 1998), as it strives to improve the structure of the policy graph in each iteration through rerouting existing branches. More specifically, the algorithm performs an extra step, namely dominance-free reevaluation, after each value iteration: identifying dominated points on the previous continuation-payoff frontier, replacing them by those on the newly created frontier, and recalculating the new frontier through a set of linear equations (determined by the dominance relations). We show analytically and through numerical experiments that this algorithm outperforms its value iteration counterpart. It has the additional advantage of facilitating the exploration of optimal contract structures. We assume in the model that the agent is risk neutral toward monetary uncertainties. This assumption is common in the operations research and management science literature, Townsend (1991) on a repeated moral hazard problem between a social planner and a continuum of agents with private production efforts, which is generalized by Sleet and Yeltekin (2001) by allowing the planner (or firm) to lay off workers. 3 as the agents are often firms or businesses which have the capacity to bear or transfer some financial risks; it is not uncommon in the economics literature as well, in which the agents under consideration are often individual consumers. The two examples presented earlier fit well in the model studied in the paper. Other examples can be found in Battaglini and Coate (2008) and Tchistyi (2006). The remainder of the paper is organized as follows. Section 2 introduces the model and some basic results. Section 3 presents the graphic representation of long-term contracts and continuation-payoff frontiers and the algorithm that generates a converging sequence of such graphs. Section 4 analyzes two numerical examples that demonstrate the main features of the algorithm and properties of the optimal solution. The last section concludes with future research suggestions. 2 A Dynamic Adverse Selection Model and Basic Results In this section, we introduce the dynamic adverse selection model proposed in Zhang and Zenios (2008) and discuss some basic results of the model. Vectors and matrices will be denoted by bold letters throughout the paper, e.g., φ, px (a), and P(a). A Dynamic Adverse Selection Model. At the beginning of the horizon, the principal makes a take-it-or-leave-it offer to the agent in the form of a long-term contract that covers T ≤ ∞ periods (henceforth, the principal will be referred to as “she” and the agent as “he”). If the agent accepts the offer, the contract execution starts. Within each period t, the following events take place. First, the agent privately observes the state of a Markov decision process, denoted by xt . The state set is finite and denoted by X = {1, · · · , n}. Next, the agent takes a public action at and incurs a cost cxt (at ). The action set is also finite and is denoted by A = {1, · · · , m}. At the end of the period, the principal receives a reward rxt (at ). She then pays the agent st as specified in the contract, contingent upon publicly observable and verifiable information. Finally, the hidden state moves to xt+1 , with transition probabilities Pr(xt+1 = y|xt = x, at = a), or simply pxy (a). A row vector px (a) = 4 (px1 (a), ..., pxn (a)) is defined for each x ∈ X and a ∈ A. The distribution of the initial state ! x1 is publicly known and is described by probabilities βx1 such that x1 ∈X βx1 = 1. The ! total discounted payoff for the principal is given by Tt=1 δ t−1 (rxt (at ) − st ), and that for the ! agent is Tt=1 δ t−1 (st − cxt (at )). The state history (x1 , · · · , xt ) and action history (a1 , · · · , at ) are abbreviated as xt and at , respectively. The beginning of period t is referred to as time t. The principal’s problem is to design a long-term contract to maximize her expected total payoff subject to the agent’s incentive compatibility and participation (or individual rationality) constraints. We assume that the principal can make full commitment not to renegotiate with the agent during contract execution. As in the standard setting, we assume that the state xt cannot be inferred (or verified) from the reward rxt (at ). Revelation Contracts. Although the types of possible long-term contracts are numerous, there is no loss of generality to focus on the following revelation contracts:2 A dynamic randomized revelation contract defines the following sequence of events in any period t after any public history (" xt−1 , zt−1 ): the agent reports a state x "t ; a public random variable zt is drawn from the set Z = {1, · · · , n} with ! probabilities θt (" xt , zt ) such that zt ∈Z θt (" xt , zt−1 , zt ) = 1; the agent takes action at (" xt , zt ); and the principal pays the agent st (" xt , zt ). The contract is denoted by σ = {θ1 (" x1 , z1 ), a1 (" x1 , z1 ), s1 (" x1 , z1 ); θ2 (" x2 , z2 ), a2 (" x2 , z2 ), s2 (" x2 , z2 ); · · · }xbT ∈X T ,zT ∈Z T , or σt (" xt−1 , zt−1 ) = {θt (" xt , zt ), at (" xt , zt ), st (" xt , zt ), σt+1 (" xt , zt )}xbt ∈X,zt ∈Z recursively. The above contract generalizes the familiar static revelation contract to multiple periods and introduces randomization at the beginning of every period, after the state is reported and before the action is taken.3 In the deterministic special case, the contract can be simplified 2 A dynamic revelation principle is first shown by Myerson (1986) for multi-player multi-stage communication games in which the players can freely communicate through a central mediator at the beginning of each stage. 3 Because any point in a facet of an n dimensional polytope can be expressed as a convex combination of no more than n vertices, the random variable zt need not take more than n values, and hence it is sufficient to define Z = {1, · · · , n}. The above randomized revelation contract differs from the one defined in Zhang and Zenios (2008), in which the randomization is over the set of actions A. It can be shown that the two definitions are equivalent, but the one given here will be more convenient for the algorithm presented in 5 to σ = {a1 (" x1 ), s1 (" x1 ); a2 (" x2 ), s2 (" x2 ); · · · }xbT ∈X T , or σt (" xt−1 ) = {at (" xt ), st (" xt ), σt+1 (" xt )}xbt ∈X recursively. We call a dynamic revelation contract a truthful revelation contract if it induces the agent to reveal the true state in every period. We call the part of a long-term contract starting from time t (after any public information history) a time-t continuation contract. The principal’s problem can be formulated within the class of truthful revelation contracts. The notation can be simplified as follows: suppressing the history (" xt−1 , zt−1 ), moving x "t to the subscript (without the “hat”), moving zt to the superscript, and removing the index t when it is clear from the context. Thus, a time-t randomized revelation contract can be z z compactly written as: σt = {θxz , azx , szx , σt+1,x }x∈X,z∈Z . The tuple (θxz , azx , szx , σt+1,x )z∈Z , given reported state x, is referred to as a submenu for state x. If the agent reports truthfully under σt , the two parties’ and system’s expected future payoffs in state x are given by: ux (σt ) = # z∈Z πx (σt ) = # z∈Z φx (σt ) = # z∈Z $ % z ) , θxz szx − cx (azx ) + δpx (azx )u(σt+1,x $ % z ) , θxz rx (azx ) − szx + δpx (azx )π(σt+1,x $ % z θxz rx (azx ) − cx (azx ) + δpx (azx )φ(σt+1,x ) , (1) (2) (3) where u(σ) = (ux (σ))x∈X , π(σ) = (πx (σ))x∈X , and φ(σ) = (φx (σ))x∈X are column vectors and pu is the matrix multiplication of a row vector and a column vector. Thus, every time-t truthful revelation contract σt generates a triple of continuation-payoff vectors (u, π, φ)(σt ). Because φ(σt ) = u(σt ) + π(σt ), it suffices to concentrate on the pair (u, φ)(σt ). Policy Trees. A long-term revelation contract can be expressed as a policy tree. Figure 1(a) illustrates a deterministic contract. A branch of the tree corresponds to a reported state xt and is associated with an action-payment pair (at , st ). Figure 1(b) illustrates a randomized contract: a branch emanating from a node corresponds to a reported state xt ; Subsection 3.4. In terms of the total number of variables in the contract, there is no clear winner between these two definitions: if |A| < n, the format in Zhang and Zenios (2008) is more parsimonious than the one given here; if |A| > n, the reverse is true. 6 ••• ••• ••• ••• ••• ••• (a) (b) Figure 1: Policy tree of a long-term revelation contract in (a) the deterministic case, and (b) the randomized case. it splits into a collection of sub-branches, each of which corresponds to a realization of the random variable zt and is associated with a probability-action-payment triple (θt , at , st ). Every node of the policy tree has two interpretations. Consider the deterministic case for 1 instance. A node at time t viewed from the top down corresponds to a history of the reported 1 2 2 1 2 1 "t−1 , associated states x with a1 history of action-payment pairs. Viewed from the bottom up, 2 the node corresponds to a continuation contract σt , associated with a continuation-payoff (a) (b) pair (u, φ). The bottom-up (or backward) perspective will be Med-the emphasis of this paper. ium Bust Boom Continuation-Payoff Frontiers and the Principal’s Problem. Given any continuationBoom Low continuation-payoff High payoff vector of the agent ut , there exists a maximum vector for the Bust Bust system, φ∗t , attainable by a continuation contract that yields ut for the Boom agent. This gives rise to the time-t continuation-payoff frontier (or function), φ∗t (ut ) = sup{φ(σt ) : σt ∈ ΣTRC t and u(σt ) = ut }, for ut ∈ Ut , where ΣTRC is the set of time-t truthful revelation contracts t and Ut = {u(σt ) : σt ∈ ΣTRC } is the time-t continuation-agent-payoff set.4,5 t 4 We can also define continuation-payoff frontiers in terms of the principal’s continuation payoffs (against the agent’s), which is more common in the literature. The two types of frontiers are equivalent because they have a one-to-one correspondence. However, analyzing continuation payoffs from the system’s perspective is more convenient for us because the payment terms szx are cancelled out in the system payoff expression (3) due to the risk neutrality of the principal and agent. This technical convenience is inessential to the algorithm presented in Subsection 3.4. 5 Notice that the notation differs from the standard notaion in economics, in which φ∗t (ut ) is usually expressed in components (i.e., φ∗t,x (ut )), denoted by Vt (ut , xt ) alike. The notation introduced here emphasizes 7 The continuation-payoff frontiers φ∗t : Ut → Rn can be obtained through backward induction, which is facilitated by the following problem, called the auxiliary planning problem in the economics literature. The goal of the problem is to find the xth component of the time-t continuation-payoff frontier, φ∗t,x (·), from the time-(t + 1) frontier φ∗t+1 (·). We refer to φ∗t,x (·) as a frontier component or component function. If the agent is offered (promised) a continuation-payoff vector ut , φ∗t,x (ut ) is the maximum continuation payoff for the system from state x onward, obtained through optimal choices of the randomization probabilities θxz , actions azx , and the agent’s time-(t + 1) continuation-payoff vectors uzx : φ∗t,x (ut ) = maxz z {θxz ∈[0,1],ax ∈A,ux ∈Ut+1 }z∈Z s.t. ut,x" − ut,x ≥ # θxz = 1. # z∈Z θxz # z∈Z {cx (azx ) $ % θxz rx (azx ) − cx (azx ) + δpx (azx )φ∗t+1 (uzx ) − cx" (azx ) + δ[px" (azx ) − px (azx )]uzx } , x$ ('= x) ∈ X (4) (5) (6) z∈Z The incentive compatibility (IC) constraints (5) reflect a change of variables, from period-t payments szx to the agent’s time-t continuation payoffs ut,x (the former can be easily recovered from the latter, with some redundancy). The constraints can be derived as follows. If the agent reports a state x truthfully, his continuation payoff at time t would be given by ! z z z z z $ ut,x = z∈Z θx {sx − cx (ax ) + δpx (ax )ux }. If the true state is x but the agent reports ! x, his continuation payoff would be ut,x|x" = z∈Z θxz {szx − cx" (azx ) + δpx" (azx )uzx } = ut,x + ! z z z z z z z z z z∈Z θx {cx (ax ) − cx" (ax ) + δ[px" (ax ) − px (ax )]ux } (recall that the subscripts in θx , sx , and ux refer to the reported state, and the ones in cx (a) and px (a) refer to the true state). Therefore, constraints (5) are in fact ut,x|x" ≤ ut,x" , which prevent the agent from misreporting state x$ as x. Intuitively, because the payment term szx and cost term cx (azx ) are additively separable in ! the agent’s payoff function, the expected period-t payment z∈Z θxz szx (given that the agent reports x) contributes to the agent’s continuation payoff uniformly, independent of the true state. In other words, the difference between ut,x and ut,x|x" is independent of the payments, the multi-dimensional nature of the continuation payoffs for the principal, agent, and system, which is consistent with the emphasis of the vector-valued continuation-payoff frontiers (functions) throughout this paper. 8 and thus the agent’s gain from misrepresenting x$ as x, i.e., ut,x|x" − ut,x" , can be conveniently translated into ut,x − ut,x" . z The problem (4)-(6) only involves the submenu (θxz , azx , szx , σt+1,x )z∈Z for state x (szx and z σt+1,x are replaced by ut,x and uzx , respectively), and the constraint set (5) is only a subset of the IC constraints that define a time-t truthful revelation contract. For any ut ∈ Ut , a continuation contract can be formed by combining the n submenus found through the auxiliary planning problems given ut . The continuation-payoff vector ut promised to the agent serves as a parameter of the problem (4)-(6). The resulting optimal objective function φ∗t,x (·) possesses useful properties such as piece-wise linearity, concavity, and monotonicity. The problem is only feasible for certain values of ut , denoted by set Ut,x , which is endogenously determined by the IC constraints (5). The time-t continuation-agent-payoff set Ut defined earlier can be obtained by Ut = ∩x∈X Ut,x , because the collection of IC constraints from all time-t auxiliary planning problems defines Ut exactly. The problem formulation (4)-(6) implies that φ∗t (ut ) = φ∗t (ut + λ1) for any ut ∈ Ut and λ ∈ R, where 1 is a column vector of ones. Thus, the set Ut has one degree of freedom, and it is convenient to focus on the agent’s relative continuation payoffs ut,x" − ut,x . For any pair of states x '= x$ , the IC constraints in the auxiliary planning problems for states x and x$ take the forms “ut,x" − ut,x ≥ · · · ” and “ut,x − ut,x" ≥ · · · ,” respectively, and hence the relative continuation payoffs are bounded from both sides.6 After the continuation-payoff frontier φ∗1 : U1 → Rn is found, the principal solves a simple problem at time 1: maxu1 ∈U1 {β (φ∗1 (u1 ) − u1 ) : u1 ≥ 0}. The participation constraint u1 ≥ 0 is based on the assumption that the agent knows the initial state x1 when signing the contract. Participation constraints are unnecessary when finding φ∗t : Ut → Rn because they can be easily satisfied by transferring payments across periods. 6 The dimensional reduction of the agent’s continuation-payoff set significantly simplifies the illustration of continuation-payoff frontiers in the two-state case, as evident in Figures 2, 7, and 8. However, this simplification is inessential for the algorithm discussed in Subsection 3.4. 9 Convergence of Continuation-Payoff Frontiers Over the Infinite Horizon. The problem (4)-(6) for state x in effect defines a functional operator Γ∗x that maps the vectorvalued function φ∗t+1 : Ut+1 → Rn to the scalar-valued function φ∗t,x : Ut,x → R. The problem can be expressed succinctly as φ∗t,x (·) = Γ∗x φ∗t+1 (·). Each iteration in the backward induction, from continuation-payoff frontier φ∗t+1 (·) to φ∗t (·), also defines a functional operator (of the APS style), denoted by Γ∗ and referred to as the value-iteration operator. The iteration can be conveniently written as φ∗t (·) = Γ∗ φ∗t+1 (·). Clearly, Γ∗ = (Γ∗x )x∈X , where the domain of Γ∗ φ∗t+1 (·) equals the intersection of the domains of Γ∗x φ∗t+1 (·). It can be shown that: Given a bounded continuous function φ∗0 : U0 → Rn with a convex domain U0 ⊂ Rn , the sequence of continaution-payoff functions φ∗k (·) = Γ∗ φ∗k−1 (·) converges to a unique bounded continuous function φ∗∞ : U∞ → Rn . That is, φ∗∞ (·) is the unique fixed point of the operator Γ∗ . In other words, over the infinite horizon, the continuation-payoff frontiers are identical in all periods, as given by φ∗∞ (·) and referred to as the optimal (continuation-payoff ) frontier.7 The convergence occurs at two levels: the sequence of domains Uk converges to U∞ (under the Hausdorff metric), and the sequence of functions φ∗k (·) converges to φ∗∞ (·).8 In the remainder of this paper, we will concentrate on φ∗∞ (·) and remove the index ∞ for simplicity (all other time indices will be suppressed from the notation as well). 3 Finding the Optimal Continuation-Payoff Frontier Through Finite Policy Graphs In this section, we solve the infinite-horizon version of the model presented in Section 2. We first focus on the two-state model, which will be useful for demonstrations throughout the 7 Because the frontier is unique in the infinite-horizon case, the modifier “optimal” is somewhat redundant. It is adopted to distinguish the true continuation-payoff frontier from other approximated ones. 8 It is commonly seen in the economics literature that an APS-style operator like Γ∗ has a unique fixed point if the domains of the continuation-payoff functions are fixed at U∞ (e.g., Fernandes and Phelan 2000, and Doepke and Townsend 2006). The result stated here frees the domains and justifies the algorithm presented in the next section. The proof of the result is available from the author. 10 paper and for numerical experiments in Section 4. We then introduce a graphic representation for long-term contracts and continuation-payoff frontiers, using finitely many nodes, and present an algorithm that generates a sequence of such graphs to approximate the optimal frontier. We also discuss a circular structure in the optimal frontier. Finally, we show that optimal contracts can be easily constructed from the optimal frontier. 3.1 Optimal Continuation-Payoff Frontier in the Two-State Case In the two-state case, X = {1, 2}. Due to the redundancy in the agent’s continuationpayoff set, the agent’s absolute continuation-payoff vector u can be replaced by a relative continuation-payoff variable u = u1 − u2 , which reduces the unbounded continuation-payoff set U to a compact interval U = [u, u] (with a slight abuse of notation). The auxiliary planning problem (4)-(6) can be rewritten as follows, given u. For x = 1, φ∗1 (u) = { max θ1z ∈[0,1],az1 ∈A,v1z ∈U s.t. u ≤ # # z∈Z # }z∈Z z∈Z θ1z {r1 (az1 ) − c1 (az1 ) + δp1 (az1 )φ∗ (v1z )} θ1z {c2 (az1 ) − c1 (az1 ) + δ (p11 (az1 ) − p21 (az1 )) v1z } (7) (8) (9) θ1z = 1; z∈Z and for x = 2, φ∗2 (u) = { max θ2z ∈[0,1],az2 ∈A,v2z ∈U s.t. u ≥ # # z∈Z # }z∈Z z∈Z θ2z {r2 (az2 ) − c2 (az2 ) + δp2 (az2 )φ∗ (v2z )} θ2z {c2 (az2 ) − c1 (az2 ) + δ (p11 (az2 ) − p21 (az2 )) v2z } (10) (11) (12) θ2z = 1. z∈Z The incentive compatibility constraint (8) prevents the agent from misreporting state 2 as state 1, and constraint (11) does the opposite. The function φ∗ (·) in the objective functions is the optimal continuation-payoff frontier, and the optimal objective functions φ∗x (·), x = 1 and 2, give its two components; the domain of φ∗ (·), U , is endogenously determined by the constraints. This formulation underscores the fact that φ∗ (·) is a fixed point of the value-iteration operator Γ∗ . 11 The component functions φ∗1 (·) and φ∗2 (·) can be depicted in a two-dimensional space, namely the (continuation-payoff ) component space, which is useful for constructing and illustrating continuation-payoff frontiers. Any continuation-payoff pair (u, φ) can be represented by two points (u, φ1 ) and (u, φ2 ) in the component space. As illustrated in Figure 2, (a) a component function φ∗x (·) can be constructed from certain intermediate functions φx (·), a ∈ A, which are in turn obtained from φ∗ (·) through simple transformations. (a) To define these intermediate functions, we define an operator Γx : R3 → R2 for any x ∈ X and a ∈ A that maps a continuation-payoff pair (v, ψ) ∈ R × R2 to a continuationpayoff component (u, φx ) ∈ R × R as follows: (13) φx = rx (a) − cx (a) + δpx (a)ψ, u = c2 (a) − c1 (a) + δ (p11 (a) − p21 (a)) v. (14) (a) Applying Γx to the optimal continuation-payoff function φ∗ (·) point by point—i.e., map(a) ping each point (v, φ∗ (v)) to a component point Γx (v, φ∗ (v))—we obtain an intermedi(a) (a) ate function φx (·). Used this way, Γx becomes a functional operator, and we can write (a) (a) φx (·) = Γx φ∗ (·). As will be useful to the algorithm in Subsection 3.4, we define another operator Γ(a) : R3 → R3 , which maps a continuation-payoff pair (v, ψ) ∈ R × R2 to another pair (u, φ) ∈ R × R2 according to equations (13)-(14).9 Now we show that the components of the optimal continuation-payoff frontier can be (a) constructed from the intermediate functions φx (·). In the problem formulation (7)-(9), if the IC constraint (8) were instead an equality, the optimal objective function φ∗1 (·) would be (a) the convex hull of the intermediate functions φ1 (·), a ∈ A. Because of the inequality, any feasible solution to the problem (7)-(9) under parameter u$ must be feasible under parameter u$$ ≤ u$ as well. Thus, φ∗1 (u$$ ) ≥ φ∗1 (u$ ) for any u$$ ≤ u$ , and the inequality (8) effectuates a projection along the −u direction, making the left tail of φ∗1 (·) flat. Similarly, the component (a) function φ∗2 (·) is the convex hull of the intermediate functions φ2 (·), a ∈ A, with a flat right 9 (a) The relationship between Γ(a) and Γx is similar to that between Γ∗ and Γ∗x . However, when used as (a) functional operators, the domain of Γ(a) φ∗ (·) is identical to those of Γx φ∗ (·), while the domain of Γ∗ φ∗ (·) ∗ ∗ is the intersection of those of Γx φ (·). 12 Projection direction for Projection direction for Figure 2: Optimal continuation-payoff frontier and intermediate functions for Example 1. tail. Example 1 Consider a& model'with two states, & two actions, ' and the &following parame' 0 1 −0.5 1 0.8 0.2 ters: δ = 0.95, C = , R−C = , P(1) = , and 2 0 0.5 −1 0.3 0.7 & ' 0.2 0.8 P(2) = , where each row of C and R − C corresponds to a state, each column 0.8 0.2 (a) (b) corresponds to an action, and each transition matrix P(a) consists of probabilities pxy (a). (a) The optimal frontier components φ∗x (·) and intermediate functions φx (·) are illustrated in Figure 2, in the component space. Since the optimal frontier φ∗ (·) is a fixed point of the value-iteration operator Γ∗ , the (a) construction of φ∗x (·) is circular in nature: the intermediate functions φx (·) can be obtained from the component functions (c) through linear transformations,(d)defined through (13)-(14), while the component functions φ∗x (·) can be constructed from the intermediate functions through convex hull operations and projections. Remark 1 The agent’s relative continuation-payoff set U = [u, u] can be determined as follows. Define the sets A+ = {a ∈ A : p11 (a) − p21 (a) > 0}, A− = {a ∈ A : p11 (a) − p21 (a) < (a) 0}, and A0 = {a ∈ A : p11 (a) − p21 (a) = 0}, and the operator Γ[u] : R → R according to 13 (a) equation (14), i.e., Γ[u] (v) = c2 (a) − c1 (a) + δ (p11 (a) − p21 (a)) v. Then, the bounds u and u are determined by two equations simultaneously: ( ) (a) (a) (a) u = min {Γ[u] (u)}a∈A+ ∪ {Γ[u] (u)}a∈A− ∪ {Γ[u] (0)}a∈A0 , ( ) (a) (a) (a) u = max {Γ[u] (u)}a∈A+ ∪ {Γ[u] (u)}a∈A− ∪ {Γ[u] (0)}a∈A0 . (15) (16) (1) (2) In Example 1, because A+ = {1}, A− = {2}, and A0 = ∅, we have u = min{Γ[u] (u), Γ[u] (u)} (1) (2) and u = max{Γ[u] (u), Γ[u] (u)}. Through straightforward calculations (or with the aid of (2) (1) Figure 2), we further obtain u = Γ[u] (u) and u = Γ[u] (u), and therefore, u = (c2 (1) − c1 (1))/(1−δ(p11 (1)−p21 (1))) ≈ 3.8095 and u = c2 (2)−c1 (2)+δ(p11 (2)−p21 (2))u ≈ −3.1714. In general, u and u can be determined by comparing various action pairs or through the algorithm presented in Subsection 3.4, because the continuation-payoff frontiers generated by the algorithm, along with their domains, converge to the optimal one. 3.2 Finite Policy Graphs An effective representation of a solution facilitates both the design and implementation of an algorithm. The policy tree for an infinite-horizon contract consists of an infinite number of nodes, as illustrated in Figure 1. If cycles are allowed however, an infinite policy tree may be reduced to a finite graph, namely a finite policy graph. Due to the cycles, the time index of a node is meaningless in a finite policy graph and can be replaced by an arbitrary label. A finite policy graph is made up of the following nodes and branches: (1) each node has a unique label i from a finite set I and corresponds to a continuation contract σ(i); (2) n directed branches emanate from each node, each of which corresponds to a submenu of the continuation contract; (3) in the deterministic case, each branch is attached with an action-payment pair (ax (i), sx (i)) and points to a unique successor node τx (i); and (4) in the randomized case, each branch splits into n sub-branches, each of which corresponds to a realization of the random variable z, is attached with a probability-action-payment triple (θxz (i), azx (i), szx (i)), and points to a unique successor node τxz (i). A simple two-state example is illustrated in Figure 3. There are two continuation contracts, σ(1) and σ(2), intertwined with each other, both consisting of a deterministic submenu and a randomized 14 ••• ••• ••• ••• (a) (b) Figure 3: A finite policy graph consisting of two nodes and deterministic and randomized 1 branches. 1 2 1 1 2 2 1 one. For clarity, the branches corresponding to state 1 2are marked by a short bar, and those (a) corresponding to state 2 are marked by two bars. (b) Medium Each node i of a finite policy graph generates a continuation-payoff pair (u, φ)(i), deBust Boom termined by the structure of the graph in a recursive manner. By expressions (1)-(3), the Boom continuation-payoff component (ux , φx )(i) associated with branch of node i can be Low the xth High computed as follows. In the randomized case, φx (i) = # z∈Z ux (i) = # z∈Z Bust Bust Boom θxz (i) {rx (azx (i)) − cx (azx (i)) + δpx (azx (i))φ(τxz (i))} , (17) θxz (i) {szx (i) − cx (azx (i)) + δpx (azx (i))u(τxz (i))} , (18) and in the deterministic case, φx (i) = rx (ax (i)) − cx (ax (i)) + δpx (ax (i))φ(τx (i)), (19) ux (i) = sx (i) − cx (ax (i)) + δpx (ax (i))u(τx (i)). (20) A finite policy graph can be used in two ways: to describe a long-term contract, and to describe a continuation-payoff frontier. As demonstrated in Figure 2, each frontier component φ∗x (·) is piecewise linear and concave and can be described by a set of extreme points where the slope of the function changes. An extreme point of a frontier component results in an extreme point of the frontier φ∗ (·). The latter can be represented by a node in a policy graph, and thus a whole frontier can be represented by a whole policy graph. As an example, the policy graph for the optimal frontier in Example 1 (Figure 2) is depicted in Figure 4. It is a deterministic policy graph and a simplified one, consisting 15 1.20 (0.97, 1.00) 2 R1 1 2 1 3.22 (0.10, 1.58) 3.53 (-0.13, 1.58) 3.68 (-0.30, 1.58) 1 1 R0 1 2.57 (0.42, 1.44) 2 R2 1 2 2 R3 1 2 1 R4 2 2 ••• • ••• L0 L1 L2 L3 L4 -1.68 (1.95, -0.08) -2.46 (2.18, -0.40) -2.84 (2.22, -0.63) -3.01 (2.22, -0.80) -3.10 • (2.22, -0.92) Figure 4: The (simplified) optimal policy graph for Example 1. of (1) the label i for each node; (2) the agent’s relative continuation payoff u(i) and the system’s continuation-payoff vector (φ1 , φ2 )(i) at each node; (3) the corresponding state for each branch, represented by one or two bars; and (4) the associated action at each branch. Payment variables are omitted from the graph. As discussed after the problem formulation (4)-(6), the agent’s relative continuation payoffs are more essential. In the algorithm to be presented, the agent’s relative continuation-payoff vector associated with any node is computed when the node is created and is kept the same afterward. Those vectors are computed from equation (22) below (free of payment variables) instead of (18) or (20). 3.3 1.93 2.27 2.39 (2.00, 1.57) (1.71, 1.57) (1.54, in 1.57)the Multi-State Case Agent’s Relative Continuation Payoffs 2 1 the operator Γ(a) from the two-state case to the n-state case. We generalize the definition of R1 R2 R3 • • • 1 Because of the 0.91 one-degree redundancy in the agent’s absolute continuation-payoff sets, it is 2 0 (2.55, 1.57) 1 2 1 2 • convenient to replace his absolute continuation-payoff vector u by a relative continuation2 2 L1 L2 L3 ••• * , defined by u payoff vector u *x = ux − un , x = 1, · · · , n − 1 (any state other than n may also • 0.36 of exposition, 0.17 0.11 the “tilde” symbol from the serve as the reference state). For ease we drop (2.55, 1.21) (2.55, 1.01) (2.55, 0.90) notation for the agent’s relative continuation-payoff vectors and sets. Define the truncated * x (a) = (px1 (a), px2 (a), · · · , px,n−1 (a)), for x ∈ X and a ∈ A. transition-probability vectors p Given a ∈ A, Γ(a) maps a continuation-payoff-vector pair (v, ψ) ∈ Rn−1 × Rn to another pair 16 (u, φ) ∈ Rn−1 × Rn as the following: φx = rx (a) − cx (a) + δpx (a)ψ, x ∈ X, (21) * n (a))v, ux = cn (a) − cx (a) + δ(* px (a) − p x ∈ X\{n}, (22) which are analogous to equations (13)-(14) in the two-state case. 3.4 Augmenting Finite Policy Graphs According to the convergence result stated at the end of Section 2, the optimal continuationpayoff frontier can be approached arbitrarily closely by a sequence of continuation-payoff frontiers. Thus, the corresponding optimal policy graph can be approximated by a sequence of finite policy graphs. In general, an optimal policy graph contains infinitely many nodes, as the one in Figure 4. However, there are some special cases in which the optimal policy graphs are finite. It is straightforward to verify if a finite policy graph is optimal, as described in the following proposition (which directly follows from the fact that the functional operator Γ∗ has a unique fixed point). Proposition 1 If the continuation-payoff frontier determined by a finite policy graph is invariant under the value-iteration operator Γ∗ , the frontier and the policy graph are both optimal. The algorithm presented below is inspired by the policy-iteration approach for solving Markov decision processes (MDPs) and the finite-state-controller approach for solving partially observable Markov decision processes (POMDPs). It is noted in the literature that policy iterations converge faster than value iterations for many MDPs (Puterman 1994) and that finite state controllers outperform value iterations for many POMDPs (Hansen 1998).10,11 10 Our algorithm is based on finite policy graphs, which are generally history dependent, while the policyiteration approach is based on stationary deterministic policies, which are sequences of repeated single-period policies. For an infinite-horizon MDP with finite state and action sets, the policy-iteration method can find an optimal policy in finite time; however, there are infinitely many history dependent deterministic policies in an infinite-horizon adverse selection problem, and therefore finite convergence to an optimal long-term contract is not guaranteed even if randomization is disregarded. 11 More discussions of finite state controllers and analyses of POMDPs can be found in Zhang (2010a). 17 In preparation for the algorithm, we define some notation. First, in any finite policy graph generated by the algorithm, every node is associated with a unique relative continuationpayoff vector u for the agent and hence can be uniquely identified by a function i(u). Second, for two relative continuation-payoff vectors u and u$ for the agent, we define u !x u$ if every feasible solution of the problem (4)-(6) for state x given parameter u$ is also feasible to the problem given parameter u. The algorithm is described next. Algorithm: Finite Policy Graph Augmentation 1. (Initialization) Define an initial finite policy graph. Determine the extreme points of the corresponding continuation-payoff frontier φ− (·), and record their corresponding continuation-agent-payoff vectors in U − . Select a precision level ε > 0. 2. (Augmentation) Apply the Γ(a) operators to the extreme points of φ− (·). For each x ∈ X, determine the extreme points of the new frontier component φ+ x (·), and record their corresponding continuation-agent-payoff vectors in Ux+ . Let U + = ∪x∈X Ux+ . Each u ∈ U + is coupled with a continuation-system-payoff vector φ so that (u, φ) is generated from an extreme point of φ− (·) by a certain Γ(a) operator (if multiple such φ exist, take their component-wise maximum). (2.a) For each u ∈ U + !U − , create a new node i(u) with the branch for each state x ∈ X determined as follows. (i) If (u, φx ) is an extreme point of the frontier component φ+ x (·), make a deterministic branch with the action and successor node used to generate $ $ (u, φx ); (ii) If (u, φx ) is dominated by an extreme point (u$ , φ+ x (u )) in that u !x u and $ $ φx < φ+ x (u ), copy the xth branch of node i(u ); (iii) If (u, φx ) is dominated by a set of ! ! z z + z extreme points {(uz , φ+ x (u ))}z∈Z in that u = z∈Z λz u and φx < z∈Z λz φx (u ) for ! some λz ≥ 0 and z∈Z λz = 1, make a randomized branch that copies the xth branch of node i(uz ) with probability λz . (2.b) For each u ∈ U + ∩ U − , modify the branches of the existing node i(u) as in (2.a). 3. (Truncation) For each u ∈ U − !U + , if the node i(u) is not used to create any extreme − point of any φ+ x (·), remove the node from the policy graph and delete u from U ; otherwise, modify the branches of the node as in (2.a.iii). 18 4. (Reevaluation) If any pre-existed branch is modified above, recalculate the continuationsystem-payoff vectors for all nodes using equation (17) or (19), and update the extreme points of the continuation-payoff frontier φ+ (·). 5. (Termination) If d(φ− (·), φ+ (·)) ≤ ε, exit. Otherwise, replace U − by U − ∪ U + , φ− (·) by φ+ (·), and return to step 2. We explain the algorithm in more detail below. The initial finite policy graph can be as simple as a single node with circular branches. The performance of the algorithm is influenced by the initial graph to some extent, as will be discussed in the next subsection. In the augmentation step, we conduct the value iteration once and modify the policy graph. More specifically, we apply the Γ(a) operators to the extreme points of the current continuation-payoff frontier φ− (·), identify the extreme points of the new frontier φ+ (·), record the corresponding u vectors in U + , create a new node for every u ∈ U + !U − (which does not exist in the current graph), and modify the existing node for every u ∈ U + ∩ U − (to improve the associated system-payoff vector). The two situations can be treated in the same way, and the xth branch of any node i(u), u ∈ U + , is determined from the new + frontier component φ+ is coupled with x (·). Consider any given x. Notice that each u ∈ U (a) a continuation system payoff φx when generated by a certain Γx operator from an extreme point of φ− (·) (corresponding to a node in the old policy graph). There are three cases about the point (u, φx ). (1) It contributes to φ+ x (·). Then the new branch should identify the action and node (in the old policy graph) that resulted in (u, φx ). (2) The point is + dominated by a point (u$ , φ$x ) on φ+ x (·), typically happening at the tail of φx (·), as in Figure 2. In such a case, we couple u with the (larger) continuation system payoff φ$x by copying the xth branch of node i(u$ ). (3) The point (u, φx ) is dominated by a convex combination of points {(uz , φzx )}z∈Z on φ+ x (·). Then, the new branch should split into n sub-branches, copying the xth branch of nodes {i(uz )}z∈Z . The new branch is deterministic in the first two cases and randomized in the third. 19 In the truncation step, we check all remaining nodes in the policy graph created in previous iterations. If a node is not used directly or indirectly to create any extreme point of the new continuation-payoff frontier φ+ (·), it will never be useful in the future and can be safely removed from the graph. Even if the contribution of an existing node is indirect, it cannot be removed without harming the completeness of the graph. But because such a node must be dominated by the nodes on the new frontier, an improvement can be made by rerouting its branches as in (2.a.iii). In the reevaluation step, we evaluate the continuation-system-payoff vectors for the modified policy graph according to equation (17) or (19). This involves solving a system of linear equations, which can be efficiently done. This step is only necessary when rerouting of pre-existed branches has occurred in the previous two steps. In the termination step, we can use the Hausdorff metric dH (·, ·) or other more convenient criteria (one of which will be seen in Subsection 4.1) to measure the distance between two consecutive continuation-payoff frontiers and terminate the algorithm when a given precision is reached. Note that if we omit the truncation and reevaluation steps and extend part (a) of the augmentation step to all u ∈ U + (omitting step 2.b), we obtain a value-iteration algorithm, which will be referred to as the value-iteration counterpart of the algorithm presented above. Therefore, the above algorithm can be viewed as an improvement over a value-iteration algorithm. One disadvantage of the value-iteration counterpart is that the size of the policy graph may grow much faster than under our algorithm, because there are no replacements or removals of existing nodes. The performance gap between the two algorithms will be demonstrated through numerical experiments in Subsection 4.1. Now, we show that the continuation-payoff frontiers generated by the finite-policy-graph algorithm follow a certain order. Define the hypograph of a function φ† : U → Rn as hypo(φ† (·)) = {(u, φ) ∈ U × Rn : φ ≤ φ† (u)}. We say that two functions φ† : U → Rn and ψ † : V → Rn satisfy φ† (·) . ψ † (·), or ψ † (·) / φ† (·), if hypo(φ† (·)) ⊂ hypo(ψ † (·))—i.e., U ⊂ V and φ† (u) ≤ ψ † (u) for all u ∈ U . The “.” relation defines a weakly increasing 20 partial order. Equipped with this order, we show that: Theorem 2 The finite-policy-graph algorithm generates a sequence of finite policy graphs with weakly increasing continuation-payoff frontiers. Before the algorithm terminates, each frontier in the sequence strictly improves upon the previous one at some node or has a strictly larger domain. The algorithm terminates with an optimal policy graph or after a finite number of iterations. In addition, the algorithm converges faster than its value-iteration counterpart if rerouting ever occurs. Proof. Let φ"k−1 (·) denote the continuation-payoff frontier at the end of iteration k−1. At the beginning of iteration k, φ"k−1 (·) is relabeled as φ− k (·). At the beginning of the augmentation (a) step, a new frontier φ+ operations. Because new extreme k (·) is formed following the Γ − points are added to the existing ones, we must have φ+ k (·) / φk (·). The remainder of this step ensures that every extreme point of φ+ k (·) has a corresponding node in the policy graph, by either adding a new node or modifying an existing one. In the latter case, the branches of an existing node are rerouted in a way that the associated continuation-system-payoff vector will be improved in the subsequent reevaluation step. The truncation step has no effect on the continuation-payoff frontier. Thus, the reevaluation step results in a weakly improved frontier φ"k (·) ≥ φ+ k (·) if rerouting has occurred; otherwise, the step is skipped, and " " " − φ"k (·) = φ+ k (·). Thus, φk (·) / φk (·) = φk−1 (·) in any case, and {φk (·)}k=1,2,··· is a sequence of weakly increasing continuation-payoff frontiers. Suppose the algorithm does not stop after iteration k. If the frontier φ"k (·) has the same domain as φ"k−1 (·) and is not strictly better than φ"k−1 (·) at any point, then φ"k (·) / φ"k−1 (·) implies φ"k (·) = φ"k−1 (·), and the algorithm should have stopped after iteration k, which is a contradiction. Therefore, φ"k (·) must be strictly better than φ"k−1 (·) at some point or have a strictly larger domain. The algorithm terminates when φ"k (·) = φ"k−1 (·), in which case φ"k (·) is the optimal frontier by Proposition 1, or when the distance between φ"k (·) and φ"k−1 (·) is smaller than ε after a finite number of iterations. Finally, if rerouting never occurs during the entire procedure, we would have φ"k (·) = " φ+ k (·) for all k, and the frontier sequence {φk (·)} would be identical to the sequence obtained 21 by the value-iteration counterpart. The latter sequence converges to the optimal frontier from below. If rerouting ever happens, we must have φ"k (·) ≥ φ+ k (·) (strictly better at some point) for some k as discussed above, and thus {φ"k (·)} must converge faster than the sequence obtained by the value-iteration counterpart. The last part of the theorem suggests that the finite-policy-graph algorithm improves over its value-iteration counterpart by seizing possible opportunities to reroute existing branches in the policy graphs. 3.5 Cycles in Finite Policy Graphs and Initialization of the Algorithm Some interesting features of the optimal continuation-payoff frontier and policy graph for Example 1 are demonstrated in Figures 2 and 4. One of them is the cycle contained in the optimal policy graph, referred to as an optimal cycle, which consists of two nodes, R0 and L0. Such a cycle is an absorbing sink in the graph and can be reached from other nodes. It helps pin down the optimal frontier: in Figure 2, the two extreme points of φ∗ (·) located at u(R0) and u(L0) can be directly computed from equations (19) and (22); other extreme points of the frontier can be obtained subsequently, as will be seen in Figure 6 in Section 4. An optimal cycle, if exists, carries important structural information about the optimal solution—given any initial state distribution, the resulting optimal contract will become a cyclical policy after a finite number of periods (when the state and action sets are finite). The possible existence of an optimal cycle reflects the circular structure of the optimal frontier, as discussed after Example 1. More specifically, the central (or pivotal ) part of a component function φ∗x (·) (the part between u(R0) and u(L0) in Figure 6) is formed from the convex (a) hull of the central parts of intermediate functions {φx (·)}a∈A , while the central part of an (a) intermediate function φx (·) is a weighted average of those of the component functions (with a shrunk domain, by a factor of δ (p11 (a) − p21 (a)) in the two-state case). Thus, the central part of φ∗ (·) must be generated from itself, which corresponds to a cycle in the policy graph. A rigorous proof of the general existence of an optimal cycle is left for future research, but in some special cases such as the two-state-two-action case of Example 1 and the ICFB scenario 22 discussed next, a unique optimal cycle can be identified analytically. When the state is the agent’s private information, a first-best solution is normally suboptimal for the principal (due to the high information rent yielded to the agent) but it may still be incentive compatible. We call such a scenario the incentive-compatible first-best (ICFB) scenario, in which the first-best system efficiency can be achieved over a non-empty set of the agent’s continuation payoffs. This scenario includes an important special case called “private values” in which the private state (or agent type) does not affect the principal’s one-period reward function—i.e., rx (a) can be simplified to r(a). Many interesting adverse selection problems fall in this category. For a problem with ICFB, the cycle in the optimal policy graph can be directly constructed, as shown in Zhang (2010b). Example 2 in Section 4, illustrated by Figures 8 and 9, belongs to this case (but not of private values). Clearly, to describe an infinite-horizon contract by finitely many nodes, a finite policy graph must contain a cycle so that every node in the graph corresponds to a legitimate continuation contract (with an infinite future). Thus, the finite-policy-graph algorithm can naturally be initialized with a cycle. The performance of the algorithm may be affected by the choice of the initial graph. As demonstrated by the two examples in Section 4, if started with an optimal cycle, the algorithm would simply add new nodes to the policy graph in subsequent iterations, without modifying the existing ones. Nevertheless, an optimal cycle may not be easy to identify in general. As a remedy, we may include multiple cycles in the initial policy graph, such as singleton nodes and node pairs (in the two-state case), which need not be all connected. These initial nodes may be removed by the algorithm in later iterations, but the benefit of including a true optimal cycle early on can be substantial. This initial treatment can be viewed as a heuristic in general, and whether an optimal cycle exists or not does not affect the validity of the algorithm. 3.6 Constructing Optimal Contracts After an optimal policy graph is obtained or approximated by the algorithm, an optimal long-term contract with respect to a given initial state distribution can be constructed con- 23 0 0 (a) (b) Figure 5: The principal’s problem at time 1: (a) u < 0 ≤ u; (b) 0 ≤ u < u. veniently. For simplicity, we discuss the two-state case, but the process can be generalized to the multi-state case. Given the continuation-payoff frontier φ∗ (·), the principal solves the following problem at time 1: maxu∈U {β[φ∗ (u) − u] : u ≥ 0}. This formulation is based on the agent’s absolute continuation-payoff vector u = (u1 , u2 ), while the continuation-payoff frontier obtained through the algorithm is based on his relative continuation payoff u = u1 − u2 —i.e., in the form of φ∗ (u1 − u2 ). Thus, the problem can be rewritten as maxu∈U {β[φ∗ (u1 − u2 ) − u] : u ≥ 0}. At an optimal solution u∗ , one of its coordinates must be zero, and thus u∗ must take the form (u, 0) for some u ≥ 0 or (0, −u) for some u ≤ 0. Then the problem can be transformed * is the agent’s relative continuation-payoff set and into max+u∈Ue {βφ∗ (u) − ρ(u)}, where U β1 u, u≥0 ρ(u) = . The terms βφ∗ (·) and ρ(·) in the objective function represent −β2 u, u ≤ 0 the expected system payoff and expected information rent, respectively, depending on the agent’s relative continuation payoff u. Thus, the objective of this problem reflects the tradeoff between system-efficiency maximization and information-rent extraction. Various parts of the objective function are demonstrated in Figure 5: panel (a) illustrates the case in which the agent’s relative continuation-payoff set [u, u] contains zero, and panel (b) illustrates the case in which the entire interval lies in the positive half of the u axis. The third case, in which the interval lies in the negative half of the axis, is symmetric to case (b). For any initial state distribution β, only a portion of the optimal frontier or optimal policy 24 graph is needed. Given any β, an optimal contract can be constructed by traversing the optimal policy graph, starting from the node corresponding to u∗ which solves the principal’s time-1 problem. If that node is created during the policy graph augmentation procedure, the task is over. This fact may be incorporated into the termination criterion of the algorithm. 4 Two-State Examples In this section, we demonstrate some key characteristics of the algorithm and the optimal continuation-payoff frontier through two-state examples. A two-state version of the finitepolicy-graph algorithm is implemented in MATLAB. For the purpose of this paper, we do not go into the details of the implementation, but it is understood that the performance of a sophisticated algorithm like ours is heavily influenced by the details of the algorithm design. A full-scale implementation of the algorithm and thorough numerical study are left for future research.12 We show two examples below, starting with Example 1 introduced in Section 3. To minimize the overlap, the emphasis of the first example is on the performance of the algorithm, and that of the second is on the structure of the optimal frontier and contracts. More examples can be obtained from the author. 4.1 Example 1 Revisited In this example, the optimal policy graph contains a unique optimal cycle, consisting of nodes R0 and L0 in Figure 4. If this cycle is chosen as the initial policy graph, the finitepolicy-graph algorithm will add node R1 in the first iteration, nodes R2 and L1 in the second iteration, nodes R3 and L2 in the third iteration, and so on. No rerouting or truncating 12 The implementation of the algorithm can be greatly facilitated by some existing algorithms for solving standard computational geometry problems, such as the point-inquiry problem, the convexhull problem, and the projection problem. There are efficient algorithms for solving these problems in the two-state case. For instance, we used the heap sort algorithm to sort a set of the agent’s relative continuation payoffs (based on the pseudocode available at the website http://en.wikipedia.org/wiki/Heapsort#cite_note-1) and Andrew’s monotone-chain convex-hull algorithm to identify the extreme points of the continuation-payoff frontiers (based on the C++ code provided at the website http://softsurfer.com/Archive/algorithm_0109/algorithm_0109.htm). For the multi-state case, more sophisticated algorithms are needed. 25 (a) (b) (c) (d) Figure 6: Continuation-payoff frontiers for Example 1 at the end of the first four iterations. is needed in the process, and the iterations reduce to mere value iterations. The resulting continuation-payoff frontiers at the end of the first four iterations are illustrated in Figure 6. If the optimal cycle is unknown at the beginning, we can initialize the algorithm with an arbitrary small graph. For instance, if we start with a singleton node based on action 1 (both branches from the node are circular and are associated with action 1), then the first 10 continuation-payoff frontiers generated by the algorithm are depicted in Figure 7(a). Figure 7(b) shows the first 50 frontiers generated by the value-iteration counterpart of the algorithm. The figures show that the frontiers generated by value iterations converge at a steady speed, while those generated by policy graph augmentations (policy iterations) may comprise big leaps. This seems to be a general phenomenon not restricted to a particular example. We also observe that the last frontier in Figure 7(a) dominates that in Figure 7(b). To compare the outputs of the two algorithms more precisely, we compare them with the optimal frontier illustrated in Figure 2. 26 More iterations (a) (b) Figure 7: Continuation-payoff frontiers for Example 1 from an initial node based on action 1, through: (a) finite policy graph augmentations, and (b) value iterations. Because the continuation-payoff frontiers after the first few iterations are roughly parallel with one another, the distance between two frontiers can be conveniently measured by the distance between their “middle” points (in terms of the agent’s relative continuation payoffs). In this example, the middle points are located at um ≈ 0.3191. The corresponding optimal continuation-system-payoff vector is φ∗ (um ) ≈ (1.2655, 0.6750), as in Figure 2. Table 1 reports the results of several experiments started from a single node based on action 1, and Table 2 records the results started from an initial policy graph consisting of two independent nodes based on actions 1 and 2, respectively. The tables consist of the following (groups of) columns: (1) the type of algorithm, “p”olicy iteration or “v”alue iteration; (2) the number of iterations performed; (3) the total run time;13 (4) the continuation-system-payoff vector φ(um ) at the middle of the last frontier in the experiment; (5) the absolute difference between φ(um ) and the optimal continuation-system-payoff vector, i.e., ∆φ(um ) = φ∗ (um ) − φ(um ); and (6) the relative difference, defined as ∆φ1 (um )+∆φ2 (um ) . φ∗1 (um )+φ∗2 (um ) The two sets of experiments generate the same qualitative results, confirming the prediction of Theorem 2. In the first set of experiments, it takes roughly 140 value iterations to beat the output precision of 10 policy iterations, and the run time is at least 50 times longer. The results of the second set of experiments are similar. Although the run-time numbers 13 Recorded on a computer with a 1.60 GHz Pentium M processor and 512 MB RAM. 27 Table 1: Experiments Started Alg. Iter. Time (s) φ1 (um ) p 10 1.88 1.2621 v 60 29.16 1.1387 v 80 51.72 1.2201 v 100 68.79 1.2492 v 120 91.97 1.2597 v 140 119.44 1.2634 Table 2: Experiments Started from Alg. Iter. Time (s) φ1 (um ) p 10 2.17 1.2646 v 60 32.97 1.2201 v 80 52.75 1.2492 v 100 74.30 1.2597 v 120 92.88 1.2634 from a Single Node Based on Action 1 φ2 (um ) ∆φ1 (um ) ∆φ2 (um ) Rel. Diff. 0.6716 0.0034 0.0034 0.0035 0.5482 0.1268 0.1268 0.1307 0.6295 0.0454 0.0454 0.0468 0.6587 0.0163 0.0163 0.0168 0.6691 0.0058 0.0058 0.0060 0.6729 0.0021 0.0021 0.0022 Two Separate Nodes φ2 (um ) ∆φ1 (um ) 0.6741 0.0009 0.6296 0.0454 0.6587 0.0163 0.6692 0.0058 0.6729 0.0021 Based on Actions 1 & 2 ∆φ2 (um ) Rel. Diff. 0.0009 0.0009 0.0454 0.0467 0.0163 0.0168 0.0058 0.0060 0.0021 0.0022 from a particular implementation of the algorithm should not be taken literally, the qualitative implication is clear, i.e., the finite-policy-graph algorithm takes far fewer iterations and much less time than its value-iteration counterpart to achieve the same level of precision. We briefly comment on the optimal continuation-payoff frontier of this example. Recall that the agent’s one-period & ' costs and the system’s one-period payoffs & are given by ' 0 1 −0.5 1 C = (cx (a))x∈X,a∈A = and R − C = (rx (a) − cx (a))x∈X,a∈A = , 2 0 0.5 −1 respectively. The action plan that maximizes the expected system payoff is given by a∗1 = 2 and a∗2 = 1, and the one that minimizes the expected agent cost is given by a1 = 1 and a2 = 2 (the claim still holds if long-run payoffs are considered, taking state transitions into account). There is clearly a severe conflict of interests between the agent and the system. As a result, in the second-best scenario when the agent’s incentive issue must be addressed, the optimal continuation-payoff frontier bears a severe loss of efficiency, compared with the first-best continuation-system-payoff vector φF B ≈ (13.0594, 12.6027). 28 1 1 2 2 1 1 2 1 2 2 1 2 2 2 • ••• L0 L1 L2 L3 L4 -1.68 (1.95, -0.08) -2.46 (2.18, -0.40) -2.84 (2.22, -0.63) -3.01 (2.22, -0.80) -3.10 • (2.22, -0.92) Figure 8: The optimal continuation-payoff frontier for Example 2. 1.93 (2.00, 1.57) 4.2 Example 2 2 1 2.27 (1.71, 1.57) R1 2.39 (1.54, 1.57) R2 R3 ••• 1 The next example possesses the ICFB property discussed in Subsection 3.5. Its optimal 0.91 2 0 (2.55, 1.57) 1 2 1 2 • solution exhibits markedly different characteristics from the example above. 2 2 L1 L2 L3 ••• • Example 2 Consider& a model with two following parame0.36 states, 0.11' and the & ' &0.17two actions, ' (2.55, 1.21) (2.55, 1.01) (2.55, 0.90) 0 0 0.5 1 0.1 0.9 ters: δ = 0.9, C = , R−C = , P(1) = , and −1 −0.5 0.8 0.2 & ' 2.5 1.4 0.15 0.85 P(2) = , where each row of C and R − C corresponds to a state, each column 0.75 0.25 corresponds to an action, and each transition matrix P(a) consists of probabilities pxy (a). (a) The optimal frontier components φ∗x (·) and intermediate functions φx (·) are illustrated in Figure 8, and the optimal policy graph is shown in Figure 9. In this example, the cost matrix C implies that the agent has the incentive to take action 2 in both states to minimize cost, and the system payoff matrix R−C implies that the system also prefers action 2 in both states to maximize system payoff (which is still true when longrun payoffs are considered). Thus, the agent’s incentive is aligned with the system objective, and the first-best action plan a∗1 = a∗2 = 2 is incentive compatible. Because the first-best action plan yields the highest possible continuation-system-payoff vector and it solely involves action 2, the self-circular node based on action 2 (the node 0 in Figure 9) must be contained in the optimal policy graph and hence constitutes an optimal cycle. This node corresponds 29 1.93 (2.00, 1.57) 2 2 R1 1 0.91 (2.55, 1.57) 0 2 2 2.27 (1.71, 1.57) 2 R2 1 2 L1 0.36 (2.55, 1.21) 2.39 (1.54, 1.57) L2 0.17 (2.55, 1.01) R3 1 2 L3 ••• • ••• • 0.11 (2.55, 0.90) Figure 9: The optimal policy graph for Example 2. to the agent’s relative continuation payoff u(0) ≈ 0.9091 and the aforementioned “central part” of the optimal frontier φ∗ (·) in Subsection 3.5. The rest of φ∗ (·) can be determined by expanding from this center. First, by the definition of intermediate functions, the peak of (1) φ1 (·), located at u(R1), is mapped from the peak of φ∗ (·) and becomes an extreme point of (1) the frontier component φ∗1 (·), by the properties of φ∗1 (·); the peak of φ2 (·), also located at u(R1), is dominated by φ∗2 (u(0)). Thus, the node R1 in the optimal policy graph and the part of φ∗ (·) between u(0) and u(R1) are determined as in Figures 8 and 9. Then, we can show (2) that the second-highest extreme point of φ2 (·) is mapped from the second-highest extreme point of φ∗ (·), located at u(R1).14 Thus, the node L1 in the optimal policy graph and the part of φ∗ (·) between u(0) and u(L1) are determined. Repeating this process, we obtain the optimal continuation-payoff frontier and policy graph. If the algorithm is initialized with the node 0, it will add nodes R1, L1, R2, L2, R3, L3, ..., one in each iteration. Now, we derive the optimal contracts. The agent’s relative continuation-payoff set can be found to be U ≈ [0.0758, 2.4523].15 Following the discussion in Subsection 3.6, the (2) 14 Because p11 (2) − p21 (2) < 0, by equations (13)-(14), φ2 (·) is a weighted average of φ∗1 (·) and φ∗2 (·), (2) flipping at u(0). Thus, a point of φ2 (·) on the left of u(0) is mapped from a point of φ∗ (·) on the right of u(0). 15 The bounds can be determined from equations (15)-(16). Because A− = {1, 2} and A+ = A0 = ∅, (1) (2) (1) (2) (2) we have u = min{Γ[u] (u), Γ[u] (u)} and u = max{Γ[u] (u), Γ[u] (u)}. Figure 8 suggests that u = Γ[u] (u) and (1) u = Γ[u] (u), i.e., u = c2 (2) − c1 (2) + δ(p11 (2) − p21 (2))u and u = c2 (1) − c1 (1) + δ(p11 (1) − p21 (1))u. Solving the two equations simultaneously, we obtain u ≈ 0.0758 and u ≈ 2.4523. 30 (c) (d) 1.0 0.8 0.6 0.4 0.2 0.0 Figure 10: The principal’s expected total-profit function π ∗ (u) for various β1 . principal maximizes her expected continuation payoff at time 1 (or, expected total profit), i.e., π ∗ (u) = βφ∗ (u)−β1 u = β1 φ∗1 (u)+(1−β1 )φ∗2 (u)−β1 u. The optimal u∗ must be associated with an extreme point of φ∗ (·) and lie in the interval [u, u(0)] = [0.0758, 0.9091], as can be seen from Figures 8 and 5(b). It also depends upon the initial state probability β1 . Figure 10 illustrates the function π ∗ (·) in the relevant domain, for different values of β1 . The extreme point of each π ∗ (·) associated with u∗ is pointed out by an arrow. An optimal contract can be constructed by traversing the policy graph from the node corresponding to u∗ . The path always leads to the optimal cycle in finite time, which implies that the contract converges to a first-best contract in finite time, a general structural result in the ICFB scenario (Zhang 2010b). An optimal contract reflects the trade-off between system-efficiency maximization and information-rent extraction. A first-best policy achieves the highest possible system efficiency, βφ∗ (u), but can only be sustained at u(0) ≈ 0.9091, which represents a substantial information rent in state 1 (from Subsection 3.6, a relative continuation payoff u ≥ 0 corresponds to an absolute continuation-payoff vector u = (u, 0)). Thus, the principal’s expected total profit βφ∗ (u) − β1 u is not maximized at u(0). According to Figures 8 and 9, the extreme point of φ∗ (·) on the left of u(0) (i.e., at u(L1)) corresponds to a contract that attaches two periods of inefficient action plans in front of the first-best policy. Compared 31 with the first best, this new contract reduces the agent’s information rent at the expense of system efficiency—both β1 u and βφ∗ (u) are reduced. The trade-off can be translated into the comparison between the slopes of β1 u and βφ∗ (u). Viewed along the −u direction, the former slope measures the marginal rent reduction, which is constant at β1 , and the latter measures the marginal efficiency loss, which increases as more inefficient periods are attached to the front of the contract. An optimal contract equalizes these two marginal values. The above interpretation also helps examine the impact of the initial state distribution on the optimal contracts. In this example, because the slope of φ∗1 (·) is always 0 in the relevant domain, the slope of βφ∗ (·) is given by the slope of β2 φ∗2 (·). Thus, we are in fact comparing the latter slope to β1 , or the slope of φ∗2 (·) to β1 /β2 . If we increase β1 (and hence β1 /β2 ), the optimal u∗ will move toward the left, away from u(0), as shown in Figure 10. Intuitively, when β1 increases, the information rent in state 1 becomes more salient to the principal, and a less efficient contract that gives up less information rent in state 1 may become more desirable. 5 Conclusion In this paper, we presented an algorithm to solve an infinite-horizon adverse selection problem, based on the finite-policy-graph representation of long-term contracts and continuationpayoff frontiers. The algorithm augments the policy graph through value iterations and exploits possible improvements in the structure of the policy graph, which may lead to substantial gains. The finite-policy-graph representation and the algorithm not only offer a numerical solution to the problem but also help the exploration of the structures of the optimal contracts, as demonstrated by the two examples. In this paper, we considered pure adverse selection (with unobservable state and observable action) and assumed a risk-neutral agent. The algorithm is based on an APS-style recursion. Interestingly, continuation-value recursion is also applicable to moral hazard problems, often with risk-averse agents (e.g., Spear and Srivastava 1987, and Phelan and Townsend 1991). Thus, from the perspective of the general solution approach, adverse selection and 32 moral hazard problems are closely related. For instance, Fernandes and Phelan (2000) study both types of problems and Doepke and Townsend (2006) analyze a hybrid adverse selection and moral hazard problem. The finite-policy-graph algorithm presented in the paper is not fundamentally restricted to the pure adverse selection setting with a risk-neutral agent. In theory, the key step of the algorithm—i.e., replacing dominated points on the previous continuation-value frontier with the points on the new frontier and recalculating the new frontier—can be applied to any finite-state finite-action principal-agent problem with a continuation-value recursion. Extending the algorithm to other principal-agent problems is a promising direction for future research. As mentioned in the introduction, there exist different approaches for solving dynamic principal-agent problems in the literature. It is an important future reserach topic to compare the performances of the finite-policy-graph algorithm and the existing algorithms. Acknowledgements The author thanks Kenneth L. Judd and two anonymous referees for their thorough and constructive reviews, which helped improve the paper significantly, and Garrett J. van Ryzin for his valuable inputs on this paper. References [1] Abreu, D., D. Pearce, and E. Stacchetti. 1990. Towards a theory of discounted repeated games with imperfect monitoring. Econometrica 58 1041–1064. [2] Battaglini, M. 2005. Long-term contracting with Markovian consumers. American Economic Review 95 637–658. [3] Battaglini, M., and S. Coate. 2008. Pareto efficient income taxation with stochastic abilities. Journal of Public Economics 92 844–868. [4] Bolton, P., and M. Dewatripont. 2005. Contract Theory. MIT Press, Cambridge, MA. [5] Cole, H., and N. Kocherlakota. 2001. Dynamic games with hidden actions and hidden states. Journal of Economic Theory 98 114–126. [6] Doepke, M., and R. M. Townsend. 2006. Dynamic mechanism design with hidden income and hidden actions. Journal of Economic Theory 126 235–285. 33 [7] Fernandes, A., and C. Phelan. 2000. A recursive formulation for repeated agency with history dependence. Journal of Economic Theory 91 223–247. [8] Fudenberg, D., and J. Tirole. 1991. Game Theory. MIT Press, Cambridge, MA. [9] Hansen, E. A. 1998. An improved policy iteration algorithm for partially observable MDPs. Advances in Neural Inform. Processing Systems 10 (NIPS-97), MIT Press, Cambridge, MA, 1015–1021. [10] Howard, R. 1960. Dynamic Programming and Markov Processes. Technology PressWiley, Cambridge, MA. [11] Judd, K. L., S. Yeltekin, and J. Conklin. 2003. Computing supergame equilibria. Econometrica 71 1239–1254. [12] Kapicka, M. 2008. Ecient allocations in dynamic private information economies with persistent shocks: A first-order approach. Working paper, University of California, Santa Barbara. [13] Laffont, J.-J., and D. Martimort. 2002. The Theory of Incentives: The Principal-Agent Model. Princeton University Press, Princeton, NJ. [14] Myerson, R. B. 1986. Multistage games with communication. Econometrica 54(2) 323– 358. [15] Phelan, C., and R. M. Townsend. 1991. Computing multi-period, informationconstrained optima. Review of Economic Studies 58 853–881. [16] Puterman, M. L. 1994. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, New York. [17] Salanie, B. 1997. The Economics of Contracts. MIT Press, Cambridge, MA. [18] Sleet, C. 2001. On credible monetary policy and private government information. Journal of Economic Theory 99 338–376. [19] Sleet, C., and S. Yeltekin. 2001. Dynamic labor contracts with temporary layoffs and permanent separations. Economic Theory 18 207–235. [20] Sleet, C., and S. Yeltekin. 2007. Recursive monetary policy games with incomplete information. Journal of Economic Dynamics & Control 31 1557–1583. [21] Spear, S. E., and S. Srivastava. 1987. On repeated moral hazard with discounting. Review of Economic Studies 54 599–617. [22] Tchistyi, A. 2006. Security design with correlated hidden cash flows: The optimality of performance pricing. Working paper, New York University. 34 [23] Zhang, H. 2010a. Partially observable Markov decision processes: A geometric technique and analysis. Operations Research 58(1) 214–228. [24] Zhang, H. 2010b. Structural analysis of a dynamic adverse-selection model. Working paper, University of Southern California, Los Angeles, CA. [25] Zhang, H., M. Nagarajan, and G. Sosic. 2010. Dynamic supplier contracts under asymmetric inventory information. Operations Research forthcoming. [26] Zhang, H., and S. Zenios. 2008. A dynamic principal-agent model with hidden information: Sequential optimality through truthful state revelation. Operations Research 56(3) 681–696. 35

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Solving a Dynamic Adverse Selection Model Through Finite Policy