Download Multi-agent MDP and POMDP Models for Coordination

Markov Models for Multi-Agent Coordination Maayan Roth Multi-Robot Reading Group April 13, 2005 Multi-Agent Coordination  Teams: – –  Agent work together to achieve a common goal No individual motivations Objective: – Generate policies (individually or globally) to yield best team performance Observability [Pynadath and Tambe, 2002]  “Observability” - degree to which agents can, either individually or as a team, identify current world state – Individual observability  – Collective observability   – O1 + O2 + … + On uniquely identify the state e.g. [Xuan, Lesser, Zilberstein, 2001] Collective partial observability   – MMDP [Boutilier, 1996] DEC-POMDP [Bernstein et al., 2000] POIPSG [Peskin, Kim, Meuleau, Kaelbling, 2000] Non-observability Communication  [Pynadath and Tambe, 2002] “Communication” - explicit message-passing from one agent to another – Free Communication   – General Communication  – No cost to send messages Transforms MMDP to MDP, DEC-POMDP to POMDP Communication is available but has cost or is limited No Communication  No explicit message-passing Taxonomy of Cooperative MAS No Communication DEC-POMDP Open-loop control? DEC-MDP MMDP General Communication Free Communication MDP Full Observability MDP Collective Observability POMDP Partial Observability Unobservability Complexity Individually Collectively Observable Observable Collectively Partially Observable No Comm. P-complete NEXP-complete NEXP-complete General Comm. P-complete NEXP-complete NEXP-complete Free Comm. P-complete P-complete PSPACE-complete Multi-Agent MDP (MMDP) [Boutilier, 1996]  Also called IPSG (identical payoff stochastic games)  M = <S, {Ai}im, T, R> – – – –   S is set of possible world states {Ai}im is set of joint actions, <a1, …, am> where ai  Ai T defines transition probabilities over joint actions R is team reward function State is fully observable by each agent P-complete Coordination Problems <*,*> <a,a>; <b,b> s2 <a,*> <b,*> s3 <*,*>  +10 s5 -10 s6 +5 <a,b>; <b,a> <*,*> s1 s4 <*,*> Multiple equilibria – two optimal joint policies Randomization with Learning  Expand MMDP to include FSM that selects actions based on whether agents are coordinated or uncoordinated <*,*> <a,b>; <b,a> <a,a> U A <*,*> <b,b> a B b random(a,b)  Build the expanded MMDP by adding a coordination mechanism at every state where a coordination problem is discovered Multi-Agent POMDP   DEC-POMDP [Bernstein et al., 2000], MTDP [Pynadath et al., 2002], POIPSG [Peshkin et al., 2000] M = <S, {Ai}im, T, {i}im, O, R> – – – – – – S is set of possible world states {Ai}im is set of joint actions, <a1, …, am> where ai  Ai T defines transition probabilities over joint actions {i}im is set of joint observations, <1, …, m> where i  i O defines observation probabilities over joint actions and joint observations R is team reward function Complexity [Bernstein, Zilberstein, Immerman, 2000]   For all m  2, DEC-POMDP with m agents is NEXP-complete – “For a given DEC-POMDP with finite time horizon T, and an integer K, is there a policy for which yields a total reward  K?” – Agents must reason about all possible actions and observations of their teammates For all m  3, DEC-MDP with m agents is NEXP-complete Dynamic Programming [Hansen, Bernstein, Zilberstein, 2004]   Finds optimal solution Partially observable stochastic game (POSG) framework –  Experimental results show speed-up in some domains –  Iterate between DP step and pruning step to remove dominated strategies Still NEXP-complete, so hyper-exponential in worst case No communication Transition Independence [Becker, Zilberstein, Lesser, Goldman, 2003]   DEC-MDP – collective observability Transition independence: – Local state transitions   –   Each agent observes local state Individual actions only affect local state transitions Team connected through joint reward Coverage set algorithm – finds optimal policy quickly in experimental domains No communication Efficient Policy Computation [Nair, Pynadath, Yokoo, Tambe, Marsella, 2003]  Joint Equilibrium-Based Search for Policies – – –   initialize random policies for all agents for each agent i:  fix the policies for all other agents  find the optimal policy for agent i repeat until no policies change Finds locally optimal policies with potentially exponential speedups over finding the global optimum No communication Communication  COM-MTDP framework for analyzing communication – –  Enhance model with {i}im where i is the set of possible messages agent i can send Reward function includes cost ( 0) of communication If communication is free: – –  [Pynadath and Tambe, 2002] DEC-POMDP reducible to single-agent POMDP (PSPACEcomplete) Optimal communication policy is to communicate at every time step Under general communication (no restrictions on cost), DEC-POMDP still NEXP-complete – i may contain every possible observation history COMM-JESP [Nair, Roth, Yokoo, Tambe, 2004]  Add SYNC action to domain – –   If one agent chooses SYNC, all other agents SYNC At SYNC, send entire observation history since last SYNC SYNC brings agents to synchronized belief over world states Policies indexed by root synchronized belief and observation history since last SYNC “At-most t=0 (SL () 0.5) (SR () 0.5)  = HL (SL (HR) 0.1275) (SL (HL) 0.7225) (SR (HR) 0.1275) (SR (HL) 0.0225) a = {Listen, Listen}  = HR (SL (HR) 0.0225) (SL (HL) 0.1275) (SR (HR) 0.7225) (SR (HL) 0.1275) a = SYNC t=2 (SL () 0.5) (SR () 0.5) t=2 (SL () 0.97) (SR () 0.03) K” heuristic – there must be a SYNC within at most K timesteps Dec-Comm [Roth, Simmons, Veloso, 2005]  At plan-time, assume communication is free –   generate joint policy using single-agent POMDPsolver Reason over possible joint beliefs to execute policy Use communication to integrate local observations into team belief Tiger Domain: (States, Actions)  Two-agent tiger problem [Nair et al., 2003]: Individual Actions: ai  {OpenL, OpenR, Listen} Robot can open left door, open right door, or listen S: {SL, SR} Tiger is either behind left door or behind right door Tiger Domain: (Observations) Individual Observations: I  {HL, HR} Robot can hear tiger behind left door or hear tiger behind right door Observations are noisy and independent. Tiger Domain: (Reward)  Coordination problem – agents must act together for maximum reward Listen has small cost (-1 per agent) Both agents opening door with tiger leads to medium negative reward (-50) Maximum reward (+20) when both agents open door with treasure Minimum reward (-100) when only one agent opens door with tiger Joint Beliefs   Joint belief (bt) – distribution over world states Why compute possible joint beliefs? – – –  action coordination transition and observation functions depend on joint action agent can’t accurately estimate belief if joint action is unknown To ensure action coordination, agents can only reason over information known by all teammates Possible Joint Beliefs HL b: P(SL) = 0.5 HL L0 p: p(b) = 1.0  = {} a = <Listen, Listen> How should agents select actions over joint beliefs? L1 b: P(SL) = 0.8 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.2 p: p(b) = 0.29 p: p(b) = 0.21 p: p(b) = 0.21 p: p(b) = 0.29  = {HL,HL}  = {HL,HR}  = {HR,HL}  = {HR,HR} p t  p t 1  P( t | bt 1 , a t 1 ) Q-POMDP Heuristic   Select joint action over possible joint beliefs Q-MDP (Littman et. al., 1995) – approximate solution to large POMDP using underlying MDP QMDP (b)  arg max  b( s)  Va ( s) a  sS Q-POMDP – approximate solution to DEC-POMDP using underlying single-agent POMDP QPOMDP ( Lt )  arg max a  p(L )  Q(b(L ), a) t i Lti Lt t i Q-POMDP Heuristic QPOMDP ( Lt )  arg max a  p(L )  Q(b(L ), a) t i t i Lti Lt b: P(SL) = 0.5 Choose joint action by computing expected reward over all leaves p: p(b) = 1.0  = {} b: P(SL) = 0.8 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.2 p: p(b) = 0.29 p: p(b) = 0.21 p: p(b) = 0.21 p: p(b) = 0.29  = {HL,HL}  = {HL,HR}  = {HR,HL}  = {HR,HR} Agents will independently select same joint action… but action choice is very conservative (always <Listen,Listen>) DEC-COMM: Use communication to add local observations to joint belief Dec-Comm Example HL b: P(SL) = 0.5 p: p(b) = 1.0 HL  = {} L1 b: P(SL) = 0.8 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.2 p: p(b) = 0.29 p: p(b) = 0.21 p: p(b) = 0.21 p: p(b) = 0.29  = {HL,HL}  = {HL,HR}  = {HR,HL}  = {HR,HR} aNC = Q-POMDP(L1) = <Listen,Listen> L* = circled nodes aC = Q-POMDP(L*) = <Listen,Listen> Don’t communicate Dec-Comm Example (cont’d) b: P(SL) = 0.5 <HL,HL> p: p(b) = 1.0  = {} L1 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.15 p: p(b) = 0.29 p: p(b) = 0.21 p: p(b) = 0.21 p: p(b) = 0.29  = {HL,HL}  = {HL,HR}  = {HR,HL}  = {HR,HR} <HL,HL> a = <Listen, Listen> L2 b: P(SL) = 0.97 b: P(SL) = 0.85 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.16 p: p(b) = 0.12 p: p(b) = 0.06 p: p(b) = 0.06 p: p(b) = 0.04 p: p(b) = 0.06 p: p(b) = 0.04 p: p(b) = 0.04 p: p(b) = 0.06 : {<HL,HL>, : {<HL,HL>, : {<HL,HL>, : {<HL,HL>, : {<HL,HR>, : {<HL,HR>, : {<HL,HR>, : {<HL,HR>, <HL,HL>} <HL,HR>} <HR,HL>} <HL,HL>} <HL,HR>} <HR,HL>} <HR,HR>} <HR,HR>} aNC = Q-POMDP(L2) = <Listen, Listen> L* = circled nodes aC = Q-POMDP(L*) = <OpenR,OpenR> V(aC) - V(aNC) > ε Agent 1 communicates … Dec-Comm Example (cont’d) b: P(SL) = 0.5 <HL,HL> p: p(b) = 1.0  = {} L1 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.15 p: p(b) = 0.29 p: p(b) = 0.21 p: p(b) = 0.21 p: p(b) = 0.29  = {HL,HL}  = {HL,HR}  = {HR,HL}  = {HR,HR} <HL,HL> a = <Listen, Listen> L2 b: P(SL) = 0.97 b: P(SL) = 0.85 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.85 b: P(SL) = 0.5 b: P(SL) = 0.5 b: P(SL) = 0.16 p: p(b) = 0.12 p: p(b) = 0.06 p: p(b) = 0.06 p: p(b) = 0.04 p: p(b) = 0.06 p: p(b) = 0.04 p: p(b) = 0.04 p: p(b) = 0.06 : {<HL,HL>, : {<HL,HL>, : {<HL,HL>, : {<HL,HL>, : {<HL,HR>, : {<HL,HR>, : {<HL,HR>, : {<HL,HR>, <HL,HL>} <HL,HR>} <HR,HL>} <HL,HL>} <HL,HR>} <HR,HL>} <HR,HR>} <HR,HR>} … Agent 1 communicates <HL,HL> Agent 2 communicates <HL,HL> Q-POMDP(L2) = <OpenR, OpenR> Agents open right door! References  Becker, R., Zilberstein, S., Lesser, V., Goldman, C. V. Transition-independent decentralized Markov decision processes. In International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2003.  Bernstein, D., Zilberstein, S., Immerman, N. The complexity of decentralized control of Markov decision processes. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.  Boutilier, C. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, 1999.  Hansen, E. A., Bernstein, D. S., Zilberstein, S. Dynamic programming for partially observable stochastic games. In National Conference on Artificial Intelligence, 2004.  Nair, R., Roth, M., Yokoo, M., Tambe, M. Communication for improving policy computation in distributed POMDPs. To appear in Proceedings of the International Conference on Autonomous Agents and Multiagent Systems, 2004.  Nair, R., Pynadath, D., Yokoo, M., Tambe, M., Marsella, S. Taming decentralized POMDPs: Towards efficient policy computation for multiagent settings. In Proceedings of International Joint Conference on Artificial Intelligence, 2003.  Peshkin, L., Kim, K.-E., Meuleau, N., Kaelbling, L. P. Learning to cooperate via policy search. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2000.  Pynadath, D. and Tambe, M. The communicative multiagent team decision problem: Analyzing teamwork theories and models. In Journal of Artificial Intelligence Research, 2002.  Roth, M., Simmons, R., Veloso, M. Reasoning about joint beliefs for execution-time communication decisions. To appear in International Joint Conference on Autonomous Agents and Multi-Agent Systems, 2005.  Xuan, P., Lesser, V., Zilberstein, S. Communication decisions in multi-agent cooperation: Model and experiments. In Proceedings of the International Joint Conference on Artificial Intelligence, 2001.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Multi-agent MDP and POMDP Models for Coordination