Download 7 repeated games.pptx

Introduction to Game Theory 7. Repeated Games Dana Nau University of Maryland Nau: Game Theory 1 Repeated Games   Used by game theorists, economists, social and behavioral scientists as highly simplified models of various real-world situations Iterated Prisoner’s Dilemma Iterated Chicken Game Roshambo Iterated Battle of the Sexes Repeated Ultimatum Game Repeated Stag Hunt Repeated Matching Nau:Pennies Game Theory 2 Finitely Repeated Games Prisoner’s Dilemma:   In repeated games, some game G is played multiple times by the same set of agents 2 C D C 3, 3 0, 5 D 5, 0 1, 1 1   G is called the stage game •  Usually (but not always), G is a normal-form game   Each occurrence of G is called an iteration or a round   Usually each agent knows what all the agents did in the previous iterations, but not what they’re doing in the current iteration   Thus, an imperfect-information game with perfect recall   Usually each agent’s payoff function is additive Iterated Prisoner’s Dilemma, with 2 iterations: Agent 1: Agent 2: Round 1: C C Round 2: D C 3+5 = 5 3+0 = 3 Total payoff: Nau: Game Theory 3 Strategies   The repeated game has a much bigger strategy space than the stage game   One kind of strategy is a stationary strategy:   Use the same strategy at every iteration   More generally, an Iterated Prisoner’s Dilemma with 2 iterations: agent’s play at each stage may depend on the history   What happened in previous iterations Nau: Game Theory 4 Backward Induction   If the number of iterations is finite and known, we can use backward induction to get a subgame-perfect equilibrium   Example: finitely many repetitions of the Prisoner’s Dilemma   In the last round, the dominant strategy is D   That’s common knowledge   So in the 2nd-to-last round, D also is the dominant strategy Agent 1: Agent 2: Round 1: D D Round 2: D D Round 3: D D Round 4: this argument is vulnerable to both empirical and theoretical criticisms D D   …   The SPE is (D,D) on every round   As with the Centipede game, Nau: Game Theory 5 Backward Induction when G is 0-sum   As before, backward induction works much better in zero-sum games   In the last round, equilibrium is the minimax profile •  Each agent uses his/her minimax strategy   That’s common knowledge   So in the 2nd-to-last round, it again is the minimax strategies …   The SPE is (D,D) on every round Nau: Game Theory 6 Infinitely Repeated Games   An infinitely repeated game in extensive form would be an infinite tree   Payoffs can’t be attached to any terminal nodes   Payoffs can’t be the sums of the payoffs in the stage games (generally infinite)   Two common ways around this problem   Let r (1)i , r (2)i , … be an infinite sequence of payoffs for agent i   Agent i’s average reward is k lim ∑ j=1 ri( j ) / k k→∞   Agent i’s future discounted reward is the discounted sum of the payoffs, i.e., ∞ ∑ j ( j) β ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor j=1   Two ways to interpret the discount factor: € 1.  The agent cares more about the preset than the future 2.  The agent cares about the future, but the game ends at any round with probability 1 − β Nau: Game Theory 7 Example   Some well-known strategies for the Iterated Prisoner’s Dilemma: »  AllC: always cooperate »  AllD (the Hawk strategy): TFT or Grim AllD TFT Tester C C C D C D C C D D D C C C D D C C C C D D C C C C D D C C C C D D C C C C D D C C ... ... … … … … always defect »  Grim: cooperate until the other agent defects, then defect forever »  Tit-for-Tat (TFT): cooperate on the first move. On the nth move, repeat the other agent (n–1)th move »  Tester: defect on move 1. If the other agent retaliates, play TFT. Otherwise, randomly intersperse cooperation and defection AllC, AllC, Grim, Grim, or TFT or TFT   If the discount factor is large enough, each of the following is a Nash equilibrium   (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM) Nau: Game Theory 8 Equilibrium Payoffs for Repeated Games   There’s a “folk theorem” that tells what the possible equilibrium payoffs are in repeated games   It says roughly the following:   In an infinitely repeated game whose stage game is G, there is a Nash equilibrium whose average payoffs are (p1, p2, …, pn) if and only if   G has a mixed-strategy profile (s1, s2, …, sn) with the following property: •  For each i, si’s payoff would be ≥ pi if the other agents used minimax strategies against i Nau: Game Theory 9 Proof and Examples Example 2: IPD with (p1, p2) = (2.5,2.5) Grim Other agent Agent 1 Agent 2 gives each agent i the average payoff pi, given certain constraints on (p1, p2, …, pn) C C D C C C C D •  In this equilibrium, the agents cycle in lock-step through a sequence of game outcomes that achieve (p1, p2, …, pn) C C D C C C C D C D D D D C D D D C D C D C D D … …   Use the definitions of minimax and best- response to show that in every equilibrium, an agent’s average payoff ≥ the agent’s minimax value   Show how to construct an equilibrium that •  If any agent i deviates, then the others punish i forever, by playing their minimax strategies against i   There’s a large family of such theorems, for various conditions on the game … Example 1: IPD with (p1, p2) = (3,3) …   The proof proceeds in 2 parts: Nau: Game Theory 10 Zero-Sum Repeated Games   For two-player zero-sum repeated games, the folk theorem is still true, but it becomes vacuous   Suppose we iterate a two-player zero-sum game G   Let V be the value of G (from the Minimax Theorem)   If agent 2 uses a minimax strategy against 1, then 1’s maximum payoff is V •  Thus max value for p1 is V, so min value for p2 is –V   If agent 1 uses a minimax strategy against 2, then 2’s maximum payoff is –V •  Thus max value for p2 is –V, so min value for p1 is V   Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)   The only way to get this is if each agent always plays his/her minimax strategy •  If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best response, 2’s expected payoff will be higher than –V Nau: Game Theory 11 Roshambo (Rock, Paper, Scissors) Rock Paper Scissors Rock 0, 0 –1, 1 1, –1 Paper 1, –1 0, 0 –1, 1 Scissors –1, 1 1, –1 0, 0 A2 A1   Nash equilibrium for the stage game:   choose randomly, P=1/3 for each move   Nash equilibrium for the repeated game:   always choose randomly, P=1/3 for each move   Expected payoff = 0   Let’s see how that works out in practice … Nau: Game Theory 12 Roshambo (Rock, Paper, Scissors) Rock Paper Scissors Rock 0, 0 –1, 1 1, –1 Paper 1, –1 0, 0 –1, 1 Scissors –1, 1 1, –1 0, 0 A2 A1   1999 international roshambo programming competition www.cs.ualberta.ca/~darse/rsbpc1.html   Round-robin tournament: •  55 programs, 1000 iterations for each pair of programs •  Lowest possible score = –55000, highest possible score = 55000   Average over 25 tournaments: •  Highest score (Iocaine Powder): 13038 •  Lowest score (Cheesebot): –36006   Very different from the game-theoretic prediction Nau: Game Theory 13   A Nash equilibrium strategy is best for you if the other agents also use their Nash equilibrium strategies   In many cases, the other agents won’t use Nash equilibrium strategies   If you can forecast their actions accurately, you may be able to do much better than the Nash equilibrium strategy   Why won’t the other agents use their Nash equilibrium strategies?   Because they may be trying to forecast your actions too   Something analogous can happen in non-zero-sum games Nau: Game Theory 14 Iterated Prisoner’s Dilemma Prisoner’s Dilemma   Multiple iterations of the Prisoner’s Dilemma Cooperate Defect Cooperate 3, 3 0, 5 Defect 5, 0 1, 1 P1   Widely used to study the emergence of cooperative behavior among agents   e.g., Axelrod (1984), The Evolution of Cooperation P2 Nash equilibrium   Axelrod ran a famous set of tournaments   People contributed strategies encoded as computer programs   Axelrod played them against each other If I defect now, he might punish me by defecting next time Nau: Game Theory 15 TFT with Other Agents   In Axelrod’s tournaments, TFT usually did best »  It could establish and maintain cooperations with many other agents »  It could prevent malicious agents from taking advantage of it TFT AllC TFT AllD TFT Grim TFT TFT TFT Tester C C D C C C C C D C C D D C C C C D C C C D D C C C C C C C C D D C C C C C C C D C C C C C C C C D D D C C C C C C C C C D D C C C C C C ... ... … … … … … … … … C Nau: Game Theory 16 Example:   A real-world example of the IPD, described in Axelrod’s book:   World War I trench warfare   Incentive to cooperate:   If I attack the other side, then they’ll retaliate and I’ll get hurt   If I don’t attack, maybe they won’t either   Result: evolution of cooperation   Although the two infantries were supposed to be enemies, they avoided attacking each other Nau: Game Theory 17 IPD with Noise   In noisy environments,   There’s a nonzero probability (e.g., 10%) C C C C C D C Noise … … that a “noise gremlin” will change some of the actions •  Cooperate (C) becomes Defect (D), and vice versa   Can use this to model accidents   Compute the score using the changed action   Can also model misinterpretations   Compute the score using the original action Did he really intend to do that? Nau: Game Theory 18 Example of Noise   Story from a British army officer in World War I:   I was having tea with A Company when we heard a lot of shouting and went out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: “We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.”   The salvo wasn’t the German infantry’s intention   They didn’t expect it nor desire it Nau: Game Theory 19 Noise Makes it Difficult to Maintain Cooperation   Consider two agents who both use TFT   One accident or misinterpretation can cause a long string of retaliations Retaliation Retaliation C C C C D C D C C C C D C C D C D Noise" Retaliation Retaliation ... ... Nau: Game Theory 20 Some Strategies for the Noisy IPD   Principle: be more forgiving in the face of defections   Tit-For-Two-Tats (TFTT) »  Retaliate only if the other agent defects twice in a row •  Can tolerate isolated instances of defections, but susceptible to exploitation of its generosity •  Beaten by the TESTER strategy I described earlier   Generous Tit-For-Tat (GTFT) »  Forgive randomly: small probability of cooperation if the other agent defects »  Better than TFTT at avoiding exploitation, but worse at maintaining cooperation   Pavlov »  Win-Stay, Lose-Shift •  Repeat previous move if I earn 3 or 5 points in the previous iteration •  Reverse previous move if I earn 0 or 1 points in the previous iteration »  Thus if the other agent defects continuously, Pavlov will alternatively cooperate and defect Nau: Game Theory 21 Discussion   The British army officer’s story:   a German shouted, ``We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.”   The apology avoided a conflict   It was convincing because it was consistent with the German infantry’s past behavior   The British had ample evidence that the German infantry wanted to keep the peace   If you can tell which actions are affected by noise, you can avoid reacting to the noise   IPD agents often behave deterministically   For others to cooperate with you it helps if you’re predictable   This makes it feasible to build a model from observed behavior Nau: Game Theory 22 The DBS Agent   Work by my recent PhD graduate, Tsz-Chiu Au   Now a postdoc at University of Texas   From the other agent’s recent behavior, build a model π of the other agent’s strategy   Use the model to filter noise   Use the model to help plan our next move Au & Nau. Accident or intention: That is the question (in the iterated prisoner’s dilemma). AAMAS, 2006. Au & Nau. Is it accidental or intentional? A symbolic approach to the noisy iterated prisoner’s dilemma. In G. Kendall (ed.), The Iterated Prisoners Dilemma: 20 Years On. World Scientific, 2007. Nau: Game Theory 23 Modeling the other agent   A set of rules of the following form if our last move was m and their last move was m' then P[their next move will be C]   Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)   For example, TFT can be described as   (C,C) 1, (C, D) 1, (D, C ) 0, (D, D) 0   How to get the probabilities?   One way: look at the agent’s behavior in the recent past   During the last k iterations,   What fraction of the time did the other agent cooperate at iteration j when the agents’ moves were (x,y) at iteration j–1? Nau: Game Theory 24 Modeling the other agent   π can only model a very small set of strategies   It doesn’t even model the Grim strategy correctly:   If Grim defects, it may be defecting because of something that happened many moves ago   But we’re not trying to model an agent’s entire strategy, just its recent behavior   If an agent’s behavior changes, then the probabilities in π will change   e.g., after Grim defects a few times, the rules will give a very low probability of it cooperating again Nau: Game Theory 25 Noise Filtering   Suppose the applicable rule is deterministic   P[their next move will be C] = 0 or 1   If the other agent’s next move isn’t what the rule predicts, then   Assume the observed action is noise   Behave as if the action were what the rule predicted The other agent cooperates when I do So I won’t retaliate here. I think these defections are actually noise C C C C C C C D C C C C D C C C C C : : Nau: Game Theory 26 I am Grim. If you ever betray me, I will never forgive you. Change of Behavior   Anomalies in observed behavior can be due either to noise or to a genuine change of behavior   Changes of behavior occur because   The other agent can change its strategy anytime   E.g., if noise affects one of Agent 1’s actions, this may trigger a change in Agent 2’s behavior •  Agent 1 does not know this   How to distinguish noise from a real change of behavior? C C C C CD C C D C D C D : D : D : : These moves are not noise Nau: Game Theory 27 Detection of a Change of Behavior Temporary tolerance:   When we observe unexpected behavior from the other agent   Don’t immediately decide whether it’s noise or a real change of behavior   Instead, defer judgment for a few iterations   If the anomaly persists, then recompute π based on the other agent’s recent behavior The other agent cooperates when I do The defections might be accidents, so I shouldn’t lose my temper too soon I think the other agent’s has really changed, so I’ll change mine too C C C C C C D D : C C C D D D D D : Nau: Game Theory 28 Move generation   Modified version of game-tree search   Use the policy π to predict probabilities of the other agent’s moves   Compute expected utility) for move x as u1(x) = ∑ y∈{C,D} u1(x,y) × P(y | π, previous moves) where x = my move, y = other agent’s move   Choose the move with the highest expected utility (C,C) : : : : : (C,D) (D,C) : : : : Current Iteration (D,D) : : : Next Iteration : : : : Iteration after next Nau: Game Theory 29 Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1 Example (C,C) C C C D D C C C ?? (C,D) (D,C) (D,D)   Suppose we search to depth 1 u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1 u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8 »  So D looks better   Is D really what we should choose? Rule 1 predicts P(C) = 0.7, P(D) = 0.3 Nau: Game Theory 30 Suppose we have the rules 1. (C,C) → 0.7 2. (C,D) → 0.4 3. (D,C) → 0.1 4. (D,D) → 0.1 Example (C,C) C C C D D C C C ?? Rule 1 predicts P(C) = 0.7, P(D) = 0.3 (C,D) (D,C) (D,D)   It’s not wise to choose D »  On the move after that, the opponent will retaliate with P=0.9 »  The depth-1 search didn’t see this   But if we search to depth d>1, we’ll see it   C will look better and we’ll choose it instead   In general, it’s best look far ahead »  e.g., 60 moves Nau: Game Theory 31 How to Search Deeper   Game trees grow exponentially with search depth »  How to search to the tree deeply?   Key assumption: π accurately models the other agent’s future behavior   Then we can use dynamic programming »  Makes the search polynomial in the search depth »  Can easily search to depth 60 »  Equivalent to solving an acyclic MDP of depth 60   This generates fairly good moves (C,C) Current iteration (C,D) (D,C) (D,D) Next iteration iteration after next : : : : Nau: Game Theory 32 20th Anniversary IPD Competition http://www.prisoners-dilemma.com   Category 2: IPD with noise   165 programs participated   DBS dominated the top 10 places   Two agents scored higher than DBS   They both used master-and-slaves strategies Nau: Game Theory 33 Master & Slaves Strategy   Each participant could submit up to 20 programs   Some submitted programs that could recognize each other   (by communicating pre-arranged sequences of Cs and Ds)   The 20 programs worked as a team •  1 master, 19 slaves   When a slave plays with its master •  Slave cooperates, master defects My goons give me all their money … => maximizes the master’s payoff   When a slave plays with an agent not in its team •  It defects … and they beat up everyone else => minimizes the other agent’s payoff Nau: Game Theory 34 Comparison   Analysis   Each master-slaves team’s average score was much lower than DBS’s   If BWIN and IMM01 had each been restricted to ≤ 10 slaves, DBS would have placed 1st   Without any slaves, BWIN and IMM01 would have done badly   In contrast, DBS had no slaves   DBS established cooperation with many other agents   DBS did this despite the noise, because it filtered out the noise Nau: Game Theory 35 Summary   Finitely repeated games – backward induction   Infinitely repeated games   average reward, future discounted reward   equilibrium payoffs   Non-equilibrium strategies   opponent modeling in roshambo   iterated prisoner’s dilemma with noise •  opponent models based on observed behavior •  detection and removal of noise •  game-tree search against the opponent model   20th anniversary IPD competition Nau: Game Theory 36

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 7 repeated games.pptx