Download 7 repeated games.pptx

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Rock–paper–scissors wikipedia , lookup

Minimax wikipedia , lookup

Artificial intelligence in video games wikipedia , lookup

Chicken (game) wikipedia , lookup

Prisoner's dilemma wikipedia , lookup

Evolutionary game theory wikipedia , lookup

The Evolution of Cooperation wikipedia , lookup

Transcript
Introduction to Game Theory
7. Repeated Games
Dana Nau
University of Maryland
Nau: Game Theory 1
Repeated Games
  Used by game theorists, economists, social and behavioral scientists
as highly simplified models of various real-world situations
Iterated Prisoner’s Dilemma
Iterated Chicken Game
Roshambo
Iterated Battle
of the Sexes
Repeated
Ultimatum Game
Repeated Stag Hunt
Repeated
Matching
Nau:Pennies
Game Theory 2
Finitely Repeated Games
Prisoner’s Dilemma:
  In repeated games, some game G is played
multiple times by the same set of agents
2
C
D
C
3, 3
0, 5
D
5, 0
1, 1
1
  G is called the stage game
•  Usually (but not always),
G is a normal-form game
  Each occurrence of G is called
an iteration or a round
  Usually each agent knows what all
the agents did in the previous iterations,
but not what they’re doing in the
current iteration
  Thus, an imperfect-information
game with perfect recall
  Usually each agent’s
payoff function is additive
Iterated Prisoner’s Dilemma,
with 2 iterations:
Agent 1:
Agent 2:
Round 1:
C
C
Round 2:
D
C
3+5 = 5
3+0 = 3
Total payoff:
Nau: Game Theory 3
Strategies
  The repeated game has a much bigger strategy space than the stage game
  One kind of strategy is a stationary strategy:
  Use the same strategy
at every iteration
  More generally, an
Iterated Prisoner’s Dilemma with 2 iterations:
agent’s play at each stage
may depend on the history
  What happened in
previous iterations
Nau: Game Theory 4
Backward Induction
  If the number of iterations is finite and known, we can use backward
induction to get a subgame-perfect equilibrium
  Example: finitely many repetitions of the Prisoner’s Dilemma
  In the last round,
the dominant strategy is D
  That’s common knowledge
  So in the 2nd-to-last round,
D also is the dominant strategy
Agent 1:
Agent 2:
Round 1:
D
D
Round 2:
D
D
Round 3:
D
D
Round 4:
this argument is vulnerable to
both empirical and theoretical criticisms
D
D
  …
  The SPE is (D,D) on every round
  As with the Centipede game,
Nau: Game Theory 5
Backward Induction when G is 0-sum
  As before, backward induction works much better in zero-sum games
  In the last round, equilibrium is the minimax profile
•  Each agent uses his/her minimax strategy
  That’s common knowledge
  So in the 2nd-to-last round, it
again is the minimax strategies
…
  The SPE is (D,D) on every round
Nau: Game Theory 6
Infinitely Repeated Games
  An infinitely repeated game in extensive form would be an infinite tree
  Payoffs can’t be attached to any terminal nodes
  Payoffs can’t be the sums of the payoffs in the stage games (generally infinite)
  Two common ways around this problem
  Let r (1)i , r (2)i , … be an infinite sequence of payoffs for agent i
  Agent i’s average reward is
k
lim ∑ j=1 ri( j ) / k
k→∞
  Agent i’s future discounted reward is the discounted sum of the payoffs, i.e.,
∞
∑
j ( j)
β
ri where€β (with 0 ≤ β ≤ 1) is a constant called the discount factor
j=1
  Two ways to interpret the discount factor:
€
1.  The agent cares more about the preset than the future
2.  The agent cares about the future, but the game ends at any round with
probability 1 − β
Nau: Game Theory 7
Example
  Some well-known strategies for the Iterated Prisoner’s Dilemma:
»  AllC: always cooperate
»  AllD (the Hawk strategy):
TFT
or
Grim AllD
TFT Tester
C
C
C
D
C
D
C
C
D
D
D
C
C
C
D
D
C
C
C
C
D
D
C
C
C
C
D
D
C
C
C
C
D
D
C
C
C
C
D
D
C
C
...
...
…
…
…
…
always defect
»  Grim: cooperate until the
other agent defects,
then defect forever
»  Tit-for-Tat (TFT):
cooperate on the first move.
On the nth move, repeat
the other agent (n–1)th move
»  Tester: defect on move 1. If the
other agent retaliates, play TFT.
Otherwise, randomly intersperse
cooperation and defection
AllC,
AllC,
Grim, Grim,
or TFT or TFT
  If the discount factor is large enough, each of the following is a Nash equilibrium
  (TFT, TFT), (TFT,GRIM), and (GRIM,GRIM)
Nau: Game Theory 8
Equilibrium Payoffs for Repeated Games
  There’s a “folk theorem” that tells what the possible equilibrium payoffs
are in repeated games
  It says roughly the following:
  In an infinitely repeated game whose stage game is G, there is a
Nash equilibrium whose average payoffs are (p1, p2, …, pn)
if and only if
  G has a mixed-strategy profile (s1, s2, …, sn) with the following
property:
•  For each i, si’s payoff would be ≥ pi if the other agents used
minimax strategies against i
Nau: Game Theory 9
Proof and Examples
Example 2:
IPD with
(p1, p2) = (2.5,2.5)
Grim
Other
agent
Agent 1 Agent 2
gives each agent i the average payoff pi,
given certain constraints on (p1, p2, …, pn)
C
C
D
C
C
C
C
D
•  In this equilibrium, the agents cycle in
lock-step through a sequence of game
outcomes that achieve (p1, p2, …, pn)
C
C
D
C
C
C
C
D
C
D
D
D
D
C
D
D
D
C
D
C
D
C
D
D
…
…
  Use the definitions of minimax and best-
response to show that in every equilibrium,
an agent’s average payoff ≥ the agent’s
minimax value
  Show how to construct an equilibrium that
•  If any agent i deviates, then the others
punish i forever, by playing their
minimax strategies against i
  There’s a large family of such theorems,
for various conditions on the game
…
Example 1:
IPD with
(p1, p2) = (3,3)
…
  The proof proceeds in 2 parts:
Nau: Game Theory 10
Zero-Sum Repeated Games
  For two-player zero-sum repeated games, the folk theorem is still true, but it
becomes vacuous
  Suppose we iterate a two-player zero-sum game G
  Let V be the value of G (from the Minimax Theorem)
  If agent 2 uses a minimax strategy against 1, then 1’s maximum payoff is V
•  Thus max value for p1 is V, so min value for p2 is –V
  If agent 1 uses a minimax strategy against 2, then 2’s maximum payoff is –V
•  Thus max value for p2 is –V, so min value for p1 is V
  Thus in the iterated game, the only Nash-equilibrium payoff profile is (V,–V)
  The only way to get this is if each agent always plays his/her minimax strategy
•  If agent 1 plays a non-minimax strategy s1 and agent 2 plays his/her best
response, 2’s expected payoff will be higher than –V
Nau: Game Theory 11
Roshambo (Rock, Paper, Scissors)
Rock
Paper
Scissors
Rock
0, 0
–1, 1
1, –1
Paper
1, –1
0, 0
–1, 1
Scissors
–1, 1
1, –1
0, 0
A2
A1
  Nash equilibrium for the stage game:
  choose randomly, P=1/3 for each move
  Nash equilibrium for the repeated game:
  always choose randomly, P=1/3 for each move
  Expected payoff = 0
  Let’s see how that works out in practice …
Nau: Game Theory 12
Roshambo (Rock, Paper, Scissors)
Rock
Paper
Scissors
Rock
0, 0
–1, 1
1, –1
Paper
1, –1
0, 0
–1, 1
Scissors
–1, 1
1, –1
0, 0
A2
A1
  1999 international roshambo programming competition
www.cs.ualberta.ca/~darse/rsbpc1.html
  Round-robin tournament:
•  55 programs, 1000 iterations
for each pair of programs
•  Lowest possible score = –55000,
highest possible score = 55000
  Average over 25 tournaments:
•  Highest score (Iocaine Powder): 13038
•  Lowest score (Cheesebot): –36006
  Very different from the game-theoretic prediction
Nau: Game Theory 13
  A Nash equilibrium strategy is best for you
if the other agents also use their Nash equilibrium strategies
  In many cases, the other agents won’t use Nash equilibrium strategies
  If you can forecast their actions accurately, you may be able to do
much better than the Nash equilibrium strategy
  Why won’t the other agents use their Nash equilibrium strategies?
  Because they may be trying to forecast your actions too
  Something analogous can happen in non-zero-sum games
Nau: Game Theory 14
Iterated Prisoner’s Dilemma
Prisoner’s Dilemma
  Multiple iterations of the Prisoner’s Dilemma
Cooperate
Defect
Cooperate
3, 3
0, 5
Defect
5, 0
1, 1
P1
  Widely used to study the emergence of
cooperative behavior among agents
  e.g., Axelrod (1984), The Evolution of Cooperation
P2
Nash equilibrium
  Axelrod ran a famous set of tournaments
  People contributed strategies
encoded as computer programs
  Axelrod played them against each other
If I defect now, he might punish
me by defecting next time
Nau: Game Theory 15
TFT with Other Agents
  In Axelrod’s tournaments, TFT usually did best
»  It could establish and maintain cooperations with many other agents
»  It could prevent malicious agents from taking advantage of it
TFT AllC
TFT AllD
TFT Grim
TFT TFT
TFT Tester
C
C
D
C
C
C
C
C
D
C
C
D
D
C
C
C
C
D
C
C
C
D
D
C
C
C
C
C
C
C
C
D
D
C
C
C
C
C
C
C
D
C
C
C
C
C
C
C
C
D
D
D
C
C
C
C
C
C
C
C
C
D
D
C
C
C
C
C
C
...
...
…
…
…
…
…
…
…
…
C
Nau: Game Theory 16
Example:
  A real-world example of the IPD, described in Axelrod’s book:
  World War I trench warfare
  Incentive to cooperate:
  If I attack the other side, then they’ll retaliate and I’ll get hurt
  If I don’t attack, maybe they won’t either
  Result: evolution of cooperation
  Although the two infantries were supposed to be enemies, they
avoided attacking each other
Nau: Game Theory 17
IPD with Noise
  In noisy environments,
  There’s a nonzero probability (e.g., 10%)
C
C
C
C
C
D C
Noise
…
…
that a “noise gremlin” will change some
of the actions
•  Cooperate (C) becomes Defect (D),
and vice versa
  Can use this to model accidents
  Compute the score using the changed
action
  Can also model misinterpretations
  Compute the score using the original
action
Did he really
intend to do that?
Nau: Game Theory 18
Example of Noise
  Story from a British army officer in World War I:
  I was having tea with A Company when we heard a lot of shouting and went
out to investigate. We found our men and the Germans standing on their
respective parapets. Suddenly a salvo arrived but did no damage.
Naturally both sides got down and our men started swearing at the Germans,
when all at once a brave German got onto his parapet and shouted out:
“We are very sorry about that; we hope no one was hurt. It is not our
fault. It is that damned Prussian artillery.”
  The salvo wasn’t the German infantry’s intention
  They didn’t expect it nor desire it
Nau: Game Theory 19
Noise Makes it Difficult
to Maintain Cooperation
  Consider two agents
who both use TFT
  One accident or
misinterpretation
can cause a long
string of retaliations
Retaliation
Retaliation
C
C
C
C
D
C
D
C
C
C
C
D C
C
D
C
D
Noise"
Retaliation
Retaliation
...
...
Nau: Game Theory 20
Some Strategies for the Noisy IPD
  Principle: be more forgiving in the face of defections
  Tit-For-Two-Tats (TFTT)
»  Retaliate only if the other agent defects twice in a row
•  Can tolerate isolated instances of defections, but susceptible to exploitation
of its generosity
•  Beaten by the TESTER strategy I described earlier
  Generous Tit-For-Tat (GTFT)
»  Forgive randomly: small probability of cooperation if the other agent defects
»  Better than TFTT at avoiding exploitation, but worse at maintaining cooperation
  Pavlov
»  Win-Stay, Lose-Shift
•  Repeat previous move if I earn 3 or 5 points in the previous iteration
•  Reverse previous move if I earn 0 or 1 points in the previous iteration
»  Thus if the other agent defects continuously, Pavlov will alternatively cooperate
and defect
Nau: Game Theory 21
Discussion
  The British army officer’s story:
  a German shouted, ``We are very sorry about that; we hope no one was
hurt. It is not our fault. It is that damned Prussian artillery.”
  The apology avoided a conflict
  It was convincing because it was consistent with the German infantry’s
past behavior
  The British had ample evidence that the German infantry wanted to
keep the peace
  If you can tell which actions are affected by noise, you can avoid reacting
to the noise
  IPD agents often behave deterministically
  For others to cooperate with you it helps if you’re predictable
  This makes it feasible to build a model from observed behavior
Nau: Game Theory 22
The DBS Agent
  Work by my recent PhD graduate, Tsz-Chiu Au
  Now a postdoc at University of Texas
  From the other agent’s recent behavior, build a model π of the other
agent’s strategy
  Use the model to filter noise
  Use the model to help plan our next move
Au & Nau. Accident or intention:
That is the question (in the iterated
prisoner’s dilemma). AAMAS, 2006.
Au & Nau. Is it accidental or
intentional? A symbolic approach to the
noisy iterated prisoner’s dilemma. In G.
Kendall (ed.), The Iterated Prisoners
Dilemma: 20 Years On. World
Scientific, 2007.
Nau: Game Theory 23
Modeling the other agent
  A set of rules of the following form
if our last move was m and their last move was m'
then P[their next move will be C]
  Four rules: one for each of (C,C), (C,D), (D,C), and (D,D)
  For example, TFT can be described as
  (C,C)
1, (C, D)
1, (D, C )
0, (D, D)
0
  How to get the probabilities?
  One way: look at the agent’s behavior in the recent past
  During the last k iterations,
  What fraction of the time did the other agent cooperate at iteration j
when the agents’ moves were (x,y) at iteration j–1?
Nau: Game Theory 24
Modeling the other agent
  π can only model a very small set of strategies
  It doesn’t even model the Grim strategy correctly:
  If Grim defects, it may be defecting because of something that
happened many moves ago
  But we’re not trying to model an agent’s entire strategy, just its recent
behavior
  If an agent’s behavior changes, then the probabilities in π will change
  e.g., after Grim defects a few times, the rules will give a very low
probability of it cooperating again
Nau: Game Theory 25
Noise Filtering
  Suppose the applicable rule is
deterministic
  P[their next move will be C] = 0 or 1
  If the other agent’s next move
isn’t what the rule predicts, then
  Assume the
observed action
is noise
  Behave as if the
action were what
the rule predicted
The other
agent
cooperates
when I do
So I won’t
retaliate here. I
think these
defections are
actually noise
C
C
C
C
C
C
C D C
C
C
C D C
C
C
C
C
:
:
Nau: Game Theory 26
I am Grim. If
you ever betray
me, I will never
forgive you.
Change of Behavior
  Anomalies in observed behavior can be due
either to noise or to a genuine change of
behavior
  Changes of behavior occur because
  The other agent can change its strategy
anytime
  E.g., if noise affects one of Agent 1’s
actions, this may trigger a change in
Agent 2’s behavior
•  Agent 1 does not know this
  How to distinguish noise from a real change
of behavior?
C
C
C
C
CD
C
C
D
C
D
C
D
:
D
:
D
:
:
These
moves are
not noise
Nau: Game Theory 27
Detection of a Change of Behavior
Temporary tolerance:
  When we observe unexpected
behavior from the other agent
  Don’t immediately decide
whether it’s noise or a real
change of behavior
  Instead, defer judgment
for a few iterations
  If the anomaly persists, then
recompute π based on the
other agent’s recent behavior
The other
agent
cooperates
when I do
The defections
might be accidents,
so I shouldn’t lose
my temper too soon
I think the other
agent’s has really
changed, so I’ll
change mine too
C
C
C
C
C
C
D
D
:
C
C
C
D
D
D
D
D
:
Nau: Game Theory 28
Move generation
  Modified version of game-tree search
  Use the policy π to predict probabilities of the other agent’s moves
  Compute expected utility) for move x as
u1(x) = ∑ y∈{C,D} u1(x,y) × P(y | π, previous moves)
where x = my move, y = other agent’s move
  Choose the move with the highest expected utility
(C,C)
:
:
:
:
:
(C,D) (D,C)
:
:
:
:
Current
Iteration
(D,D)
:
:
:
Next
Iteration
:
:
:
:
Iteration
after next
Nau: Game Theory 29
Suppose we have the rules
1. (C,C) → 0.7
2. (C,D) → 0.4
3. (D,C) → 0.1
4. (D,D) → 0.1
Example
(C,C)
C
C
C
D
D
C
C
C
??
(C,D) (D,C)
(D,D)
  Suppose we search to depth 1
u1(C) = 0.7 u1(C,C) + 0.3 u1(C,D) = 2.1 + 0 = 2.1
u1(D) = 0.7 u1(D,C) + 0.3 u1(D,D) = 3.5 + 0.3 = 3.8
»  So D looks better
  Is D really what we should choose?
Rule 1 predicts
P(C) = 0.7, P(D) = 0.3
Nau: Game Theory 30
Suppose we have the rules
1. (C,C) → 0.7
2. (C,D) → 0.4
3. (D,C) → 0.1
4. (D,D) → 0.1
Example
(C,C)
C
C
C
D
D
C
C
C
??
Rule 1 predicts
P(C) = 0.7, P(D) = 0.3
(C,D) (D,C)
(D,D)
  It’s not wise to choose D
»  On the move after that, the opponent will
retaliate with P=0.9
»  The depth-1 search didn’t see this
  But if we search to depth d>1, we’ll see it
  C will look better and we’ll choose it instead
  In general, it’s best look far ahead
»  e.g., 60 moves
Nau: Game Theory 31
How to Search Deeper
  Game trees grow exponentially with search depth
»  How to search to the tree deeply?
  Key assumption: π accurately models the other agent’s future behavior
  Then we can use dynamic programming
»  Makes the search polynomial in the search depth
»  Can easily search to depth 60
»  Equivalent to solving an acyclic MDP of depth 60
  This generates fairly good moves
(C,C)
Current
iteration
(C,D) (D,C)
(D,D)
Next
iteration
iteration
after next
:
:
:
:
Nau: Game Theory 32
20th Anniversary IPD Competition
http://www.prisoners-dilemma.com
  Category 2: IPD with noise
  165 programs participated
  DBS dominated the
top 10 places
  Two agents scored
higher than DBS
  They both used
master-and-slaves strategies
Nau: Game Theory 33
Master & Slaves Strategy
  Each participant could submit up to 20 programs
  Some submitted programs that could recognize each other
  (by communicating pre-arranged sequences of Cs and Ds)
  The 20 programs worked as a team
•  1 master, 19 slaves
  When a slave plays with its master
•  Slave cooperates, master defects
My goons give
me
all their
money …
=> maximizes the master’s payoff
  When a slave plays with
an agent not in its team
•  It defects
… and they
beat up
everyone else
=> minimizes the other agent’s payoff
Nau: Game Theory 34
Comparison
  Analysis
  Each master-slaves team’s average score was much lower than DBS’s
  If BWIN and IMM01 had each been restricted to ≤ 10 slaves,
DBS would have placed 1st
  Without any slaves, BWIN and IMM01 would have done badly
  In contrast, DBS had no slaves
  DBS established cooperation
with many other agents
  DBS did this despite the noise,
because it filtered out the noise
Nau: Game Theory 35
Summary
  Finitely repeated games – backward induction
  Infinitely repeated games
  average reward, future discounted reward
  equilibrium payoffs
  Non-equilibrium strategies
  opponent modeling in roshambo
  iterated prisoner’s dilemma with noise
•  opponent models based on observed behavior
•  detection and removal of noise
•  game-tree search against the opponent model
  20th anniversary IPD competition
Nau: Game Theory 36