Download Planning, Learning, Prediction, and Games 4 Two–Player Zero

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Rock–paper–scissors wikipedia , lookup

The Evolution of Cooperation wikipedia , lookup

Deathmatch wikipedia , lookup

Artificial intelligence in video games wikipedia , lookup

Prisoner's dilemma wikipedia , lookup

Nash equilibrium wikipedia , lookup

Evolutionary game theory wikipedia , lookup

Minimax wikipedia , lookup

Chicken (game) wikipedia , lookup

Transcript
Planning, Learning, Prediction, and Games
Two–Player Zero–Sum Games and von Neumann’s Minimax Theorem: 12/18/09 - 01/15/10
Lecturer: Patrick Briest, Peter Pietrzyk
4
Scribe: Philipp Brandes, Peter Pietrzyk, Patrick Briest
Two–Player Zero–Sum Games and von Neumann’s Minimax
Theorem
Definition 4.1 A normal–form game is specified by
• I – a set of players,
• Ai – a set of strategies for each player i ∈ I and
Y
• ui :
Ai → R – a payoff function for each player i ∈ I.
i∈I
Normal–form games with 2 players can be written in matrix form, such that
• the elements of A1 index rows,
• the elements of A2 index columns, and
• the entry in row r and column c is (u1 (r, c), u2 (r, c)).
We call player 1 the row player and player 2 the column player. We can also think of the game
as being described by two matrices R and C specifying the payoffs of the row and column players,
respectively.
Example 4.2 (Rock, Paper, Scissors)
Rock beats Scissors, Scissors beats Paper, Paper beats Rock.
Strategies: R, P, S
Payoffs: 1 for winning, −1 for losing, 0 for draw.
Payoff matrix:
R
P
S
R
(0, 0)
(1, −1)
(−1, 1)
P
(−1, 1)
(0, 0)
(1, −1)
S
(1, −1)
(−1, 1)
(0, 0)
Example 4.3 (The Prisoners’ Dilemma)
Two players are accused of a crime. If both admit, both go to jail for 2 years. If they both keep
quiet, they will go to jail for only one year. If only one of them admits, he becomes a principal
witness and goes free, while the other goes to jail for 3 years.
36
Define the ”payoff” of x years in jail as (3 − x).
Strategies of both players: A – admit, Q – keep quiet
Payoff matrix:
Q
A
Q
(2, 2)
(3, 0)
A
(0, 3)
(1, 1)
Definition 4.4 For a set of strategies A, let
¯X
n
o
¯
∆(A) = p : A → [0, 1]¯
p(a) = 1
a∈A
denote the set of all probability distributions on A. For some player i ∈ I, we call the elements of
∆(Ai ) her mixed strategies.
Mixed strategies are rules for randomly picking a strategy. Picking a strategy deterministically is
called a pure strategy.
• Elements of
Y
Ai are called pure strategy profiles.
i∈I
• Elements of
Y
∆(Ai ) are mixed called mixed strategy profiles.
i∈I
The expected payoff of player i ∈ I given a mixed strategy profile (p1 , p2 , . . . , p|I| ) ∈
Y
∆(Ai ) is
i∈I
ui (p1 , p2 , . . . , p|I| ) =
³Y
X
(a1 ,a2 ,...,a|I| )∈
Y
Ai
´
pj (aj ) · ui (a1 , a2 , . . . , a|I| ),
j∈I
i∈I
i.e., the expected payoff if every player samples according to their random strategy independently.
For strategy profile a = (a1 , a2 , . . . , a|I| ) and a0i ∈ Ai , let
(a0i , a−i ) = (a1 , . . . ai−1 , a0i , ai+1 , . . . , a|I| ),
i.e., the strategy profile obtained from a by changing player i0 s strategy from ai to a0i . We will use
the same notation for mixed strategy profiles.
Definition 4.5 A mixed strategy profile p = (p1 , . . . , p|I| ) is a mixed Nash equilibrium, if for all
i ∈ I and qi ∈ ∆(Ai ),
ui (qi , p−i ) ≤ ui (pi , p−i ).
If each pi is a pure strategy (assigning probability 1 to a single element from Ai ), we call p a pure
Nash equilibrium.
37
Example 4.6 The ”Prisoners’ Dilemma” game has one pure Nash equilibrium (A, A).
”Rock, Paper, Scissors” has no pure Nash equilibrium, but a mixed equilibrium in which both players
choose P, R, S with probability 31 each. (Easy to check: If one player randomizes uniformly, every
mixed strategy performs equally well for the other player, so deviating does not increase the payoff.)
Example 4.7 (Bach or Mozart)
Two players want to decide whether to go to a Bach concert or one of Mozart. They want to go
together, but prefer different alternatives:
B
M
B
(2, 1)
(0, 0)
M
(0, 0)
(1, 2)
There are 2 pure Nash equilibria:
• P ureE1 :
(B, B)
• P ureE2 :
(M, M )
There is also a third mixed Nash equilibrium M ixedE:
• row player:
• column player:
P r(B) = 23 , P r(M ) =
1
3
P r(B) = 31 , P r(M ) =
2
3
Some critiques of the Nash equilibrium concept:
1. Idea of a Nash equilibrium is that player 1 plays her side of the equilibrium, because she
believes that player 2 plays her side of the equilibrium, because she thinks that player 1 plays
her side of the equilibrium, because ... But if there are multiple equilibria, why should we
believe that players will be able to coordinate their beliefs?
2. Look at the payoff in ”Bach or Mozart”:
• P ureE1 :
(2, 1)
• P ureE2 :
(1, 2)
• M ixedE :
( 23 , 23 )
So different equilibria result in different payoffs. If we can’t predict which Nash equilibrium
will be reached, we also can’t predict the payoffs.
In this lecture we will address these critiques, showing that players arrive at an equilibrium by
playing a game repeatedly and using learning rules to adopt to their opponent’s behavior.
Definition 4.8 A two–player zero–sum game is one in which I = {1, 2} and u2 (a1 , a2 ) = −u1 (a1 , a2 )
for all pure strategy profiles (a1 , a2 ).
”Rock, Paper, Scissors” is an example of a zero–sum game (actually, even a win/lose–game with
payoffs in {−1, 1}).
38
4.1
Von Neumann’s Minimax-Theorem
Theorem 4.9 Let a two-player zero-sum game G = (I, (Ai ) , (ui )) be given. Define
v1min = max
min u1 (p, q)
v1max = min
max u1 (p, q).
p∈∆(A1 ) q∈∆(A2 )
and
q∈∆(A2 ) p∈∆(A1 )
It holds that v1min = v1max . We call this value V the game value of G.
Intuitively, Theorem 4.9 says the following: The best payoff player 1 can guarantee
¡ for ¢herself, if
she has to pick a strategy first and player 2 is then allowed to play a best response v1min , is equal
to the minimum payoff she can achieve if player 2 has to go first, and player 1 is allowed to respond
optimally (v1max ).
This has some very nice consequences.
Corollary 4.10 Let G be a two-player zero-sum game with game value V . In every mixed Nash
equilibrium (p∗ , q ∗ ), we have u1 (p∗ , q ∗ ) = V .
Proof: Since p∗ is a best response to q ∗ ,
u1 (p∗ , q ∗ ) = max u1 (p, q ∗ )
p∈∆(A1 )
≥ min
max u1 (p, q) = v1max = V.
q∈∆(A2 ) p∈∆(A1 )
Similary, since q ∗ is a best response to p∗ , u2 (p∗ , q ∗ ) ≥ −V . So u1 (p∗ , q ∗ ) = −u2 (p∗ , q ∗ ) ≤ V .
Corollary 4.11 Let G be a two-player zero-sum game. A mixed strategy profile (p∗ , q ∗ ) is a mixed
Nash equilibrium, if and only if
p∗ ∈ argmax min u1 (p, q)
p∈∆(A1 ) q∈∆(A2 )
and
q ∗ ∈ argmin max u1 (p, q).
q∈∆(A2 ) p∈∆(A1 )
In particular, the set of mixed Nash equilibria is non-empty.
Proof: ”⇒”: Let (p∗ , q ∗ ) be a Nash equilibrium. By Corollary 4.10, u1 (p∗ , q ∗ ) = V . Since q ∗ is a
best response to p∗ ,
min u1 (p∗ , q) = u1 (p∗ , q ∗ ) = V
q∈∆(A2 )
= max
min u1 (p, q) .
p∈∆(A1 ) q∈∆(A2 )
Thus, p∗ ∈ argmaxp∈∆(A1 ) minq∈∆(A2 ) u1 (p, q). By symmetry, the same argument applies to player
2, as well.
39
”⇐”: Let strategy profile (p∗ , q ∗ ) with p∗ , q ∗ from the respective sets of mixed strategies be given.
Since p∗ ∈ argmaxp∈∆(A1 ) minq∈∆(A2 ) u1 (p, q), we have u1 (p∗ , q ∗ ) ≥ V . On the other hand, for any
p ∈ ∆ (A1 ),
u1 (p, q ∗ ) ≤ max u1 (p, q ∗ )
p∈∆(A1 )
= min
max u1 (p, q) = V
q∈∆(A2 ) p∈∆(A1 )
where the last line follows because q ∗ ∈ argminq∈∆(A2 ) maxp∈∆(A1 ) u1 (p, q). So player 1 has no
incentive to defect to a different strategy. Again, by symmetry, the same argument applies to
player 2.
Remark Note that Corollary 4.11 resolves one of the critiques of the Nash equilibrium concept.
Players don’t have to coordinate their actions in order to reach an equilibrium, but can pick
strategies from their respective argmax-sets independently.
Recall the MaxHedge algorithm from Homework Assignment 4. We showed:
´
³p
T ln(n)
Theorem 4.12 Algorithm MaxHedge (with n experts, costs in [0, 1]) has regret O
against the class of adaptive adversaries. In particular, for any sequence of reward functions rt :
[n] → [0, 1], the sequence of experts x1 , . . . , xT selected by the algorithm satisfies
à n
" T
#
"
!#
T
´
³p
X
X
X
T ln(n) .
E
px rt (x)
−O
rt (xt ) ≥ E max
p∈∆([n])
t=1
t=1
x=1
In Theorem 4.12 above we compare our algorithm to the best mixture of experts. However, the
bound follows immediately by observing that the maximum on the right hand side is always achieved
by a single best expert.
Proof of Theorem 4.9: We start with the easy direction and prove that v1min ≤ v1max :
For any strategy profile (p̂, q̂),
u1 (p̂, q̂) ≤ max u1 (p, q̂) .
p∈∆(A1 )
Thus,
min u1 (p̂, q) ≤ min
q∈∆(A2 )
max u1 (p, q)
q∈∆(A2 ) p∈∆(A1 )
and, taking the maximum of both sides,
v1min = max
min u1 (p, q) ≤ min
p∈∆(A1 ) q∈∆(A2 )
max u1 (p, q) = v1max .
q∈∆(A2 ) p∈∆(A1 )
To prove the other direction v1max ≤ v1min , we will use the existence of expert learning algorithms
with vanishing per-time-step regret (as, e.g., Hedge).
We assume w.l.o.g. that u1 (p, q) ∈ [0, 1] (and, thus, u2 (p, q) ∈ [−1, 0]) for all p ∈ ∆ (A1 ) , q ∈
∆ (A2 ). This can always be achieved by applying an appropriate linear transformation to the game
matrix.
40
Note, that
v1min = max
min u1 (p, q)
p∈∆(A1 ) q∈∆(A2 )
= max
min (−u2 (p, q))
p∈∆(A1 ) q∈∆(A2 )
= − min
=
max u2 (p, q)
p∈∆(A1 ) q∈∆(A2 )
−v2max .
Similarly, v1max = −v2min .
Now assume that for T steps, both players use an expert learning algorithm to determine their
strategy. Formally, let n = max {|A1 | , |A2 |}. Player 1 applies the algorithm with one expert for
each a ∈ A1 . If player 2 plays strategy b at time t, the reward of expert a is rt (a) = u1 (a, b). Player
2 applies the algorithm with an expert for each b ∈ A2 and reward functions rt (b) = u2 (b, a) + 1
(making sure rewards are in [0, 1]).
Let a1 , . . . , aT and b1 , . . .³, bT be the´ strategies selected by the 2 players and assume that both
p
algorithms have regret O
T ln(n) .
Define mixed strategies
T
1X
p=
at
T
and
t=1
T
1X
q=
bt .
T
t=1
(The above is somewhat sloppy notation. We associate each at with the vector a~t ∈ ∆(A1 ) that
assigns probability 1 to strategy at ∈ A1 .) Intuitively, strategies p and q mix pure strategies
proportional to the frequency with which they have been played.
Let p∗ ∈ argmaxp∈∆(A1 ) u1 (p, q) , q ∗ ∈ argmaxq∈∆(A2 ) u2 (p, q) be best responses to q and p, respectively. Note, that
u1 (p∗ , q) = max u1 (p, q)
p∈∆(A1 )
≥ min
max u1 (p, q) = v1max .
q∈∆(A2 ) p∈∆(A1 )
Analogously, u2 (p, q ∗ ) ≥ v2max . By our regret-bound,
#
"
#
T
T
³p
´
1X
1X
u1 (at , bt ) ≥ E
u1 (p∗ , bt ) − O
E
ln(n)/T
T
T
t=1
t=1
³p
´
= E [u1 (p∗ , q)] − O
ln(n)/T .
"
41
Finally, combining the above,
v1max − O
´
³p
´
³p
ln(n)/T ≤ E [u1 (p∗ , q)] − O
ln(n)/T
"
#
T
1X
≤E
u1 (at , bt )
T
t=1
#
"
T
1X
u2 (at , bt )
= −E
T
t=1
³p
´
≤ −E [u2 (p, q ∗ )] + O
ln(n)/T
³p
´
≤ −v2max + O
ln(n)/T
³p
´
= v1min + O
ln(n)/T .
Now the claim follows for T → ∞ by compactness of ∆ (A1 ) , ∆ (A2 ) .
Remark By the last set of inequalities in the proof of Theorem 4.9, running the regret-minimizing
algorithms for Ω(ln(n)/(δ/2)2 ) steps, yields a strategy profile (p, q), such that
¸
·
E max u1 (p, q) = E[u1 (p∗ , q)] ≤ V + δ/2.
p∈∆(A1 )
Consequently, E[u1 (p, q)] ≤ V + δ/2 and, since the game is zero sum, E[u2 (p, q)] ≥ −V − δ/2.
Similarly, for player 2 we have that
¸
·
E max u2 (p, q) = E[u2 (p, q ∗ )] ≤ −V + δ/2.
q∈∆(A2 )
Thus, E[u2 (p, q)] ≤ −V + δ/2 and E[u1 (p, q)] ≥ V − δ/2.
Such a strategy profile, in which no player can gain more than δ by deviating unilaterally is called
a δ-approximate Nash equilibrium.
Since we only run the algorithms for O(log n) steps, the mixed strategies p and q assign positive
probability to at most O(log n) pure strategies. We conclude that every two-player zero-sum game
possesses δ-approximate Nash equilibria with small support of size O(log n).
Remark Using Markov’s inequality we can turn the above existence result into a constructive
procedure. Running the algorithms for Ω(ln(n)/(δ/2c)2 ) steps for any c > 0, we obtain a mixed
strategy profile (p, q) with
µ
¶
δ
δ
1
≤ 2c
=
Prob
max u1 (p, q) ≥
δ
2
c
p∈∆(A1 )
2
and support of size O(log n). Running the procedure repeatedly yields the desired δ-approximate
Nash equilibrium with probability exponentially close to 1.
Remark Assume that a zero-sum game is played repeatedly. By using a regret-minimizing algorithm like Hedge, player 1 comes close to the best possible payoff against whatever distribution of
strategies player 2 happens to use. It is not necessary to assume that player 2 is acting rationally.
42
4.2
Yao’s Minimax Principle
Consider a computational problem with
• I - a finite set of possible inputs and
• A - a finite set of possbible algorithms.
For all i ∈ I, a ∈ A denote by t(i, a) the running time of algorithm a on input i. We can think
of this as a game, in which player 1 gets to select the input, while player 2 is allowed to pick an
algorithm.
Similar to the payoff functions in the previous sections, we can generalize t to distributions p ∈ ∆(I),
q ∈ ∆(A) as
XX
t(p, q) =
t(i, a)p(i)q(a).
i∈I a∈A
In words, t(p, q) denotes the expected running time of an algorithm from distribution q on an input
from distribution p.
Theorem 4.13 In the setting described above, it holds that
max min t(p, a) = min max t(i, q).
p∈∆(I) a∈A
q∈∆(A) i∈I
Proof: This follows immediately from von Neumann’s Minimax Theorem 4.9 and the observation
that for any mixed strategy p, q of one of the players, the other player has a pure strategy that
constitutes a best response.
Yao’s Minimax Principle states that, for finite sets of algorithms and inputs, the best worst-case
running time achievable by any randomized algorithm (the right hand side), is equal to the best
running time obtainable by a deterministic algorithm on a worst-case distribution of inputs.
This is very helpful in proving lower bounds on the performance guarantee of randomized algorithms (e.g., in online computation). If one can construct a distribution on inputs, on which no
deterministic algorithm performs well in expectation (which is often much easier than arguing about
randomized algorithms), then it follows that there must exist an instance on which no randomized
algorithm performs well.
43