Download Planning, Learning, Prediction, and Games 4 Two–Player Zero

Planning, Learning, Prediction, and Games Two–Player Zero–Sum Games and von Neumann’s Minimax Theorem: 12/18/09 - 01/15/10 Lecturer: Patrick Briest, Peter Pietrzyk 4 Scribe: Philipp Brandes, Peter Pietrzyk, Patrick Briest Two–Player Zero–Sum Games and von Neumann’s Minimax Theorem Definition 4.1 A normal–form game is specified by • I – a set of players, • Ai – a set of strategies for each player i ∈ I and Y • ui : Ai → R – a payoff function for each player i ∈ I. i∈I Normal–form games with 2 players can be written in matrix form, such that • the elements of A1 index rows, • the elements of A2 index columns, and • the entry in row r and column c is (u1 (r, c), u2 (r, c)). We call player 1 the row player and player 2 the column player. We can also think of the game as being described by two matrices R and C specifying the payoffs of the row and column players, respectively. Example 4.2 (Rock, Paper, Scissors) Rock beats Scissors, Scissors beats Paper, Paper beats Rock. Strategies: R, P, S Payoffs: 1 for winning, −1 for losing, 0 for draw. Payoff matrix: R P S R (0, 0) (1, −1) (−1, 1) P (−1, 1) (0, 0) (1, −1) S (1, −1) (−1, 1) (0, 0) Example 4.3 (The Prisoners’ Dilemma) Two players are accused of a crime. If both admit, both go to jail for 2 years. If they both keep quiet, they will go to jail for only one year. If only one of them admits, he becomes a principal witness and goes free, while the other goes to jail for 3 years. 36 Define the ”payoff” of x years in jail as (3 − x). Strategies of both players: A – admit, Q – keep quiet Payoff matrix: Q A Q (2, 2) (3, 0) A (0, 3) (1, 1) Definition 4.4 For a set of strategies A, let ¯X n o ¯ ∆(A) = p : A → [0, 1]¯ p(a) = 1 a∈A denote the set of all probability distributions on A. For some player i ∈ I, we call the elements of ∆(Ai ) her mixed strategies. Mixed strategies are rules for randomly picking a strategy. Picking a strategy deterministically is called a pure strategy. • Elements of Y Ai are called pure strategy profiles. i∈I • Elements of Y ∆(Ai ) are mixed called mixed strategy profiles. i∈I The expected payoff of player i ∈ I given a mixed strategy profile (p1 , p2 , . . . , p|I| ) ∈ Y ∆(Ai ) is i∈I ui (p1 , p2 , . . . , p|I| ) = ³Y X (a1 ,a2 ,...,a|I| )∈ Y Ai ´ pj (aj ) · ui (a1 , a2 , . . . , a|I| ), j∈I i∈I i.e., the expected payoff if every player samples according to their random strategy independently. For strategy profile a = (a1 , a2 , . . . , a|I| ) and a0i ∈ Ai , let (a0i , a−i ) = (a1 , . . . ai−1 , a0i , ai+1 , . . . , a|I| ), i.e., the strategy profile obtained from a by changing player i0 s strategy from ai to a0i . We will use the same notation for mixed strategy profiles. Definition 4.5 A mixed strategy profile p = (p1 , . . . , p|I| ) is a mixed Nash equilibrium, if for all i ∈ I and qi ∈ ∆(Ai ), ui (qi , p−i ) ≤ ui (pi , p−i ). If each pi is a pure strategy (assigning probability 1 to a single element from Ai ), we call p a pure Nash equilibrium. 37 Example 4.6 The ”Prisoners’ Dilemma” game has one pure Nash equilibrium (A, A). ”Rock, Paper, Scissors” has no pure Nash equilibrium, but a mixed equilibrium in which both players choose P, R, S with probability 31 each. (Easy to check: If one player randomizes uniformly, every mixed strategy performs equally well for the other player, so deviating does not increase the payoff.) Example 4.7 (Bach or Mozart) Two players want to decide whether to go to a Bach concert or one of Mozart. They want to go together, but prefer different alternatives: B M B (2, 1) (0, 0) M (0, 0) (1, 2) There are 2 pure Nash equilibria: • P ureE1 : (B, B) • P ureE2 : (M, M ) There is also a third mixed Nash equilibrium M ixedE: • row player: • column player: P r(B) = 23 , P r(M ) = 1 3 P r(B) = 31 , P r(M ) = 2 3 Some critiques of the Nash equilibrium concept: 1. Idea of a Nash equilibrium is that player 1 plays her side of the equilibrium, because she believes that player 2 plays her side of the equilibrium, because she thinks that player 1 plays her side of the equilibrium, because ... But if there are multiple equilibria, why should we believe that players will be able to coordinate their beliefs? 2. Look at the payoff in ”Bach or Mozart”: • P ureE1 : (2, 1) • P ureE2 : (1, 2) • M ixedE : ( 23 , 23 ) So different equilibria result in different payoffs. If we can’t predict which Nash equilibrium will be reached, we also can’t predict the payoffs. In this lecture we will address these critiques, showing that players arrive at an equilibrium by playing a game repeatedly and using learning rules to adopt to their opponent’s behavior. Definition 4.8 A two–player zero–sum game is one in which I = {1, 2} and u2 (a1 , a2 ) = −u1 (a1 , a2 ) for all pure strategy profiles (a1 , a2 ). ”Rock, Paper, Scissors” is an example of a zero–sum game (actually, even a win/lose–game with payoffs in {−1, 1}). 38 4.1 Von Neumann’s Minimax-Theorem Theorem 4.9 Let a two-player zero-sum game G = (I, (Ai ) , (ui )) be given. Define v1min = max min u1 (p, q) v1max = min max u1 (p, q). p∈∆(A1 ) q∈∆(A2 ) and q∈∆(A2 ) p∈∆(A1 ) It holds that v1min = v1max . We call this value V the game value of G. Intuitively, Theorem 4.9 says the following: The best payoff player 1 can guarantee ¡ for ¢herself, if she has to pick a strategy first and player 2 is then allowed to play a best response v1min , is equal to the minimum payoff she can achieve if player 2 has to go first, and player 1 is allowed to respond optimally (v1max ). This has some very nice consequences. Corollary 4.10 Let G be a two-player zero-sum game with game value V . In every mixed Nash equilibrium (p∗ , q ∗ ), we have u1 (p∗ , q ∗ ) = V . Proof: Since p∗ is a best response to q ∗ , u1 (p∗ , q ∗ ) = max u1 (p, q ∗ ) p∈∆(A1 ) ≥ min max u1 (p, q) = v1max = V. q∈∆(A2 ) p∈∆(A1 ) Similary, since q ∗ is a best response to p∗ , u2 (p∗ , q ∗ ) ≥ −V . So u1 (p∗ , q ∗ ) = −u2 (p∗ , q ∗ ) ≤ V . Corollary 4.11 Let G be a two-player zero-sum game. A mixed strategy profile (p∗ , q ∗ ) is a mixed Nash equilibrium, if and only if p∗ ∈ argmax min u1 (p, q) p∈∆(A1 ) q∈∆(A2 ) and q ∗ ∈ argmin max u1 (p, q). q∈∆(A2 ) p∈∆(A1 ) In particular, the set of mixed Nash equilibria is non-empty. Proof: ”⇒”: Let (p∗ , q ∗ ) be a Nash equilibrium. By Corollary 4.10, u1 (p∗ , q ∗ ) = V . Since q ∗ is a best response to p∗ , min u1 (p∗ , q) = u1 (p∗ , q ∗ ) = V q∈∆(A2 ) = max min u1 (p, q) . p∈∆(A1 ) q∈∆(A2 ) Thus, p∗ ∈ argmaxp∈∆(A1 ) minq∈∆(A2 ) u1 (p, q). By symmetry, the same argument applies to player 2, as well. 39 ”⇐”: Let strategy profile (p∗ , q ∗ ) with p∗ , q ∗ from the respective sets of mixed strategies be given. Since p∗ ∈ argmaxp∈∆(A1 ) minq∈∆(A2 ) u1 (p, q), we have u1 (p∗ , q ∗ ) ≥ V . On the other hand, for any p ∈ ∆ (A1 ), u1 (p, q ∗ ) ≤ max u1 (p, q ∗ ) p∈∆(A1 ) = min max u1 (p, q) = V q∈∆(A2 ) p∈∆(A1 ) where the last line follows because q ∗ ∈ argminq∈∆(A2 ) maxp∈∆(A1 ) u1 (p, q). So player 1 has no incentive to defect to a different strategy. Again, by symmetry, the same argument applies to player 2. Remark Note that Corollary 4.11 resolves one of the critiques of the Nash equilibrium concept. Players don’t have to coordinate their actions in order to reach an equilibrium, but can pick strategies from their respective argmax-sets independently. Recall the MaxHedge algorithm from Homework Assignment 4. We showed: ´ ³p T ln(n) Theorem 4.12 Algorithm MaxHedge (with n experts, costs in [0, 1]) has regret O against the class of adaptive adversaries. In particular, for any sequence of reward functions rt : [n] → [0, 1], the sequence of experts x1 , . . . , xT selected by the algorithm satisfies Ã n " T # " !# T ´ ³p X X X T ln(n) . E px rt (x) −O rt (xt ) ≥ E max p∈∆([n]) t=1 t=1 x=1 In Theorem 4.12 above we compare our algorithm to the best mixture of experts. However, the bound follows immediately by observing that the maximum on the right hand side is always achieved by a single best expert. Proof of Theorem 4.9: We start with the easy direction and prove that v1min ≤ v1max : For any strategy profile (p̂, q̂), u1 (p̂, q̂) ≤ max u1 (p, q̂) . p∈∆(A1 ) Thus, min u1 (p̂, q) ≤ min q∈∆(A2 ) max u1 (p, q) q∈∆(A2 ) p∈∆(A1 ) and, taking the maximum of both sides, v1min = max min u1 (p, q) ≤ min p∈∆(A1 ) q∈∆(A2 ) max u1 (p, q) = v1max . q∈∆(A2 ) p∈∆(A1 ) To prove the other direction v1max ≤ v1min , we will use the existence of expert learning algorithms with vanishing per-time-step regret (as, e.g., Hedge). We assume w.l.o.g. that u1 (p, q) ∈ [0, 1] (and, thus, u2 (p, q) ∈ [−1, 0]) for all p ∈ ∆ (A1 ) , q ∈ ∆ (A2 ). This can always be achieved by applying an appropriate linear transformation to the game matrix. 40 Note, that v1min = max min u1 (p, q) p∈∆(A1 ) q∈∆(A2 ) = max min (−u2 (p, q)) p∈∆(A1 ) q∈∆(A2 ) = − min = max u2 (p, q) p∈∆(A1 ) q∈∆(A2 ) −v2max . Similarly, v1max = −v2min . Now assume that for T steps, both players use an expert learning algorithm to determine their strategy. Formally, let n = max {|A1 | , |A2 |}. Player 1 applies the algorithm with one expert for each a ∈ A1 . If player 2 plays strategy b at time t, the reward of expert a is rt (a) = u1 (a, b). Player 2 applies the algorithm with an expert for each b ∈ A2 and reward functions rt (b) = u2 (b, a) + 1 (making sure rewards are in [0, 1]). Let a1 , . . . , aT and b1 , . . .³, bT be the´ strategies selected by the 2 players and assume that both p algorithms have regret O T ln(n) . Define mixed strategies T 1X p= at T and t=1 T 1X q= bt . T t=1 (The above is somewhat sloppy notation. We associate each at with the vector a~t ∈ ∆(A1 ) that assigns probability 1 to strategy at ∈ A1 .) Intuitively, strategies p and q mix pure strategies proportional to the frequency with which they have been played. Let p∗ ∈ argmaxp∈∆(A1 ) u1 (p, q) , q ∗ ∈ argmaxq∈∆(A2 ) u2 (p, q) be best responses to q and p, respectively. Note, that u1 (p∗ , q) = max u1 (p, q) p∈∆(A1 ) ≥ min max u1 (p, q) = v1max . q∈∆(A2 ) p∈∆(A1 ) Analogously, u2 (p, q ∗ ) ≥ v2max . By our regret-bound, # " # T T ³p ´ 1X 1X u1 (at , bt ) ≥ E u1 (p∗ , bt ) − O E ln(n)/T T T t=1 t=1 ³p ´ = E [u1 (p∗ , q)] − O ln(n)/T . " 41 Finally, combining the above, v1max − O ´ ³p ´ ³p ln(n)/T ≤ E [u1 (p∗ , q)] − O ln(n)/T " # T 1X ≤E u1 (at , bt ) T t=1 # " T 1X u2 (at , bt ) = −E T t=1 ³p ´ ≤ −E [u2 (p, q ∗ )] + O ln(n)/T ³p ´ ≤ −v2max + O ln(n)/T ³p ´ = v1min + O ln(n)/T . Now the claim follows for T → ∞ by compactness of ∆ (A1 ) , ∆ (A2 ) . Remark By the last set of inequalities in the proof of Theorem 4.9, running the regret-minimizing algorithms for Ω(ln(n)/(δ/2)2 ) steps, yields a strategy profile (p, q), such that ¸ · E max u1 (p, q) = E[u1 (p∗ , q)] ≤ V + δ/2. p∈∆(A1 ) Consequently, E[u1 (p, q)] ≤ V + δ/2 and, since the game is zero sum, E[u2 (p, q)] ≥ −V − δ/2. Similarly, for player 2 we have that ¸ · E max u2 (p, q) = E[u2 (p, q ∗ )] ≤ −V + δ/2. q∈∆(A2 ) Thus, E[u2 (p, q)] ≤ −V + δ/2 and E[u1 (p, q)] ≥ V − δ/2. Such a strategy profile, in which no player can gain more than δ by deviating unilaterally is called a δ-approximate Nash equilibrium. Since we only run the algorithms for O(log n) steps, the mixed strategies p and q assign positive probability to at most O(log n) pure strategies. We conclude that every two-player zero-sum game possesses δ-approximate Nash equilibria with small support of size O(log n). Remark Using Markov’s inequality we can turn the above existence result into a constructive procedure. Running the algorithms for Ω(ln(n)/(δ/2c)2 ) steps for any c > 0, we obtain a mixed strategy profile (p, q) with µ ¶ δ δ 1 ≤ 2c = Prob max u1 (p, q) ≥ δ 2 c p∈∆(A1 ) 2 and support of size O(log n). Running the procedure repeatedly yields the desired δ-approximate Nash equilibrium with probability exponentially close to 1. Remark Assume that a zero-sum game is played repeatedly. By using a regret-minimizing algorithm like Hedge, player 1 comes close to the best possible payoff against whatever distribution of strategies player 2 happens to use. It is not necessary to assume that player 2 is acting rationally. 42 4.2 Yao’s Minimax Principle Consider a computational problem with • I - a finite set of possible inputs and • A - a finite set of possbible algorithms. For all i ∈ I, a ∈ A denote by t(i, a) the running time of algorithm a on input i. We can think of this as a game, in which player 1 gets to select the input, while player 2 is allowed to pick an algorithm. Similar to the payoff functions in the previous sections, we can generalize t to distributions p ∈ ∆(I), q ∈ ∆(A) as XX t(p, q) = t(i, a)p(i)q(a). i∈I a∈A In words, t(p, q) denotes the expected running time of an algorithm from distribution q on an input from distribution p. Theorem 4.13 In the setting described above, it holds that max min t(p, a) = min max t(i, q). p∈∆(I) a∈A q∈∆(A) i∈I Proof: This follows immediately from von Neumann’s Minimax Theorem 4.9 and the observation that for any mixed strategy p, q of one of the players, the other player has a pure strategy that constitutes a best response. Yao’s Minimax Principle states that, for finite sets of algorithms and inputs, the best worst-case running time achievable by any randomized algorithm (the right hand side), is equal to the best running time obtainable by a deterministic algorithm on a worst-case distribution of inputs. This is very helpful in proving lower bounds on the performance guarantee of randomized algorithms (e.g., in online computation). If one can construct a distribution on inputs, on which no deterministic algorithm performs well in expectation (which is often much easier than arguing about randomized algorithms), then it follows that there must exist an instance on which no randomized algorithm performs well. 43

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Planning, Learning, Prediction, and Games 4 Two–Player Zero