Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Planning, Learning, Prediction, and Games Learning in Non-Zero-Sum Games: 01/15/10 - 01/29/10 Lecturer: Patrick Briest 5 Scribe: Philipp Brandes Learning in Non-Zero-Sum Games The property that players using no-regret learning algorithms converge to the set of Nash equilibria is specific to the class of zero-sum games. So do Nash equilibria even exist in general? This question is answered by Nash’s famous theorem: Theorem 5.1 For every normal-form game G = (I, (Ai ) , (ui )) with |I| , |Ai | < ∞ for all i ∈ I, there exists a mixed Nash equilibrium. We will not prove this. Instead, we will see that in non-zero-sum games, players using (a slightly stronger form of) no-regret learning algorithms converge to a different kind of equilibrium. 5.1 Correlated Equilibria In a Nash equilibrium, each player samples a strategy according to her mixed strategy independently. Correlated equilibria allow correlation among different players’ strategies via a trusted sampling device. Here’s an example. In the Traffic Light Game, two cars approach an intersection from perpendicular directions. Both have the options to stop or to go. Let’s define their payoffs as follows: Stop (4,4) (5,1) Stop Go Go (1,5) (0,0) There are 3 Nash equilibria, namely: 1 0 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1/2 1/2 1/2 1/4 1/4 1/2 1/4 1/4 None of these seem quite right. Either, one car never gets to go or the cars get wrecked 25% of the time. We could use a traffic light to sample a pair of strategies. If both players believe that the recommendation they receive from the traffic light is a best response to the recommendation to the other player, they don’t have an incentive to deviate. This idea is captured in the following definition: 44 Definition 5.2 Let G = (I, (A Q i ) , (ui )) be a normal-form game with |I| , |Ai | < ∞ for all i ∈ I. A probability distribution p on j∈I Aj is a correlated equilibrium, if for all i ∈ I and all α, β ∈ Ai , we have Ea∼p [ui (α, a−i ) |ai = α] ≥ Ea∼p [ui (β, a−i ) |ai = α] . (1) By definition of the conditional expectation, the above says X X Pr (a−i |a−i = α) · ui (β, a−i ) Pr (a−i |ai = α) · ui (α, a−i ) ≥ {z } | a−i a−i Pr ((α, a−i )) = Pr (ai = α) By multiplying both sides with Pr (ai = α) we obtain the equivalent formulation X (ui (α, a−i ) − ui (β, a−i )) · Pr ((α, a−i )) ≥ 0. (2) a−i Remark Note that every Nash equilibrium is also a correlated equilibrium. A probability distribution is a correlated equilibrium, if it satisfies the linear constraints (2) for all Q i ∈ I, α, β ∈ Ai . Since the requirement that p is a probability distribution on i∈I Ai can also be expressed by linear constraints, it follows that the set of all correlated equilibria is convex. This is in contrast to the set of all Nash equilibria, which may be a collection of isolated points. Some correlated equilibria in the Traffic Light Game are as follows: Stop Go Stop 0 1/2 Go 1/2 0 Stop Go Stop 1/3 1/3 Go 1/3 0 In the second correlated equilibrium: If player 1 gets the recommendation to stop, his expected value is 12 · 4 + 12 · 1 = 2.5. If he deviates from the recommendation and goes, his expected value is 1 1 2 · 5 + 2 · 0 = 2.5. Thus, he has no incentive to deviate. 5.2 Internal Regret Definition 5.3 Let A be a set of actions, a1 , . . . , aT ∈ A a sequence of choices from A and r1 , . . . , rT : A → [0, 1] a sequence of reward functions. For a function f : A → A, define R̂f (~a, ~r, T ) = T 1X (rt (f (at )) − rt (at )) , T t=1 where ~a = (a1 , . . . , aT ) and ~r = (r1 , . . . , rT ). We define the internal regret of the sequence of actions ~a as R̂int (~a, ~r, T ) = max R̂f (~a, ~r, T ) . f :A→A For two actions a, b ∈ A, we define the pairwise regret of the sequence ~a as R̂a,b (~a, ~r, T ) = T 1X (rt (b) − rt (a)) · 1[at = a], T t=1 45 where ( 1 1[at = a] = 0 , if at = a , else. Remark In the definition above, think of f as an advisor function. R̂f (~a, ~r, T ) is the algorithm’s average per-time-step regret for not following the advisor function f . R̂int (~a, ~r, T ) is its regret relative to the best advisor function. Pairwise regret is a special case of internal regret. R̂a,b (~a, ~r, T ) = R̂f (~a, ~r, T ), where f (a) = b and f (c) = c for all c ∈ A, c 6= a. Regret as we defined it in previous sections is also called external regret. Note, that external regret is the regret relative to the best constant advisor function. In contrast to internal regret, external regret compares the algorithm’s performance to a fixed (external) benchmark that is independent of the algorithm. Definition 5.4 A randomized algorithm has no internal regret, if for all adaptive adversaries given by reward functions ~r = (r1 , . . . , rT ), the algorithm outputs a sequence of actions ~a = (a1 , . . . , aT ), such that h i lim E R̂int (~a, ~r, T ) = 0. T →∞ It turns out that an algorithm has no internal regret, if and only if it has no pairwise regret. Theorem 5.5 An algorithm has no internal regret, if and only if h i lim max E R̂a,b (~a, ~r, T ) = 0. T →∞ a,b∈A Theorem 5.5 above is an immediate consequence from the following lemma: Lemma 5.6 It holds that, max R̂a,b (~a, ~r, T ) ≤ R̂int (~a, ~r, T ) ≤ |A| · max R̂a,b (~a, ~r, T ) . a,b∈A a,b∈A Proof: For the first inequality, fix any a, b ∈ A and define f : A → A as f (a) = b and f (c) = c for all c 6= a. Then, R̂a,b (~a, ~r, T ) = R̂f (~a, ~r, T ) ≤ max R̂f (~a, ~r, T ) = R̂int (~a, ~r, T ) . f :A→A Taking the maximum over all a, b ∈ A yields the claim. 46 For the second inequality, note that R̂f (~a, ~r, T ) = T 1X (rt (f (at )) − rt (at )) | {z } T 1 = T t=1 T X t=1 X rt (f (a)) − rt (a) · 1[at = a] ! a∈A T X X 1 rt (f (a)) − rt (a) · 1[at = a] T a∈A {z } | t=1 X R̂a,f (a) (~a, ~r, T ) = = a∈A ≤ |A| · max R̂a,b (~a, ~r, T ) . a,b∈A 5.3 Internal Regret & Correlated Equilibria Definition 5.7 For S ⊆ RN and x ∈ RN , let dist (x, S) = inf kx − sk2 . s∈S We say that an infinite sequence x1 , x2 , . . . ∈ RN converges to S, if lim dist (xn , S) = 0. n→∞ Theorem 5.8 Let G be a normal-form game with a finite number k of players and a finite number of strategies per player. ∞ Suppose the players play G repeatedly and each player i chooses her sequence of strategies ati t=1 by applying a no-internal-regret algorithm with rewards rit (a) = ui a, at−i . Let equilibria of G and p(T ) the uniform distribution on the multi-set set of correlated tC be the t a1 , . . . , ak |1 ≤ t ≤ T of strategy profiles. Then the sequence (p (T ))∞ T =1 converges to C. Proof: By contradiction. If the sequence does not converge to C, then for some δ > 0 there exists an infinite subsequence of distributions that have distance at least δ from C. Q Since C and the space of all probability distributions on i∈I Ai are compact (bounded and closed), so is the space of probability distributions at distance δ or greater from C. Thus, there is an infinite subsequence that converges to some distribution p with dist (p, C) ≥ δ. Denote this subsequence as p(T1 ), p (T2 ) , p (T3 ) , . . . Since p ∈ / C, there exists a player i, two strategies α, β ∈ Ai and ε > 0, such that X ui (β, a−i ) − ui (α, a−i ) · p (α, a−i ) = ε. a−i Since p is the limit point of the sequence p (Ts ), for any sufficiently large s, it holds that X ε ui (β, a−i ) − ui (α, a−i ) · p (Ts ) (α, a−i ) ≥ . 2 a −i 47 Recall that p (Ts ) (a) is defined as the number of times a was played in the first Ts rounds divided by Ts . Thus, Ts X 1 X ε ui (β, a−i ) − ui (α, a−i ) · 1[at = (α, a−i )] ≥ . Ts a 2 t=1 −i 1[at Note that = (α, a−i )] = the following: 1[at−i = a−i ] · 1[ati = α]. Changing the order of summation, we obtain Ts X 1 X ε ui (β, a−i ) − ui (α, a−i ) · 1[at−i = a−i ] · 1[ati = α] ≤ 2 Ts a t=1 = 1 Ts | −i Ts X ui β, at−i − ui α, at−i · 1[ati = α] t=1 = R̂α,β {z } a1i , . . . , aTi s , ri1 , . . . , riTs , Ts . This contradicts the fact that player i is using a no-internal-regret algorithm. 5.4 A No-Internal-Regret Algorithm Let A1 , . . . , An be no-external-regret algorithms. So, if Ri (n, T ) denotes the expected external per-time-step regret of algorithm Ai on an instance with n experts of length T , then lim Ri (n, T ) = 0. T →∞ We combine A1 , . . . , An into a new algorithm as follows: Algorithm 1: NoIntRegret • Initialize independent no-external-regret algorithms A1 , . . . , An . • At time t: t , . . . , qt • Let qit = qi,1 i,n be the distribution according to which algorithm Ai samples its expert at this time. Define matrix Qt as q1t Qt = ... . qnt • Find a distribution pt = pt1 , . . . , ptn with pt = pt · Qt . • Sample an expert according to pt and observe the reward vector rt = r1t , . . . , rnt . • Report reward vector pti rt to each algorithm Ai . 48 Remark: Note, that pt in the algorithm is well-defined, since we can view it as the stationary distribution of the Markov chain induced by matrix Qt , which is known to exist. The idea of the algorithm can be described as follows: For every (advisor) function f , we want to use algorithm Ai to ensure low pairwise regret of the i → f (i) variety. This works because we can choose pt , such that we can view pti as both the probability of choosing expert i at time t, or the probability of choosing algorithm Ai and then following its advice, for which reason we can assign reward pti rt as the expected reward due to Ai to each of the algorithms. Theorem 5.9 Let A1 , . . . , An have expected per-time-step external regret of at most R (n, T ) against the class of adaptive adversaries. Then algorithm NoIntRegret has internal regret at most n · R (n, T ) against the same class of adversaries. Proof: By our assumption on A1 , . . . , An , we have # " # " T T 1X t t 1X t t t pi · qi · r ≥E pi rj − R (n, T ) E T T t=1 (3) t=1 for all 1 ≤ i, j ≤ n, since every Ai has regret at most R (n, T ) relative to each expert j. The sum of expected rewards of A1 , . . . , An at time t is " n # h X T i t t t E = E pt Qt rt p i · qi · r i=1 h T i , = E pt r t since pt Qt = pt . Let f : {1, . . . , n} → {1, . . . , n} be an arbitrary function. Summing (3) over all i and choosing for each right hand side j = f (i), we obtain " # # " T T n 1 X t t T 1 XX t t E p r ≥E pi rf (i) − n · R (n, T ) . T T t=1 t=1 i=1 Taking the maximum over all functions f yields the claim. q ln(n) Corollary 5.10 There exists an algorithm with expected internal per-time-step regret O n · T against the class of (randomized) adaptive adversaries. Remark: It is possible to improve the bound in Corollary 5.10 to O 10). 49 q n ln(n) T (see Problem Set