Download Planning, Learning, Prediction, and Games Learning in Non

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Fundamental theorem of algebra wikipedia , lookup

Polynomial greatest common divisor wikipedia , lookup

Fisher–Yates shuffle wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Transcript
Planning, Learning, Prediction, and Games
Learning in Non-Zero-Sum Games: 01/15/10 - 01/29/10
Lecturer: Patrick Briest
5
Scribe: Philipp Brandes
Learning in Non-Zero-Sum Games
The property that players using no-regret learning algorithms converge to the set of Nash equilibria
is specific to the class of zero-sum games.
So do Nash equilibria even exist in general? This question is answered by Nash’s famous theorem:
Theorem 5.1 For every normal-form game G = (I, (Ai ) , (ui )) with |I| , |Ai | < ∞ for all i ∈ I,
there exists a mixed Nash equilibrium.
We will not prove this. Instead, we will see that in non-zero-sum games, players using (a slightly
stronger form of) no-regret learning algorithms converge to a different kind of equilibrium.
5.1
Correlated Equilibria
In a Nash equilibrium, each player samples a strategy according to her mixed strategy independently. Correlated equilibria allow correlation among different players’ strategies via a trusted
sampling device.
Here’s an example. In the Traffic Light Game, two cars approach an intersection from perpendicular
directions. Both have the options to stop or to go. Let’s define their payoffs as follows:
Stop
(4,4)
(5,1)
Stop
Go
Go
(1,5)
(0,0)
There are 3 Nash equilibria, namely:
1
0
0
0
0
1
1
0
0
1
1
0
1
0
0
0
1/2
1/2
1/2
1/4
1/4
1/2
1/4
1/4
None of these seem quite right. Either, one car never gets to go or the cars get wrecked 25% of the
time.
We could use a traffic light to sample a pair of strategies. If both players believe that the recommendation they receive from the traffic light is a best response to the recommendation to the other
player, they don’t have an incentive to deviate.
This idea is captured in the following definition:
44
Definition 5.2 Let G = (I, (A
Q i ) , (ui )) be a normal-form game with |I| , |Ai | < ∞ for all i ∈ I. A
probability distribution p on j∈I Aj is a correlated equilibrium, if for all i ∈ I and all α, β ∈ Ai ,
we have
Ea∼p [ui (α, a−i ) |ai = α] ≥ Ea∼p [ui (β, a−i ) |ai = α] .
(1)
By definition of the conditional expectation, the above says
X
X
Pr (a−i |a−i = α) · ui (β, a−i )
Pr (a−i |ai = α) · ui (α, a−i ) ≥
{z
}
|
a−i
a−i
Pr ((α, a−i ))
=
Pr (ai = α)
By multiplying both sides with Pr (ai = α) we obtain the equivalent formulation
X
(ui (α, a−i ) − ui (β, a−i )) · Pr ((α, a−i )) ≥ 0.
(2)
a−i
Remark
Note that every Nash equilibrium is also a correlated equilibrium.
A probability distribution is a correlated equilibrium, if it satisfies the linear constraints
(2) for all
Q
i ∈ I, α, β ∈ Ai . Since the requirement that p is a probability distribution on i∈I Ai can also be
expressed by linear constraints, it follows that the set of all correlated equilibria is convex. This is
in contrast to the set of all Nash equilibria, which may be a collection of isolated points.
Some correlated equilibria in the Traffic Light Game are as follows:
Stop
Go
Stop
0
1/2
Go
1/2
0
Stop
Go
Stop
1/3
1/3
Go
1/3
0
In the second correlated equilibrium: If player 1 gets the recommendation to stop, his expected
value is 12 · 4 + 12 · 1 = 2.5. If he deviates from the recommendation and goes, his expected value is
1
1
2 · 5 + 2 · 0 = 2.5. Thus, he has no incentive to deviate.
5.2
Internal Regret
Definition 5.3 Let A be a set of actions, a1 , . . . , aT ∈ A a sequence of choices from A and
r1 , . . . , rT : A → [0, 1] a sequence of reward functions. For a function f : A → A, define
R̂f (~a, ~r, T ) =
T
1X
(rt (f (at )) − rt (at )) ,
T
t=1
where ~a = (a1 , . . . , aT ) and ~r = (r1 , . . . , rT ). We define the internal regret of the sequence of actions
~a as
R̂int (~a, ~r, T ) = max R̂f (~a, ~r, T ) .
f :A→A
For two actions a, b ∈ A, we define the pairwise regret of the sequence ~a as
R̂a,b (~a, ~r, T ) =
T
1X
(rt (b) − rt (a)) · 1[at = a],
T
t=1
45
where
(
1
1[at = a] =
0
, if at = a
, else.
Remark In the definition above, think of f as an advisor function. R̂f (~a, ~r, T ) is the algorithm’s
average per-time-step regret for not following the advisor function f . R̂int (~a, ~r, T ) is its regret
relative to the best advisor function.
Pairwise regret is a special case of internal regret. R̂a,b (~a, ~r, T ) = R̂f (~a, ~r, T ), where f (a) = b and
f (c) = c for all c ∈ A, c 6= a.
Regret as we defined it in previous sections is also called external regret. Note, that external regret
is the regret relative to the best constant advisor function. In contrast to internal regret, external
regret compares the algorithm’s performance to a fixed (external) benchmark that is independent
of the algorithm.
Definition 5.4 A randomized algorithm has no internal regret, if for all adaptive adversaries given
by reward functions ~r = (r1 , . . . , rT ), the algorithm outputs a sequence of actions ~a = (a1 , . . . , aT ),
such that
h
i
lim E R̂int (~a, ~r, T ) = 0.
T →∞
It turns out that an algorithm has no internal regret, if and only if it has no pairwise regret.
Theorem 5.5 An algorithm has no internal regret, if and only if
h
i
lim max E R̂a,b (~a, ~r, T ) = 0.
T →∞
a,b∈A
Theorem 5.5 above is an immediate consequence from the following lemma:
Lemma 5.6 It holds that,
max R̂a,b (~a, ~r, T ) ≤ R̂int (~a, ~r, T ) ≤ |A| · max R̂a,b (~a, ~r, T ) .
a,b∈A
a,b∈A
Proof: For the first inequality, fix any a, b ∈ A and define f : A → A as f (a) = b and f (c) = c for
all c 6= a. Then,
R̂a,b (~a, ~r, T ) = R̂f (~a, ~r, T ) ≤ max R̂f (~a, ~r, T ) = R̂int (~a, ~r, T ) .
f :A→A
Taking the maximum over all a, b ∈ A yields the claim.
46
For the second inequality, note that
R̂f (~a, ~r, T ) =
T
1X
(rt (f (at )) − rt (at ))
|
{z
}
T
1
=
T
t=1
T
X
t=1
X
rt (f (a)) − rt (a) · 1[at = a]
!
a∈A
T X
X 1
rt (f (a)) − rt (a) · 1[at = a]
T
a∈A
{z
}
| t=1
X
R̂a,f (a) (~a, ~r, T )
=
=
a∈A
≤ |A| · max R̂a,b (~a, ~r, T ) .
a,b∈A
5.3
Internal Regret & Correlated Equilibria
Definition 5.7 For S ⊆ RN and x ∈ RN , let
dist (x, S) = inf kx − sk2 .
s∈S
We say that an infinite sequence x1 , x2 , . . . ∈ RN converges to S, if
lim dist (xn , S) = 0.
n→∞
Theorem 5.8 Let G be a normal-form game with a finite number k of players and a finite number of
strategies per player.
∞ Suppose the players play G repeatedly and each player i chooses her sequence
of strategies ati t=1 by applying a no-internal-regret algorithm with rewards rit (a) = ui a, at−i .
Let
equilibria of G and p(T ) the uniform distribution on the multi-set
set of correlated
tC be the
t
a1 , . . . , ak |1 ≤ t ≤ T of strategy profiles. Then the sequence (p (T ))∞
T =1 converges to C.
Proof: By contradiction. If the sequence does not converge to C, then for some δ > 0 there exists
an infinite subsequence of distributions that have distance at least δ from C.
Q
Since C and the space of all probability distributions on i∈I Ai are compact (bounded and closed),
so is the space of probability distributions at distance δ or greater from C. Thus, there is an infinite
subsequence that converges to some distribution p with dist (p, C) ≥ δ. Denote this subsequence as
p(T1 ), p (T2 ) , p (T3 ) , . . .
Since p ∈
/ C, there exists a player i, two strategies α, β ∈ Ai and ε > 0, such that
X
ui (β, a−i ) − ui (α, a−i ) · p (α, a−i ) = ε.
a−i
Since p is the limit point of the sequence p (Ts ), for any sufficiently large s, it holds that
X
ε
ui (β, a−i ) − ui (α, a−i ) · p (Ts ) (α, a−i ) ≥ .
2
a
−i
47
Recall that p (Ts ) (a) is defined as the number of times a was played in the first Ts rounds divided
by Ts . Thus,
Ts
X
1 X
ε
ui (β, a−i ) − ui (α, a−i ) ·
1[at = (α, a−i )] ≥ .
Ts a
2
t=1
−i
1[at
Note that
= (α, a−i )] =
the following:
1[at−i
=
a−i ] · 1[ati
= α]. Changing the order of summation, we obtain
Ts X
1 X
ε
ui (β, a−i ) − ui (α, a−i ) · 1[at−i = a−i ] · 1[ati = α]
≤
2
Ts
a
t=1
=
1
Ts
|
−i
Ts X
ui β, at−i − ui α, at−i · 1[ati = α]
t=1
= R̂α,β
{z
}
a1i , . . . , aTi s , ri1 , . . . , riTs , Ts .
This contradicts the fact that player i is using a no-internal-regret algorithm.
5.4
A No-Internal-Regret Algorithm
Let A1 , . . . , An be no-external-regret algorithms. So, if Ri (n, T ) denotes the expected external
per-time-step regret of algorithm Ai on an instance with n experts of length T , then
lim Ri (n, T ) = 0.
T →∞
We combine A1 , . . . , An into a new algorithm as follows:
Algorithm 1: NoIntRegret
• Initialize independent no-external-regret algorithms A1 , . . . , An .
• At time t:
t , . . . , qt
• Let qit = qi,1
i,n be the distribution according to which algorithm Ai samples its
expert at this time. Define matrix Qt as

q1t


Qt =  ...  .
qnt

• Find a distribution pt = pt1 , . . . , ptn with pt = pt · Qt .
• Sample an expert according to pt and observe the reward vector rt = r1t , . . . , rnt .
• Report reward vector pti rt to each algorithm Ai .
48
Remark: Note, that pt in the algorithm is well-defined, since we can view it as the stationary
distribution of the Markov chain induced by matrix Qt , which is known to exist.
The idea of the algorithm can be described as follows: For every (advisor) function f , we want to
use algorithm Ai to ensure low pairwise regret of the i → f (i) variety. This works because we can
choose pt , such that we can view pti as both the probability of choosing expert i at time t, or the
probability of choosing algorithm Ai and then following its advice, for which reason we can assign
reward pti rt as the expected reward due to Ai to each of the algorithms.
Theorem 5.9 Let A1 , . . . , An have expected per-time-step external regret of at most R (n, T ) against
the class of adaptive adversaries. Then algorithm NoIntRegret has internal regret at most
n · R (n, T ) against the same class of adversaries.
Proof: By our assumption on A1 , . . . , An , we have
#
"
#
"
T
T
1X t t
1X t
t
t
pi · qi · r
≥E
pi rj − R (n, T )
E
T
T
t=1
(3)
t=1
for all 1 ≤ i, j ≤ n, since every Ai has regret at most R (n, T ) relative to each expert j. The sum
of expected rewards of A1 , . . . , An at time t is
" n
#
h
X
T i
t
t
t
E
= E pt Qt rt
p i · qi · r
i=1
h
T i
,
= E pt r t
since pt Qt = pt . Let f : {1, . . . , n} → {1, . . . , n} be an arbitrary function. Summing (3) over all i
and choosing for each right hand side j = f (i), we obtain
"
#
#
"
T
T
n
1 X t t T
1 XX t t
E
p r
≥E
pi rf (i) − n · R (n, T ) .
T
T
t=1
t=1 i=1
Taking the maximum over all functions f yields the claim.
q
ln(n)
Corollary 5.10 There exists an algorithm with expected internal per-time-step regret O n ·
T
against the class of (randomized) adaptive adversaries.
Remark: It is possible to improve the bound in Corollary 5.10 to O
10).
49
q
n ln(n)
T
(see Problem Set