Download Learning Algorithms for Solving MDPs References: Barto, Bradtke

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic algorithm wikipedia , lookup

Randomness wikipedia , lookup

Machine learning wikipedia , lookup

Mathematical optimization wikipedia , lookup

Corecursion wikipedia , lookup

Simplex algorithm wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Pattern recognition wikipedia , lookup

Factorization of polynomials over finite fields wikipedia , lookup

Birthday problem wikipedia , lookup

Algorithm characterizations wikipedia , lookup

Simulated annealing wikipedia , lookup

Drift plus penalty wikipedia , lookup

Algorithm wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
Learning Algorithms for Solving MDPs
References: Barto, Bradtke and Singh (1995)
“Learning to Act Using Real-Time Dynamic Programming”
in Machine Learning (also on WWW)
1. Q-Learning Given an MDP problem, define the
function by:
"!#
(')
$ "!%&
*
+'-,/.01 (')
)
2 '
453 ;:=<">/67?A@ 9:C? 8 FE-E +') ! G ! H (')
IJ LKM
B B;D
%ONP%RQ %TSU%
pairs is updated at each step
Comment: Each of the
of the -learning algorithm.
At any stage * the recommended
action
'
in state is the action for which the corresponding
is the
V ' 9WX96798 ' highest, i.e.
ZY[
1<">/? B E
2 '] are a sequence of possibily stochastic step size funcComment: \
:
tions satisfying for each
2 ';
_^ `bacd1e
g'-fh
i 2 ' j w.p.1
poq
g'-fh
i 2R'k ml n w.p.1
r
i.e.
is the expected present discounted value of taking action
in state . Let
denote the (random) successor to
, i.e. a
random draw from
. Then if
denotes the estimate of
the function at iteration of the -learning algorithm, we have:
Relationship to Stochastic Approximation
8
1. Standard
Find fixed point
stochastic
approximation:
where
by
where
\ ']
8 '-,/. 8 ' 2 ' ` #8 ' H 8 ' ' e
\ ' % ' ] a
' l S var
is a sequence of stochastic shocks satisfying
8'
8 '
#84
M
[
2. Asynchronous stochastic approximation: Only update or “back
up” some of the components of at time . Let be an infinite
sequence of times at which state is updated. Then
*
^ M
8 ' *
8 ' 2 ' ` #8 ' H 8 ' ' e * ^ where
8 ' 8 . ! . * )G# "$"#""8 ! * )& %
8R'
where ' * denotes the last time component ( of was updated.
ml * .
Obviously ' *
6 +' * f ' * Theorem
(Tsitsiklis,
1993)
If
for
all
and
(
we
have
!
)
j with probability and if is a contraction mapping, then 8 ' 8 8 '-,/.0 with probability 1.
, then we have ' 1
m^ N Q S .
Corollary: If
1 for each
1
with probability
pair at every
Comment: It is not necessary to back up each
iteration of the -learning algorithm. However it is necessary
that every
pair be backed up
often. This corre+* infinitely
'
with probability
sponds to the assumption that )!
1 in Tsitsiklis’s theorem.
* 6 ' f
* j
Real Time Dynamic Programming A learning algorithm closely
related to successive approximations that works in real time, i.e. you
update value function during the process of controlling system.
0. Successive Approximations
'-, . 17
6 "< >/9? 8 E ` ' e
B
167< > 9? 8 B E (')1 % ' M
E ` 1 ' D e
E (' % D ' "
[
'-,/. 7
6 "< > ? 8
B
67<"> ? 8
B
1. Gauss-Seidel
where
' D ' '-, . ' ' ' ,
M
if (
otherwise
2. Asynchronous Successive Approximations
1 67<">/9? 8 B E ' % ' if ^ '
[
' otherwise.
'
^ N which are backed up
Comment: denotes
the
set
of
states
of the asynchronous successive approximations
at iteration *
algorithm.
Successive approximations is a special case when
' N
(all states backed
up)
Gauss-Seidel
method
is ] a
i
.
.
.
]
and
]
#
"
$
"
"
special
\ ,
\ k
\ ,
case
\ . ]c "#when
"$"
'-,/.G
' \ ']
3. Real Time Dynamic Programming This is a special case of
asynchronous successive approximations when
where
is the realized state of the process at time .
'
*
\ ']
Theorem: In discounted finite MDPs, if
denotes the sequence of
value
functions
from asynchronous successive approximations, then
with probability 1 provided each state
is backed
up infinitely often.
'
H a
^ N