Download 8 Markov Decision Process

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
KC Chen Markov Decision Processes 8. Markov Decision Processes Decisions may not be static in many situations, especially with uncertainty and dynamics. The outcome of decision may affect the system or the mechanism. For example, in a business transaction that user A is negotiating price with user B, if A can first determine whether B is trustworthy, price negotiation in the following can be affected. Consequently, the system state (i.e. status) can be introduced into the decision mechanism. In case the system can be modeled as a Markov chain, we have a Markov decision process (MDP). 8.1 Mathematical Formulation Real world decision making often encounters multiple-­‐objective or non-­‐commensurate situations, decision under uncertainty and risk, or having impacts from the decision. A mathematical optimization of discrete-­‐stage sequential decision in a stochastic environment can be developed as Markov decision process (MDP), via the dynamics of a controlled Markov process. The Markov process becomes a controlled Markov process when the transition probabilities can be affected by the action (i.e. decision). With the following additional information, n The length of planning horizon n The cost/reward structure n A criterion of measure of performance n The information pattern (i.e. what the decision maker knows at the decision epoch) We can form a MDP. The objective of MDP analysis is to determine a rule for each decision epoch (to constitute a policy) for selecting an action, as a function of the specific information pattern, such that the policy optimizes some criterion value. Let !! , ! = 0,1, 2, … , ! be a Markov chain subject to control, with !!" ! = ! !!!! = !|!! = !, !! = ! where !! is the action (or decision) taken at the decision epoch !, ! ∈ !, !, ! ∈ ! (state space). More generally, the action space can be dependent on the current state, that is, !! ∈ ! !! . ! ≤ ∞ denotes the number of decision epochs in the planning, which we call the horizon length. The (real-­‐valued) reward ! !, ! is 1 Markov Decision Processes KC Chen accrued at decision epoch ! ≤ !, !! = !, !! = !. If the reward depends on the current state and action, and the state at next decision epoch, ! !, ! = !! is considered as the expected reward to be accrued at next epoch. The terminal reward ! ! is received at next epoch while !! = !. A terminal reward is only accrued for finite horizon ! < ∞. ∀! = 0,1,2, … , ! we assume that the decision maker is allowed to know !! during the process of selecting !! . Again, ! is the decision rule from state space mapping to action space. A policy is therefore a sequence of rules ! = !! , … , !! , where !! is the rule at decision epoch !. We can therefore treat the following as conditional probabilities: n Transition probability: ! !!!! |!! , !! n Reward probability: ! !! |!! , !! n Policy: ! !! |!! To ensure convergence, we define the discount factor 0 ≤ ! < 1. Our decision criterion is !
! !!! ! !! , !! !!
!
+ ! ! ! !!!! |!! = ! !!!
Such MDP is said to have a finite horizon. When ! < 1 and ! = ∞, we have the infinite horizon discounted MDP. Our objective is to select a policy maximizing the criterion value with respect to all policies. In implementation, a policy can be viewed as a matrix or a look-­‐up table, with !! as the !, !! th entry. A solution to the standard MDP includes n An optimal policy n The criterion value generated by the policy for each state Such a policy initially exists for a given state. However, policies exist and simultaneously optimal for all states and such policies are designated as optimal policies. We typically determine a solution based on dynamic programming. Let !! ! denote the value of the criterion generated by an optimal policy given !! = ! and horizon length !. !! = !! ! : ! ∈ ! can be determined by the following 2 KC Chen Markov Decision Processes recursion equation !!!! ! = max ! !, ! + !
!∈!
!!" ! !! !
, ! = 0 , … , ! − 1, !! ! = ! ! !
∗
It is straightforward to show that !!!!
is the optimal value at decision epoch ! if and only if it results in maximum in above equation ∀! ∈ !. It suggests that ∗
!! = !!!!
! when !! = !. ! ∗ = !!∗ , … , !!∗ constitutes an optimal policy. For finite horizon case, existence of optimal rule is guaranteed at each decision epoch. Proposition: We define two operators !! and ! for any real-­‐valued function ! on !: !! ! ! = ! !, ! !
!!" ! ! ! ! ! +!
!
!" = sup !! ! !
Then, the recursion equation can be rewritten as !!!! = !!! , ! = 0, … , ! − 1 ∗
! = !!!!
if an only if !! !! = !!! Now, we turn our attention to infinite horizon discounted problem. Let ! be the set of all real-­‐valued functions on !. !: ! → !. Define ∥ ! ∥= max ! ! : ! ∈ ! Furthermore, ∀!, ! ∈ !, ∥ !" − !" ∥≤ ! ∥ ! − ! ∥ ! < 1 guarantees that ! is a contraction operator and thus ahs the following properties: n ! has a unique fixed point in !. That is, there exists a unique ! ∗ ∈ ! such that ! ∗ = !! ∗ . n For any sequence !! such that !! ∈ ! and !!!! = !!! , lim ∥ ! ∗ − !! ∥ = 0 !→!
Another nice property is that !! is also a contraction operator for any !. We have n The criterion value function of an optimal policy is the fixed point ! ∗ 3 Markov Decision Processes n
n
KC Chen There exists an optimal policy that is stationary, that is, there is a rule ! ∗ such that the policy ! ∗ = ! ∗ ! is optimal. A stationary policy ! = ! ∗ ! is optimal if and only if ! attains the maximum in !! ∗ . 8.2 Recursive Property An MDP process can be delineated as Figure 1. The next challenge is how to obtain the numerical solution(s). a0
a1
a2
x1
x2
π
x0
r0
r1
…
r2
Figure 1: MDP Process For a given policy ! and a starting state !! , we can compute the expectation of future rewards. The value function ! ! ! measures the expected discounted reward !
!
! ! !! |!! = !; ! ! ! =!
!!!
For every starting state, we can find the best policy ! and its value function. ! ∗ !! = max ! ! !! !
A policy ! ∗ is optimal if it maximizes the value for each starting state. That is, 4 KC Chen Markov Decision Processes ∗
! ! !! = ! ∗ !! , ∀!! ∈ ! Please note that every MDP has at least one optimal policy. Furthermore, there exists at least one non-­‐randomized optimal policy. The optimal control given an MDP is to compute an optimal policy ! ∗ when the model ! !!!! |!! , !! , the reward ! !! |!! , !! , and the discounted factor !, are known. The value function ! ! !! is an expectation of a series of rewards satisfying a recursive property: !
!
!
!
! ! =!
! !!! !! |!! = !; ! ! !! |!! = !; ! = ! !! |!! = !; ! + !!
!!!
!!!
!
! ! ! = ! !, ! !
+!
! !! = !|!, ! !
!
! !!! !! |!! = !; ! !
!!!
In other words, ! ! ! = ! !, ! !
+!
! !! = !|!, ! !
! ! ! !
This recursive property surely holds for the optimal value function ! ∗ . Proposition (Bellman Optimality Equation): ! ∗ !! = max ! !! , ! + !
!
! !! |!! , ! ! ∗ !! !!
! ∗ !! = arg max ! !! , ! + !
!
! !! |!! , ! ! ∗ !! !!
Remark: Bellman optimality equation plays a central role in MDP. If the system I deterministic, it can be interpreted as follows: From any point on an optimal trajectory, the remaining trajectory is optimal for the corresponding problem initiated at the point. 3.3 Decision and Learning When we have computed the optimal value function ! ∗ !! , the optimal policy can directly be derived based on Bellman optimal equation. If we only computed the 5 Markov Decision Processes KC Chen value function ! ! ! for current policy (i.e. policy evaluation), we can apply the policy iteration algorithm which alternates between evaluating and improving the policy: Proposition (Policy Iteration Algorithm): (i)
Initialization: ! = 0 and initial policy !! (randomly) (ii)
Policy evaluation: compute ! !! (iii)
Policy update: !!!! !! = arg max! ! !! (iv)
Repeat iteration (! ← ! + 1) from (ii) Typical optimal control assumes knowledge of ! !!!! |!! , !! that is also known as the system model and intends to find the optimal (control) policy. A new branch of study called reinforcement learning goes a different direction, that is, to learn from experience when ! !!!! |!! , !! is not a priori known. There are three common approaches: n Model-­‐based reinforcement learning: based on experience to learn ! !!!! |!! , !! and ! !! |!! , !! , and then to derive a policy n Model-­‐free reinforcement learning: based on experience to directly learn value function n Policy search: based on experience to evaluate different policies and to directly search in the space of policy, such as using gradient ascent search. Typical applications of MDP modeling include n Inventory problem n Routing n Sequential resource allocation n Secretary problem (i.e. dynamic programming) Example: Let us consider a set of frequency bands to represent the general case, though more dimensional radio resource can be considered. Suppose the frequency bands that we are interested in (typically PS operating) are a set of numbered bands, ℳ = 1, … , ! . At time !! , CRN operation allows an update of spectrum utilization. The !th observation (or allocation) time interval is [!! , !!!! ). Due to opportunistic nature of each link (thus frequency band) modeled as a Markov chain, the !th frequency band is available following a Bernoulli process with probability !! available, and is invariant to time. 6 KC Chen Markov Decision Processes We define the following indicator function (just as the clear channel indicator defined in Chapter 5): !! [!] =
1, channel i available in [!! , !!!! )
(8.1) 0, otherwise For perfect spectrum sensing, we can determine (8.1) in a reliable way. However, any spectrum sensing has some vulnerable situations, and thus we need to consider more to decide medium access control. Following (8.1), the probability mass function (pmf) of Bernoulli random variable at the !th frequency band is !!! [!] ! !! = !! ! + 1 − !! !(!) (8.2) It is reasonable to assume !! [!] !
!!! independent, ! = 1, … , ! where ! implies the observation interval depth. Denote ! = [!! , … , !! ] For reliable CR operation, spectrum sensing is necessary, so that CR-­‐Tx can have information about availability of each frequency band. However, for network operations on top of CR links, the strategy would be highly related to !. Case 1: ! is known. Case 2: ! is unknown Case 3: ! can be detected or estimated via some CRN sensing or tomography (network tomography introduced in Chapter 9) methods. Traditional CR functions as follows: At time !! , CR learns the availability of a selected frequency band !! (typically via spectrum sensing). If !!! ! =1, information amount ! can be successfully transmitted. For ! time durations, the overall throughput is ! = !!!! !!! ! (8.3) In case ! is known, the spectrum sensing strategy for a CR is simply to select channel ! = !"# !"#!∈ℳ !! to sense. Then, access decision is therefore optimally or sub-­‐optimally made based on certain decision criterion and conditions, such as partially observed Makovian decision process. 7