Reinforcement learning : Markov Decision Processes Notes

: Reinforcement learning : Markov Decision Processes Notes

 Page 1


1
Reinforcement learning:
Markov Decision Processes
1
Reinforcement Learning
•You can think of supervised learning as the 
teacher providing answers (the class labels)
•In reinforcement learning, the agent learns 
basedonapunishment/reward scheme
2
based on a punishment/reward scheme
•Before we can talk about reinforcement 
learning, we need to introduce Markov 
Decision Processes
Decision Processes: General 
Description
•Decide what action to take next given that your 
action will affect what happens in the future
•Real world examples:
– Robot path planning
3
–Elevator scheduling
–Travel route planning
–Aircraft navigation
– Manufacturing processes
– Network switching and routing
Sequential Decisions
• Assume a fully observable, 
deterministic environment
• Each grid cell is a state
•The goal state is marked +1
4
g
• At each time step, agent must 
move Up, Right, Down, or Left
• How do you get from start to 
the goal state?
Sequential Decisions
• Suppose the environment is 
now stochastic
• With 0.8 probability you go 
in the direction you intend
• With 0 2 probability you With 0.2 probability you 
move at right angles to the 
intended direction (0.1 in 
either direction – if you hit 
the wall you stay put)
• What is the optimal solution 
now?
Sequential Decisions
• Up, Up, Right, Right, Right 
reaches the goal state with 
probability 0.8
5
=032768
• But in this stochastic world, 
going Up, Up, Right, Right, 
Right might end up with
6
Right might end up with 
you actually going Right, 
Right, Up, Up, Right with 
probability 
(0.1
4
)(0.8)=0.00008
• Even worse, you might end 
up in the -1 state 
accidentally
Page 2


1
Reinforcement learning:
Markov Decision Processes
1
Reinforcement Learning
•You can think of supervised learning as the 
teacher providing answers (the class labels)
•In reinforcement learning, the agent learns 
basedonapunishment/reward scheme
2
based on a punishment/reward scheme
•Before we can talk about reinforcement 
learning, we need to introduce Markov 
Decision Processes
Decision Processes: General 
Description
•Decide what action to take next given that your 
action will affect what happens in the future
•Real world examples:
– Robot path planning
3
–Elevator scheduling
–Travel route planning
–Aircraft navigation
– Manufacturing processes
– Network switching and routing
Sequential Decisions
• Assume a fully observable, 
deterministic environment
• Each grid cell is a state
•The goal state is marked +1
4
g
• At each time step, agent must 
move Up, Right, Down, or Left
• How do you get from start to 
the goal state?
Sequential Decisions
• Suppose the environment is 
now stochastic
• With 0.8 probability you go 
in the direction you intend
• With 0 2 probability you With 0.2 probability you 
move at right angles to the 
intended direction (0.1 in 
either direction – if you hit 
the wall you stay put)
• What is the optimal solution 
now?
Sequential Decisions
• Up, Up, Right, Right, Right 
reaches the goal state with 
probability 0.8
5
=032768
• But in this stochastic world, 
going Up, Up, Right, Right, 
Right might end up with
6
Right might end up with 
you actually going Right, 
Right, Up, Up, Right with 
probability 
(0.1
4
)(0.8)=0.00008
• Even worse, you might end 
up in the -1 state 
accidentally
2
Transition Model
• Transition model: a specification of the outcome 
probabilities for each action in each possible state
•T(s, a, s’) = probability of going to state s’if you are in 
state sand do action a
•The transitions are Markovianie. the probability of 
reachingstates’fromsdependsonlyonsandnotonthe
7
reaching state s  from s depends only on s and not on the 
history of earlier states (aka The Markov Property)
•Mathematically:
Suppose you visited the following states in chronological 
order: s
0
, …, s
t
P(s
t+1
| a, s
0
, …, s
t
) = P(s
t+1
| a, s
t
)
Markov Property Example
Suppose 
s
0
= (1,1), s
1
= (1,2), s
2
= (1,3)
If I go right from state s
2
, the 
s
s
1
s
2
s
3
8
2
probability of going to s
3
only 
depends on the fact that I am at 
state s
2 
and not the entire state 
history {s
0
, s
1
, s
2
}
s
0
The Reward Function
• Depends on the sequence of states (known 
as the environment history) 
•At each state s, the agent receives a reward
R(s)which maybepositive ornegative(but
9
R(s) which may be positive or negative (but 
must be bounded)
•For now, we’ll define the utility of an 
environment history as the sum of the 
rewards received 
Utility Function Example
R(4,3) = +1 (Agent wants to get here)
R(4,2) = -1 (Agent wants to avoid this
R(s) = -0.04 (for all other states)
U(s
1
, …, s
n
) = R(s
1
) + … + R(s
n
)
10
If the states an agent goes through 
are Up, Up, Right, Right, Right, the 
utility of this environment history is: 
-0.04-0.04-0.04-0.04-0.04+1
Utility Function Example
If there’s no uncertainty, then the agent would find 
the sequence of actions that maximizes the sum of 
the rewards of the visited states
11
Markov Decision Process
The specification of a sequential decision problem 
for a fully observable environment with a 
Markoviantransition model and additive rewardsis 
called a Markov Decision Process (MDP)
A MDPh th fll i t
12
An MDP has the following components:
1. A finite set of states S along with an initial state 
S
0
2. A finite set of actions A
3. Transition Model: T(s, a, s’) = P( s’ | a, s )
4. Reward Function: R(s)
Page 3


1
Reinforcement learning:
Markov Decision Processes
1
Reinforcement Learning
•You can think of supervised learning as the 
teacher providing answers (the class labels)
•In reinforcement learning, the agent learns 
basedonapunishment/reward scheme
2
based on a punishment/reward scheme
•Before we can talk about reinforcement 
learning, we need to introduce Markov 
Decision Processes
Decision Processes: General 
Description
•Decide what action to take next given that your 
action will affect what happens in the future
•Real world examples:
– Robot path planning
3
–Elevator scheduling
–Travel route planning
–Aircraft navigation
– Manufacturing processes
– Network switching and routing
Sequential Decisions
• Assume a fully observable, 
deterministic environment
• Each grid cell is a state
•The goal state is marked +1
4
g
• At each time step, agent must 
move Up, Right, Down, or Left
• How do you get from start to 
the goal state?
Sequential Decisions
• Suppose the environment is 
now stochastic
• With 0.8 probability you go 
in the direction you intend
• With 0 2 probability you With 0.2 probability you 
move at right angles to the 
intended direction (0.1 in 
either direction – if you hit 
the wall you stay put)
• What is the optimal solution 
now?
Sequential Decisions
• Up, Up, Right, Right, Right 
reaches the goal state with 
probability 0.8
5
=032768
• But in this stochastic world, 
going Up, Up, Right, Right, 
Right might end up with
6
Right might end up with 
you actually going Right, 
Right, Up, Up, Right with 
probability 
(0.1
4
)(0.8)=0.00008
• Even worse, you might end 
up in the -1 state 
accidentally
2
Transition Model
• Transition model: a specification of the outcome 
probabilities for each action in each possible state
•T(s, a, s’) = probability of going to state s’if you are in 
state sand do action a
•The transitions are Markovianie. the probability of 
reachingstates’fromsdependsonlyonsandnotonthe
7
reaching state s  from s depends only on s and not on the 
history of earlier states (aka The Markov Property)
•Mathematically:
Suppose you visited the following states in chronological 
order: s
0
, …, s
t
P(s
t+1
| a, s
0
, …, s
t
) = P(s
t+1
| a, s
t
)
Markov Property Example
Suppose 
s
0
= (1,1), s
1
= (1,2), s
2
= (1,3)
If I go right from state s
2
, the 
s
s
1
s
2
s
3
8
2
probability of going to s
3
only 
depends on the fact that I am at 
state s
2 
and not the entire state 
history {s
0
, s
1
, s
2
}
s
0
The Reward Function
• Depends on the sequence of states (known 
as the environment history) 
•At each state s, the agent receives a reward
R(s)which maybepositive ornegative(but
9
R(s) which may be positive or negative (but 
must be bounded)
•For now, we’ll define the utility of an 
environment history as the sum of the 
rewards received 
Utility Function Example
R(4,3) = +1 (Agent wants to get here)
R(4,2) = -1 (Agent wants to avoid this
R(s) = -0.04 (for all other states)
U(s
1
, …, s
n
) = R(s
1
) + … + R(s
n
)
10
If the states an agent goes through 
are Up, Up, Right, Right, Right, the 
utility of this environment history is: 
-0.04-0.04-0.04-0.04-0.04+1
Utility Function Example
If there’s no uncertainty, then the agent would find 
the sequence of actions that maximizes the sum of 
the rewards of the visited states
11
Markov Decision Process
The specification of a sequential decision problem 
for a fully observable environment with a 
Markoviantransition model and additive rewardsis 
called a Markov Decision Process (MDP)
A MDPh th fll i t
12
An MDP has the following components:
1. A finite set of states S along with an initial state 
S
0
2. A finite set of actions A
3. Transition Model: T(s, a, s’) = P( s’ | a, s )
4. Reward Function: R(s)
3
Solutions to an MDP
•Why is the following not a satisfactory 
solution to the MDP?
[1,1]-Up
[1,2]-Up
13
[,] p
[1,3]-Right
[2,3]-Right
[3,3]-Right
A Policy
• Policy: mapping from a state to an action
• Need this to be defined for all states so 
that the agent will always know what to do
• Notation:
14
• Notation:
– pdenotes a policy
– p(s) denotes the action recommended by the 
policy pfor state s
Optimal Policy
•There are obviously many different policies for an MDP
•Some are better than others.  The “best” one is called the 
optimal policy p* (we will define best more precisely in 
later slides)
• Note: every time we start at the initial state and execute a 
policy wegetadifferentenvironmenthistory(duetothe
15
policy, we get a different environment history (due to the 
stochastic nature of the environment)
•This means we get a different utility every time we 
execute a policy
• Need to measure expected utility ie. the average of the 
utilities of the possible environment histories generated 
by the policy
Optimal Policy Example
R(s)=-0.04
16
Notice the tradeoff between risk and reward!
Roadmap for the Next Few Slides
We need to describe how to compute 
optimal policies
1. Before we can do that, we need to define 
theutilityofastate
17
the utility of a state
2. Before we can do (1), we need to explain 
stationarityassumption
3. Before we can do (2), we need to explain 
finite/infinite horizons
Finite/Infinite Horizons
• Finite horizon: fixed time N after which nothing 
matters (think of this as a deadline)
• Suppose our agent starts at (3,1), R(s)=-0.04, and 
N=3.  Then to get to the +1 state, agent must go 
up.
IfN 100 t tk th f t d
18
• If N=100, agent can take the safe route around
Page 4


1
Reinforcement learning:
Markov Decision Processes
1
Reinforcement Learning
•You can think of supervised learning as the 
teacher providing answers (the class labels)
•In reinforcement learning, the agent learns 
basedonapunishment/reward scheme
2
based on a punishment/reward scheme
•Before we can talk about reinforcement 
learning, we need to introduce Markov 
Decision Processes
Decision Processes: General 
Description
•Decide what action to take next given that your 
action will affect what happens in the future
•Real world examples:
– Robot path planning
3
–Elevator scheduling
–Travel route planning
–Aircraft navigation
– Manufacturing processes
– Network switching and routing
Sequential Decisions
• Assume a fully observable, 
deterministic environment
• Each grid cell is a state
•The goal state is marked +1
4
g
• At each time step, agent must 
move Up, Right, Down, or Left
• How do you get from start to 
the goal state?
Sequential Decisions
• Suppose the environment is 
now stochastic
• With 0.8 probability you go 
in the direction you intend
• With 0 2 probability you With 0.2 probability you 
move at right angles to the 
intended direction (0.1 in 
either direction – if you hit 
the wall you stay put)
• What is the optimal solution 
now?
Sequential Decisions
• Up, Up, Right, Right, Right 
reaches the goal state with 
probability 0.8
5
=032768
• But in this stochastic world, 
going Up, Up, Right, Right, 
Right might end up with
6
Right might end up with 
you actually going Right, 
Right, Up, Up, Right with 
probability 
(0.1
4
)(0.8)=0.00008
• Even worse, you might end 
up in the -1 state 
accidentally
2
Transition Model
• Transition model: a specification of the outcome 
probabilities for each action in each possible state
•T(s, a, s’) = probability of going to state s’if you are in 
state sand do action a
•The transitions are Markovianie. the probability of 
reachingstates’fromsdependsonlyonsandnotonthe
7
reaching state s  from s depends only on s and not on the 
history of earlier states (aka The Markov Property)
•Mathematically:
Suppose you visited the following states in chronological 
order: s
0
, …, s
t
P(s
t+1
| a, s
0
, …, s
t
) = P(s
t+1
| a, s
t
)
Markov Property Example
Suppose 
s
0
= (1,1), s
1
= (1,2), s
2
= (1,3)
If I go right from state s
2
, the 
s
s
1
s
2
s
3
8
2
probability of going to s
3
only 
depends on the fact that I am at 
state s
2 
and not the entire state 
history {s
0
, s
1
, s
2
}
s
0
The Reward Function
• Depends on the sequence of states (known 
as the environment history) 
•At each state s, the agent receives a reward
R(s)which maybepositive ornegative(but
9
R(s) which may be positive or negative (but 
must be bounded)
•For now, we’ll define the utility of an 
environment history as the sum of the 
rewards received 
Utility Function Example
R(4,3) = +1 (Agent wants to get here)
R(4,2) = -1 (Agent wants to avoid this
R(s) = -0.04 (for all other states)
U(s
1
, …, s
n
) = R(s
1
) + … + R(s
n
)
10
If the states an agent goes through 
are Up, Up, Right, Right, Right, the 
utility of this environment history is: 
-0.04-0.04-0.04-0.04-0.04+1
Utility Function Example
If there’s no uncertainty, then the agent would find 
the sequence of actions that maximizes the sum of 
the rewards of the visited states
11
Markov Decision Process
The specification of a sequential decision problem 
for a fully observable environment with a 
Markoviantransition model and additive rewardsis 
called a Markov Decision Process (MDP)
A MDPh th fll i t
12
An MDP has the following components:
1. A finite set of states S along with an initial state 
S
0
2. A finite set of actions A
3. Transition Model: T(s, a, s’) = P( s’ | a, s )
4. Reward Function: R(s)
3
Solutions to an MDP
•Why is the following not a satisfactory 
solution to the MDP?
[1,1]-Up
[1,2]-Up
13
[,] p
[1,3]-Right
[2,3]-Right
[3,3]-Right
A Policy
• Policy: mapping from a state to an action
• Need this to be defined for all states so 
that the agent will always know what to do
• Notation:
14
• Notation:
– pdenotes a policy
– p(s) denotes the action recommended by the 
policy pfor state s
Optimal Policy
•There are obviously many different policies for an MDP
•Some are better than others.  The “best” one is called the 
optimal policy p* (we will define best more precisely in 
later slides)
• Note: every time we start at the initial state and execute a 
policy wegetadifferentenvironmenthistory(duetothe
15
policy, we get a different environment history (due to the 
stochastic nature of the environment)
•This means we get a different utility every time we 
execute a policy
• Need to measure expected utility ie. the average of the 
utilities of the possible environment histories generated 
by the policy
Optimal Policy Example
R(s)=-0.04
16
Notice the tradeoff between risk and reward!
Roadmap for the Next Few Slides
We need to describe how to compute 
optimal policies
1. Before we can do that, we need to define 
theutilityofastate
17
the utility of a state
2. Before we can do (1), we need to explain 
stationarityassumption
3. Before we can do (2), we need to explain 
finite/infinite horizons
Finite/Infinite Horizons
• Finite horizon: fixed time N after which nothing 
matters (think of this as a deadline)
• Suppose our agent starts at (3,1), R(s)=-0.04, and 
N=3.  Then to get to the +1 state, agent must go 
up.
IfN 100 t tk th f t d
18
• If N=100, agent can take the safe route around
4
Nonstationary Policies
•Nonstationarypolicy: the optimal action in a 
given state changes over time
•With a finite horizon, the optimal policy is 
nonstationary
• Withaninfinitehorizon thereisnoincentiveto
19
• With an infinite horizon, there is no incentive to 
behave differently in the same state at different 
times
•With an infinite horizon, the optimal policy is 
stationary
•We will assume infinite horizons
Utility of a State Sequence
Under stationarity, there are two ways to assign 
utilities to sequences:
1. Additive rewards: The utility of a state sequence 
is:
20
U(s
0
, s
1
, s
2
, …) = R(s
0
) + R(s
1
) + R(s
2
) + …
2. Discounted rewards: The utility of a state 
sequence is:
U(s
0
, s
1
, s
2
, …) = R(s
0
) +  ?R(s
1
) +  ?
2
R(s
2
) + …
Where 0 =  ? =1 is the discount factor 
The Discount Factor
• Describes preference for current rewards over 
future rewards
• Compensates for uncertainty in available time 
(models mortality)
• Eg Beingpromised$10000nextyearisonly
21
• Eg. Being promised $10000 next year is only 
worth 90% of being promised $10000 now
• ?near 0 means future rewards don’t mean 
anything
• ?= 1 makes discounted rewards equivalent to 
additive rewards
Utilities
We assume infinite horizons.  This means that if the 
agent doesn’t get to a terminal state, then 
environmental histories are infinite, and utilities with 
additive rewards are infinite.  How do we deal with this?
Discounted rewards makes utility finite.
Ailt ibl di R d 1
22
Assuming largest possible reward is R
max 
and  ?< 1,
) 1 (
) ( ,...) , , (
max
max
0
0
2 1 0
?
?
?
-
= =
=
?
?
8
=
8
=
R
R
s R s s s U
t
t
t
t
t
Computing Optimal Policies
•A policy p generates a sequence of states
•But the world is stochastic, so a policy phas 
a range of possible state sequences, each 
ofwhich hassomeprobability ofoccurring
23
of which has some probability of occurring
•The value of a policy is the expected sum of 
discounted rewards obtained
The Optimal Policy
•Given a policy p, we write the expected 
sum of discounted rewards obtained as:
?
?
?
?
?
?
?
8
0
| ) (
t
t
t
s R E p ?
24
? ? =0 t
• An optimal policy p* is the policy that 
maximizes the expected sum above
?
?
?
?
?
?
=
?
8
=0
*
| ) ( max arg
t
t
t
s R E p ? p
p
Page 5


1
Reinforcement learning:
Markov Decision Processes
1
Reinforcement Learning
•You can think of supervised learning as the 
teacher providing answers (the class labels)
•In reinforcement learning, the agent learns 
basedonapunishment/reward scheme
2
based on a punishment/reward scheme
•Before we can talk about reinforcement 
learning, we need to introduce Markov 
Decision Processes
Decision Processes: General 
Description
•Decide what action to take next given that your 
action will affect what happens in the future
•Real world examples:
– Robot path planning
3
–Elevator scheduling
–Travel route planning
–Aircraft navigation
– Manufacturing processes
– Network switching and routing
Sequential Decisions
• Assume a fully observable, 
deterministic environment
• Each grid cell is a state
•The goal state is marked +1
4
g
• At each time step, agent must 
move Up, Right, Down, or Left
• How do you get from start to 
the goal state?
Sequential Decisions
• Suppose the environment is 
now stochastic
• With 0.8 probability you go 
in the direction you intend
• With 0 2 probability you With 0.2 probability you 
move at right angles to the 
intended direction (0.1 in 
either direction – if you hit 
the wall you stay put)
• What is the optimal solution 
now?
Sequential Decisions
• Up, Up, Right, Right, Right 
reaches the goal state with 
probability 0.8
5
=032768
• But in this stochastic world, 
going Up, Up, Right, Right, 
Right might end up with
6
Right might end up with 
you actually going Right, 
Right, Up, Up, Right with 
probability 
(0.1
4
)(0.8)=0.00008
• Even worse, you might end 
up in the -1 state 
accidentally
2
Transition Model
• Transition model: a specification of the outcome 
probabilities for each action in each possible state
•T(s, a, s’) = probability of going to state s’if you are in 
state sand do action a
•The transitions are Markovianie. the probability of 
reachingstates’fromsdependsonlyonsandnotonthe
7
reaching state s  from s depends only on s and not on the 
history of earlier states (aka The Markov Property)
•Mathematically:
Suppose you visited the following states in chronological 
order: s
0
, …, s
t
P(s
t+1
| a, s
0
, …, s
t
) = P(s
t+1
| a, s
t
)
Markov Property Example
Suppose 
s
0
= (1,1), s
1
= (1,2), s
2
= (1,3)
If I go right from state s
2
, the 
s
s
1
s
2
s
3
8
2
probability of going to s
3
only 
depends on the fact that I am at 
state s
2 
and not the entire state 
history {s
0
, s
1
, s
2
}
s
0
The Reward Function
• Depends on the sequence of states (known 
as the environment history) 
•At each state s, the agent receives a reward
R(s)which maybepositive ornegative(but
9
R(s) which may be positive or negative (but 
must be bounded)
•For now, we’ll define the utility of an 
environment history as the sum of the 
rewards received 
Utility Function Example
R(4,3) = +1 (Agent wants to get here)
R(4,2) = -1 (Agent wants to avoid this
R(s) = -0.04 (for all other states)
U(s
1
, …, s
n
) = R(s
1
) + … + R(s
n
)
10
If the states an agent goes through 
are Up, Up, Right, Right, Right, the 
utility of this environment history is: 
-0.04-0.04-0.04-0.04-0.04+1
Utility Function Example
If there’s no uncertainty, then the agent would find 
the sequence of actions that maximizes the sum of 
the rewards of the visited states
11
Markov Decision Process
The specification of a sequential decision problem 
for a fully observable environment with a 
Markoviantransition model and additive rewardsis 
called a Markov Decision Process (MDP)
A MDPh th fll i t
12
An MDP has the following components:
1. A finite set of states S along with an initial state 
S
0
2. A finite set of actions A
3. Transition Model: T(s, a, s’) = P( s’ | a, s )
4. Reward Function: R(s)
3
Solutions to an MDP
•Why is the following not a satisfactory 
solution to the MDP?
[1,1]-Up
[1,2]-Up
13
[,] p
[1,3]-Right
[2,3]-Right
[3,3]-Right
A Policy
• Policy: mapping from a state to an action
• Need this to be defined for all states so 
that the agent will always know what to do
• Notation:
14
• Notation:
– pdenotes a policy
– p(s) denotes the action recommended by the 
policy pfor state s
Optimal Policy
•There are obviously many different policies for an MDP
•Some are better than others.  The “best” one is called the 
optimal policy p* (we will define best more precisely in 
later slides)
• Note: every time we start at the initial state and execute a 
policy wegetadifferentenvironmenthistory(duetothe
15
policy, we get a different environment history (due to the 
stochastic nature of the environment)
•This means we get a different utility every time we 
execute a policy
• Need to measure expected utility ie. the average of the 
utilities of the possible environment histories generated 
by the policy
Optimal Policy Example
R(s)=-0.04
16
Notice the tradeoff between risk and reward!
Roadmap for the Next Few Slides
We need to describe how to compute 
optimal policies
1. Before we can do that, we need to define 
theutilityofastate
17
the utility of a state
2. Before we can do (1), we need to explain 
stationarityassumption
3. Before we can do (2), we need to explain 
finite/infinite horizons
Finite/Infinite Horizons
• Finite horizon: fixed time N after which nothing 
matters (think of this as a deadline)
• Suppose our agent starts at (3,1), R(s)=-0.04, and 
N=3.  Then to get to the +1 state, agent must go 
up.
IfN 100 t tk th f t d
18
• If N=100, agent can take the safe route around
4
Nonstationary Policies
•Nonstationarypolicy: the optimal action in a 
given state changes over time
•With a finite horizon, the optimal policy is 
nonstationary
• Withaninfinitehorizon thereisnoincentiveto
19
• With an infinite horizon, there is no incentive to 
behave differently in the same state at different 
times
•With an infinite horizon, the optimal policy is 
stationary
•We will assume infinite horizons
Utility of a State Sequence
Under stationarity, there are two ways to assign 
utilities to sequences:
1. Additive rewards: The utility of a state sequence 
is:
20
U(s
0
, s
1
, s
2
, …) = R(s
0
) + R(s
1
) + R(s
2
) + …
2. Discounted rewards: The utility of a state 
sequence is:
U(s
0
, s
1
, s
2
, …) = R(s
0
) +  ?R(s
1
) +  ?
2
R(s
2
) + …
Where 0 =  ? =1 is the discount factor 
The Discount Factor
• Describes preference for current rewards over 
future rewards
• Compensates for uncertainty in available time 
(models mortality)
• Eg Beingpromised$10000nextyearisonly
21
• Eg. Being promised $10000 next year is only 
worth 90% of being promised $10000 now
• ?near 0 means future rewards don’t mean 
anything
• ?= 1 makes discounted rewards equivalent to 
additive rewards
Utilities
We assume infinite horizons.  This means that if the 
agent doesn’t get to a terminal state, then 
environmental histories are infinite, and utilities with 
additive rewards are infinite.  How do we deal with this?
Discounted rewards makes utility finite.
Ailt ibl di R d 1
22
Assuming largest possible reward is R
max 
and  ?< 1,
) 1 (
) ( ,...) , , (
max
max
0
0
2 1 0
?
?
?
-
= =
=
?
?
8
=
8
=
R
R
s R s s s U
t
t
t
t
t
Computing Optimal Policies
•A policy p generates a sequence of states
•But the world is stochastic, so a policy phas 
a range of possible state sequences, each 
ofwhich hassomeprobability ofoccurring
23
of which has some probability of occurring
•The value of a policy is the expected sum of 
discounted rewards obtained
The Optimal Policy
•Given a policy p, we write the expected 
sum of discounted rewards obtained as:
?
?
?
?
?
?
?
8
0
| ) (
t
t
t
s R E p ?
24
? ? =0 t
• An optimal policy p* is the policy that 
maximizes the expected sum above
?
?
?
?
?
?
=
?
8
=0
*
| ) ( max arg
t
t
t
s R E p ? p
p
5
The Optimal Policy
•For every MDP, there exists an optimal policy
• There is no better option (in terms of expected 
sum of rewards) than to follow this policy
•How do you calculate this optimal policy?  Can’t 
25
evaluate all policies…too many of them
•First, need to calculate the utility of each state
• Then use the state utilities to select an optimal 
action in each state
Rewards vs Utilities
•What’s the difference between R(s) the 
reward for a state and U(s) the utility of a 
state?
– R(s) –the short term reward for being in s
26
() g
–U(s) –The long-term total expected reward for 
the sequence of states starting at s (not just 
the reward for state s)
Utilities in the Maze Example
Start at state (1,1).  Let’s 
suppose we choose the 
action Up.
27
U(1,1) = R(1,1) + ...
Reward for current state
Utilities in the Maze Example
Start at state (1,1).  Let’s 
choose the action Up.
U(1,1) = R(1,1) + 0.8*U(1,2) + 0.1*U(2,1) + 0.1*U(1,1)
Prob of moving up
Prob of moving right
Prob of moving left (into 
the wall) and staying put
Utilities in the Maze Example
Now let’s throw in the 
29
discounting factor
U(1,1) = R(1,1) + ?*0.8*U(1,2) + ?*0.1*U(2,1) + 
?*0.1*U(1,1)
The Utility of a State
?
+ =
'
) ' ( ) ' , , ( ) ( ) (
s
s U s a s T s R s U ?
If we choose action a at state s, expected future 
rewards (discounted) are:
Reward at current state s
Probability of moving 
from state s to state s’ 
by doing action a
Expected sum of future 
discounted rewards 
starting at state s’
Expected sum of future 
discounted rewards 
starting at state s
Read More
Use Code STAYHOME200 and get INR 200 additional OFF
Use Coupon Code

Download free EduRev App

Track your progress, build streaks, highlight & save important lessons and more!