Get startedGet started for free

Action-value functions

1. Action-value functions

We've seen how to define and evaluate policies. But what if we need to improve policies? Let's see how action-value functions help us achieve this.

2. Action-value functions (Q-values)

Action value functions also known as Q-values provide us with an estimate of the expected return of starting in a state, taking a certain action, and then following a policy thereafter. Therefore, the action value is the sum of the immediate reward received after performing an action and the discounted value of the new state computed for a specific policy. While state-value functions give us a broad overview of the desirability of states, action-value functions break it down further, giving us insight into the desirability of actions within those states.

3. Grid world

Recall the nine states and the policy dictating the agent's deterministic movements. We previously evaluated this policy using state-value functions. Now we need to compute action values for each state, which means that, for each state, we have to compute 4 values. We'll keep the state values on the right as we will need them for the action-value computation.

4. Q-values - state 4

Suppose the agent is born in the state 4.

5. Q-values - state 4

The agent can choose to move up, down, left, or right.

6. State 4 - action down

If the agent moves down from state 4, it receives a -2 reward and lands in a state having a value of 5, which we've previously calculated.

7. State 4 - action down

The Q-value for moving down from state 4 combines a reward of -2 with the next state's value, which gives 3, assuming a discount factor of 1.

8. State 4 - action left

Moving left yields a Q-value of 1, calculated by adding a reward of -1 to the value of the resulting state, which is 2.

9. State 4 - action up

Similarly, moving up results in a Q-value of 7, derived from a -1 reward and the new state's value of 8.

10. State 4 - action right

Finally, when the agent moves right, it receives a reward of -1 and visits a state of value 10, leading to a Q-value of 9.

11. All Q-values

This process is repeated until Q-values for all state-action pairs are computed.

12. Computing Q-values

To do this in code, we define a function compute_q_value that takes a state and an action as input. If the state is terminal, it returns None. Otherwise, it calculates the immediate reward and adds the discounted value of the state-value that follows as per the Bellman equation.

13. Computing Q-values

Then, we define a dictionary Q that maps every state-action pair to a Q-value using a dictionary comprehension notation. For every state, and for every action within these states, we calculate the Q-value using the compute_q_value() function. Finally, we print Q.

14. Computing Q-values

We see that the computation matches our numerical calculations, providing for each state-action pair a corresponding value.

15. Improving the policy

Now, since each state has four Q-values, each corresponding to an action, we can improve our initial policy

16. Improving the policy

by selecting the action with the highest Q-value for each state.

17. Improving the policy

Therefore, the new policy would be as follows, and we know it is better than the previous one because the state-values are higher or equal for all states.

18. Improving the policy

To do that in code, we create an empty dictionary to hold our improved_policy. For each state, we identify the action that yields the highest Q-value. We achieve this by using the max function combined with a lambda function that fetches the Q-value for each action within a state. The action with the highest Q-value is then mapped to its corresponding state in the improved_policy dictionary. Finally, we print the improved_policy.

19. Let's practice!

Time to put this into practice.