Get startedGet started for free

Policies and state-value functions

1. Policies and state-value functions

After exploring environment modeling with MDPs, we now turn to solving these environments. This involves delving into policies and state-value functions.

2. Policies

The core objective in RL is formulating effective policies, which act as road maps guiding the agent by specifying optimal actions in each state to maximize return.

3. Grid world example

Consider a custom grid-world environment where the agent aims to reach a diamond quickly, avoiding mountains that slow progress. The environment features nine states, with deterministic movements: up, down, right, or left.

4. Grid world example - rewards

The rewards are given based on states, where reaching the diamond yields a +10 reward;

5. Grid world example - rewards

encountering a mountain incurs a -2 penalty,

6. Grid world example - rewards

and moving in other states costs -1 to encourage reaching the diamond as fast as possible. As we can see, multiple paths exist for reaching the goal, therefore many policies.

7. Grid world example: policy

Suppose the agent follows this policy to reach the diamond. To define it in code, we create a dictionary mapping each state to an action, where in state 0, the agent moves down, in state 1, it moves right, and so on. Then, we initialize the environment, and for each state within the episode, the agent chooses its action based on the policy and executes it.

8. State-value functions

To evaluate a policy, we utilize state-value functions. These functions assess the worth of a state by calculating the discounted return an agent accumulates, starting from that state and adhering to the policy. This involves discounting rewards by a factor, gamma, over time, and summing these discounted rewards.

9. Grid world example: State-values

In our example, we have 9 states, therefore, we need to compute nine state-values. For simplicity we consider a discount factor gamma of 1.

10. Value of goal state

Suppose the agent is born in the goal state. According to the policy, it won't move, therefore its return starting from that state and following the policy will be zero.

11. Value of state 5

If the agent starts in state 5, and follows the policy, it directly moves to the goal state, receiving a reward of 10. Therefore, the state-value is 10.

12. Value of state 2

Now, starting in state 2, and following the policy yields two rewards, -1 and 10, therefore the state-value will be 9.

13. All state values

And so on, until all state-values are computed.

14. Bellman equation

In practice, the Bellman equation, a recursive formula, computes state values by combining the immediate reward of the current state with the discounted value of the next state, thereby connecting each state's value to its successors. In deterministic environments like ours, this standard formula suffices, whereas non-deterministic environments require modifications to incorporate transition probabilities.

15. Computing state-values

To code this, we create a compute_state_value() function that accepts a state. It returns 0 for terminal states. For non-terminal states, it selects the next action per the policy, computes the next_state and reward with env.unwrapped.P, and returns the state value using the Bellman equation: the immediate reward plus gamma times the next state's value.

16. Computing state-values

We define the terminal state and gamma, then initialize a dictionary V with states as keys and their state_values, computed using the the compute_state_value() function as values. As we can see, the numerical results align with our manual calculations.

17. Changing policies

Suppose we define a policy_two, asking the agent to always go right if possible, otherwise, go down. To compare it with policy_one, we compute its state_values just as before,

18. Comparing policies

and compare them with the state_values calculated for policy_one. Since for each state, the value in policy_two is higher or equal than in policy_one, we can confidently say that policy_two is better, since it yields a higher expected return for each state.

19. Let's practice!

Time to practice!