Get Started

Policy iteration and value iteration

1. Policy iteration and value iteration

Welcome back! Now that you understand policy assessment and improvement, let's dive into finding optimal policies with policy iteration and value iteration algorithms.

2. Policy iteration

Policy iteration is an iterative process to find the optimal policy. Here's how it works: we start with a random policy.

3. Policy iteration

Then, we evaluate it by computing the state-value function.

4. Policy iteration

Then, improve it by choosing actions at each state that maximize the state-value function.

5. Policy iteration

This evaluate-improve cycle repeats until the policy stabilizes,

6. Policy iteration

signifying we've found the optimal policy.

7. Grid world

Let's apply policy iteration to our grid world environment starting with this policy.

8. Policy evaluation

To do that, we need several functions. The policy_evaluation() function receives a policy as input, computes state-values using compute_state_value(), and returns these values in a dictionary V.

9. Policy improvement

policy_improvement(), receives the policy we want to improve, initializes an improved_policy, and populates a Q dictionary using compute_q_value() for each state-action pair. For each state, max_action, the action with the highest Q-value, is computed, and the improved_policy for that state is updated. Finally, the improved_policy is returned.

10. Policy iteration

policy_iteration() begins by initializing the policy we aim to optimize, then enters a loop where two main steps are executed: policy_evaluation and policy_improvement, yielding the value function V and an improved_policy, respectively. If the improved_policy is identical to the old one, it signifies convergence, prompting the loop to terminate. Otherwise, the policy is updated for the next iteration. Finally, the policy and V are returned.

11. Optimal policy

Calling the function, we see that indeed the new policy is better as the agent reaches the goal more quickly.

12. Value iteration

Now let's move to value iteration, a more efficient algorithm. Value iteration simplifies and speeds up the process by combining policy evaluation and improvement into a single step. It directly computes the optimal state-value function and derives the policy from it. Here’s the process: we begin with an initial value function V, typically set to zero for all states.

13. Value iteration

Then, at each time step, we compute the action-value function using the state-values V of the previous iteration.

14. Value iteration

We update the value of each state by considering the maximum expected return achievable by any action from that state.

15. Value iteration

And we continue this process until the changes are below a small threshold that we define.

16. Value iteration

Only then, we consider that the algorithm converged and we found the optimal policy.

17. Implementing value-iteration

To implement value iteration, we start by initializing the state_values V and the policy as dictionaries with zero values. Additionally, we define a threshold variable. Then, we start a loop initializing the new_V with zeros, and for each state, we get the max_action and max_q_value using a helper function we'll define next. We update new_V of that state with max_q_value, and the policy with max_action. Finally, if the difference between new_V and V is below the given threshold, we break the loop, otherwise, we update V with new_V and re-iterate.

18. Getting optimal actions and values

The get_max_action_and_value() function primarily calculates the Q_values for a given state using V. It identifies the max_action, the action associated with the highest Q_value, and then determines the corresponding max_q_value. Ultimately, the function returns both the max_action and its max_q_value.

19. Computing Q-values

Note that in the compute_action_value function, the return statement has been modified. Rather than using compute_state_value to compute the value of the next state, it now utilizes the V dictionary, which contains values from the previous iteration. This modification is crucial for updating state-values iteratively.

20. Optimal policy

Applying this algorithm to our environment leads to the same outcomes as the policy iteration, proving that both algorithms converge to the optimal solution.

21. Let's practice!

Time to practice!