1. Policy iteration and value iteration
Welcome back! Now that you understand policy assessment and improvement, let's dive into finding optimal policies with policy iteration and value iteration algorithms.
2. Policy iteration
Policy iteration is an iterative process to find the optimal policy. Here's how it works:
we start with a random policy.
3. Policy iteration
Then, we evaluate it by computing the state-value function.
4. Policy iteration
Then, improve it by choosing actions at each state that maximize the state-value function.
5. Policy iteration
This evaluate-improve cycle repeats until the policy stabilizes,
6. Policy iteration
signifying we've found the optimal policy.
7. Grid world
Let's apply policy iteration to our grid world environment starting with this policy.
8. Policy evaluation
To do that, we need several functions. The policy_evaluation() function receives a policy as input,
computes state-values using compute_state_value(),
and returns these values in a dictionary V.
9. Policy improvement
policy_improvement(), receives the policy we want to improve,
initializes an improved_policy,
and populates a Q dictionary using compute_q_value() for each state-action pair.
For each state, max_action, the action with the highest Q-value, is computed, and the improved_policy for that state is updated.
Finally, the improved_policy is returned.
10. Policy iteration
policy_iteration() begins by initializing the policy we aim to optimize,
then enters a loop where two main steps are executed: policy_evaluation and policy_improvement, yielding the value function V and an improved_policy, respectively.
If the improved_policy is identical to the old one, it signifies convergence, prompting the loop to terminate. Otherwise, the policy is updated for the next iteration.
Finally, the policy and V are returned.
11. Optimal policy
Calling the function, we see that indeed the new policy is better as the agent reaches the goal more quickly.
12. Value iteration
Now let's move to value iteration, a more efficient algorithm. Value iteration simplifies and speeds up the process by
combining policy evaluation and improvement into a single step.
It directly computes the optimal state-value function
and derives the policy from it.
Here’s the process: we begin with an initial value function V, typically set to zero for all states.
13. Value iteration
Then, at each time step, we compute the action-value function using the state-values V of the previous iteration.
14. Value iteration
We update the value of each state by considering the maximum expected return achievable by any action from that state.
15. Value iteration
And we continue this process until the changes are below a small threshold that we define.
16. Value iteration
Only then, we consider that the algorithm converged and we found the optimal policy.
17. Implementing value-iteration
To implement value iteration, we start by initializing the state_values V and the policy as dictionaries with zero values. Additionally, we define a threshold variable.
Then, we start a loop initializing the new_V with zeros, and
for each state, we get the max_action and max_q_value using a helper function we'll define next.
We update new_V of that state with max_q_value, and the policy with max_action.
Finally, if the difference between new_V and V is below the given threshold, we break the loop, otherwise, we update V with new_V and re-iterate.
18. Getting optimal actions and values
The get_max_action_and_value() function primarily calculates the Q_values for a given state using V.
It identifies the max_action, the action associated with the highest Q_value,
and then determines the corresponding max_q_value.
Ultimately, the function returns both the max_action and its max_q_value.
19. Computing Q-values
Note that in the compute_action_value function, the return statement has been modified. Rather than using compute_state_value to compute the value of the next state, it now utilizes the V dictionary, which contains values from the previous iteration. This modification is crucial for updating state-values iteratively.
20. Optimal policy
Applying this algorithm to our environment leads to the same outcomes as the policy iteration, proving that both algorithms converge to the optimal solution.
21. Let's practice!
Time to practice!