Improving a policy
In the previous exercise, you computed the Q-values for each state-action pair in the MyGridWorld
environment. Now, you'll use these Q-values to improve the existing policy. Policy improvement is a critical step in reinforcement learning, where you enhance the policy by choosing actions that maximize the expected utility (Q-value) in each state. After improving the policy, you will render the new movements according to this improved policy.
The environment has been imported as env
, along with the Q-values as Q
, and the render()
function.
This exercise is part of the course
Reinforcement Learning with Gymnasium in Python
Exercise instructions
- Find the best action for each state based on Q-values.
- Select the right
action
based on theimproved_policy
. - Execute the selected
action
to observe its outcome.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
improved_policy = {}
for state in range(num_states-1):
# Find the best action for each state based on Q-values
max_action = ____
improved_policy[state] = max_action
terminated = False
while not terminated:
# Select action based on policy
action = ____
# Execute the action
state, reward, terminated, truncated, info = ____
render()