Get startedGet started for free

Improving a policy

In the previous exercise, you computed the Q-values for each state-action pair in the MyGridWorld environment. Now, you'll use these Q-values to improve the existing policy. Policy improvement is a critical step in reinforcement learning, where you enhance the policy by choosing actions that maximize the expected utility (Q-value) in each state. After improving the policy, you will render the new movements according to this improved policy.

The environment has been imported as env, along with the Q-values as Q, and the render() function.

This exercise is part of the course

Reinforcement Learning with Gymnasium in Python

View Course

Exercise instructions

  • Find the best action for each state based on Q-values.
  • Select the right action based on the improved_policy.
  • Execute the selected action to observe its outcome.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

improved_policy = {}

for state in range(num_states-1):
    # Find the best action for each state based on Q-values
    max_action = ____
    improved_policy[state] = max_action

terminated = False
while not terminated:
  # Select action based on policy 
  action = ____
  # Execute the action
  state, reward, terminated, truncated, info = ____
  render()
Edit and Run Code