CommencerCommencer gratuitement

Improving a policy

In the previous exercise, you computed the Q-values for each state-action pair in the MyGridWorld environment. Now, you'll use these Q-values to improve the existing policy. Policy improvement is a critical step in reinforcement learning, where you enhance the policy by choosing actions that maximize the expected utility (Q-value) in each state. After improving the policy, you will render the new movements according to this improved policy.

The environment has been imported as env, along with the Q-values as Q, and the render() function.

Cet exercice fait partie du cours

Reinforcement Learning with Gymnasium in Python

Afficher le cours

Instructions

  • Find the best action for each state based on Q-values.
  • Select the right action based on the improved_policy.
  • Execute the selected action to observe its outcome.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

improved_policy = {}

for state in range(num_states-1):
    # Find the best action for each state based on Q-values
    max_action = ____
    improved_policy[state] = max_action

terminated = False
while not terminated:
  # Select action based on policy 
  action = ____
  # Execute the action
  state, reward, terminated, truncated, info = ____
  render()
Modifier et exécuter le code