Improving a policy
In the previous exercise, you computed the Q-values for each state-action pair in the MyGridWorld
environment. Now, you'll use these Q-values to improve the existing policy. Policy improvement is a critical step in reinforcement learning, where you enhance the policy by choosing actions that maximize the expected utility (Q-value) in each state. After improving the policy, you will render the new movements according to this improved policy.
The environment has been imported as env
, along with the Q-values as Q
, and the render()
function.
Cet exercice fait partie du cours
Reinforcement Learning with Gymnasium in Python
Instructions
- Find the best action for each state based on Q-values.
- Select the right
action
based on theimproved_policy
. - Execute the selected
action
to observe its outcome.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
improved_policy = {}
for state in range(num_states-1):
# Find the best action for each state based on Q-values
max_action = ____
improved_policy[state] = max_action
terminated = False
while not terminated:
# Select action based on policy
action = ____
# Execute the action
state, reward, terminated, truncated, info = ____
render()