Get startedGet started for free

Evaluating policy on a slippery Frozen Lake

In a slippery Frozen Lake environment, merely deducing the policy from a learned Q-table isn't sufficient to gauge its effectiveness. To accurately assess the suitability of a learned policy, you must play multiple episodes, observing the average reward achieved. This exercise compares the effectiveness of the learned policy against a baseline established by following a random policy during training. Your task is to execute the learned policy over several episodes and analyze its performance based on the average rewards collected, contrasting it with the average rewards collected during the random policy phase.

The Q-table Q, num_states, num_actions, and avg_reward_per_random_episode have been pre-loaded for you. The NumPy library has been imported as np.

This exercise is part of the course

Reinforcement Learning with Gymnasium in Python

View Course

Exercise instructions

  • In each iteration, select the best action to take based on learned Q-table Q.
  • Compute the average reward per learned episode avg_reward_per_learned_episode.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

for episode in range(10000):
    state, info = env.reset()
    terminated = False
    episode_reward = 0
    while not terminated:
        # Select the best action based on learned Q-table
        action = ____
        new_state, reward, terminated, truncated, info = env.step(action)
        state = new_state
        episode_reward += reward
    reward_per_learned_episode.append(episode_reward)
# Compute and print the average reward per learned episode
avg_reward_per_learned_episode = ____
print("Average reward per learned episode: ", avg_reward_per_learned_episode)
print("Average reward per random episode: ", ____)
Edit and Run Code