Implementing first-visit Monte Carlo
The goal of Monte Carlo algorithms is to estimate the Q-table in order to derive an optimal policy. In this exercise, you will implement the First-Visit Monte Carlo method to estimate the action-value function Q, and then compute the optimal policy to solve the custom environment you've seen in the previous exercise. Whenever computing the return, assume a discount factor of 1.
The numpy
arrays Q
, returns_sum
, and returns_count
, storing the Q-values, the cumulative sum of rewards, and the count of visits for each state-action pair, respectively, have been initialized and pre-loaded for you.
This exercise is part of the course
Reinforcement Learning with Gymnasium in Python
Exercise instructions
- Define the
if
condition that should be tested in the first-visit Monte Carlo algorithm. - Update the returns (
returns_sum
), their counts (returns_count
) and thevisited_states
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
for i in range(100):
episode = generate_episode()
visited_states = set()
for j, (state, action, reward) in enumerate(episode):
# Define the first-visit condition
if ____ not in ____:
# Update the returns, their counts and the visited states
returns_sum[state, action] += ____([____ for ____ in ____])
returns_count[state, action] += ____
visited_states.____(____)
nonzero_counts = returns_count != 0
Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
render_policy(get_policy())