Get startedGet started for free

Implementing first-visit Monte Carlo

The goal of Monte Carlo algorithms is to estimate the Q-table in order to derive an optimal policy. In this exercise, you will implement the First-Visit Monte Carlo method to estimate the action-value function Q, and then compute the optimal policy to solve the custom environment you've seen in the previous exercise. Whenever computing the return, assume a discount factor of 1.

The numpy arrays Q, returns_sum, and returns_count, storing the Q-values, the cumulative sum of rewards, and the count of visits for each state-action pair, respectively, have been initialized and pre-loaded for you.

This exercise is part of the course

Reinforcement Learning with Gymnasium in Python

View Course

Exercise instructions

  • Define the if condition that should be tested in the first-visit Monte Carlo algorithm.
  • Update the returns (returns_sum), their counts (returns_count) and the visited_states.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

for i in range(100):
  episode = generate_episode()
  visited_states = set()
  for j, (state, action, reward) in enumerate(episode):
    # Define the first-visit condition
    if ____ not in ____:
      # Update the returns, their counts and the visited states
      returns_sum[state, action] += ____([____ for ____ in ____])
      returns_count[state, action] += ____
      visited_states.____(____)

nonzero_counts = returns_count != 0

Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
render_policy(get_policy())
Edit and Run Code