CommencerCommencer gratuitement

Implementing first-visit Monte Carlo

The goal of Monte Carlo algorithms is to estimate the Q-table in order to derive an optimal policy. In this exercise, you will implement the First-Visit Monte Carlo method to estimate the action-value function Q, and then compute the optimal policy to solve the custom environment you've seen in the previous exercise. Whenever computing the return, assume a discount factor of 1.

The numpy arrays Q, returns_sum, and returns_count, storing the Q-values, the cumulative sum of rewards, and the count of visits for each state-action pair, respectively, have been initialized and pre-loaded for you.

Cet exercice fait partie du cours

Reinforcement Learning with Gymnasium in Python

Afficher le cours

Instructions

  • Define the if condition that should be tested in the first-visit Monte Carlo algorithm.
  • Update the returns (returns_sum), their counts (returns_count) and the visited_states.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

for i in range(100):
  episode = generate_episode()
  visited_states = set()
  for j, (state, action, reward) in enumerate(episode):
    # Define the first-visit condition
    if ____ not in ____:
      # Update the returns, their counts and the visited states
      returns_sum[state, action] += ____([____ for ____ in ____])
      returns_count[state, action] += ____
      visited_states.____(____)

nonzero_counts = returns_count != 0

Q[nonzero_counts] = returns_sum[nonzero_counts] / returns_count[nonzero_counts]
render_policy(get_policy())
Modifier et exécuter le code