BaşlayınÜcretsiz Başlayın

Implementing every-visit Monte Carlo

The Every-Visit Monte Carlo method differs from the First-Visit variant by updating values every time a state-action pair appears, rather than only on first encounters. While this approach provides a comprehensive evaluation of the policy by utilizing all the available information from the episodes, it may also introduce more variance in the value estimates because it includes all samples, regardless of when they occur in the episode. Your task is to complete the implementation of the every_visit_mc() function, which estimates the action-value function Q over num_episodes episodes.

The dictionaries returns_sum, and returns_count, with state-action pairs as keys have been initialized and pre-loaded for you along with the generate_episode() function.

Bu egzersiz

Reinforcement Learning with Gymnasium in Python

kursunun bir parçasıdır
Kursu Görüntüle

Egzersiz talimatları

  • Generate an episode using the generate_episode() function.
  • Update the returns and their counts for each state-action pair within an episode.
  • Compute the estimated Q-values.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

Q = np.zeros((num_states, num_actions))
for i in range(100):
  # Generate an episode
  episode = ____
  # Update the returns and their counts
  for j, (state, action, reward) in ____:
    returns_sum[(state,  action)] += sum(____)
    returns_count[(state,  action)] += ____

# Update the Q-values for visited state-action pairs 
nonzero_counts = ____
Q[nonzero_counts] = ____
    
render_policy(get_policy())
Kodu Düzenle ve Çalıştır