Get startedGet started for free

Monte Carlo methods

1. Monte Carlo methods

Welcome back! In our journey through RL, we now explore model-free learning, beginning with Monte Carlo methods.

2. Recap: model-based learning

So far, we've covered policy iteration and value iteration, two model-based strategies optimizing policies or value functions, dependent on the environment's dynamics like transition probabilities and rewards, not requiring environment interaction for optimal policy derivation.

3. Model-free learning

Model-free learning, however, does not rely on knowledge about the environment's dynamics. Instead, it requires the agent to interact with the environment, learning the optimal policy through trial and error, thus being highly adaptable to complex scenarios.

4. Monte Carlo methods

Monte Carlo methods are model-free techniques used to estimate Q-values and derive policies from experience without needing a model of the environment's dynamics. The process starts by collecting several episodes where the agent performs random actions.

5. Monte Carlo methods

Then, Q-values for each state and action are estimated from episodes' returns.

6. Monte Carlo methods

Lastly, the optimal policy is derived by choosing actions with the highest Q-value in each state. In RL, we distinguish between first-visit and every-visit Monte Carlo methods.

7. Custom grid world

Consider this custom environment.

8. Collecting two episodes

Suppose we collect two random episodes where states, actions, and rewards are as shown. Also, returns are computed for every state-action pair, considering a discount factor of 1.

9. Estimating Q-values

We fill the Q-table, a table containing Q-values for state-action pairs. To fill a specific pair, we look for instances of this pair in the collected episodes and average them.

10. Q(4, left), Q(4, up), and Q(1, down)

(4, left), (4, up), and (1, down) occur once in an episode, therefore, we fill the Q-table with their returns.

11. Q(4, right)

(4, Right) occurs once in each episode. We calculate its Q-value averaging the returns from both episodes, which gives a Q-value of 10. The distinction between first-visit and every-visit Monte Carlo emerges in handling repeated state-action pairs within episodes. (3, Right) for instance appears twice in the first episode and once in the second.

12. Q(3, right) - first-visit Monte Carlo

First-visit Monte Carlo averages only the first occurrence in every episode, leading to a Q-value of 5.

13. Q(3, right) - every-visit Monte Carlo

Every-visit Monte Carlo averages the returns of every occurrence, giving a Q-value of 6.

14. Generating an episode

To translate this into code, we need several functions. generate_episode() initializes an episode, resets the environment, samples random actions, executes them, and records state-action-reward tuples until the episode concludes, returning the episode's complete data.

15. First-visit Monte Carlo

first_visit_mc() initializes three numpy arrays: Q, returns_sum, and returns_count for storing Q-values, cumulative rewards, and visit counts, respectively. For each iteration, an episode is generated, and a set visited_states_actions is initialized to keep track of which state-action pairs have been visited. Then, we iterate over each state-action-reward tuple. Only for first encounters of state-action pairs within an episode, the function calculates the return from that state onwards and adds it to returns_sum. It also increments the returns_count and adds the pair to visited_states_actions. Then, state-action pairs with non-zero counts are identified to avoid division by zero. Finally, Q is updated by dividing the returns_sum by the returns_count

16. Every-visit Monte Carlo

every_visit_mc() uses similar code, but does not track visits. Therefore, it updates returns_sum and returns_count for every occurrence of each state-action pair, without maintaining a set of visited_state_actions.

17. Getting the optimal policy

Finally, the get_policy() function extracts the policy from the Q-values, selecting the action with the highest Q-value for each state.

18. Putting things together

We execute first_visit_mc and every_visit_mc for 1000 episodes and derive the resulting policies. As we can see, for a large number of episodes, both methods converge to the same optimal policy.

19. Let's practice!

Now, let's practice!