1. Monte Carlo methods
Welcome back! In our journey through RL, we now explore model-free learning, beginning with Monte Carlo methods.
2. Recap: model-based learning
So far, we've covered policy iteration and value iteration, two model-based strategies optimizing policies or value functions,
dependent on the environment's dynamics like transition probabilities and rewards,
not requiring environment interaction for optimal policy derivation.
3. Model-free learning
Model-free learning, however, does not rely on knowledge about the environment's dynamics.
Instead, it requires the agent to interact with the environment,
learning the optimal policy through trial and error,
thus being highly adaptable to complex scenarios.
4. Monte Carlo methods
Monte Carlo methods are model-free techniques used to
estimate Q-values and derive policies from experience without needing a model of the environment's dynamics.
The process starts by collecting several episodes where the agent performs random actions.
5. Monte Carlo methods
Then, Q-values for each state and action are estimated from episodes' returns.
6. Monte Carlo methods
Lastly, the optimal policy is derived by choosing actions with the highest Q-value in each state.
In RL, we distinguish between first-visit and every-visit Monte Carlo methods.
7. Custom grid world
Consider this custom environment.
8. Collecting two episodes
Suppose we collect two random episodes where states, actions, and rewards are as shown. Also, returns are computed for every state-action pair, considering a discount factor of 1.
9. Estimating Q-values
We fill the Q-table, a table containing Q-values for state-action pairs. To fill a specific pair, we look for instances of this pair in the collected episodes and average them.
10. Q(4, left), Q(4, up), and Q(1, down)
(4, left), (4, up), and (1, down) occur once in an episode, therefore, we fill the Q-table with their returns.
11. Q(4, right)
(4, Right) occurs once in each episode. We calculate its Q-value averaging the returns from both episodes, which gives a Q-value of 10.
The distinction between first-visit and every-visit Monte Carlo emerges in handling repeated state-action pairs within episodes. (3, Right) for instance appears twice in the first episode and once in the second.
12. Q(3, right) - first-visit Monte Carlo
First-visit Monte Carlo averages only the first occurrence in every episode, leading to a Q-value of 5.
13. Q(3, right) - every-visit Monte Carlo
Every-visit Monte Carlo averages the returns of every occurrence, giving a Q-value of 6.
14. Generating an episode
To translate this into code, we need several functions.
generate_episode() initializes an episode, resets the environment,
samples random actions,
executes them,
and records state-action-reward tuples until the episode concludes,
returning the episode's complete data.
15. First-visit Monte Carlo
first_visit_mc() initializes three numpy arrays: Q, returns_sum, and returns_count for storing Q-values, cumulative rewards, and visit counts, respectively.
For each iteration, an episode is generated, and a set visited_states_actions is initialized to keep track of which state-action pairs have been visited.
Then, we iterate over each state-action-reward tuple.
Only for first encounters of state-action pairs within an episode,
the function calculates the return from that state onwards and adds it to returns_sum.
It also increments the returns_count and adds the pair to visited_states_actions.
Then, state-action pairs with non-zero counts are identified to avoid division by zero.
Finally, Q is updated by dividing the returns_sum by the returns_count
16. Every-visit Monte Carlo
every_visit_mc() uses similar code, but does not track visits. Therefore, it updates returns_sum and returns_count for every occurrence of each state-action pair, without maintaining a set of visited_state_actions.
17. Getting the optimal policy
Finally, the get_policy() function extracts the policy from the Q-values, selecting the action with the highest Q-value for each state.
18. Putting things together
We execute first_visit_mc and every_visit_mc for 1000 episodes
and derive the resulting policies.
As we can see, for a large number of episodes, both methods converge to the same optimal policy.
19. Let's practice!
Now, let's practice!