1. Temporal difference learning
Now we dive into Temporal Difference Learning, a model-free approach that plays a pivotal role in RL.
2. TD learning vs. Monte Carlo
Temporal Difference, or TD Learning, like Monte Carlo methods, is model-free; meaning it does not require a model of the environment's dynamics,
but estimates the Q_table based on interaction with the environment. However, the critical difference lies in when and how they update their value estimates.
Monte Carlo methods can't update their estimates until at least one episode is done. They rely on the final outcome of the episode,
which makes them well-suited for environments where episodes are clearly defined and relatively short.
TD Learning, on the other hand updates value estimates at each step within an episode, based on the most recent experience.
This quality makes TD Learning more flexible and efficient, particularly in environments with long or indefinite episodes.
3. TD learning as weather forecasting
You can think of TD learning as weather forecasting, where predictions are constantly updated as new data like current weather conditions comes in, rather than waiting for the outcome of the whole day.
4. SARSA
Now, let's focus on SARSA, a specific TD Learning algorithm.
SARSA stands for State-Action-Reward-State-Action, which outlines the data involved in its update process.
As an on-policy method, SARSA learns the value of the policy it's currently following, adjusting its strategy based on the actions it takes.
In SARSA, the agent learns by observing the current state, taking an action, receiving a reward, observing the next state, and then taking the next action. The value of the current state-action pair is updated based on this experience. Let's see how!
5. SARSA update rule
The new Q_value of the current state-action pair is equal to (1 - alpha) times the old Q_value of the current state-action pair, plus alpha times the sum of the reward and gamma times the Q_value of the next state-action pair.
Here, alpha is the learning rate, controlling the speed of change in Q_values,
and gamma is the discount factor, controlling how much the agent values future rewards.
Both values are between 0 and 1 and require tuning to be set.
6. Frozen Lake
Now, let's implement SARSA to solve the Frozen Lake environment.
7. Initialization
We start by creating the environment,
getting the number of states, actions,
and initializing an array Q with zeros.
Also, we initialize the learning rate alpha with 0.1, gamma with 1, and the number of training episodes with 1000.
8. SARSA loop
Next, we define the SARSA training loop, iterating over a number of episodes.
Each episode starts by resetting the environment and randomly selecting an initial action.
During the episode, after executing an action, we observe the resulting reward and the next state.
We randomly choose the subsequent action - a placeholder strategy that we'll enhance later in the course.
The Q-table is updated with the update_q_table() function that we define next.
Finally, we set the current state and action to their next counterparts for the upcoming iteration.
9. SARSA updates
The update_q_table function receives the state, action, reward, next_state, and next_action.
First, it retrieves the old Q-value of the state-action pair, represented by the red term in the formula.
Then, it retrieves the Q-value of the next state-action pair, represented by the green term in the formula.
Finally, it calculates the new Q-value for the state-action pair using the formula.
10. Deriving the optimal policy
Calling the get_policy() function defined earlier, we get the optimal policy,
and we see how the agent impressively learns optimal actions in each state, avoiding falling into holes.
11. Let's practice!
Time for some practice!