1. Expected SARSA
Welcome back to our exploration of RL techniques! Now, we dive into Expected SARSA, a sophisticated twist on the SARSA and Q-learning algorithms that enhances our agent's decision-making process. Let's jump in!
2. Expected SARSA
Expected SARSA, like its counterparts SARSA and Q-learning,
is a Temporal Difference or TD learning method
used in model-free RL,
where we start by initializing a Q-table. Then, repeatedly, the agent chooses an action, receives a reward, and updates the table, until convergence is achieved.
However, the key distinction of Expected SARSA over SARSA and Q-learning lies in its update rule.
3. Expected SARSA update
While SARSA relies on the actual next action taken to update Q-values,
and while Q-learning updates Q-values based on the maximum reward attainable from the next state, regardless of the policy being followed,
Expected SARSA calculates the expected value of the next state based on all possible actions. This makes Expected SARSA more robust to changes and uncertainties, as it considers the average outcome of all possible next actions according to the current policy.
4. Expected value of next sate
Expected SARSA's formula reflects this approach by focusing on the expected value of the next state.
This is achieved by calculating the sum of the Q-values from all possible actions initiated from this state. Each Q-value is weighted by the probability of its corresponding action being selected under the current policy.
In our context, since actions are chosen randomly for now when training, it means they have an equal probability of being selected.
Therefore, the expected value simplifies to the mean of the Q-values for all actions in the next state.
5. Implementation with Frozen Lake
Let's apply Expected SARSA to the Frozen Lake environment, where our agent navigates a surface to reach a goal.
We begin by setting up our environment and initializing our Q-table as a numpy array with zeros, with dimensions reflecting the number of states and number of actions.
We also define a learning rate, alpha, a discount factor, gamma, and the total number of episodes we want for training.
6. Expected SARSA update rule
The core of our implementation is the update_q_table function that receives a state, an action, the next_state, and the reward as input,
and calculates the expected Q-value by averaging the Q-values of the next state for all actions, reflecting our assumption of random action selection.
This expected Q is then used to update the current state-action pair's Q-value, incorporating the immediate reward and discounted expected future rewards as per the expected SARSA update formula.
7. Training
Our training loop iterates over the specified number of episodes, each starting with a fresh environment state.
Within each episode, actions are selected randomly.
For each action taken, we observe the new state and reward,
then we call our update_q_table function to adjust our Q-values based on the Expected SARSA rule.
8. Agent's policy
After training, we evaluate our agent's performance by printing its learned policy.
As we can see, the agent's policy tends to avoid the holes which demonstrates the ability of the Expected SARSA algorithm to converge to the optimal policy.
9. Let's practice!
Let's get coding and see Expected SARSA in action!