Get startedGet started for free

Q-learning

1. Q-learning

Now that you know about SARSA, it's time to explore Q-learning, another temporal difference method widely used in the RL landscape. Let's get started!

2. Introduction to Q-learning

Q-learning, which stands for 'quality learning,' is a model-free technique that helps an agent learn the optimal Q-table by interacting with an environment. Just like SARSA, the Q-learning algorithm starts by initializing a Q-table. Then, repeatedly, the agent chooses an action to perform, receives a reward from the environment, and updates the table. The agent continues this loop until convergence is achieved after a certain number of episodes.

3. Q-learning vs. SARSA

The main difference between Q-learning and SARSA lies in the way the Q-table is updated. In SARSA, the new Q-value for a state-action pair is updated based on the action a' actually taken when the agent is in state s'. Conversely, Q-learning updates the Q-value by considering the maximum possible Q-value that could be obtained from the next state s', regardless of the actual action taken. In other words, while SARSA updates the Q-table based on the action taken, Q-learning does so independently of the agent's specific actions. This distinction categorizes SARSA as an on-policy learner and Q-learning as an off-policy learner.

4. Q-learning implementation

Suppose we need to solve the slippery frozen lake environment with Q-learning. We create the environment and specify parameters such as the number of episodes, learning rate, alpha, and the discount factor, gamma. We then initialize the Q-table, which has dimensions equal to num_states and num_actions, with zeros. Additionally, we initiate an empty list to record the total reward for each episode. This will be helpful for evaluation later on.

5. Q-learning implementation

For each episode, we initialize the agent's state, a boolean variable indicating whether the agent has reached an end state, and set the episode reward to zero. While the episode is not concluded, the agent selects an action, which, for now, is assumed to be random. Later in the course, we will explore improved methods for action selection during the training process. After executing an action, the agent receives a reward and a new state, and then updates the Q-table. The reward is accumulated in the total episode_reward, and the state is updated. Once the episode concludes, we add the episode_reward to the list reward_per_random_episode.

6. Q-learning update

The update_q_table function is a direct implementation of the Q-learning update formula. It receives a state, action, reward, and a new_state as inputs. First, it retrieves the old value of the state-action pair, represented by the red term in the formula. Then, it computes the next maximum value based on the new state, represented by the green term in the formula. Finally, it calculates the new value for the state-action pair.

7. Using the policy

After training, since the environment is slippery, we allow the agent to play several episodes using the learned policy in order to better evaluate it. We execute the same loop of performing an action, receiving a reward, and updating the state. However, in each step, the agent now selects the action based on its learned policy. Additionally, we track the episode_reward collected in each episode and store these values in the list reward_per_learned_episode.

8. Q-learning evaluation

Finally, we compute the average reward per random episode and average reward per learned episode using NumPy's mean function, and compare them using a bar chart. We observe that, on average, the agent collects a much higher reward when following the learned policy. This result underscores the success of the learning process.

9. Let's practice!

Now let's practice!