Get startedGet started for free

Double DQN

1. Double DQN

In this video, we will study the Double DQN algorithm.

2. Double Q-learning

You have seen in your prerequisite course that Q-learning has a tendency to overestimate Q-values. This is because the calculation for the target Q-value involves taking the maximum across all actions; this maximum is not taken from the real action value function, but for our current best estimate, which is noisy. When we take the maximum of a noisy function, we tend to be overoptimistic, even if our estimate is correct on average. We call this the maximization bias. This leads to slower and less stable learning. You have also seen that Double Q-learning was a way to address this problem. Double Q-learning introduces a second Q-table, and alternates between both tables to ensure that action selection is decoupled from value estimation. This inspired a comparable solution for Deep Q-Learning, called Double DQN, or DDQN.

3. The idea behind DDQN

Double DQN uses the same core concepts as vanilla DQN with fixed Q-targets. In DQN, when we calculate the TD target in the Bellman Error, we use the target network both for action selection and for value estimation; this is subject to the maximization bias. Instead of introducing yet another network, DDQN proposes to use the target network only for value estimation, with the online network now being used for next action selection. This is not exactly like Double Q Learning, as we do not alternate between two networks. This algorithm was introduced to get most of the benefit from Double Q Learning, by making the smallest possible change to the original DQN algorithm.

4. Double DQN implementation

Let us look at DQN and DDQN side by side, as the differences are minimal but important to grasp. The online and target networks are instantiated the same way. The target network updates are also identical. The only difference appears in the target Q-value calculations. In DQN, we use the target network both for next action selection and for next value estimation. This is implicit here as we just take the maximum value in the output layer with the amax function. But for better comparison with DDQN, let us rewrite this in a way which is more explicit, though equivalent.

5. Double DQN implementation

We use the target network to first select the action with argmax, and then in a separate step evaluate its value. In DDQN, we decouple those. It is the online network which is used to determine the next action; the target network is used only to evaluate the value of this action. As we can see, the algorithm change between DQN and DDQN truly is minimal.

6. DDQN performance

Many Deep Q-Learning research papers report the performance of their algorithms in a range of around 50 Atari games, comparing their scores against each other and against top human players. The original DQN was the first algorithm to match human players on many games, sometimes beating them by a large margin. Agents trained with DDQN exhibit substantially higher median and average scores across the same Atari environments. It should be said, however, that this is task-dependent. In some instances, vanilla DQN may perform as well or better than DDQN. In practice, it generally pays off to try both approaches.

7. Let's practice!

Let's practice!

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.