The complete DQN algorithm

1. The complete DQN algorithm

Welcome back!

2. The DQN algorithm

With experience replay, we are reasonably close to the complete DQN algorithm as published by Deepmind in 2015. However, we are still missing two components. Epsilon-greediness allows the agent to do more exploration. Fixed Q-targets enhance the stability of the learning process by ensuring that the target Q-value in the loss function does not shift too rapidly over time.

3. Epsilon-greediness in the DQN algorithm

Epsilon-greediness lets the agent occasionally choose a random action over the highest-value one. We use decayed epsilon-greediness to focus more on exploration early in training and exploitation later. This is implemented by modifying the select_action function. It requires five arguments: the q_values, which determine the optimal action; the current step number; and three parameters describing the epsilon decay: start, end, and decay. The threshold epsilon is calculated, and a random number between zero and one is drawn. If the number is less than epsilon, a random action is taken. Otherwise, the action with the highest Q-value is selected. This is the resulting schedule for different decay values; higher values for decay result in slower decay.

4. Fixed Q-targets

Now let's discuss Fixed Q-Targets. The DQN loss, based on the Bellman Error, involves the network twice: for the current action value (which we want to learn) and the TD target. This causes the target to shift with each update, destabilizing training. To stabilize training, we introduce a second Q-Network. We refer to it as target network, and the original one as online network. The target network is used to calculate the TD targets, and is updated much more slowly.

5. Implementing fixed Q-targets

First, we instantiate both networks with the same initial parameters. One way to do this is to load the online network state dictionary into the target network. The state_dict method returns the state dictionary, which contains all the network weights layer by layer, as illustrated with this smaller network. The target can be loaded into the target network with the load_state_dict method. We will use gradient descent to update the online network at each step. However, we also want to update the target network parameters. At every step, they need to get a bit closer to those of the online network, as if they had a lot of inertia. One way to do that is to iterate over the state dict of the network, taking the weighted average for each layer. The weight on the online network is a hyperparameter tau, typically with a small value like 0.001. The higher tau is, the faster the target network shifts towards the online network. We must then load the updated state dict to update the network parameters.

6. Loss calculation with fixed Q-targets

The training loop changes occur in the inner loop, after action selection. The loss calculation from an experience replay batch now requires a couple of adjustments. The online and target networks now provide respectively the Q-values and target Q-values. Since we are not interested in optimizing the target network weights by gradient descent, we use the torch.no_grad context to tell PyTorch not to track the gradients for the target_q_values, ensuring that they are not updated in the optimizer step. For the loss itself, we still use the Mean Squared Bellman Error. Finally, we use the update_target_network function to nudge target_network towards online_network.

7. Let's practice!

Let's practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.