Implementing the complete DQN algorithm

The time has finally arrived! All the prerequisites are complete; you will now implement the full DQN algorithm and use it to train a Lunar Lander agent. This means that your algorithm will use not just Experience Replay, but also Decayed Epsilon-Greediness and Fixed Q-Targets.

The select_action() function implementing Decayed Epsilon Greediness is available for you to use, as is the update_target_network() function from the last exercise. All that remains to do is fit those functions in the DQN training loop, and ensure that you are correctly using the Target Network in the loss caculations.

You need to keep a new step counter, total_steps, to decay the value for \(\varepsilon\) over time. This variable is initialized for you with value 0.

Use select_action() to implement Decayed Epsilon Greediness and select the agent's action; you will need to use total_steps, the running total across episodes.
Before calculating the TD target, switch off gradient tracking.
After obtaining the next state, get the next state Q-Values.
Update the target network at the end of each step.

Exercise

Implementing the complete DQN algorithm

Instructions

.css-6su6fj{-webkit-flex-shrink:0;-ms-flex-negative:0;flex-shrink:0;}Exercise

Instructions

Exercise