Implementing the complete DQN algorithm
The time has finally arrived! All the prerequisites are complete; you will now implement the full DQN algorithm and use it to train a Lunar Lander agent. This means that your algorithm will use not just Experience Replay, but also Decayed Epsilon-Greediness and Fixed Q-Targets.
The select_action()
function implementing Decayed Epsilon Greediness is available for you to use, as is the update_target_network()
function from the last exercise. All that remains to do is fit those functions in the DQN training loop, and ensure that you are correctly using the Target Network in the loss caculations.
You need to keep a new step counter, total_steps
, to decay the value for \(\varepsilon\) over time. This variable is initialized for you with value 0.
This exercise is part of the course
Deep Reinforcement Learning in Python
Exercise instructions
- Use
select_action()
to implement Decayed Epsilon Greediness and select the agent's action; you will need to usetotal_steps
, the running total across episodes. - Before calculating the TD target, switch off gradient tracking.
- After obtaining the next state, get the next state Q-Values.
- Update the target network at the end of each step.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
for episode in range(10):
state, info = env.reset()
done = False
step = 0
episode_reward = 0
while not done:
step += 1
total_steps += 1
q_values = online_network(state)
# Select the action with epsilon greediness
action = ____(____, ____, start=.9, end=.05, decay=1000)
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
replay_buffer.push(state, action, reward, next_state, done)
if len(replay_buffer) >= batch_size:
states, actions, rewards, next_states, dones = replay_buffer.sample(64)
q_values = online_network(states).gather(1, actions).squeeze(1)
# Ensure gradients are not tracked
with ____:
# Obtain the next state Q-values
next_q_values = ____(next_states).amax(1)
target_q_values = rewards + gamma * next_q_values * (1-dones)
loss = nn.MSELoss()(q_values, target_q_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Update the target network weights
____(____, ____, tau=.005)
state = next_state
episode_reward += reward
describe_episode(episode, reward, episode_reward, step)