Get startedGet started for free

Implementing the complete DQN algorithm

The time has finally arrived! All the prerequisites are complete; you will now implement the full DQN algorithm and use it to train a Lunar Lander agent. This means that your algorithm will use not just Experience Replay, but also Decayed Epsilon-Greediness and Fixed Q-Targets.

The select_action() function implementing Decayed Epsilon Greediness is available for you to use, as is the update_target_network() function from the last exercise. All that remains to do is fit those functions in the DQN training loop, and ensure that you are correctly using the Target Network in the loss caculations.

You need to keep a new step counter, total_steps, to decay the value for \(\varepsilon\) over time. This variable is initialized for you with value 0.

This exercise is part of the course

Deep Reinforcement Learning in Python

View Course

Exercise instructions

  • Use select_action() to implement Decayed Epsilon Greediness and select the agent's action; you will need to use total_steps, the running total across episodes.
  • Before calculating the TD target, switch off gradient tracking.
  • After obtaining the next state, get the next state Q-Values.
  • Update the target network at the end of each step.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

for episode in range(10):
    state, info = env.reset()
    done = False
    step = 0
    episode_reward = 0
    while not done:
        step += 1
        total_steps += 1
        q_values = online_network(state)
        # Select the action with epsilon greediness
        action = ____(____, ____, start=.9, end=.05, decay=1000)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        replay_buffer.push(state, action, reward, next_state, done)        
        if len(replay_buffer) >= batch_size:
            states, actions, rewards, next_states, dones = replay_buffer.sample(64)
            q_values = online_network(states).gather(1, actions).squeeze(1)
            # Ensure gradients are not tracked
            with ____:
                # Obtain the next state Q-values
                next_q_values = ____(next_states).amax(1)
                target_q_values = rewards + gamma * next_q_values * (1-dones)
            loss = nn.MSELoss()(q_values, target_q_values)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()   
            # Update the target network weights
            ____(____, ____, tau=.005)
        state = next_state
        episode_reward += reward    
    describe_episode(episode, reward, episode_reward, step)
Edit and Run Code