CommencerCommencer gratuitement

Implementing the complete DQN algorithm

The time has finally arrived! All the prerequisites are complete; you will now implement the full DQN algorithm and use it to train a Lunar Lander agent. This means that your algorithm will use not just Experience Replay, but also Decayed Epsilon-Greediness and Fixed Q-Targets.

The select_action() function implementing Decayed Epsilon Greediness is available for you to use, as is the update_target_network() function from the last exercise. All that remains to do is fit those functions in the DQN training loop, and ensure that you are correctly using the Target Network in the loss caculations.

You need to keep a new step counter, total_steps, to decay the value for \(\varepsilon\) over time. This variable is initialized for you with value 0.

Cet exercice fait partie du cours

Deep Reinforcement Learning in Python

Afficher le cours

Instructions

  • Use select_action() to implement Decayed Epsilon Greediness and select the agent's action; you will need to use total_steps, the running total across episodes.
  • Before calculating the TD target, switch off gradient tracking.
  • After obtaining the next state, get the next state Q-Values.
  • Update the target network at the end of each step.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

for episode in range(10):
    state, info = env.reset()
    done = False
    step = 0
    episode_reward = 0
    while not done:
        step += 1
        total_steps += 1
        q_values = online_network(state)
        # Select the action with epsilon greediness
        action = ____(____, ____, start=.9, end=.05, decay=1000)
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        replay_buffer.push(state, action, reward, next_state, done)        
        if len(replay_buffer) >= batch_size:
            states, actions, rewards, next_states, dones = replay_buffer.sample(64)
            q_values = online_network(states).gather(1, actions).squeeze(1)
            # Ensure gradients are not tracked
            with ____:
                # Obtain the next state Q-values
                next_q_values = ____(next_states).amax(1)
                target_q_values = rewards + gamma * next_q_values * (1-dones)
            loss = nn.MSELoss()(q_values, target_q_values)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()   
            # Update the target network weights
            ____(____, ____, tau=.005)
        state = next_state
        episode_reward += reward    
    describe_episode(episode, reward, episode_reward, step)
Modifier et exécuter le code