Aan de slagGa gratis aan de slag

Training the A2C algorithm

Time to train our Lunar Lander using the A2C algorithm! You have all the building blocks, now it's about putting it all together.

The actor and critic networks have been instantiated as actorand critic, as have their optimizers actor_optimizer and critic_optimizer.

Your REINFORCE select_action() function and the calculate_losses() function from the previous exercise are also available for you to use here.

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

Cursus bekijken

Oefeninstructies

  • Let the actor select the action, given the state.
  • Calculate the losses for both actor and critic.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

for episode in range(10):
    state, info = env.reset()
    done = False
    episode_reward = 0
    step = 0
    while not done:
        step += 1
        if done:
            break
        # Select the action
        ____
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        episode_reward += reward
        # Calculate the losses
        ____, ____ = ____(
            critic, action_log_prob, 
            reward, state, next_state, done)        
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()
        state = next_state
    describe_episode(episode, reward, episode_reward, step)
Code bewerken en uitvoeren