Aan de slagGa gratis aan de slag

Training the PPO algorithm

You will now use the familiar A2C training loop to train the PPO algorithm.

This training loop does not take full advantage of the clipped surrogate objective function, and as a result this algorithm should not perform much better than A2C; it serves as illustration of the concepts learned around the clipped surrogate objective and the entropy bonus.

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

Cursus bekijken

Oefeninstructies

  • Remove the entropy bonus from the actor loss, using value 0.01 for the \(c_{entropy}\) parameter.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

for episode in range(10):
    state, info = env.reset()
    done = False
    episode_reward = 0
    step = 0
    while not done:    
        step += 1
        action, action_log_prob, entropy = select_action(actor, state)
        next_state, reward, terminated, truncated, _ = env.step(action)
        episode_reward += reward
        done = terminated or truncated
        actor_loss, critic_loss = calculate_losses(critic, action_log_prob, action_log_prob,
                                                   reward, state, next_state, done)
        # Remove the entropy bonus from the actor loss
        actor_loss -= ____ * ____
        actor_optimizer.zero_grad(); actor_loss.backward(); actor_optimizer.step()
        critic_optimizer.zero_grad(); critic_loss.backward(); critic_optimizer.step()
        state = next_state
    describe_episode(episode, reward, episode_reward, step)
Code bewerken en uitvoeren