Training the PPO algorithm
You will now use the familiar A2C training loop to train the PPO algorithm.
This training loop does not take full advantage of the clipped surrogate objective function, and as a result this algorithm should not perform much better than A2C; it serves as illustration of the concepts learned around the clipped surrogate objective and the entropy bonus.
This exercise is part of the course
Deep Reinforcement Learning in Python
Exercise instructions
- Remove the entropy bonus from the actor loss, using value 0.01 for the \(c_{entropy}\) parameter.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
for episode in range(10):
state, info = env.reset()
done = False
episode_reward = 0
step = 0
while not done:
step += 1
action, action_log_prob, entropy = select_action(actor, state)
next_state, reward, terminated, truncated, _ = env.step(action)
episode_reward += reward
done = terminated or truncated
actor_loss, critic_loss = calculate_losses(critic, action_log_prob, action_log_prob,
reward, state, next_state, done)
# Remove the entropy bonus from the actor loss
actor_loss -= ____ * ____
actor_optimizer.zero_grad(); actor_loss.backward(); actor_optimizer.step()
critic_optimizer.zero_grad(); critic_loss.backward(); critic_optimizer.step()
state = next_state
describe_episode(episode, reward, episode_reward, step)