Training the A2C algorithm
Time to train our Lunar Lander using the A2C algorithm! You have all the building blocks, now it's about putting it all together.
The actor and critic networks have been instantiated as actor
and critic
, as have their optimizers actor_optimizer
and critic_optimizer
.
Your REINFORCE select_action()
function and the calculate_losses()
function from the previous exercise are also available for you to use here.
This exercise is part of the course
Deep Reinforcement Learning in Python
Exercise instructions
- Let the actor select the action, given the state.
- Calculate the losses for both actor and critic.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
for episode in range(10):
state, info = env.reset()
done = False
episode_reward = 0
step = 0
while not done:
step += 1
if done:
break
# Select the action
____
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
episode_reward += reward
# Calculate the losses
____, ____ = ____(
critic, action_log_prob,
reward, state, next_state, done)
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()
state = next_state
describe_episode(episode, reward, episode_reward, step)