Addestrare l'algoritmo REINFORCE

Sei pronto ad addestrare il tuo Lunar Lander usando REINFORCE! Ti basta implementare il training loop di REINFORCE, inclusa la computazione della loss di REINFORCE.

Dato che i passaggi per calcolare la loss attraversano sia il loop interno che quello esterno, questa volta non userai una funzione calculate_loss().

Quando l'episodio è completo, puoi usare entrambe queste quantità per calcolare la loss.

Per riferimento, questa è l'espressione della funzione di loss di REINFORCE:

Userai di nuovo la funzione describe_episode() per stampare come sta andando il tuo agente a ogni episodio.

Questo esercizio fa parte del corso

Deep Reinforcement Learning in Python

Visualizza il corso

Istruzioni dell'esercizio

Aggiungi la log-probabilità dell'azione selezionata alle log-probabilità dell'episodio.
Incrementa il ritorno dell'episodio con la ricompensa scontata dello step corrente.
Calcola la loss dell'episodio REINFORCE.

Esercizio pratico interattivo

Prova a risolvere questo esercizio completando il codice di esempio.

for episode in range(50):
    state, info = env.reset()
    done = False
    episode_reward = 0
    step = 0
    episode_log_probs = torch.tensor([])
    R = 0
    while not done:
        step += 1
        action, log_prob = select_action(policy_network, state)                
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        episode_reward += reward
        # Append to the episode action log probabilities
        episode_log_probs = torch.cat((____, ____))
        # Increment the episode return
        R += (____ ** step) * ____
        state = next_state
    # Calculate the episode loss
    loss = ____ * ____.sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    describe_episode(episode, reward, episode_reward, step)

Modifica ed esegui il codice

Questo esercizio fa parte del corso

Deep Reinforcement Learning in Python

AvançadoNível de habilidade

4.8+

Inizia il corso gratis

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introduction to deep reinforcement learning Exercise 2: Environment and neural network setup Exercise 3: DRL training loop Exercise 4: Introduction to deep Q learning Exercise 5: Deep learning and DQN Exercise 6: The Q-Network architecture Exercise 7: Instantiating the Q-Network Exercise 8: The barebone DQN algorithm Exercise 9: Barebone DQN action selection Exercise 10: Barebone DQN loss function Exercise 11: Training the barebone DQN

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: DQN with experience replay Exercise 2: The double-ended queue Exercise 3: Experience replay buffer Exercise 4: DQN with experience replay Exercise 5: The complete DQN algorithm Exercise 6: Epsilon-greediness Exercise 7: Fixed Q-targets Exercise 8: Implementing the complete DQN algorithm Exercise 9: Double DQN Exercise 10: Online network and target network in DDQN Exercise 11: Training the double DQN Exercise 12: Prioritized experience replay Exercise 13: Prioritized experience replay buffer Exercise 14: Sampling from the PER buffer Exercise 15: DQN with prioritized experience replay

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduzione ai metodi policy gradient Exercise 2: L'architettura della rete di policy Exercise 3: Lavorare con distribuzioni discrete Exercise 4: Policy gradient e REINFORCE Exercise 5: Selezione dell'azione in REINFORCE Exercise 6: Addestrare l'algoritmo REINFORCE

Esercizio in corso

Exercise 7: Advantage Actor-Critic Exercise 8: Rete del Critic Exercise 9: Calcolo delle loss di Actor-Critic Exercise 10: Addestrare l’algoritmo A2C

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Proximal policy optimization Exercise 2: The clipped probability ratio Exercise 3: The clipped surrogate objective function Exercise 4: Entropy bonus and PPO Exercise 5: Entropy playground Exercise 6: Training the PPO algorithm Exercise 7: Batch updates in policy gradient Exercise 8: Minibatch and DRL Exercise 9: A2C with batch updates Exercise 10: Hyperparameter optimization with Optuna Exercise 11: Hyperparameter or not?Exercise 12: Hands-on with Optuna Exercise 13: Congratulations!