DRL-trainingslus

Om de agent het milieu herhaaldelijk te laten ervaren, moet je een trainingslus opzetten.

Veel DRL-algoritmen delen deze kernstructuur:

Lus over episodes
Lus over stappen binnen elke episode
Kies bij elke stap een actie, bereken het verlies en werk het netwerk bij

Je krijgt tijdelijke functies select_action() en calculate_loss() waarmee de code kan draaien. De Network en optimizer die in de vorige oefening zijn gedefinieerd, zijn ook voor je beschikbaar.

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

Cursus bekijken

Oefeninstructies

Zorg ervoor dat de buitenste lus (over episodes) tien episodes draait.
Zorg ervoor dat de binnenste lus (over stappen) loopt totdat de episode is voltooid.
Voer in de env-omgeving de actie uit die door select_action() is gekozen.
Werk aan het eind van een iteratie van de binnenste lus de state bij voordat je aan de volgende stap begint.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

env = gym.make("LunarLander-v2")
# Run ten episodes
for episode in ____:
    state, info = env.reset()
    done = False    
    # Run through steps until done
    while ____:
        action = select_action(network, state)        
        # Take the action
        next_state, reward, terminated, truncated, _ = ____
        done = terminated or truncated        
        loss = calculate_loss(network, state, action, next_state, reward, done)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()        
        # Update the state
        state = ____
    print(f"Episode {episode} complete.")

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

SkillTag.level.advancedSkillTag.label

4.8+

Begin de cursus gratis

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introductie tot deep reinforcement learning Exercise 2: Omgeving en neuraal netwerk instellen Exercise 3: DRL-trainingslus

Huidige oefening

Exercise 4: Introductie tot deep Q-learning Exercise 5: Deep learning en DQN Exercise 6: De Q-Network-architectuur Exercise 7: Het Q-netwerk instantiëren Exercise 8: Het kale DQN-algoritme Exercise 9: Barebone DQN-actie-selectie Exercise 10: Barebone DQN-verliesfunctie Exercise 11: Een barebone DQN trainen

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: DQN with experience replay Exercise 2: The double-ended queue Exercise 3: Experience replay buffer Exercise 4: DQN with experience replay Exercise 5: The complete DQN algorithm Exercise 6: Epsilon-greediness Exercise 7: Fixed Q-targets Exercise 8: Implementing the complete DQN algorithm Exercise 9: Double DQN Exercise 10: Online network and target network in DDQN Exercise 11: Training the double DQN Exercise 12: Prioritized experience replay Exercise 13: Prioritized experience replay buffer Exercise 14: Sampling from the PER buffer Exercise 15: DQN with prioritized experience replay

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduction to policy gradient Exercise 2: The policy network architecture Exercise 3: Working with discrete distributions Exercise 4: Policy gradient and REINFORCE Exercise 5: Action selection in REINFORCE Exercise 6: Training the REINFORCE algorithm Exercise 7: Advantage Actor Critic Exercise 8: Critic network Exercise 9: Actor Critic loss calculations Exercise 10: Training the A2C algorithm

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Proximal policy optimization Exercise 2: The clipped probability ratio Exercise 3: The clipped surrogate objective function Exercise 4: Entropy bonus and PPO Exercise 5: Entropy playground Exercise 6: Training the PPO algorithm Exercise 7: Batch updates in policy gradient Exercise 8: Minibatch and DRL Exercise 9: A2C with batch updates Exercise 10: Hyperparameter optimization with Optuna Exercise 11: Hyperparameter or not?Exercise 12: Hands-on with Optuna Exercise 13: Congratulations!