Toplu güncellemelerle A2C

Bu derste şimdiye kadar aynı temel DRL eğitim döngüsünün farklı varyasyonlarını kullandın. Pratikte, bu yapıyı genişletmenin, örneğin toplu (batch) güncellemeleri desteklemek için, çeşitli yolları vardır.

Şimdi Lunar Lander ortamında A2C eğitim döngüsüne geri döneceksin; ancak ağları her adımda güncellemek yerine, gradyan inişi adımını çalıştırmadan önce 10 adımın geçmesini bekleyeceksin. Kayıpları 10 adım boyunca ortalayarak, biraz daha kararlı güncellemeler elde edeceksin.

Bu egzersiz

Python ile Deep Reinforcement Learning

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Her adımdan gelen kayıpları, geçerli yığın için kayıp tensörlerine ekle.
Yığın kayıplarını hesapla.
Kayıp tensörlerini yeniden başlat.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

actor_losses = torch.tensor([])
critic_losses = torch.tensor([])
for episode in range(10):
    state, info = env.reset()
    done = False
    episode_reward = 0
    step = 0
    while not done:
        step += 1
        action, action_log_prob = select_action(actor, state)                
        next_state, reward, terminated, truncated, _ = env.step(action)
        done = terminated or truncated
        episode_reward += reward
        actor_loss, critic_loss = calculate_losses(
            critic, action_log_prob, 
            reward, state, next_state, done)
        # Append to the loss tensors
        actor_losses = torch.cat((____, ____))
        critic_losses = torch.cat((____, ____))
        if len(actor_losses) >= 10:
            # Calculate the batch losses
            actor_loss_batch = actor_losses.____
            critic_loss_batch = critic_losses.____
            actor_optimizer.zero_grad(); actor_loss_batch.backward(); actor_optimizer.step()
            critic_optimizer.zero_grad(); critic_loss_batch.backward(); critic_optimizer.step()
            # Reinitialize the loss tensors
            actor_losses = ____
            critic_losses = ____
        state = next_state
    describe_episode(episode, reward, episode_reward, step)

Kodu Düzenle ve Çalıştır

Bu egzersiz

Python ile Deep Reinforcement Learning

kursunun bir parçasıdır

AvançadoNível de habilidade

4.8+

Kursa Ücretsiz Başlayın

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introduction to deep reinforcement learning Exercise 2: Environment and neural network setup Exercise 3: DRL training loop Exercise 4: Introduction to deep Q learning Exercise 5: Deep learning and DQN Exercise 6: The Q-Network architecture Exercise 7: Instantiating the Q-Network Exercise 8: The barebone DQN algorithm Exercise 9: Barebone DQN action selection Exercise 10: Barebone DQN loss function Exercise 11: Training the barebone DQN

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: DQN with experience replay Exercise 2: The double-ended queue Exercise 3: Experience replay buffer Exercise 4: DQN with experience replay Exercise 5: The complete DQN algorithm Exercise 6: Epsilon-greediness Exercise 7: Fixed Q-targets Exercise 8: Implementing the complete DQN algorithm Exercise 9: Double DQN Exercise 10: Online network and target network in DDQN Exercise 11: Training the double DQN Exercise 12: Prioritized experience replay Exercise 13: Prioritized experience replay buffer Exercise 14: Sampling from the PER buffer Exercise 15: DQN with prioritized experience replay

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduction to policy gradient Exercise 2: The policy network architecture Exercise 3: Working with discrete distributions Exercise 4: Policy gradient and REINFORCE Exercise 5: Action selection in REINFORCE Exercise 6: Training the REINFORCE algorithm Exercise 7: Advantage Actor Critic Exercise 8: Critic network Exercise 9: Actor Critic loss calculations Exercise 10: Training the A2C algorithm

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Yakınsal politika optimizasyonu Exercise 2: Kırpılmış olasılık oranı Exercise 3: Kırpılmış yerine geçen amaç fonksiyonu Exercise 4: Entropi bonusu ve PPO Exercise 5: Entropi oyun alanı Exercise 6: PPO algoritmasını eğitme Exercise 7: Politika gradyanında yığın güncellemeleri Exercise 8: Minibatch ve DRL Exercise 9: Toplu güncellemelerle A2C

Geçerli Egzersiz

Exercise 10: Optuna ile hiperparametre optimizasyonu Exercise 11: Hiperparametre mi değil mi?Exercise 12: Optuna ile uygulama Exercise 13: Tebrikler!