Sabit Q-hedefleri

Sabit Q-hedefleriyle Lunar Lander'ını eğitmeye hazırlanıyorsun. Önkoşul olarak, hem eylemi seçen çevrimiçi ağı hem de TD-hedefi hesaplamasında kullanılan hedef ağı oluşturman gerekiyor.

Ayrıca, her eğitim adımında kullanabileceğin bir update_target_network işlevi de yazmalısın. Hedef ağ gradyan inişiyle güncellenmez; bunun yerine, update_target_network ağırlıklarını küçük bir miktar Q-ağına doğru iterek zaman içinde oldukça kararlı kalmasını sağlar.

Dikkat: Yalnızca bu egzersizde, durum sözlüğünü kolayca yazdırıp inceleyebilmemiz için çok küçük bir ağ kullanıyorsun. Sadece iki boyutlu tek bir gizli katmana sahip; eylem uzayı ve durum uzayı da boyut 2'dir.

print_state_dict() işlevi, durum sözlüğünü yazdırman için ortamında hazırdır.

Bu egzersiz

Python ile Deep Reinforcement Learning

kursunun bir parçasıdır

Kursu Görüntüle

Egzersiz talimatları

Hem hedef hem de çevrimiçi ağlar için .state_dict() değerini al.
tau çevrimiçi ağın ağırlığı olacak şekilde, çevrimiçi ağın ve hedef ağın parametreleri arasında ağırlıklı ortalama alarak hedef ağın durum sözlüğünü güncelle.
Güncellenmiş durum sözlüğünü hedef ağa geri yükle.

Uygulamalı interaktif egzersiz

Bu örnek kodu tamamlayarak bu egzersizi bitirin.

def update_target_network(target_network, online_network, tau):
    # Obtain the state dicts for both networks
    target_net_state_dict = ____
    online_net_state_dict = ____
    for key in online_net_state_dict:
        # Calculate the updated state dict for the target network
        target_net_state_dict[key] = (online_net_state_dict[____] * ____ + target_net_state_dict[____] * ____)
        # Load the updated state dict into the target network
        target_network.____
    return None
  
print("online network weights:", print_state_dict(online_network))
print("target network weights (pre-update):", print_state_dict(target_network))
update_target_network(target_network, online_network, .001)
print("target network weights (post-update):", print_state_dict(target_network))

Kodu Düzenle ve Çalıştır

Bu egzersiz

Python ile Deep Reinforcement Learning

kursunun bir parçasıdır

AvançadoNível de habilidade

4.8+

Kursa Ücretsiz Başlayın

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introduction to deep reinforcement learning Exercise 2: Environment and neural network setup Exercise 3: DRL training loop Exercise 4: Introduction to deep Q learning Exercise 5: Deep learning and DQN Exercise 6: The Q-Network architecture Exercise 7: Instantiating the Q-Network Exercise 8: The barebone DQN algorithm Exercise 9: Barebone DQN action selection Exercise 10: Barebone DQN loss function Exercise 11: Training the barebone DQN

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: Deneyim tekrar çalma ile DQN Exercise 2: Çift uçlu kuyruk Exercise 3: Deneyim tekrar oynatma tamponu Exercise 4: Deneyim Tekrarı ile DQN Exercise 5: Tam DQN algoritması Exercise 6: Epsilon-açgözlülüğü Exercise 7: Sabit Q-hedefleri

Geçerli Egzersiz

Exercise 8: Tam DQN algoritmasını uygulama Exercise 9: Double DQN Exercise 10: DDQN'de çevrimiçi ağ ve hedef ağ Exercise 11: Double DQN'i eğitmek Exercise 12: Önceliklendirilmiş deneyim tekrar oynatma Exercise 13: Öncelikli deneyim tekrar oynatma tamponu Exercise 14: PER arabelleğinden örnekleme Exercise 15: Öncelikli deneyim tekrarı ile DQN

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduction to policy gradient Exercise 2: The policy network architecture Exercise 3: Working with discrete distributions Exercise 4: Policy gradient and REINFORCE Exercise 5: Action selection in REINFORCE Exercise 6: Training the REINFORCE algorithm Exercise 7: Advantage Actor Critic Exercise 8: Critic network Exercise 9: Actor Critic loss calculations Exercise 10: Training the A2C algorithm

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Proximal policy optimization Exercise 2: The clipped probability ratio Exercise 3: The clipped surrogate objective function Exercise 4: Entropy bonus and PPO Exercise 5: Entropy playground Exercise 6: Training the PPO algorithm Exercise 7: Batch updates in policy gradient Exercise 8: Minibatch and DRL Exercise 9: A2C with batch updates Exercise 10: Hyperparameter optimization with Optuna Exercise 11: Hyperparameter or not?Exercise 12: Hands-on with Optuna Exercise 13: Congratulations!