Fixed Q-targets

Anda sedang mempersiapkan pelatihan Lunar Lander dengan fixed Q-targets. Sebagai prasyarat, Anda perlu membuat instance online network (yang memilih aksi) dan target network (digunakan untuk perhitungan TD-target).

Anda juga perlu mengimplementasikan fungsi update_target_network yang dapat digunakan pada setiap langkah pelatihan. Target network tidak diperbarui dengan gradient descent; sebagai gantinya, update_target_network mendorong bobotnya mendekati Q-network dalam jumlah kecil, sehingga tetap cukup stabil dari waktu ke waktu.

Perhatikan bahwa, khusus untuk latihan ini, Anda menggunakan jaringan yang sangat kecil agar kita dapat dengan mudah mencetak dan memeriksa state dictionary-nya. Jaringan ini hanya memiliki satu hidden layer berukuran dua; action space dan state space-nya juga berdimensi 2.

Fungsi print_state_dict() tersedia di lingkungan Anda untuk mencetak state dict.

Latihan ini adalah bagian dari kursus

Deep Reinforcement Learning dengan Python

Petunjuk latihan

Peroleh .state_dict() untuk target network dan online network.
Perbarui state dict untuk target network dengan mengambil rata-rata tertimbang antara parameter online network dan target network, menggunakan tau sebagai bobot untuk online network.
Muat kembali state dict yang telah diperbarui ke target network.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

def update_target_network(target_network, online_network, tau):
    # Obtain the state dicts for both networks
    target_net_state_dict = ____
    online_net_state_dict = ____
    for key in online_net_state_dict:
        # Calculate the updated state dict for the target network
        target_net_state_dict[key] = (online_net_state_dict[____] * ____ + target_net_state_dict[____] * ____)
        # Load the updated state dict into the target network
        target_network.____
    return None
  
print("online network weights:", print_state_dict(online_network))
print("target network weights (pre-update):", print_state_dict(target_network))
update_target_network(target_network, online_network, .001)
print("target network weights (post-update):", print_state_dict(target_network))

Edit dan Jalankan Kode

Latihan ini adalah bagian dari kursus

Deep Reinforcement Learning dengan Python

SkillTag.level.advancedSkillTag.label

4.8+

Mulai Kursus Gratis

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introduction to deep reinforcement learning Exercise 2: Environment and neural network setup Exercise 3: DRL training loop Exercise 4: Introduction to deep Q learning Exercise 5: Deep learning and DQN Exercise 6: The Q-Network architecture Exercise 7: Instantiating the Q-Network Exercise 8: The barebone DQN algorithm Exercise 9: Barebone DQN action selection Exercise 10: Barebone DQN loss function Exercise 11: Training the barebone DQN

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: DQN dengan experience replay Exercise 2: Double-Ended Queue Exercise 3: Buffer experience replay Exercise 4: DQN dengan experience replay Exercise 5: Algoritma DQN yang lengkap Exercise 6: Epsilon-greediness Exercise 7: Fixed Q-targets

Latihan Saat Ini

Exercise 8: Mengimplementasikan algoritme DQN lengkap Exercise 9: Double DQN Exercise 10: Jaringan online dan jaringan target dalam DDQN Exercise 11: Melatih double DQN Exercise 12: Prioritized experience replay Exercise 13: Buffer prioritized experience replay Exercise 14: Sampling dari buffer PER Exercise 15: DQN dengan prioritized experience replay

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduction to policy gradient Exercise 2: The policy network architecture Exercise 3: Working with discrete distributions Exercise 4: Policy gradient and REINFORCE Exercise 5: Action selection in REINFORCE Exercise 6: Training the REINFORCE algorithm Exercise 7: Advantage Actor Critic Exercise 8: Critic network Exercise 9: Actor Critic loss calculations Exercise 10: Training the A2C algorithm

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Proximal policy optimization Exercise 2: The clipped probability ratio Exercise 3: The clipped surrogate objective function Exercise 4: Entropy bonus and PPO Exercise 5: Entropy playground Exercise 6: Training the PPO algorithm Exercise 7: Batch updates in policy gradient Exercise 8: Minibatch and DRL Exercise 9: A2C with batch updates Exercise 10: Hyperparameter optimization with Optuna Exercise 11: Hyperparameter or not?Exercise 12: Hands-on with Optuna Exercise 13: Congratulations!