Steekproeven uit de PER-buffer

Voordat je de Prioritized Experience Buffer-klasse kunt gebruiken om je agent te trainen, moet je nog de methode .sample() implementeren. Deze methode krijgt als argument de grootte van de steekproef die je wilt trekken en retourneert de getrokken transities als tensors, samen met hun indexen in de geheugenbuffer en hun belangrijkheidsgewicht.

Een buffer met capaciteit 10 is alvast in je omgeving geladen, zodat je daaruit kunt steekproeven.

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

Cursus bekijken

Oefeninstructies

Bereken de steekproefkans die hoort bij elke transitie.
Trek de indexen die overeenkomen met de transities in de steekproef; np.random.choice(a, s, p=p) neemt een steekproef van grootte s met terugleggen uit de array a, op basis van kansarray p.
Bereken het belangrijkheidsgewicht dat hoort bij elke transitie.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

def sample(self, batch_size):
    priorities = np.array(self.priorities)
    # Calculate the sampling probabilities
    probabilities = ____ / np.sum(____)
    # Draw the indices for the sample
    indices = np.random.choice(____)
    # Calculate the importance weights
    weights = (1 / (len(self.memory) * ____)) ** ____
    weights /= np.max(weights)
    states, actions, rewards, next_states, dones = zip(*[self.memory[idx] for idx in indices])
    weights = [weights[idx] for idx in indices]
    states_tensor = torch.tensor(states, dtype=torch.float32)
    rewards_tensor = torch.tensor(rewards, dtype=torch.float32)
    next_states_tensor = torch.tensor(next_states, dtype=torch.float32)
    dones_tensor = torch.tensor(dones, dtype=torch.float32)
    weights_tensor = torch.tensor(weights, dtype=torch.float32)
    actions_tensor = torch.tensor(actions, dtype=torch.long).unsqueeze(1)
    return (states_tensor, actions_tensor, rewards_tensor, next_states_tensor,
            dones_tensor, indices, weights_tensor)

PrioritizedReplayBuffer.sample = sample
print("Sampled transitions:\n", buffer.sample(3))

Code bewerken en uitvoeren

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

SkillTag.level.advancedSkillTag.label

4.8+

Begin de cursus gratis

Discover how deep reinforcement learning improves upon traditional Reinforcement Learning while studying and implementing your first Deep Q Learning algorithm.

Exercise 1: Introduction to deep reinforcement learning Exercise 2: Environment and neural network setup Exercise 3: DRL training loop Exercise 4: Introduction to deep Q learning Exercise 5: Deep learning and DQN Exercise 6: The Q-Network architecture Exercise 7: Instantiating the Q-Network Exercise 8: The barebone DQN algorithm Exercise 9: Barebone DQN action selection Exercise 10: Barebone DQN loss function Exercise 11: Training the barebone DQN

Dive into Deep Q-learning by implementing the original DQN algorithm, featuring Experience Replay, epsilon-greediness and fixed Q-targets. Beyond DQN, you will then explore two fascinating extensions that improve the performance and stability of Deep Q-learning: Double DQN and Prioritized Experience Replay.

Exercise 1: DQN met experience replay Exercise 2: De double-ended queue Exercise 3: Experience replay-buffer Exercise 4: DQN met experience replay Exercise 5: Het complete DQN-algoritme Exercise 6: Epsilon-greediness Exercise 7: Gefixeerde Q-targets Exercise 8: Het complete DQN-algoritme implementeren Exercise 9: Double DQN Exercise 10: Online netwerk en targetnetwerk in DDQN Exercise 11: De Double DQN trainen Exercise 12: Prioritized experience replay Exercise 13: Prioritized experience replay-buffer Exercise 14: Steekproeven uit de PER-buffer

Huidige oefening

Exercise 15: DQN met prioritaire experience replay

Learn about the foundational concepts of policy gradient methods found in DRL. You will begin with the policy gradient theorem, which forms the basis for these methods. Then, you will implement the REINFORCE algorithm, a powerful approach to learning policies. The chapter will then guide you through Actor-Critic methods, focusing on the Advantage Actor-Critic (A2C) algorithm, which combines the strengths of both policy gradient and value-based methods to enhance learning efficiency and stability.

Exercise 1: Introduction to policy gradient Exercise 2: The policy network architecture Exercise 3: Working with discrete distributions Exercise 4: Policy gradient and REINFORCE Exercise 5: Action selection in REINFORCE Exercise 6: Training the REINFORCE algorithm Exercise 7: Advantage Actor Critic Exercise 8: Critic network Exercise 9: Actor Critic loss calculations Exercise 10: Training the A2C algorithm

Explore Proximal Policy Optimization (PPO) for robust DRL performance. Next, you will examine using an entropy bonus in PPO, which encourages exploration by preventing premature convergence to deterministic policies. You'll also learn about batch updates in policy gradient methods. Finally, you will learn about hyperparameter optimization with Optuna, a powerful tool for optimizing performance in your DRL models.

Exercise 1: Proximal policy optimization Exercise 2: The clipped probability ratio Exercise 3: The clipped surrogate objective function Exercise 4: Entropy bonus and PPO Exercise 5: Entropy playground Exercise 6: Training the PPO algorithm Exercise 7: Batch updates in policy gradient Exercise 8: Minibatch and DRL Exercise 9: A2C with batch updates Exercise 10: Hyperparameter optimization with Optuna Exercise 11: Hyperparameter or not?Exercise 12: Hands-on with Optuna Exercise 13: Congratulations!