Aan de slagGa gratis aan de slag

The clipped probability ratio

You will now implement the clipped probability ratio, an essential component of the PPO objective function.

For reference, the probability ratio is defined as: $$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

And the clipped probability ratio is: \(\mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\).

Deze oefening maakt deel uit van de cursus

Deep Reinforcement Learning in Python

Cursus bekijken

Oefeninstructies

  • Obtain the action probability prob from action_log_prob, and prob_old from action_log_prob_old.
  • Detach the old action log prob from the torch gradient computation graph.
  • Calculate the probability ratio.
  • Clip the surrogate objective.

Praktische interactieve oefening

Probeer deze oefening eens door deze voorbeeldcode in te vullen.

log_prob = torch.tensor(.5).log()
log_prob_old = torch.tensor(.4).log()

def calculate_ratios(action_log_prob, action_log_prob_old, epsilon):
    # Obtain prob and prob_old
    prob = ____
    prob_old = ____
    # Detach the old action log prob
    prob_old_detached = ____.____()
    # Calculate the probability ratio
    ratio = ____ / ____
    # Apply clipping
    clipped_ratio = torch.____(ratio, ____, ____)
    print(f"+{'-'*29}+\n|         Ratio: {str(ratio)} |\n| Clipped ratio: {str(clipped_ratio)} |\n+{'-'*29}+\n")
    return (ratio, clipped_ratio)

_ = calculate_ratios(log_prob, log_prob_old, epsilon=.2)
Code bewerken en uitvoeren