MulaiMulai sekarang secara gratis

The clipped probability ratio

You will now implement the clipped probability ratio, an essential component of the PPO objective function.

For reference, the probability ratio is defined as: $$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

And the clipped probability ratio is: \(\mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\).

Latihan ini adalah bagian dari kursus

Deep Reinforcement Learning in Python

Lihat Kursus

Petunjuk latihan

  • Obtain the action probability prob from action_log_prob, and prob_old from action_log_prob_old.
  • Detach the old action log prob from the torch gradient computation graph.
  • Calculate the probability ratio.
  • Clip the surrogate objective.

Latihan interaktif praktis

Cobalah latihan ini dengan menyelesaikan kode contoh berikut.

log_prob = torch.tensor(.5).log()
log_prob_old = torch.tensor(.4).log()

def calculate_ratios(action_log_prob, action_log_prob_old, epsilon):
    # Obtain prob and prob_old
    prob = ____
    prob_old = ____
    # Detach the old action log prob
    prob_old_detached = ____.____()
    # Calculate the probability ratio
    ratio = ____ / ____
    # Apply clipping
    clipped_ratio = torch.____(ratio, ____, ____)
    print(f"+{'-'*29}+\n|         Ratio: {str(ratio)} |\n| Clipped ratio: {str(clipped_ratio)} |\n+{'-'*29}+\n")
    return (ratio, clipped_ratio)

_ = calculate_ratios(log_prob, log_prob_old, epsilon=.2)
Edit dan Jalankan Kode