1. Học hỏi
  2. /
  3. Khoa Học
  4. /
  5. Deep Reinforcement Learning in Python

Connected

Bài tập

The clipped probability ratio

You will now implement the clipped probability ratio, an essential component of the PPO objective function.

For reference, the probability ratio is defined as: $$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

And the clipped probability ratio is: \(\mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\).

Hướng dẫn

100 XP
  • Obtain the action probability prob from action_log_prob, and prob_old from action_log_prob_old.
  • Detach the old action log prob from the torch gradient computation graph.
  • Calculate the probability ratio.
  • Clip the surrogate objective.