Learn

/

Courses

/

Deep Reinforcement Learning in Python

Connected

Exercise

The clipped probability ratio

You will now implement the clipped probability ratio, an essential component of the PPO objective function.

For reference, the probability ratio is defined as: $$\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$

And the clipped probability ratio is: \(\mathrm{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)\).

Instructions

100 XP

Obtain the action probability prob from action_log_prob, and prob_old from action_log_prob_old.
Detach the old action log prob from the torch gradient computation graph.
Calculate the probability ratio.
Clip the surrogate objective.