Expected SARSA の更新則

この演習では、時相差学習に基づく Model-Free RL アルゴリズムである Expected SARSA の更新則を実装します。Expected SARSA は、取りうるすべての行動にわたって平均をとることで現在のポリシーの期待値を推定し、SARSA と比べてより安定した更新ターゲットを提供します。Expected SARSA で用いる数式は以下を参照してください。

Image showing the mathematical formula of the expected SARSA update rule.