CS-456: Model-free RL, policy gradient

Hi,

remember that V_theta(s_0) only depends on s_0 (to compute the V-value, we already took the expectation over the state action sequence sampled by theta, conditioned on the initial state s_0). That means that when we add the state-action-sequence sampled by theta' in the expectation of V_theta(s_0) (going from Eq. (4) to Eq. (5)), this does not affect our expectation of V_theta(s_0).

Hope it's more clear now!

Best,
Your TAs

ANN Forum

Model-free RL, policy gradient

Re: Model-free RL, policy gradient