CS-456: Ex 7.2 a)

For exercise 7.2 a) we are asked to derive a batch update rule for the policy gradient algorithm. I find the proposed solution very hard and long. In class we just did something like this:

I have a tough time understanding why in the solution we don't use the bellman equation, like in the lecture and work with that, instead of working with the integrals, as shown in the solution.

Re: Ex 7.2 a)

par Sophia Becker, mardi, 14 mai 2024, 15:46

Hi,

both the derivation in the class and in the exercise are valid ways to derive the batch rule for the policy gradient algorithm. The intention behind the exercise was to make you think about the derivation from a different angle: without explicitly using the Bellman equation, we can derive the batch rule directly from the multi-step return by writing out the integral over all possible state-action sequences. However, note that we use the same idea underlying the Bellman equation when we write the following (i.e. that a given state and action choice only depends on the future rewards):

In the approach on the blackboard, when we write that we 'iteratively expand' the value estimate (since it also depends on the policy, so we need to apply the log-likelihood trick there as well), we effectively sum over the entire path in the episode, similar to the path integral that we write in the exercise.

Hope it's more clear now.