CS-456: MP2-A2C n-step bootstrapping

My teammate and I are quite confused about the n-step bootstrapping algorithm required to be implemented (we read a previous post on this forum on this topic but it did not help us). What we understood is:

From any step t (far enough from termination), we need to compute the n-step return (from t), the (n-1)-return (from t), ..., the 1-step return (still from t).

Take the average of those n targets (let us call it R_av) to compute the advantage at time t as A_t = R_av - V(s_t).

Is that correct? Moreover, we know that we need to account for the StopGradient operation when computing our targets, thus only in step 1., not 2. (because we need the gradient of V(s_t), as opposed to that of posterior steps used in 1.). Are we correct?

Your advice on whether using .detach() or torch.no_grad() are also welcome! .detach() exhibits more versatility, but we are not sure it is as reliable as torch.no_grad().

Many thanks for your help !

Re: MP2-A2C n-step bootstrapping

by Skander Moalla - Friday, 10 May 2024, 15:05

Hello,

What you are describing is similar to generalized advantage estimation where you combine multiple n-step returns to estimate the return of a single step. This is not what is asked in the project.

The project asks to compute a single return per step, where at time t if you sampled n steps you can use the n-steps to estimate the return of state s_t then the n-1 steps after t+1 to estimate the return of state s_t+1, ..., until the state s_t+n whose return is estimated only from the reward at t+n. (all of them do bootstrap to finish the return estimate obviously).
There is no average to be taken to estimate the return of a given state.

Yes, you should account for StopGradient as the method is a semi-gradient.

Prefer using torch.no_grad as this does not generate the computation graph for autodiff and is thus faster then .detach which removes the tensor from the graph after computing the graph.
Both are "reliable".