(Project) n-step A2C

(Project) n-step A2C

by Ali Bakly -
Number of replies: 3

For the n-step A2C part of the project it says:

I am not sure if I am understanding correctly here. When we did n-step SARSA we had to do a "look ahead" of n-steps, that is to update Q(s_0, a_0) we had to to step all the way s_n, and to calculate Q(s_1, a_1) we had to to step all the way s_{n+1}, on the same trajectory.

Isn't that the exact same thing we want to do here but calculating advantages instead? I think it would be beneficial to see the a2c n-step algorithm written out (currently in lecture 8 we are only provided with the algorithm for n=1).

In reply to Ali Bakly

Re: (Project) n-step A2C

by Skander Moalla -
Hello,

No the algorithm described in the miniproject is different from the algorithm you attached.
In the project you are asked to implement a loop iterating between data collection -> learning, data collection -> learning, ... where in the learning step you learn from all the data collected. n-step in this case means that each worker collected n steps.

Side note: this would a more proper way to do on-policy. In fact, if you look at the algorithm you attach, the policy is updated multiple times before a state gets its value updated so the returns it is updated from comes from multiple policies and are not exactly on-policy.
In reply to Skander Moalla

Re: (Project) n-step A2C

by Ali Bakly -
Hi,
Thank you for the answer. I wrote the algorithm (heavily inspired by lecture 8) of my current understanding after your clarification.

Is my understanding correct? Since each worker learns from all the n-steps we would increment the time t by n as I highlighted in red? Thanks!
In reply to Ali Bakly

Re: (Project) n-step A2C

by Skander Moalla -
Yes, that's the idea. However, be careful as environments may have trajectories/episodes that end/reset at different timesteps and advantage computation should be robust to that.
Also you want to keep a global counter on the total environment steps used to train the agent which would be incremented according to the number of workers and steps per worker.