CS-456: MP2 - Question 3.1

Hello,

For the first agent, the success criteria says "after training has stabilized around an optimal policy (value loss less than 1e-4)". We were thus wondering if value loss = loss of the critic net, and if so should we really have our critic loss below 1e-4 ? Because it's not the case at all and we can't understand what would be wrong in our implementation. An assistant yesterday told us it was fine but was not so sure.

Thanks.

Re: MP2 - Question 3.1

par Skander Moalla, jeudi, 2 mai 2024, 10:38

Hello,

Yes the value loss refers to the critic network's loss and in the scope of question 3.1 (i.e. before you introduce reward stochasticity), this loss should converge when the agent has reached optimal performance.
Around the convergence the loss should be less than 1e-4 as described in the success criteria. (Much less even. In our runs it's 1e-10).
For this, you have to nail the details of the algorithm and in particular bootstrapping.

After question 3.1 this will not be the case anymore as you introduce stochasticity.

Re: MP2 - Question 3.1

par Thomas Brunet, jeudi, 2 mai 2024, 13:24

Mh ok... The thing is that our loss is stuck around 1, while the rewards during evaluations every 20k steps do increase.

When we talk about the loss is it well the formula used to update phi in the A2C algorithm ? So basically the advantage squared ?

Thanks.

Re: MP2 - Question 3.1

par Skander Moalla, jeudi, 2 mai 2024, 13:38

Yes the loss is the L2 distance between the estimated return and the value function. A common error is not handling autograd correctly when bootstrapping (line 6). Think of which V's should autograd move and which it shouldn't.

Re: MP2 - Question 3.1

par Thomas Brunet, jeudi, 2 mai 2024, 14:40

It seems to solve the problem indeed, thank you !