MP2, Standard Deviation in Task 4.1

MP2, Standard Deviation in Task 4.1

by Ali Bakly -
Number of replies: 3

We are told "The output of the actor’s network gives the mean of the distribution given the state, and the standard deviation is learned separately, in a state-independent manner." So the actor should only output the mean? Why would we not output the (log) standard deviation in the actor as well? Am I misunderstanding something? Thanks!

In reply to Ali Bakly

Re: MP2, Standard Deviation in Task 4.1

by Skander Moalla -

Hello,

(updates in blue)

Yes, the actor only outputs the mean. (meaning only the mean depends on the state)

It's also valid to output the (log) standard deviation in the actor as well. Here we have chosen not to; it's an implicit bias to make things more stable. (meaning make the std also state-dependent)

You can refer to Appendix B.8 and decision C59 in this paper https://arxiv.org/pdf/2006.05990 for an empirical discussion.

In reply to Skander Moalla

Re: MP2, Standard Deviation in Task 4.1

by Ali Bakly -
Ok, but you can still output the standard deviation in the actor, while it is state independent, like this.


As you see in the forward method log_std does not depend on x (the state). Is this acceptable? Since log_std will indeed be state independent.
In reply to Ali Bakly

Re: MP2, Standard Deviation in Task 4.1

by Skander Moalla -
Yes yes, implementation doesn't matter. As long as the mean is state-dependent and the std state-independent and both are learned.
Typically you would put both of them in the policy class yes.