CS-456: MP2, Standard Deviation in Task 4.1

We are told "The output of the actor’s network gives the mean of the distribution given the state, and the standard deviation is learned separately, in a state-independent manner." So the actor should only output the mean? Why would we not output the (log) standard deviation in the actor as well? Am I misunderstanding something? Thanks!

Re: MP2, Standard Deviation in Task 4.1

by Skander Moalla - Thursday, 2 May 2024, 10:49

Hello,

(updates in blue)

Yes, the actor only outputs the mean. (meaning only the mean depends on the state)

It's also valid to output the (log) standard deviation in the actor as well. Here we have chosen not to; it's an implicit bias to make things more stable. (meaning make the std also state-dependent)

You can refer to Appendix B.8 and decision C59 in this paper https://arxiv.org/pdf/2006.05990 for an empirical discussion.

Re: MP2, Standard Deviation in Task 4.1

by Ali Bakly - Thursday, 2 May 2024, 12:06

Ok, but you can still output the standard deviation in the actor, while it is state independent, like this.

As you see in the forward method log_std does not depend on x (the state). Is this acceptable? Since log_std will indeed be state independent.

Re: MP2, Standard Deviation in Task 4.1

by Skander Moalla - Thursday, 2 May 2024, 12:34

Yes yes, implementation doesn't matter. As long as the mean is state-dependent and the std state-independent and both are learned.
Typically you would put both of them in the policy class yes.