Pytorch 中的策略梯度损失

问题描述

版本 1

y = episode_a.argmax(-1)   # episode_a is in shape [T,n_actions]
action_preds = self.net(ep_s)  # action_preds is logits before softmax
neg_log_like = self.loss_fn(action_preds,y) 
loss = torch.mean(r * neg_log_like)   # r is reward

版本 2

y = torch.tensor(episode_a,requires_grad=True)
action_preds = model(ep_s)
neg_log_like = -y * torch.log(action_preds)
loss = torch.sum(neg_log_like,1).mean()

版本 1 和 2 似乎具有相同的损失值。不同之处在于，y 在版本 2 中不需要 grad。但它就像一个监督学习反向传播操作，y 应该不需要 require_grad。我不明白为什么版本 1 不能训练策略而版本 2 可以？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

deep-learning policy-gradient-descent pytorch reinforcement-learning