1 算法原文流程

在这里插入图片描述

2 算法核心技巧

2.1 累计误差

$Q_\theta(s_t,a_t)=r_t + \gamma \cdot E[Q_\theta(s_ {t+1},a_{t+1})]-\delta(s_t,a_t)$
展开后得到:
$Q_\theta(s_t,a_t)=E_{s_i \sim p_\pi ,a_i \sim \pi}[\sum_{i=1}^T \gamma_i \cdot(r_i - \delta_i)]$
所以动作价值的估计函数学习的目标是累计回报与TD error之差的期望。

2.2 clipped Double Q-learning

$y_1 = r+\gamma\cdot min_{i=1,2}Q_{\theta _ i^{'}}(s^{'},\pi_{\phi _1}(s^{'}))$

避免高估，可能引入低估，但低估比高估要好。

2.3 Target Network

使用了两个动作价值网络和一个策略网络，对应于三个Target 网络。
$Q_{\theta_1}\gets Q_{\theta_1^{'}}$
$Q_{\theta_2}\gets Q_{\theta_2^{'}}$
$\pi_{\phi}\gets \pi_{\phi^{'}}$

使用两个动作价值网络，是为了进一步降低高估
使用一个策略网络是为了简化计算

２.4 Delayed Policy Updates

策略网络在高误差状态下进行更新容易得到发散的动作
策略网络应该以低于价值网络更新的频率进行更新
直到价值网络的误差尽可能低时再更新策略网络
TD-error较小时在对target网络的参数进行更新

2.5 Target Policy Smoothing Regularization

为了避免确定策略网络过度拟合窄峰值，即避免近似误差造成的不精确问题（避免增大方差）
相似的动作应该有相似的动作价值
$\gamma \cdot Q_{\theta^{'}}(s^{'},\pi_{\phi^{'}}(s^{'})+\epsilon)$
$\epsilon \sim clip(N(0,\delta),-c,c)$

3 算法计算步骤

初始化价值网络 $Q_{\theta_1}$ 、 $Q_{\theta_2}$ ，初始化策略网络 $\pi_{\phi}$ ，并随机初始化其中的参数
初始化Target网络中的参数 $\theta_1^{'}\gets \theta_1$ 、 $\theta_2^{'}\gets \theta_2$ 、 $\phi^{'}\gets \phi$
初始化replay buffer
for t=1 to T do:
--------选择动作并加入探索性： $a\sim \pi_{\phi}(s)+\epsilon$ 其中 $\epsilon \sim N(0,\delta)$
--------得到奖励 $r$ ，并得到下一时刻的状态 $s^{'}$
--------将transition $s,a,r,s^{'})$ 存入replay buffer
-------- 从replay buffer中随机采样一个batch
-------- $\hat{a}\sim \pi_{\phi^{'}}(s^{'})+\epsilon$ 其中 $\epsilon \sim clip(N(0,\delta),-c,c)$
-------- $r+\gamma\cdot min_{i=1,2}Q_{\theta _ i^{'}}(s^{'},\hat{a})$
--------更新价值网络 $\theta_i \sim argmin_{\theta_i}N^{-1}\sum{(y-Q_{\theta_i}(s,a))^2}$
-------- if t % d then :
----------------依据确定策略梯度更新策略网络：
---------------- $\bigtriangledown J_\phi(\phi)=N^{-1}\sum\bigtriangledown_a Q_{\theta_1}(s,a)\cdot\bigtriangledown _\phi \pi_\phi(s)$
---------------- 更新Target network
---------------- $\theta_1^{'}\gets \tau \cdot \theta_1 + (1-\tau)\cdot \theta_1^{'}$
---------------- $\theta_2^{'}\gets \tau \cdot \theta_2 + (1-\tau)\cdot \theta_2^{'}$
---------------- $\phi^{'}\gets \tau \cdot \phi + (1-\tau)\cdot \phi^{'}$

By CyrusMay 2022.08.23

人工智能强化学习机器学习算法算法

强化学习—— Twin delay deep deterministic policy gradient(TD3算法)

强化学习—— Twin delay deep deterministic policy gradient(TD3算法