使用 Adam 优化器时 PyTorch 与 TensorFlow 相比的次优收敛

问题描述

我在 PyTorch 中训练模型的程序比 TensorFlow 实现更糟糕。当我切换到 SGD 而不是 Adam 时，损失是相同的。对于 Adam，从第一个 epoch 开始，损失就不同了。我相信我在两个程序中使用相同的设置。关于如何调试的任何想法都会有所帮助！

使用 SGD 的损失

PyTorch

0.1504615843296051
0.10858417302370071
0.08603279292583466

TensorFlow

0.15046157
0.108584
0.08603277

使用 Adam 的损失

PyTorch

0.0031117501202970743
0.0020642257295548916
0.0019268309697508812
0.0016333406092599034
0.0017334128497168422
0.0014430736191570759
0.0010424457723274827
0.0012145100627094507
0.0011195113183930516
0.0009501167223788798
0.0009987876983359456
0.0007953296881169081
0.00075263757025823
0.0008374055614694953
0.000735406531020999

TensorFlow：

0.0036667113
0.0032563617
0.0021536187
0.0015266595
0.0013580231
0.0013878695
0.0011856346
0.0011136091
0.00091276
0.000890126
0.00088381825
0.0007283067
0.00081382995
0.0006670901
0.00046282331

Adam 优化器设置

TF 1.15.3：

adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)

# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001,beta1=0.9,beta2=0.999,epsilon=1e-8,use_locking=False,name="Adam")

PyTorch

torch.optim.Adam(params=model.parameters(),lr=5e-5,betas=(0.9,0.999),eps=1e-08,weight_decay=0.0)

培训

我从文件中加载了相同的权重来初始化两个模型。
我对同样从文件加载的单个数据样本进行了训练和测试。我使用 1000 次迭代进行训练，使用 1 次迭代进行测试，批量大小为 1。

事先调试

如上所述，我使用了相同的参数和数据
我使用 Adam 优化器运行了一次向前向后传递，并保存了每一层的数据和梯度。我绘制了结果。所有看起来都一样，并且彼此之间的距离在 1e-6 到 1e-10 之内。舍入误差内的损失也相同。

保存和加载 PyTorch 模型

def train(...):
    ...
    checkpoint = torch.load(checkpoint_file,map_location=device)
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])
    ...
    counter = 0
    while run:
            counter += 1
            if counter > 1000:
                break

            in = np.load("debug_data/in.npy")
            out1 = np.load("debug_data/out1.npy")
            out2 = np.load("debug_data/out2.npy")

            # adjust from TF
            in = in.squeeze(3)
            in = np.expand_dims(in,axis=0)
            ... do the same for out1 and out2

        in,out1,out2 = \
                torch.from_numpy(in).to(device),\
                torch.from_numpy(out1).to(device),\
                torch.from_numpy(out2).to(device)

        optimizer.zero_grad()
        out1_hat,out2_hat = model(in)

        train_loss = loss_fn(out1_hat,out1) + loss_fn(out2_hat,out2)
        train_loss.backward()

        optimizer.step()

    save_checkpoint({'state_dict': model.state_dict(),'optimizer': optimizer.state_dict()},latest_filename=latest_checkpoint_path)

保存和加载 TensorFlow 模型

sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(my_path,graph=sess.graph)

restorer = tf.train.Saver(tf.global_variables(),write_version=tf.train.SaverDef.V2)
restorer.restore(sess,load_path)

saver = tf.train.Saver(tf.global_variables(),write_version=tf.train.SaverDef.V2)

counter = 0
while run:
    counter += 1
    if counter > 1000:
        break

    in = np.load("")
    out1 = np.load("")
    out2 = np.load("")
    out1 = out1[0,:,:]
    out1 = out1[:,np.newaxis]
    out2 = out2[0,:]
    out2 = out2[:,np.newaxis]
    in = in[0,:]
    in = in[:,np.newaxis]
    _,_loss = sess.run([optimizer,loss],Feed_dict={in: in,out1: out1,out2: out2})

save_path = saver.save(sess,my_save_path,global_step=int(_global_step))

sess.close()
tf.reset_default_graph()

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

deep-learning gradient-descent pytorch tensorflow tensorflow tensorflow