问题描述
我在 PyTorch 中训练模型的程序比 TensorFlow 实现更糟糕。当我切换到 SGD 而不是 Adam 时,损失是相同的。对于 Adam,从第一个 epoch 开始,损失就不同了。我相信我在两个程序中使用相同的设置。关于如何调试的任何想法都会有所帮助!
使用 SGD 的损失
PyTorch
0.1504615843296051
0.10858417302370071
0.08603279292583466
TensorFlow
0.15046157
0.108584
0.08603277
使用 Adam 的损失
PyTorch
0.0031117501202970743
0.0020642257295548916
0.0019268309697508812
0.0016333406092599034
0.0017334128497168422
0.0014430736191570759
0.0010424457723274827
0.0012145100627094507
0.0011195113183930516
0.0009501167223788798
0.0009987876983359456
0.0007953296881169081
0.00075263757025823
0.0008374055614694953
0.000735406531020999
TensorFlow:
0.0036667113
0.0032563617
0.0021536187
0.0015266595
0.0013580231
0.0013878695
0.0011856346
0.0011136091
0.00091276
0.000890126
0.00088381825
0.0007283067
0.00081382995
0.0006670901
0.00046282331
Adam 优化器设置
TF 1.15.3:
adam_optimizer = tf.train.AdamOptimizer(learning_rate=5e-5)
# default parameters from the documentation at https://github.com/tensorflow/tensorflow/blob/v1.15.0/tensorflow/python/training/adam.py#L32-L235:
# learning_rate=0.001,beta1=0.9,beta2=0.999,epsilon=1e-8,use_locking=False,name="Adam")
PyTorch
torch.optim.Adam(params=model.parameters(),lr=5e-5,betas=(0.9,0.999),eps=1e-08,weight_decay=0.0)
培训
事先调试
- 如上所述,我使用了相同的参数和数据
- 我使用 Adam 优化器运行了一次向前向后传递,并保存了每一层的数据和梯度。我绘制了结果。所有看起来都一样,并且彼此之间的距离在 1e-6 到 1e-10 之内。舍入误差内的损失也相同。
保存和加载 PyTorch 模型
def train(...):
...
checkpoint = torch.load(checkpoint_file,map_location=device)
model.load_state_dict(checkpoint['state_dict'])
optimizer.load_state_dict(checkpoint['optimizer'])
...
counter = 0
while run:
counter += 1
if counter > 1000:
break
in = np.load("debug_data/in.npy")
out1 = np.load("debug_data/out1.npy")
out2 = np.load("debug_data/out2.npy")
# adjust from TF
in = in.squeeze(3)
in = np.expand_dims(in,axis=0)
... do the same for out1 and out2
in,out1,out2 = \
torch.from_numpy(in).to(device),\
torch.from_numpy(out1).to(device),\
torch.from_numpy(out2).to(device)
optimizer.zero_grad()
out1_hat,out2_hat = model(in)
train_loss = loss_fn(out1_hat,out1) + loss_fn(out2_hat,out2)
train_loss.backward()
optimizer.step()
save_checkpoint({'state_dict': model.state_dict(),'optimizer': optimizer.state_dict()},latest_filename=latest_checkpoint_path)
保存和加载 TensorFlow 模型
sess.run(tf.global_variables_initializer())
writer = tf.summary.FileWriter(my_path,graph=sess.graph)
restorer = tf.train.Saver(tf.global_variables(),write_version=tf.train.SaverDef.V2)
restorer.restore(sess,load_path)
saver = tf.train.Saver(tf.global_variables(),write_version=tf.train.SaverDef.V2)
counter = 0
while run:
counter += 1
if counter > 1000:
break
in = np.load("")
out1 = np.load("")
out2 = np.load("")
out1 = out1[0,:,:]
out1 = out1[:,np.newaxis]
out2 = out2[0,:]
out2 = out2[:,np.newaxis]
in = in[0,:]
in = in[:,np.newaxis]
_,_loss = sess.run([optimizer,loss],Feed_dict={in: in,out1: out1,out2: out2})
save_path = saver.save(sess,my_save_path,global_step=int(_global_step))
sess.close()
tf.reset_default_graph()
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)