Chainer Standard Updater 导致间歇性训练失败

问题描述

我正在 GeForce 2080 RTX Ti 中训练 Chainer SSD300 模型。

以下场景针对相同的数据集,尝试了整个批次大小(4-32)和 100-250 个时期。

训练过程顺利完成并生成了良好的模型。但有时在 30 个 epoch 内训练会抛出以下错误

Exception in main training loop: merge_sort: Failed to synchronize: an illegal memory access was encountered
Traceback (most recent call last):
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainer/training/trainer.py",line 316,in run update()
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py",line 175,in update self.update_core()
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainer/training/updaters/standard_updater.py",line 187,in update_core optimizer.update(loss_func,*in_arrays)
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainer/optimizer.py",line 800,in update loss = lossfun(*args,**kwds)
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainer/link.py",line 294,in __call__ out = forward(*args,**kwargs)
File "/home/malini/aim/top/process_runner1.py",line 291,in forward mb_locs,mb_confs,gt_mb_locs,gt_mb_labels,self.k)
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainercv/links/model/ssd/multiBox_loss.py",line 91,in multiBox_loss hard_negative = _hard_negative(conf_loss.array,positive,k)
File "/home/malini/anaconda3/envs/chainer/lib/python3.7/site-packages/chainercv/links/model/ssd/multiBox_loss.py",line 19,in _hard_negative rank = (x * (positive - 1)).argsort(axis=1).argsort(axis=1)
File "cupy/core/core.pyx",line 627,in cupy.core.core.ndarray.argsort
File "cupy/core/core.pyx",line 644,in cupy.core.core.ndarray.argsort
File "cupy/core/_routines_sorting.pyx",line 101,in cupy.core._routines_sorting._ndarray_argsort
File "cupy/cuda/thrust.pyx",line 135,in cupy.cuda.thrust.argsort
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "cupy/cuda/driver.pyx",line 193,in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx",line 82,in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEgal_ADDRESS: an illegal memory access was encountered

堆栈跟踪告诉错误来自这个 Chainer 培训班:

training.updaters.StandardUpdater(train_iter,optimizer,device=gpu_id)

就我而言,train_iter 是相同的数据集、相同的优化器和设备,均为 0。

这是训练代码

chainer.cuda.set_max_workspace_size(1024 * 1024 * 1024)
chainer.config.autotune = True

batchsize = 4

gpu_id = 0

out = user_dir +user_id + '/Result'
initial_lr = 0.001
training_epoch = 100

log_interval = 1,'epoch'
lr_decay_rate = 0.1
lr_decay_timing = [200,250]

transformed_train_dataset = TransformDataset(train_dataset,Transform(model.coder,model.insize,model.mean))

train_iter = chainer.iterators.MultiprocessIterator(transformed_train_dataset,batchsize)

valid_iter = chainer.iterators.SerialIterator(valid_dataset,batchsize,repeat=False,shuffle=False)

optimizer  =  chainer.optimizers.MomentumSGD ()
optimizer.setup(train_chain)
for param in train_chain.params():   
    if param.name=='b':   
        param.update_rule.add_hook(GradientScaling(2))
    else:
        param.update_rule.add_hook(WeightDecay(0.0005))

updater = training.updaters.StandardUpdater(train_iter,device=gpu_id)

trainer = training.Trainer(updater,(training_epoch,'epoch'),out)

trainer.extend(extensions.ExponentialShift('lr',lr_decay_rate,init=initial_lr),trigger=triggers.ManualScheduleTrigger(lr_decay_timing,'epoch'))
trainer.extend(DetectionVOCEvaluator(valid_iter,model,use_07_metric=False,label_names=bccd_labels),trigger=(1,'epoch'))
trainer.extend(extensions.LogReport(['epoch'],trigger=log_interval,log_name='file.txt'))
trainer.extend(extensions.observe_lr(),trigger=log_interval)
trainer.extend(extensions.ExponentialShift('lr','epoch'))
trainer.extend(extensions.LogReport(['epoch']),trigger=log_interval)
trainer.extend(extensions.observe_lr(),trigger=log_interval)
trainer.extend(extensions.PrintReport(['epoch']),trigger=log_interval)
trainer.extend(extensions.snapshot(filename='fin_{.updater.epoch}.npz'),trigger=(50,'epoch'))
trainer.extend(extensions.PrintReport(['epoch']),trigger=log_interval)

trainer.run()

从堆栈跟踪来看,当优化器参与多框生成时,训练过程中似乎出了点问题。

  1. 谁能分享一下实际问题是什么以及我能做些什么来克服它?

  2. 为什么它有时会发生而不是一直发生?

这个问题是否与 github 中的 this open issue 有任何关系?还有这个github link

谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)