在训练ONNX的预训练模型Emotion FerPlus时引发异常“ cuDNN故障8:CUDNN_STATUS_EXECUTION_FAILED”

问题描述

我正在测试训练Emotion FerPlus情绪识别模型。 训练有cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed错误。 我正在使用Nvidia GPU TitanRTX 24G。 然后更改minibatch_size from 32 to 1。但是仍然有错误。 我正在使用cntk-GPU泊坞窗。 完整的错误消息是

About to throw exception 'cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,&C::One,m_inT,ptr(in),*m_kernelT,ptr(kernel),*m_conv,m_fwdAlgo.selectedAlgo,ptr(workspace),workspace.BufferSize(),&C::Zero,m_outT,ptr(out))'
cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,ptr(out))
Traceback (most recent call last):
  File "train.py",line 193,in <module>
    main(args.base_folder,args.training_mode)
  File "train.py",line 124,in main
    trainer.train_minibatch({input_var : images,label_var : labels})
  File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/train/trainer.py",line 184,in train_minibatch
    device)
  File "/root/anaconda3/envs/cntk-py35/lib/python3.5/site-packages/cntk/cntk_py.py",line 3065,in train_minibatch
    return _cntk_py.Trainer_train_minibatch(self,*args)
RuntimeError: cuDNN failure 8: CUDNN_STATUS_EXECUTION_Failed ; GPU=0 ; hostname=d9150da5d531 ; expr=cudnnConvolutionForward(*m_cudnn,ptr(out))

[CALL STACK]
[0x7fc04da7ce89]                                                       + 0x732e89
[0x7fc045a71aaf]                                                       + 0xeabaaf
[0x7fc045a7b613]    Microsoft::MSR::cntk::CuDnnConvolutionEngine<float>::  ForwardCore  (Microsoft::MSR::cntk::Matrix<float> const&,Microsoft::MSR::cntk::Matrix<float> const&,Microsoft::MSR::cntk::Matrix<float>&,Microsoft::MSR::cntk::Matrix<float>&) + 0x1a3
[0x7fc04dd4f8d3]    Microsoft::MSR::cntk::ConvolutionNode<float>::  ForwardProp  (Microsoft::MSR::cntk::FrameRange const&) + 0xa3
[0x7fc04dfba654]    Microsoft::MSR::cntk::computationNetwork::PARTraversalFlowControlNode::  ForwardProp  (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&,Microsoft::MSR::cntk::FrameRange const&) + 0xf4
[0x7fc04dcb6e33]    std::_Function_handler<void (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&),void Microsoft::MSR::cntk::computationNetwork::ForwardProp<std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>>>(std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>> const&)::{lambda(std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&)#1}>::  _M_invoke  (std::_Any_data const&,std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&) + 0x63
[0x7fc04dd04ed9]    void Microsoft::MSR::cntk::computationNetwork::  TravserseInSortedGlobalEvalOrder  <std::vector<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>,std::allocator<std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase>>> const&,std::function<void (std::shared_ptr<Microsoft::MSR::cntk::computationNodeBase> const&)> const&) + 0x5b9
[0x7fc04dca64da]    cntk::CompositeFunction::  Forward  (std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>,std::hash<cntk::Variable>,std::equal_to<cntk::Variable>,std::allocator<std::pair<cntk::Variable const,std::shared_ptr<cntk::Value>>>> const&,std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>>>>&,cntk::DeviceDescriptor const&,std::unordered_set<cntk::Variable,std::allocator<cntk::Variable>> const&,std::allocator<cntk::Variable>> const&) + 0x15da
[0x7fc04dc3d603]    cntk::Function::  Forward  (std::unordered_map<cntk::Variable,std::allocator<cntk::Variable>> const&) + 0x93
[0x7fc04ddbf91b]    cntk::Trainer::  ExecuteForwardBackward  (std::unordered_map<cntk::Variable,std::shared_ptr<cntk::Value>>>>&) + 0x36b
[0x7fc04ddc06e4]    cntk::Trainer::  TrainLocalMinibatch  (std::unordered_map<cntk::Variable,bool,cntk::DeviceDescriptor const&) + 0x94
[0x7fc04ddc178a]    cntk::Trainer::  TrainMinibatch  (std::unordered_map<cntk::Variable,cntk::DeviceDescriptor const&) + 0x5a
[0x7fc04ddc1852]    cntk::Trainer::  TrainMinibatch  (std::unordered_map<cntk::Variable,cntk::DeviceDescriptor const&) + 0x52
[0x7fc04eb2db22]                                                       + 0x229b22
[0x7fc057ea15e9]    PyCFunction_Call                                   + 0xf9
[0x7fc057f267c0]    PyEval_EvalFrameEx                                 + 0x6ba0
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f28df5]    PyEval_EvalFrameEx                                 + 0x91d5
[0x7fc057f29b49]                                                       + 0x144b49
[0x7fc057f29cd8]    PyEval_EvalCodeEx                                  + 0x48
[0x7fc057f29d1b]    PyEval_EvalCode                                    + 0x3b
[0x7fc057f4f020]    PyRun_FileExFlags                                  + 0x130
[0x7fc057f50623]    PyRun_SimpleFileExFlags                            + 0x173
[0x7fc057f6b8c7]    Py_Main                                            + 0xca7
[0x400add]          main                                               + 0x15d
[0x7fc056f06830]    __libc_start_main                                  + 0xf0
[0x4008b9]                                                            

解决方法

CNTK现在处于维护模式(已基本弃用)。虽然CNTK可以很好地导出到ONNX,但是导入ONNX模型并没有得到很好的支持。

ONNX Runtime https://github.com/microsoft/onnxruntime现在支持培训,因此请尝试一下。 ONNX Runtime培训正在积极开发并得到支持,因此,如果某些工作无法正常进行,则很可能会很快解决问题。