英特尔 Devcloud oneAPI 错误:执行 python 代码时出错 - 杀死了 python 文件

问题描述

我正在尝试在 devcloud 上执行一个 python 文件。作业脚本job.sh如下:

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh  > /dev/null 2>&1
python master.py

我使用 Mac 终端上的命令分配它:

qsub -l nodes=1:xeon:batch:ppn=2 -d . job.sh

作业运行了大约 3 个小时,并生成了 2 个输出文件:job.sh.e934264 和 job.sh.o934264

job.sh.e934264文件如下:

2021-07-26 03:49:45.014693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:45.014777: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-07-26 03:49:50.062319: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapI/Oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:50.062403: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] Failed call to cuInit: UNKNowN ERROR (303)
2021-07-26 03:49:50.062449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (s001-n061): /proc/driver/nvidia/version does not exist
2021-07-26 03:49:50.062948: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (onednN) to use the following cpu instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations,rebuild TensorFlow with the appropriate compiler flags.
2021-07-26 03:52:31.660446: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-26 03:52:31.679568: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] cpu Frequency: 3400000000 Hz
/var/spool/torque/mom_priv/jobs/934264.v-qsvr-1.aidevcloud.SC: line 4: 110188 Killed                  python master.py
 

job.sh.o934264 是:

########################################################################
#      Date:           Mon 26 Jul 2021 03:49:38 AM PDT
#    Job ID:           934264.v-qsvr-1.aidevcloud
#      User:           u65358
# Resources:           neednodes=1:xeon:batch:ppn=2,nodes=1:xeon:batch:ppn=2,walltime=06:00:00
########################################################################


########################################################################
# End of output for job 934264.v-qsvr-1.aidevcloud
# Date: Mon 26 Jul 2021 06:52:21 AM PDT
########################################################################

生成所需的输出代码,我正面临此问题/错误。有人可以帮我解决这个问题吗?谢谢

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)