问题描述
我在学校使用 PBS 管理的计算集群已有几年了。几个月前,我遇到了这个问题,但他们永远无法弄清楚。当我提交作业时,它们会排队,然后一些会立即运行。我相信由于缺乏资源而应该排队的工作几乎会立即死亡。这会间歇性地发生,具体取决于我一次可以使用多少个节点。有时我提交说 10 个作业,前两个会运行,接下来的三个会失败,然后接下来的五个会运行。
我没有为这些失败的作业创建 stdout 或 stderr 文件。运行的那些确实创建了这些文件。当这些工作结束时,我会收到一封电子邮件,我已将其附在此处并删除了一些识别信息。退出状态 -9 表示“无法创建/打开 stdout stderr 文件”,但我不知道如何解决该问题,因为它是如此间歇性。
PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name: mc1055
Exec host: m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335
当作业失败时,我还查看了 qstat -f,它就在下面。如果我不立即抓住它,它就会从 qstat 中消失。
Job Id: 11339.pearl.hpcc.XXX.edu
Job_Name = mc1059
Job_Owner = USERNAME@CLUSTERNAME.hpcc.XXX.edu
resources_used.cput = 00:00:00
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
resources_used.mem = 0kb
resources_used.energy_used = 0
job_state = C
queue = default
server = CLUSTERNAME.hpcc.XXX.edu
Account_Name = ADVISOR
Checkpoint = u
ctime = Mon Jan 4 20:02:25 2021
Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
exec_host = m09/9
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon Jan 4 20:03:14 2021
Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
Priority = 0
qtime = Mon Jan 4 20:02:25 2021
Rerunable = True
Resource_List.nodes = 1:ppn=1
Resource_List.walltime = 50:00:00
Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
Resource_List.nodect = 1
session_id = 0
Variable_List = PBS_O_QUEUE=largeq,PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,PBS_O_PATH=lots of things
PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,PBS_O_WORKDIR=/PATH/TOSCRIPT/run,PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
euser = USERNAME
egroup = physics
queue_type = E
etime = Mon Jan 4 20:02:25 2021
exit_status = -9
submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
/PATH/TOSCRIPT/run/tmp/montec_1059
start_time = Mon Jan 4 20:03:14 2021
start_count = 1
fault_tolerant = False
comp_time = Mon Jan 4 20:03:14 2021
job_radix = 0
total_runtime = 7.218811
submit_host = CLUSTERNAME.hpcc.XXX.edu
init_work_dir = /PATH/TOSCRIPT/run
request_version = 1
req_information.task_count.0 = 1
req_information.lprocs.0 = 1
req_information.thread_usage_policy.0 = allowthreads
req_information.hostlist.0 = m09:ppn=1
req_information.task_usage.0.task.0.cpu_list = 5
req_information.task_usage.0.task.0.mem_list = 1
req_information.task_usage.0.task.0.cores = 0
req_information.task_usage.0.task.0.threads = 1
req_information.task_usage.0.task.0.host = m09
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)