PBS 作业队列有时会立即退出

问题描述

我在学校使用 PBS 管理的计算集群已有几年了。几个月前,我遇到了这个问题,但他们永远无法弄清楚。当我提交作业时,它们会排队,然后一些会立即运行。我相信由于缺乏资源而应该排队的工作几乎会立即死亡。这会间歇性地发生,具体取决于我一次可以使用多少个节点。有时我提交说 10 个作业,前两个会运行,接下来的三个会失败,然后接下来的五个会运行。

我没有为这些失败的作业创建 stdout 或 stderr 文件。运行的那些确实创建了这些文件。当这些工作结束时,我会收到一封电子邮件,我已将其附在此处并删除了一些识别信息。退出状态 -9 表示“无法创建/打开 stdout stderr 文件”,但我不知道如何解决该问题,因为它是如此间歇性。

PBS Job Id: 11335.pearl.hpcc.XXX.edu
Job Name:   mc1055
Exec host:  m09/5
Aborted by PBS Server
Job cannot be executed
See Administrator for help
Exit_status=-9
resources_used.cput=00:00:00
resources_used.vmem=0kb
resources_used.walltime=00:00:02
resources_used.mem=0kb
resources_used.energy_used=0
req_information.task_count.0=1
req_information.lprocs.0=1
req_information.thread_usage_policy.0=allowthreads
req_information.hostlist.0=m09:ppn=1
req_information.task_usage.0.task.0={"task":{"cpu_list":"9","mem_list":"0","cores":0,"threads":1,"host":"m09"}}
Error_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.e11335
Output_Path: CLUSTERNAME.hpcc.XXX.edu:/PATH/TOSCRIPT/run/mc1055.o11335

当作业失败时,我还查看了 qstat -f,它就在下面。如果我不立即抓住它,它就会从 qstat 中消失。

Job Id: 11339.pearl.hpcc.XXX.edu
    Job_Name = mc1059
    Job_Owner = USERNAME@CLUSTERNAME.hpcc.XXX.edu
    resources_used.cput = 00:00:00
    resources_used.vmem = 0kb
    resources_used.walltime = 00:00:00
    resources_used.mem = 0kb
    resources_used.energy_used = 0
    job_state = C
    queue = default
    server = CLUSTERNAME.hpcc.XXX.edu
    Account_Name = ADVISOR
    Checkpoint = u
    ctime = Mon Jan  4 20:02:25 2021
    Error_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.e11339
    exec_host = m09/9
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Mon Jan  4 20:03:14 2021
    Output_Path = CLUSTERNAME.hpcc.XXX.edu/PATH/TOSCRIPT/mc1059.o11339
    Priority = 0
    qtime = Mon Jan  4 20:02:25 2021
    Rerunable = True
    Resource_List.nodes = 1:ppn=1
    Resource_List.walltime = 50:00:00
    Resource_List.var = mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2
    Resource_List.nodect = 1
    session_id = 0
    Variable_List = PBS_O_QUEUE=largeq,PBS_O_HOME=/PATH,PBS_O_LOGNAME=USERNAME,PBS_O_PATH=lots of things
    PBS_O_MAIL=/var/spool/mail/USERNAME,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_US,KRB5CCNAME=FILE:/tmp/krb5cc_404112_hd4Yty,PBS_O_WORKDIR=/PATH/TOSCRIPT/run,PBS_O_HOST=CLUSTERNAME.hpcc.XXX.edu,PBS_O_SERVER=CLUSTERNAME.hpcc.XXX.edu
    euser = USERNAME
    egroup = physics
    queue_type = E
    etime = Mon Jan  4 20:02:25 2021
    exit_status = -9
    submit_args = -l var=mkuuid:1e94a3e50dd44803bab2d3a7c2286ee2 -v KRB5CCNAME
     /PATH/TOSCRIPT/run/tmp/montec_1059
    start_time = Mon Jan  4 20:03:14 2021
    start_count = 1
    fault_tolerant = False
    comp_time = Mon Jan  4 20:03:14 2021
    job_radix = 0
    total_runtime = 7.218811
    submit_host = CLUSTERNAME.hpcc.XXX.edu
    init_work_dir = /PATH/TOSCRIPT/run
    request_version = 1
    req_information.task_count.0 = 1
    req_information.lprocs.0 = 1
    req_information.thread_usage_policy.0 = allowthreads
    req_information.hostlist.0 = m09:ppn=1
    req_information.task_usage.0.task.0.cpu_list = 5
    req_information.task_usage.0.task.0.mem_list = 1
    req_information.task_usage.0.task.0.cores = 0
    req_information.task_usage.0.task.0.threads = 1
    req_information.task_usage.0.task.0.host = m09

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

错误1:Request method ‘DELETE‘ not supported 错误还原:...
错误1:启动docker镜像时报错:Error response from daemon:...
错误1:private field ‘xxx‘ is never assigned 按Alt...
报错如下,通过源不能下载,最后警告pip需升级版本 Requirem...