问题描述
我最近为具有一个节点和72 cpus的服务器配置了一个排队查询系统。 这是conf文件:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more @R_457_4045@ion.
#
ControlMachine= hoffmann
##ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdspoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
# ---- Here to get jmore than one job running per node,seems to causedata transmission failure ----
#SelectType=select/cons_res
#SelectTypeParameters=CR_cpu_MEMORY
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=hoffmann
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/SlurmctldLogFile
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/SlurmdLogFile
#
#
# COmpuTE NODES
NodeName=hoffmann cpus=72 CoresPerSocket=18 ThreadsPerCore=2 State=UNKNowN
PartitionName=queuing Nodes=hoffmann Default=YES MaxTime=INFINITE State=UP
运行良好,但有一个局限性,无论我问什么,它都允许所有作业使用cpus,其结果是一次只能运行一个作业。 这是我正在运行的批处理:
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=/home/ubuntu/test.out
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=500:00
sleep 50
echo 'done'
当我启动其中两个并查看:sinfo -o“%all”时,我看到所有节点都已分配。我猜我在conf文件中犯了一个错误。知道会是什么吗? 谢谢
解决方法
您需要在以下位置取消注释该部分:
# ---- Here to get jmore than one job running per node,seems to causedata transmission failure ----
因此请取消注释SelectType
和SelectTypeParameters
。
您是否已取消注释,并自行在其中添加了注释?它不会引起任何故障。