SLURM 对子任务的资源限制

问题描述

当任务依次产生其他进程时,我在使用 SLURM 控制资源使用时遇到问题。如果相关,这是由克伦威尔 WDL 执行者完成的,我想在将其报告为他们的问题之前检查一般建议。 Executor 产生这样的进程:

sbatch -J cromwell_eaa218e4_CollectAggregationMetrics -D /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics -o /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stdout -e /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stderr -t 600 -p short \
-c 1 \
--mem 7168 \
--wrap "docker run -v /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics:/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics us.gcr.io/broad-gotc-prod/genomes-in-the-cloud@sha256:93ae8b895f4e83dfc20dd2651e87baccac7dd2cbe0af602bf50afcc4b9e6f925 /bin/bash /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/script"

作业产生两个进程(根据 scontrol listpids):

50176    14542    batch  0       0       
50192    14542    batch  -       -

顶部显示了这两个作业:

50176 root      20   0    4628    440    440 S   0.0  0.0   0:00.00 /bin/sh /var/spool/slurmd/job14542/slurm_script                                      
 50192 root      20   0  897336   6912   2364 S   0.0  0.0   0:03.93 docker run -v /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics:/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics us.gcr.io/broad-gotc-prod/genomes-in-the-cloud@sha256:93ae8b895f4e83dfc20dd2651e87baccac7dd2cbe0af602bf50afcc4b9e6f925 /bin/bash /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/script

这个脚本的内容是:

#!/bin/bash

cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
tmpDir=$(mkdir -p "/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/tmp.62f45f38" && echo "/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/tmp.62f45f38")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution

)
outeaa218e4="${tmpDir}/out.$$" erreaa218e4="${tmpDir}/err.$$"
mkfifo "$outeaa218e4" "$erreaa218e4"
trap 'rm "$outeaa218e4" "$erreaa218e4"' EXIT
tee '/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stdout' < "$outeaa218e4" &
tee '/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stderr' < "$erreaa218e4" >&2 &
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution


# These are optionally generated,but need to exist for Cromwell's sake
touch CGND-HDA-00475.gc_bias.detail_metrics \
  CGND-HDA-00475.gc_bias.pdf \
  CGND-HDA-00475.gc_bias.summary_metrics \
  CGND-HDA-00475.insert_size_metrics \
  CGND-HDA-00475.insert_size_histogram.pdf

java -Xms5000m -jar /usr/gitc/picard.jar \
  CollectMultipleMetrics \
  INPUT=/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/inputs/-1469962972/CGND-HDA-00475.bam \
  REFERENCE_SEQUENCE=/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/inputs/-701641790/Homo_sapiens_assembly38.fasta \
  OUTPUT=CGND-HDA-00475 \
  ASSUME_SORTED=true \
  PROGRAM=null \
  PROGRAM=CollectAlignmentSummaryMetrics \
  PROGRAM=CollectInsertSizeMetrics \
  PROGRAM=CollectSequencingArtifactMetrics \
  PROGRAM=QualityScoreDistribution \
  PROGRAM="CollectGcBiasMetrics" \
  METRIC_ACCUMULATION_LEVEL=null \
  METRIC_ACCUMULATION_LEVEL=SAMPLE \
  METRIC_ACCUMULATION_LEVEL=LIBRARY
)  > "$outeaa218e4" 2> "$erreaa218e4"
echo $? > /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc.tmp
(
# add a .file in every empty directory to facilitate directory delocalization on the cloud
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
find . -type d -exec sh -c '[ -z "$(ls -A '"'"'{}'"'"')" ] && touch '"'"'{}'"'"'/.file' \;
)
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
sync


)
mv /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc.tmp /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc

最后一个脚本生成了一个进程 50615,该进程似乎未被跟踪并占用 13G RAM,这意味着它不受 cgroup 规则的约束。作为证据:

>cat /proc/50176/cgroup
12:pids:/system.slice/slurmd.service
11:net_cls,net_prio:/
10:hugetlb:/
9:cpuset:/slurm/uid_0/job_14542/step_batch
8:freezer:/slurm/uid_0/job_14542/step_batch
7:memory:/slurm/uid_0/job_14542/step_batch
6:rdma:/
5:blkio:/system.slice/slurmd.service
4:perf_event:/
3:devices:/system.slice/slurmd.service
2:cpu,cpuacct:/system.slice/slurmd.service
1:name=systemd:/system.slice/slurmd.service
0::/system.slice/slurmd.service

>cat /proc/50615/cgroup
12:pids:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
11:net_cls,net_prio:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
10:hugetlb:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
9:cpuset:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
8:freezer:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
7:memory:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
6:rdma:/
5:blkio:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
4:perf_event:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
3:devices:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
2:cpu,cpuacct:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
1:name=systemd:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
0::/system.slice/containerd.service

上一节中的 cgroup docker/hash 确实存在,但没有有意义的限制。 为什么不继承约束?这个新的 docker cgroup 来自哪里?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...