问题描述
当任务依次产生其他进程时,我在使用 SLURM 控制资源使用时遇到问题。如果相关,这是由克伦威尔 WDL 执行者完成的,我想在将其报告为他们的问题之前检查一般建议。 Executor 产生这样的进程:
sbatch -J cromwell_eaa218e4_CollectAggregationMetrics -D /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics -o /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stdout -e /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stderr -t 600 -p short \
-c 1 \
--mem 7168 \
--wrap "docker run -v /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics:/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics us.gcr.io/broad-gotc-prod/genomes-in-the-cloud@sha256:93ae8b895f4e83dfc20dd2651e87baccac7dd2cbe0af602bf50afcc4b9e6f925 /bin/bash /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/script"
作业产生两个进程(根据 scontrol listpids):
50176 14542 batch 0 0
50192 14542 batch - -
顶部显示了这两个作业:
50176 root 20 0 4628 440 440 S 0.0 0.0 0:00.00 /bin/sh /var/spool/slurmd/job14542/slurm_script
50192 root 20 0 897336 6912 2364 S 0.0 0.0 0:03.93 docker run -v /data/og/NYG/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics:/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics us.gcr.io/broad-gotc-prod/genomes-in-the-cloud@sha256:93ae8b895f4e83dfc20dd2651e87baccac7dd2cbe0af602bf50afcc4b9e6f925 /bin/bash /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/script
这个脚本的内容是:
#!/bin/bash
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
tmpDir=$(mkdir -p "/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/tmp.62f45f38" && echo "/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/tmp.62f45f38")
chmod 777 "$tmpDir"
export _JAVA_OPTIONS=-Djava.io.tmpdir="$tmpDir"
export TMPDIR="$tmpDir"
export HOME="$HOME"
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
)
outeaa218e4="${tmpDir}/out.$$" erreaa218e4="${tmpDir}/err.$$"
mkfifo "$outeaa218e4" "$erreaa218e4"
trap 'rm "$outeaa218e4" "$erreaa218e4"' EXIT
tee '/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stdout' < "$outeaa218e4" &
tee '/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/stderr' < "$erreaa218e4" >&2 &
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
# These are optionally generated,but need to exist for Cromwell's sake
touch CGND-HDA-00475.gc_bias.detail_metrics \
CGND-HDA-00475.gc_bias.pdf \
CGND-HDA-00475.gc_bias.summary_metrics \
CGND-HDA-00475.insert_size_metrics \
CGND-HDA-00475.insert_size_histogram.pdf
java -Xms5000m -jar /usr/gitc/picard.jar \
CollectMultipleMetrics \
INPUT=/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/inputs/-1469962972/CGND-HDA-00475.bam \
REFERENCE_SEQUENCE=/cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/inputs/-701641790/Homo_sapiens_assembly38.fasta \
OUTPUT=CGND-HDA-00475 \
ASSUME_SORTED=true \
PROGRAM=null \
PROGRAM=CollectAlignmentSummaryMetrics \
PROGRAM=CollectInsertSizeMetrics \
PROGRAM=CollectSequencingArtifactMetrics \
PROGRAM=QualityScoreDistribution \
PROGRAM="CollectGcBiasMetrics" \
METRIC_ACCUMULATION_LEVEL=null \
METRIC_ACCUMULATION_LEVEL=SAMPLE \
METRIC_ACCUMULATION_LEVEL=LIBRARY
) > "$outeaa218e4" 2> "$erreaa218e4"
echo $? > /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc.tmp
(
# add a .file in every empty directory to facilitate directory delocalization on the cloud
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
find . -type d -exec sh -c '[ -z "$(ls -A '"'"'{}'"'"')" ] && touch '"'"'{}'"'"'/.file' \;
)
(
cd /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution
sync
)
mv /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc.tmp /cromwell-executions/WholeGenomeGermlineSingleSample/c2fcae53-0cab-43cc-98aa-2ac3da5d7818/call-AggregatedBamQC/AggregatedBamQC/eaa218e4-d1ca-48f7-bbdc-dc43d0dc9472/call-CollectAggregationMetrics/execution/rc
最后一个脚本生成了一个进程 50615,该进程似乎未被跟踪并占用 13G RAM,这意味着它不受 cgroup 规则的约束。作为证据:
>cat /proc/50176/cgroup
12:pids:/system.slice/slurmd.service
11:net_cls,net_prio:/
10:hugetlb:/
9:cpuset:/slurm/uid_0/job_14542/step_batch
8:freezer:/slurm/uid_0/job_14542/step_batch
7:memory:/slurm/uid_0/job_14542/step_batch
6:rdma:/
5:blkio:/system.slice/slurmd.service
4:perf_event:/
3:devices:/system.slice/slurmd.service
2:cpu,cpuacct:/system.slice/slurmd.service
1:name=systemd:/system.slice/slurmd.service
0::/system.slice/slurmd.service
>cat /proc/50615/cgroup
12:pids:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
11:net_cls,net_prio:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
10:hugetlb:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
9:cpuset:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
8:freezer:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
7:memory:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
6:rdma:/
5:blkio:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
4:perf_event:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
3:devices:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
2:cpu,cpuacct:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
1:name=systemd:/docker/31faaa50328998b41dcd772ddcc0673ad02ec05284be945e14ae2782a428275f
0::/system.slice/containerd.service
上一节中的 cgroup docker/hash 确实存在,但没有有意义的限制。 为什么不继承约束?这个新的 docker cgroup 来自哪里?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)