问题描述
我有一个很长的运行任务,提交给dask集群(工人正在运行1个进程和1个线程),并且我使用tracemalloc
来跟踪内存使用情况。该任务可以运行足够长的时间,从而导致内存使用率上升并导致各种问题。这是我如何使用tracemalloc
的结构。
def task():
tracemalloc.start()
...
snapshot1 = tracemalloc.take_snapshot()
for i in range(10):
...
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1,"lineno")
print("[ Top 6 differences ]")
for stat in top_stats[:6]:
print(str(stat))
我得到以下内容(清理了一点),这表明dask distributed中的探查器正在累积内存。这是在第二次迭代之后,这些内存数量呈线性增长。
[ Top 6 differences ]
/usr/local/lib/python3.8/site-packages/distributed/profile.py:112:
size=137 MiB (+113 MiB),count=1344168 (+1108779),average=107 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:68:
size=135 MiB (+110 MiB),count=1329005 (+1095393),average=106 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:48:
size=93.7 MiB (+78.6 MiB),count=787568 (+655590),average=125 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:118:
size=82.3 MiB (+66.5 MiB),count=513462 (+414447),average=168 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:67:
size=64.4 MiB (+53.1 MiB),count=778747 (+647905),average=87 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:115:
size=48.1 MiB (+40.0 MiB),count=787415 (+655449),average=64 B
有人知道如何清理分析器或不使用它(我们不使用仪表板,因此我们不需要它)?
解决方法
我在worker窗格上设置了以下环境变量,这样可以大大减少性能分析。似乎有效。
DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms
默认值可在此处找到:https://github.com/dask/distributed/blob/master/distributed/distributed.yaml#L74-L76
ETA:@rpanai这就是我们在K8s清单中用于部署的
spec:
template:
spec:
containers:
- env:
- name: DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
value: 10000ms
- name: DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
value: 1000000ms