Dask分布式探查器中的内存使用量不断增加泄漏?

问题描述

我有一个很长的运行任务,提交给dask集群(工人正在运行1个进程和1个线程),并且我使用tracemalloc来跟踪内存使用情况。该任务可以运行足够长的时间,从而导致内存使用率上升并导致各种问题。这是我如何使用tracemalloc的结构。

def task():
    tracemalloc.start()
    ...
    snapshot1 = tracemalloc.take_snapshot()
    for i in range(10):
        ...
        snapshot2 = tracemalloc.take_snapshot()
        top_stats = snapshot2.compare_to(snapshot1,"lineno")
        print("[ Top 6 differences ]")
        for stat in top_stats[:6]:
            print(str(stat))

我得到以下内容(清理了一点),这表明dask distributed中的探查器正在累积内存。这是在第二次迭代之后,这些内存数量呈线性增长。

[ Top 6 differences ]
/usr/local/lib/python3.8/site-packages/distributed/profile.py:112:
    size=137 MiB (+113 MiB),count=1344168 (+1108779),average=107 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:68:
    size=135 MiB (+110 MiB),count=1329005 (+1095393),average=106 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:48:
    size=93.7 MiB (+78.6 MiB),count=787568 (+655590),average=125 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:118:
    size=82.3 MiB (+66.5 MiB),count=513462 (+414447),average=168 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:67:
    size=64.4 MiB (+53.1 MiB),count=778747 (+647905),average=87 B
/usr/local/lib/python3.8/site-packages/distributed/profile.py:115:
    size=48.1 MiB (+40.0 MiB),count=787415 (+655449),average=64 B

有人知道如何清理分析器或不使用它(我们不使用仪表板,因此我们不需要它)?

解决方法

我在worker窗格上设置了以下环境变量,这样可以大大减少性能分析。似乎有效。

DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL=10000ms 
DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE=1000000ms

默认值可在此处找到:https://github.com/dask/distributed/blob/master/distributed/distributed.yaml#L74-L76

ETA:@rpanai这就是我们在K8s清单中用于部署的

spec:
  template:
    spec:
      containers:
      - env:
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__INTERVAL
          value: 10000ms
        - name: DASK_DISTRIBUTED__WORKER__PROFILE__CYCLE
          value: 1000000ms