Tensorflow / WSL2 GPU内存不足,无法使用所有可用的内存吗?

问题描述

因此,我正在尝试在WSL2中的TITAN RTX(24G)上微调中型模型,但似乎用完了内存?小模型适合。 如果我在实时ubuntu上启动计算机,则可以针对问题解决大中型模型的问题。

2020-09-23 13:19:36.310992: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23b7a0000 next 260 of size 4194304
2020-09-23 13:19:36.310995: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23bba0000 next 266 of size 16777216
2020-09-23 13:19:36.310998: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23cba0000 next 268 of size 16777216
2020-09-23 13:19:36.311001: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23dba0000 next 270 of size 12582912
2020-09-23 13:19:36.311004: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23e7a0000 next 272 of size 4194304
2020-09-23 13:19:36.311006: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23eba0000 next 278 of size 16777216
2020-09-23 13:19:36.311009: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x23fba0000 next 280 of size 16777216
2020-09-23 13:19:36.311012: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x240ba0000 next 282 of size 12582912
2020-09-23 13:19:36.311015: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2417a0000 next 284 of size 4194304
2020-09-23 13:19:36.311020: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x241ba0000 next 290 of size 16777216
2020-09-23 13:19:36.311023: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x242ba0000 next 18446744073709551615 of size 29360128
2020-09-23 13:19:36.311026: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 130543104
2020-09-23 13:19:36.311029: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2447a0000 next 294 of size 12582912
2020-09-23 13:19:36.311032: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2453a0000 next 296 of size 4194304
2020-09-23 13:19:36.311035: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2457a0000 next 302 of size 16777216
2020-09-23 13:19:36.311037: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2467a0000 next 304 of size 16777216
2020-09-23 13:19:36.311040: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2477a0000 next 306 of size 12582912
2020-09-23 13:19:36.311043: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2483a0000 next 308 of size 4194304
2020-09-23 13:19:36.311046: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2487a0000 next 314 of size 16777216
2020-09-23 13:19:36.311049: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x2497a0000 next 316 of size 16777216
2020-09-23 13:19:36.311052: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24a7a0000 next 318 of size 12582912
2020-09-23 13:19:36.311055: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 0x24b3a0000 next 320 of size 4194304
2020-09-23 13:19:36.311058: I tensorflow/core/common_runtime/bfc_allocator.cc:905] Free  at 0x24b7a0000 next 18446744073709551615 of size 13102592
2020-09-23 13:19:36.311061: I tensorflow/core/common_runtime/bfc_allocator.cc:914]      Summary of in-use Chunks by size: 
2020-09-23 13:19:36.311065: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 98 Chunks of size 256 totalling 24.5KiB
2020-09-23 13:19:36.311069: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 113 Chunks of size 4096 totalling 452.0KiB
2020-09-23 13:19:36.311073: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 12288 totalling 228.0KiB
2020-09-23 13:19:36.311076: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 16384 totalling 288.0KiB
2020-09-23 13:19:36.311079: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 32256 totalling 31.5KiB
2020-09-23 13:19:36.311083: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 19 Chunks of size 4194304 totalling 76.00MiB
2020-09-23 13:19:36.311086: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 18 Chunks of size 12582912 totalling 216.00MiB
2020-09-23 13:19:36.311089: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 36 Chunks of size 16777216 totalling 576.00MiB
2020-09-23 13:19:36.311093: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 29360128 totalling 28.00MiB
2020-09-23 13:19:36.311096: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 268435456 totalling 256.00MiB
2020-09-23 13:19:36.311099: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 1.13GiB
2020-09-23 13:19:36.311102: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 1222110720 memory_limit_: 68719476736 available bytes: 67497366016 curr_region_allocation_bytes_: 2147483648
2020-09-23 13:19:36.311108: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                 68719476736
InUse:                  1209008128
MaxInUse:               1209008128
NumAllocs:                     762
MaxAllocSize:            268435456```

Not sure what to do from here..

解决方法

OOM问题可能有很多原因,以下是一些常见的原因以及解决此问题的解决方法。

  • 确保您不在同一GPU上运行评估和培训,这将保持该过程并导致OOM问题。您可以尝试在其他GPU上进行评估。
  • 减少batch size会减慢您的训练速度,但可以避免OOM问题。
  • 如果您有大数据,请尝试减小其大小,如果它是图像数据或使用可以使用tf.data.Dataset格式来减少内存消耗。