在Google AI Platform Engine上提交tensorflow2作业的内存不足问题

问题描述

我正在尝试使用Google AI Platform Engine上的gcloud提交Tensorflow2培训作业（微调对象检测模型）。我的数据集不大（浣熊数据集，大约为10M）。我尝试了许多配置，但每次都会遇到相同的错误：

The replica master 0 ran out-of-memory and exited with a non-zero status of 9(SIGKILL)

我的命令：

gcloud ai-platform jobs submit training OD_ssd_fpn_large \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-east1 \
--config cloud.yml \
--  \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONfig_PATH}

我对cloud.yml文件的最后一次尝试涉及大型模型：

trainingInput:
runtimeVersion: "2.2"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: large_model
workerCount: 5
workerType: large_model
parameterServerCount: 3
parameterServerType: large_model

，但总是相同的错误。任何提示或帮助，不胜感激

解决方法

读取所有数据正在消耗RAM，因此内存不足。您需要获得更大的实例类型（large_model或complex_model_l；有关机器类型的更多信息，请参见此documentation）。

trainingInput:
  scaleTier: CUSTOM
  masterType: n1-highcpu-16
  workerType: n1-highcpu-16
  parameterServerType: n1-highmem-8
  evaluatorType: n1-highcpu-16
  workerCount: 9
  parameterServerCount: 3
  evaluatorCount: 1

或者您需要减少数据集。

gcloud object-detection-api out-of-memory tensorflow2.0