在Jetson Nano平台上使用VGG19模型进行推断时，docker内的Tensorflow 1.15.3崩溃带有OOM

问题描述

我有一个Jetson Nano，并且我已经使用Jetpack 4.4从jetson-nano-sd-card-image下载了SD映像，并使用以下Dockerfile创建了Docker基本映像：

FROM nvcr.io/nvidia/l4t-base:r32.4.3

WORKDIR /

RUN apt-get update && apt-get install -y --fix-missing make g++

RUN apt-get install -y --fix-missing python3-pip

RUN apt-get install -y python3-h5py

RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata

RUN apt-get install -y python3-opencv

RUN apt-get install -y python3-scipy

RUN apt-get install -y python3-dev

RUN pip3 install numpy cython

RUN apt-get install -y libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran

RUN pip3 install -U pip testresources setuptools

RUN pip3 install -U numpy==1.16.1 future==0.18.2 mock==3.0.5 keras_preprocessing==1.1.1 keras_applications==1.0.8 gast==0.2.2 futures protobuf pybind11

RUN pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 'tensorflow==1.15.3'

RUN pip3 install Keras==2.3.1

RUN apt-get install -y python3-opencv unzip autoconf build-essential libtool

为了能够使用优化为Tensorrt的预训练VGG19分类Tensorflow模型来推断图像的类别。

当我像这样启动docker容器时：

docker run -it --gpus all --shm-size=4g --ulimit memlock=-1 inferencecontainer

我的脚本从给定路径加载冻结图，创建带有标志tf_config.gpu_options.allow_growth = True的Session并定义输入和输出张量，并以其名称tf_sess.graph.get_tensor_by_name()来获取它们。

这是Tensorflow设备创建步骤的日志：

2020-09-25 21:09:32.042986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 65 MB memory) -> physical GPU (device: 0,name: NVIDIA Tegra X1,pci bus id: 0000:00:00.0,compute capability: 5.3)

（仅分配65MB的内存）。

当我运行会话tf_sess.run(output_tensor,feed_dict)时，在feed_dict中提供了预期输入大小的已加载图像时，它由于以下跟踪而崩溃：

    2020-09-25 21:10:28.983061: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 59.51MiB
2020-09-25 21:10:28.983097: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 66060288 memory_limit_: 68411392 available bytes: 2351104 curr_region_allocation_bytes_: 67108864
2020-09-25 21:10:28.983141: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats: 
Limit:                    68411392
InUse:                    62403584
MaxInUse:                 62403584
NumAllocs:                      26
MaxAllocSize:             14680064

2020-09-25 21:10:28.983191: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *********xx********____***********************xxx********************************************xxxxxxx
2020-09-25 21:10:28.983454: W tensorflow/core/framework/op_kernel.cc:1628] OP_REQUIRES failed at constant_op.cc:77 : Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
2020-09-25 21:10:28.983619: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [3,512] and type float
     [[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1365,in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1350,in _run_fn
    target_list,run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1443,in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,512] and type float
     [[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]

During handling of the above exception,another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py",line 193,in _run_module_as_main
    "__main__",mod_spec)
  File "/usr/lib/python3.6/runpy.py",line 85,in _run_code
    exec(code,run_globals)
  File "/app/app/main.py",line 17,in <module>
    prediction = predictor.predict_frame(image)
  File "/app/app/Predictor.py",line 81,in predict_frame
    preds = self.tf_sess.run(self.output_tensor,feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 956,in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1180,in _run
    feed_dict_tensor,options,line 1359,in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1384,in _do_call
    raise type(e)(node_def,op,message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,512] and type float
     [[node vgg19/block5_conv2/Conv2D/ReadVariableOp (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Original stack trace for 'vgg19/block5_conv2/Conv2D/ReadVariableOp':
  File "usr/lib/python3.6/runpy.py",mod_spec)
  File "usr/lib/python3.6/runpy.py",run_globals)
  File "app/app/main.py",line 10,in <module>
    predictor = Predictor(trt_model_path,class_labels,image_size)
  File "app/app/Predictor.py",line 26,in __init__
    tf.import_graph_def(trt_graph,name="")
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py",line 513,in new_func
    return func(*args,**kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 405,in import_graph_def
    producer_op_list=producer_op_list)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 517,in _import_graph_def_internal
    _ProcessNewOps(graph)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 243,in _ProcessNewOps
    for new_op in graph._add_new_tf_operations(compute_devices=False):  # pylint: disable=protected-access
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 3561,in _add_new_tf_operations
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",in <listcomp>
    for c_op in c_api_util.new_tf_operations(self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 3451,in _create_op_from_tf_operation
    ret = Operation(c_op,self)
  File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 1748,in __init__
    self._traceback = tf_stack.extract_stack()

对导致问题的原因有什么想法？

谢谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

nvidia nvidia-docker nvidia-jetson-nano tensorflow tensorrt

在Jetson Nano平台上使用VGG19模型进行推断时，docker内的Tensorflow 1.15.3崩溃带有OOM

问题描述

解决方法

相关问答