问题描述
我有一个Jetson Nano,并且我已经使用Jetpack 4.4从jetson-nano-sd-card-image下载了SD映像,并使用以下Dockerfile创建了Docker基本映像:
FROM nvcr.io/nvidia/l4t-base:r32.4.3
WORKDIR /
RUN apt-get update && apt-get install -y --fix-missing make g++
RUN apt-get install -y --fix-missing python3-pip
RUN apt-get install -y python3-h5py
RUN DEBIAN_FRONTEND="noninteractive" apt-get -y install tzdata
RUN apt-get install -y python3-opencv
RUN apt-get install -y python3-scipy
RUN apt-get install -y python3-dev
RUN pip3 install numpy cython
RUN apt-get install -y libhdf5-serial-dev hdf5-tools libhdf5-dev zlib1g-dev zip libjpeg8-dev liblapack-dev libblas-dev gfortran
RUN pip3 install -U pip testresources setuptools
RUN pip3 install -U numpy==1.16.1 future==0.18.2 mock==3.0.5 keras_preprocessing==1.1.1 keras_applications==1.0.8 gast==0.2.2 futures protobuf pybind11
RUN pip3 install --pre --extra-index-url https://developer.download.nvidia.com/compute/redist/jp/v44 'tensorflow==1.15.3'
RUN pip3 install Keras==2.3.1
RUN apt-get install -y python3-opencv unzip autoconf build-essential libtool
为了能够使用优化为Tensorrt的预训练VGG19分类Tensorflow模型来推断图像的类别。
当我像这样启动docker容器时:
docker run -it --gpus all --shm-size=4g --ulimit memlock=-1 inferencecontainer
我的脚本从给定路径加载冻结图,创建带有标志tf_config.gpu_options.allow_growth = True
的Session并定义输入和输出张量,并以其名称tf_sess.graph.get_tensor_by_name()
来获取它们。
这是Tensorflow设备创建步骤的日志:
2020-09-25 21:09:32.042986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1320] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 65 MB memory) -> physical GPU (device: 0,name: NVIDIA Tegra X1,pci bus id: 0000:00:00.0,compute capability: 5.3)
(仅分配65MB的内存)。
当我运行会话tf_sess.run(output_tensor,feed_dict)
时,在feed_dict中提供了预期输入大小的已加载图像时,它由于以下跟踪而崩溃:
2020-09-25 21:10:28.983061: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 59.51MiB
2020-09-25 21:10:28.983097: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 66060288 memory_limit_: 68411392 available bytes: 2351104 curr_region_allocation_bytes_: 67108864
2020-09-25 21:10:28.983141: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 68411392
InUse: 62403584
MaxInUse: 62403584
NumAllocs: 26
MaxAllocSize: 14680064
2020-09-25 21:10:28.983191: W tensorflow/core/common_runtime/bfc_allocator.cc:424] *********xx********____***********************xxx********************************************xxxxxxx
2020-09-25 21:10:28.983454: W tensorflow/core/framework/op_kernel.cc:1628] OP_REQUIRES failed at constant_op.cc:77 : Resource exhausted: OOM when allocating tensor of shape [3,3,512,512] and type float
2020-09-25 21:10:28.983619: E tensorflow/core/common_runtime/executor.cc:648] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [3,512] and type float
[[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1365,in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1350,in _run_fn
target_list,run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1443,in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,512] and type float
[[{{node vgg19/block5_conv2/Conv2D/ReadVariableOp}}]]
During handling of the above exception,another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py",line 193,in _run_module_as_main
"__main__",mod_spec)
File "/usr/lib/python3.6/runpy.py",line 85,in _run_code
exec(code,run_globals)
File "/app/app/main.py",line 17,in <module>
prediction = predictor.predict_frame(image)
File "/app/app/Predictor.py",line 81,in predict_frame
preds = self.tf_sess.run(self.output_tensor,feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 956,in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1180,in _run
feed_dict_tensor,options,line 1359,in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py",line 1384,in _do_call
raise type(e)(node_def,op,message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3,512] and type float
[[node vgg19/block5_conv2/Conv2D/ReadVariableOp (defined at usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Original stack trace for 'vgg19/block5_conv2/Conv2D/ReadVariableOp':
File "usr/lib/python3.6/runpy.py",mod_spec)
File "usr/lib/python3.6/runpy.py",run_globals)
File "app/app/main.py",line 10,in <module>
predictor = Predictor(trt_model_path,class_labels,image_size)
File "app/app/Predictor.py",line 26,in __init__
tf.import_graph_def(trt_graph,name="")
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py",line 513,in new_func
return func(*args,**kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 405,in import_graph_def
producer_op_list=producer_op_list)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 517,in _import_graph_def_internal
_ProcessNewOps(graph)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/importer.py",line 243,in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 3561,in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 3451,in _create_op_from_tf_operation
ret = Operation(c_op,self)
File "usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py",line 1748,in __init__
self._traceback = tf_stack.extract_stack()
对导致问题的原因有什么想法?
谢谢!
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)