问题描述
我将训练后的模型转换为onnx格式,然后从onnx模型创建TensorRT引擎文件。我使用下面的snnipet代码执行此操作?
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt
# logger to capture errors,warnings,and other information during the build and inference phases
TRT_LOGGER = trt.Logger()
def build_engine(onnx_file_path):
# initialize TensorRT engine and parse ONNX model
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network()
parser = trt.OnnxParser(network,TRT_LOGGER)
# parse ONNX
with open(onnx_file_path,'rb') as model:
print('Beginning ONNX file parsing')
parser.parse(model.read())
print('Completed parsing of ONNX file')
# allow TensorRT to use up to 1GB of GPU memory for tactic selection
builder.max_workspace_size = 1 << 30
# we have only one image in batch
builder.max_batch_size = 1
# use FP16 mode if possible
if builder.platform_has_fast_fp16:
builder.fp16_mode = True
# generate TensorRT engine optimized for the target platform
print('Building an engine...')
engine = builder.build_cuda_engine(network)
context = engine.create_execution_context()
print("Completed creating Engine")
return engine,context
# get sizes of input and output and allocate memory required for input data and for output data
for binding in engine:
if engine.binding_is_input(binding): # we expect only one input
input_shape = engine.get_binding_shape(binding)
input_size = trt.volume(input_shape) * engine.max_batch_size * np.dtype(np.float32).itemsize # in bytes
device_input = cuda.mem_alloc(input_size)
else: # and one output
output_shape = engine.get_binding_shape(binding)
# create page-locked memory buffers (i.e. won't be swapped to disk)
host_output = cuda.pagelocked_empty(trt.volume(output_shape) * engine.max_batch_size,dtype=np.float32)
device_output = cuda.mem_alloc(host_output.nbytes)
stream = cuda.Stream()
# preprocess input data
host_input = np.array(preprocess_image("turkish_coffee.jpg").numpy(),dtype=np.float32,order='C')
cuda.memcpy_htod_async(device_input,host_input,stream)
# run inference
context.execute_async(bindings=[int(device_input),int(device_output)],stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_output,device_output,stream)
stream.synchronize()
# postprocess results
output_data = torch.Tensor(host_output).reshape(engine.max_batch_size,output_shape[0])
postprocess(output_data)
上面的代码对于一个批处理大小的图像是正确的,但是我想对多批处理大小做,因为这一件事需要更改:
builder.max_batch_size = 1
对于批量大小超过一个的批处理,我还需要更改什么才能正常工作?我认为,我必须从同步更改为异步的一件事,对吧?
stream.synchronize()
我的系统:
火炬:1.2.0 火炬视觉:0.4.0 专辑化:0.4.5 onnx:1.4.1 opencv的python:4.2.0.34 CUDA:10.0 ubuntu:18.04 张量:5.x / 6.x
其他解决方案是在TRT 7.x中使用优化探查器,但是我想知道如何使用5/6版本解决此问题,有可能吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)