使用Nvidia的TensorRT进行批处理

问题描述

我将训练后的模型转换为onnx格式,然后从onnx模型创建TensorRT引擎文件。我使用下面的snnipet代码执行此操作?

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt

# logger to capture errors,warnings,and other information during the build and inference phases
TRT_LOGGER = trt.Logger()

def build_engine(onnx_file_path):
    # initialize TensorRT engine and parse ONNX model
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network()
    parser = trt.OnnxParser(network,TRT_LOGGER)
    
    # parse ONNX
    with open(onnx_file_path,'rb') as model:
        print('Beginning ONNX file parsing')
        parser.parse(model.read())
    print('Completed parsing of ONNX file')
# allow TensorRT to use up to 1GB of GPU memory for tactic selection
builder.max_workspace_size = 1 << 30
# we have only one image in batch
builder.max_batch_size = 1
# use FP16 mode if possible
if builder.platform_has_fast_fp16:
    builder.fp16_mode = True
# generate TensorRT engine optimized for the target platform
    print('Building an engine...')
    engine = builder.build_cuda_engine(network)
    context = engine.create_execution_context()
    print("Completed creating Engine")

    return engine,context

# get sizes of input and output and allocate memory required for input data and for output data
for binding in engine:
    if engine.binding_is_input(binding):  # we expect only one input
        input_shape = engine.get_binding_shape(binding)
        input_size = trt.volume(input_shape) * engine.max_batch_size * np.dtype(np.float32).itemsize  # in bytes
        device_input = cuda.mem_alloc(input_size)
    else:  # and one output
        output_shape = engine.get_binding_shape(binding)
        # create page-locked memory buffers (i.e. won't be swapped to disk)
        host_output = cuda.pagelocked_empty(trt.volume(output_shape) * engine.max_batch_size,dtype=np.float32)
        device_output = cuda.mem_alloc(host_output.nbytes)

stream = cuda.Stream()

# preprocess input data
 host_input = np.array(preprocess_image("turkish_coffee.jpg").numpy(),dtype=np.float32,order='C')
 cuda.memcpy_htod_async(device_input,host_input,stream)

# run inference
    context.execute_async(bindings=[int(device_input),int(device_output)],stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(host_output,device_output,stream)
    stream.synchronize()

# postprocess results
    output_data = torch.Tensor(host_output).reshape(engine.max_batch_size,output_shape[0])
    postprocess(output_data)

上面的代码对于一个批处理大小的图像是正确的,但是我想对多批处理大小做,因为这一件事需要更改:

builder.max_batch_size = 1

对于批量大小超过一个的批处理,我还需要更改什么才能正常工作?我认为,我必须从同步更改为异步的一件事,对吧?

stream.synchronize()

如何解决批量大于一个的问题?

我的系统:

火炬:1.2.0 火炬视觉:0.4.0 专辑化:0.4.5 onnx:1.4.1 opencv的python:4.2.0.34 CUDA:10.0 ubuntu:18.04 张量:5.x / 6.x

其他解决方案是在TRT 7.x中使用优化探查器,但是我想知道如何使用5/6版本解决此问题,有可能吗?

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)