如何使用TensorRT和PyCUDA仅测量GPU中的推理时间？

问题描述

我只想测量Jetson TX2中的推理时间。我该如何改善我的功能呢？现在，我正在测量：

图像从cpu到GPU的传输
结果从GPU传输到cpu
推断

还是由于GPU的工作方式而无法实现？我的意思是，如果我将功能分为3部分，我将不得不使用stream.synchronize()多少次：

从cpu转移到GPU
推断
从GPU传输到cpu

谢谢

INFERENCE.PY代码

def do_inference(engine,pics_1,h_input,d_input,h_output,d_output,stream,batch_size):

    """
    This is the function to run the inference
    Args:
      engine : Path to the TensorRT engine. 
      pics_1 : Input images to the model.  
      h_input: Input in the host (cpu). 
      d_input: Input in the device (GPU). 
      h_output: Output in the host (cpu). 
      d_output: Output in the device (GPU). 
      stream: CUDA stream.
      batch_size : Batch size for execution time.
      height: Height of the output image.
      width: Width of the output image.
    
    Output:
      The list of output images.

    """
      
    # Context for executing inference using ICudaEngine
    with engine.create_execution_context() as context:
        
        # Transfer input data from cpu to GPU.
        cuda.memcpy_htod_async(d_input,stream)

        # Run inference.
        #context.profiler = trt.Profiler() ##shows execution time(ms) of each layer
        context.execute(batch_size=1,bindings=[int(d_input),int(d_output)])

        # Transfer predictions back from the GPU to the cpu.
        cuda.memcpy_dtoh_async(h_output,stream)
        
        # Synchronize the stream.
        stream.synchronize()
        
        # Return the host output.
        out = h_output       
        return out

在TIMER.PY中编码

for i in range (count):
    start = time.perf_counter()
    # Classification - calling TX2_classify.py
    out = eng.do_inference(engine,image,1) 
    inference_time = time.perf_counter() - start
    print("TIME")
    print(inference_time * 1000)
    print("\n")
    pred = postprocess_inception(out)
    print(pred)
    print("\n")

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

gpu nvidia-jetson pycuda python-3.x tensorrt