如何使用NSight Compute CLI获取内核的执行时间？

问题描述

假设我有一个不需要命令行参数的可执行文件myapp，并启动了CUDA内核mykernel。我可以调用：

nv-nsight-cu-cli -k mykernel myapp

并获得如下所示的输出：

==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== disconnected from process 1234
[1234] myapp@127.0.0.1
  mykernel(),2020-Oct-25 01:23:45,Context 1,Stream 7
    Section: GPU Speed Of Light
    --------------------------------------------------------------------
    Memory Frequency                      cycle/nsecond      1.62
    SOL FB                                %                  1.58
    Elapsed Cycles                        cycle              4,421,067
    SM Frequency                          cycle/nsecond      1.43
    Memory [%]                            %                  61.76
    Duration                              msecond            3.07
    SOL L2                                %                  0.79
    SM Active Cycles                      cycle              4,390,420.69
    (etc. etc.)
    --------------------------------------------------------------------
    (etc. etc. - other sections here)

到目前为止-很好。但是现在，我只想要mykernel的总体内核持续时间-而没有其他输出。看着nv-nsight-cu-cli --query-metrics，我发现，其中包括：

gpu__time_duration           incremental duration in nanoseconds; isolated measurement is same as gpu__time_active
gpu__time_active             total duration in nanoseconds

因此，它必须是其中之一，对吗？但是当我跑步

nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_duration,gpu__time_active

我得到：

==PROF== Connected to process 30446 (/path/to/myapp)
==PROF== Profiling "mykernel": 0%....50%....100% - 13 passes
==PROF== disconnected from process 12345
[12345] myapp@127.0.0.1
  mykernel(),2020-Oct-25 12:34:56,Stream 7
    Section: GPU Speed Of Light
    Section: Command line profiler metrics
    ---------------------------------------------------------------
    gpu__time_active                                   (!) n/a
    gpu__time_duration                                 (!) n/a
    ---------------------------------------------------------------

我的问题：

为什么我会得到“ n / a”值？
如何获取我想要的实际值，而没有其他内容？

注意：：

我正在将CUDA 10.2与NSight Compute版本2019.5.0（内部版本27346997）一起使用。
我意识到我可以过滤不合格调用的标准输出流，但这不是我想要的。
我实际上只想要原始数字，但是我愿意使用--csv并接受最后一个字段。
在nvprof transition guide中找不到任何相关内容。

解决方法

tl; dr：您需要指定适当的“ submetric”：

nv-nsight-cu-cli -k mykernel myapp --metrics gpu__time_active.avg

_{（基于@RobertCrovella的评论）}

CUDA的配置机制收集“基本指标”，这些指标的确与--list-metrics一起列出。对于这些中的每一个，均采用多个样本。在NSight Compute 2019.5版本中，您不能仅获取原始样本；您只能获取“ submetric”值。

'Submetrics'本质上是样本序列到标量值的某种聚合。不同的度量标准具有不同种类的子度量标准（请参见this listing）；对于gpu__time_active，它们是：.min，.max，.sum，.avg。是的，如果您想知道-他们缺少诸如矩或样本标准偏差之类的第二时刻指标。

因此，您必须指定一个或多个子度量标准（请参见上面的示例），或者升级到newer version of NSight Compute，实际上，您可以显然可以获取所有样本。

command-line-interface cuda cuda nsight-compute profiling profiling