使用Google Speech-To-Text进行音频转录时，时间偏移是否起作用？

问题描述

通过Google语音转文本流音频转录的时间偏移对我不起作用。我的配置如下：

const request = {
  config: {
    model: 'phoneCall',maxAlternatives: 1,// for real-time,we always parse a single alternative.
    enableWordTimeOffsets: true,encoding: "MULAW",sampleRateHertz: 8000,languageCode: "en-GB"
  },interimResults: true
};

一旦获得了WebSockets连接的句柄，我们便为转录建立回调：

recognizeStream = client
  .streamingRecognize(request)
  .on("error",console.error)
  .on("data",data => {
    console.log(data.results[0].alternatives[0].transcript);
    for (v in data.results[0].alternatives[0]) {
      console.log(`v=${data.results[0].alternatives[0][v]}`);
    }
    data.results[0].alternatives[0].words.forEach(wordInfo => {
      // NOTE: If you have a time offset exceeding 2^32 seconds,use the
      // wordInfo.{x}Time.seconds.high to calculate seconds.
      const startSecs =
        `${wordInfo.startTime.seconds}` +
        '.' +
        wordInfo.startTime.nanos / 100000000;
      const endSecs =
        `${wordInfo.endTime.seconds}` +
        '.' +
        wordInfo.endTime.nanos / 100000000;
      console.log(`Word: ${wordInfo.word}`);
      console.log(`\t ${startSecs} secs - ${endSecs} secs`);
    });
  });

然后，当我们获得音频块时，我们执行以下操作：

recognizeStream.write(msg.media.payload);

其中msg是从WebSockets消息中解析的JSON对象：

const msg = JSON.parse(message);

不幸的是，即使实时转录按预期工作，数组data.results[0].alternatives[0].words始终为空。

有没有人证实时间偏移实际上可以用于通过Google Speech-To-Text流音频转录？

顺便说一下，这是nodejs API for Google Speech-To-Text的git-repo。

解决方法

大量证据表明，只有当位is_final为True时，才返回通过Google语音到文本转录的单词的时间偏移。

换一种说法，带有时间戳的单词边界在实时转录中似乎只在转录结束时才可用。

我知道我并不是唯一一个要求此功能的API使用者。我无法想象这很难做到，而且我怀疑此修复程序不会破坏当前的API。

google-cloud-platform google-cloud-speech speech-to-text