Speech-to-Text 和视频智能 SPEECH_TRANSCRIPTION 有何关联？

我的目标是使用语音到文本模型处理多个视频。

令人困惑的是，Google 有两种似乎做同样事情的产品。

这些产品之间的主要区别是什么？

Google Cloud Speech-to-Text：https://cloud.google.com/speech-to-text/docs/basics
- Speech-to-Text 具有用于解释音频的“增强型视频”模型。
Google 视频智能：https://cloud.google.com/video-intelligence/docs/feature-speech-transcription
- VI 可以选择请求 const axiosInstance = axios.create({ baseURL: 'https://corpURL',headers: { Authorization: `Bearer ${token}`,'Content-Type': 'application/json',env: 'it04',},}); axiosInstance .get('/get/path') .then(response => console.log('response',response.data)) .catch(err => console.log('err',err)); 功能

两者的主要区别在于使用的输入。 Speech to Text API 仅接受音频输入，而 Video Intelligence 接受视频输入。

如您的问题“Speech to Text 具有增强视频模型”所述，这意味着它具有一个旨在转录源自视频文件的音频的模型。这意味着原始文件在视频中，然后转换为音频。如本 tutorial 中所示，视频在转录之前已转换为音频。

如果您想直接将音频内容转录为文本，我建议使用 Video Intelligence API。您可以使用 Video Intelligence API 关注此 tutorial on how to transcribe text。