与Google语音转文本API型号：视频，UseEnhanced：true相比，YouTube的自动字幕产生的效果更好这怎么可能呢？

问题描述

这是我从Google Speech to Text AI的设置

这是语音转文字AI的输出文件：https://justpaste.it/speechtotext2

这是YouTube自动字幕的输出文件：https://justpaste.it/ytautotranslate

这是视频链接：https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses

这是提供给Google Speech AI的视频的音频文件：https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac

我在这里提供按时间分配的SRT文件

YouTube的SRT：https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing

Google Speech to Text API的SRT（由YouTube分配的时间）：https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing

我比较了一些句子，肯定YouTube的自动翻译效果更好

例如

Google语音转文本： Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.

What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.

YouTube的自动字幕： represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons

what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it

我检查了很多情况，YouTube猜测正确的单词要好得多。这怎么可能？

这是我用来提取视频音频的命令：ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac

解决方法

Youtube Auto Caption功能的自动字幕和语音到文本识别的转录都是由机器学习算法生成的，在这种情况下，转录的质量可能会因不同方面而异。

重要的是要注意他的语音转文本API使用机器学习算法进行转录，这些算法会随着时间的推移而得到改进，结果会根据输入文件和请求配置而有所不同。帮助Google转录模型的一种方法是启用data logging，这将使Google能够从您的音频转录请求中收集数据，这将有助于改善其用于识别语音音频的机器学习模型，包括增强模型。

此外，在语音转换为文本API的请求配置中，您可以指定RecognitionConfig设置。该参数包含编码，sampleRateHertz，languageCode，maxAlternatives，profanityFilter和SpeechContext，每个参数对文件转录的准确性都起着重要作用。

专门针对FLAC音频文件，lossless compression有助于提高所提供音频的质量，因为原始数字样本的质量不会降低，因此FLAC使用从0（最快）到8的压缩级别参数。（最小文件大小）。

语音到文本API还提供了多种方法来提高转录的准确性，例如：

Speech adaptation：此功能允许您指定STT应该在音频数据中更频繁地识别的单词和/或短语
Speech adaptation boost：此功能允许您根据应在音频数据中识别它们的频率，为单词和/或短语添加数字权重。
Phrases hints：发送单词和短语的列表，这些单词和短语为语音识别任务提供提示

这些功能可能会帮助您提高语音转文本API识别音频文件的准确性。

最后，请参考语音转文本best practices，以改善音频文件的转录，这些建议旨在提高效率和准确性，并提高API的合理响应时间。

google-cloud-platform google-cloud-speech google-speech-to-text-api speech-recognition speech-to-text

与Google语音转文本API型号：视频，UseEnhanced：true相比，YouTube的自动字幕产生的效果更好这怎么可能呢？

问题描述

解决方法

相关问答