为什么Keras.preprocessing.sequence pad_sequences处理字符而不是单词？

问题描述

我正在尝试将语音转录为文本，并在Keras中使用pad_sequences时遇到了一个问题（我认为）。我预训练了一个在数据帧上使用pad_sequences的模型，该模型将数据拟合到数组中，每个值的列和行数相同。但是，当我使用4 X 500转录文本时，该语音字符串中的字符数就是作为numpy数组返回的行数。

说我有一个包含4个字符的字符串，那么它将返回一个6 X 500 Numpy数组。对于包含6个字符的字符串，它将返回import speech_recognition as sr import pyaudio import pandas as pd from helperFunctions import * jurors = ['Zack','Ben'] storage = [] storage_df = pd.DataFrame() while len(storage) < len(jurors): print('Juror' + ' ' + jurors[len(storage)] + ' ' + 'is speaking:') init_rec = sr.Recognizer() with sr.Microphone() as source: audio_data = init_rec.adjust_for_ambient_noise(source) audio_data = init_rec.listen(source) #each juror speaks for 10 seconds audio_text = init_rec.recognize_google(audio_data) print('End of juror' + ' ' + jurors[len(storage)] + ' ' + 'speech') storage.append(audio_text) cleaned = clean_text(audio_text) tokenized = tokenize_text(cleaned) padded_text = padding(cleaned,tokenized) #fix padded text elongating rows Numpy数组，依此类推。

我要澄清的代码：

def clean_text(text,stem=False):
    text_clean = '@\S+|https?:\S|[^A-Za-z0-9]+'
    text = re.sub(text_clean,' ',str(text).lower()).strip()
    #text = tf.strings.substr(text,300) #restrict text size to 300 chars
    return text

def tokenize_text(text):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text)
    return tokenizer

def padding(text,tokenizer):
    text = pad_sequences(tokenizer.texts_to_sequences(text),maxlen = 500)
    return text

我使用辅助功能脚本：

{{1}}

返回的文本将被输入到经过预先训练的模型中，我很确定不同长度的行会引起问题。

解决方法

Tokenizer的方法，例如fit_on_texts或texts_to_sequences期望文本/字符串的列表作为输入（如其名称所示，即{{ 1}}）。但是，您要向他们传递单个文本/字符串，因此，它会迭代其字符而不是假设它实际上是一个列表！

解决此问题的一种方法是在每个函数的开头添加一个检查，以确保输入数据类型实际上是一个列表。例如：

texts

您还应该对def padding(text,tokenizer): if isinstanceof(text,str): text = [text] # the rest would not change...函数执行此操作。进行此更改之后，您的自定义函数将在单个字符串和字符串列表上都可以使用。

重要的一点是，如果您在问题中输入的代码属于预测阶段，则存在根本错误：您应使用与训练模型时所用的相同的tokenize_text实例确保映射和标记化与培训阶段相同。实际上，为每个或所有测试样本创建一个新的Tokenizer实例是没有意义的（除非它具有与训练阶段所用的相同的映射和配置）。

keras keras nlp python speech-to-text text-processing