ValueError:TextEncodeInput必须为Union [TextInputSequence,Tuple [InputSequence,InputSequence]]-令牌化BERT / Distilbert错误

问题描述

def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df,test_size=0.1,random_state=100)

train,test = split_data(data_dir)
train_texts,train_labels = train['text'].to_list(),train['sentiment'].to_list() 
test_texts,test_labels = test['text'].to_list(),test['sentiment'].to_list() 

train_texts,val_texts,train_labels,val_labels = train_test_split(train_texts,random_state=100)

from transformers import distilBertTokenizerFast
tokenizer = distilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts,truncation=True,padding=True)
valid_encodings = tokenizer(valid_texts,padding=True)
test_encodings = tokenizer(test_texts,padding=True)

当我尝试使用BERT标记生成器从数据帧拆分时,我们遇到了这样的错误

解决方法

我有同样的错误。问题是我的清单中没有任何内容,例如:

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')

# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE','Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46','KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',None]

labels = [1,2,3,1]

d = {'texts': texts,'labels': labels} 
test_df = pd.DataFrame(d)

因此,在将“数据框”列转换为列表之前,我删除了所有“无”行。

test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts,truncation=True,padding=True)

这对我有用。

,
def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df,test_size=0.2,random_state=100)

train,test = split_data(DATA_DIR)
train_texts,train_labels = train['text'].to_list(),train['sentiment'].to_list() 
test_texts,test_labels = test['text'].to_list(),test['sentiment'].to_list() 

train_texts,val_texts,train_labels,val_labels = train_test_split(train_texts,random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts,padding=True)
valid_encodings = tokenizer(valid_texts,padding=True)
test_encodings = tokenizer(test_texts,padding=True)

尝试更改拆分的大小。会的。这意味着分割数据不足以使分词器进行分词

,

就我而言,我必须设置 is_split_into_words=True

https://huggingface.co/transformers/main_classes/tokenizer.html

要编码的序列或一批序列。每个序列可以是一个字符串或一个字符串列表(预标记字符串)。如果序列以字符串列表(预标记化)的形式提供,则必须设置 is_split_into_words=True(以消除一批序列的歧义)。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...