问题描述
我正在使用 HuggingFace Trainer 类训练模型。以下代码做得不错:
!pip install datasets
!pip install transformers
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification,TrainingArguments,Trainer,AutoTokenizer
dataset = load_dataset('glue','mnli')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',use_fast=True)
def preprocess_function(examples):
return tokenizer(examples["premise"],examples["hypothesis"],truncation=True,padding=True)
encoded_dataset = dataset.map(preprocess_function,batched=True)
args = TrainingArguments(
"test-glue",learning_rate=3e-5,per_device_train_batch_size=8,num_train_epochs=3,remove_unused_columns=True
)
trainer = Trainer(
model,args,train_dataset=encoded_dataset["train"],tokenizer=tokenizer
)
trainer.train()
但是,设置 remove_unused_columns=False
会导致以下错误:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self,tensor_type,prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
706
ValueError: too many dimensions 'str'
During handling of the above exception,another exception occurred:
ValueError Traceback (most recent call last)
8 frames
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self,prepend_batch_axis)
720 )
721 raise ValueError(
--> 722 "Unable to create tensor,you should probably activate truncation and/or padding "
723 "with 'padding=True' 'truncation=True' to have batched tensors with the same length."
724 )
ValueError: Unable to create tensor,you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
非常感谢任何建议。
解决方法
失败是因为value
行中的705
是一个str列表,指向hypothesis
。而 hypothesis
是 ignored_columns
中的 trainer.py
之一。
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self,tensor_type,prepend_batch_axis)
704 if not is_tensor(value):
--> 705 tensor = as_tensor(value)
有关 trainer.py
标志,请参阅来自 remove_unused_columns
的以下片段:
def _remove_unused_columns(self,dataset: "datasets.Dataset",description: Optional[str] = None):
if not self.args.remove_unused_columns:
return dataset
if self._signature_columns is None:
# Inspect model forward signature to keep only the arguments it accepts.
signature = inspect.signature(self.model.forward)
self._signature_columns = list(signature.parameters.keys())
# Labels may be named label or label_ids,the default data collator handles that.
self._signature_columns += ["label","label_ids"]
columns = [k for k in self._signature_columns if k in dataset.column_names]
ignored_columns = list(set(dataset.column_names) - set(self._signature_columns))
在标志为 False
的情况下,HuggingFace 上可能存在一个潜在的拉取请求以提供后备选项。但总的来说,标志实现似乎不完整,例如它不能与 Tensorflow 一起使用。
相反,保留它True
并没有什么坏处,除非有特殊需要。