微调 GPT-2

问题描述

我正在尝试微调 GPT-2 以完成以下任务:如果我给出五个连续的数字,那么接下来的连续数字是多少。例如,如果 input_text = "one | two | three | four | five"output_text = "six | seven... | ten"

我通过 Huggingface API 使用的模型的重要部分如下:

class Model(pl.LightningModule):
    def __init__(self,tokenizer,lr: float) -> None:
        super().__init__()
        self.lr = lr
        self.tokenizer = Tokenizer(tokenizer)
        self.model = GPT2LMHeadModel.from_pretrained('gpt2')
        
    def common_step(self,batch: Tuple[List[str],List[str]]) -> torch.FloatTensor:
        questions,answers = batch
        combined = [input + " <EOS> " + output for input,output in zip(questions,answers)]
        tokens = {k: v.to(self.device) for k,v in self.tokenizer(combined).items()}
        
        labels = tokens["input_ids"].clone()
        labels[tokens["attention_mask"]==0] = -100

        outputs = self.model(
            input_ids=tokens["input_ids"],attention_mask=tokens["attention_mask"],labels=labels,return_dict=True
        )
        
        return outputs["loss"]
    
    def training_step(self,List[str]],*args) -> torch.FloatTensor:
        loss = self.common_step(batch)
        return loss
        
    def generate_examples(self,batch):
        questions,answers = batch
        combined = [question + " <EOS> " for question in questions]
        tokens = {k: v.to(self.device) for k,v in self.tokenizer(combined).items()}

        generated = self.model.generate(
            input_ids=tokens["input_ids"],)

        print(questions[0])
        print("="*30)
        print(self.tokenizer.decode(generated[0]))

我可以尝试输出数字,但不幸的是看起来像这样。输出显示标签的位置开始,否则它只是复制内容。请注意,GPT-2 标记器没有:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
<|endoftext|>five thousand,five hundred and ninety-one| five thousand,five hundred and ninety-two| five thousand,five hundred and ninety-three| five thousand,five hundred and ninety-four| five thousand,five hundred and ninety-five <EOS> <|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|> fifteen thousand,four hundred and thirty-six| ten thousand,six hundred and sixty-seven| fifteen thousand and sixty‑eight| 15 thousand and eighty-nine| fifteen hundred and seventy<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

所以问题是为什么在一堆标记之后生成一个可能的候选。在训练集中,通过一个“”字(它不是一个实际的标记)将输入和输出组合起来,输出就在那里,没有任何填充。

这是否与我使用的我在下面定义的标记器有关?

# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self,token_ids_0,token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs
    
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token

可以在 here 中找到有关 colab 的工作示例。

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)