对大文本微调GPT-2以生成域文本

问题描述

尝试在非常大的文本上训练GPT-2，以便从特定域生成文本。
使用tensorflow2。

例如，假设我拥有《哈利·波特》的所有书籍：)
而且我想在它们上面训练GPT-2，以便以后可以从“哈利波特”域生成文本。

from tensorflow.keras.utils import get_file
from transformers import GPT2Tokenizer,TFGPT2Model

text = '...'
# Length of text: 474429 characters
# 84 unique characters

tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = TFGPT2Model.from_pretrained('gpt2-medium')

encoded_input = tokenizer(text,return_tensors='tf') # ERROR
output = model(encoded_input)

input_ids = tokenizer.encode('severus snape',return_tensors='tf')
greedy_output = model.generate(input_ids,max_length=50)
print(tokenizer.decode(greedy_output[0],skip_special_tokens=True))

错误：令牌索引序列长度大于指定的长度此模型的最大序列长度（149887> 1024）。运行这个整个模型的顺序会导致索引错误

那我该怎么做呢？
如何为模型提供大量的新文本以进行训练？

编辑：
尝试合并时，令牌生成器有效，但模型无效：

from textwrap import wrap
text_batches = wrap(text,1000)

encoded_input = None

for tb in text_batches:
    current = tokenizer(tb,return_tensors='tf')
  
    if encoded_input == None:
        encoded_input = current
    else:
        encoded_input['input_ids']      = tf.concat([encoded_input['input_ids'],current['input_ids']],axis=-1)
        encoded_input['attention_mask'] = tf.concat([encoded_input['attention_mask'],current['attention_mask']],axis=1)

output = model(encoded_input) # ERROR

错误：InvalidArgumentError：索引[0,1024] = 1024不在[0， 1024）[Op：ResourceGather]

我想念什么？

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

deep-learning huggingface-transformers keras nlp tensorflow