问题描述
我有以下代码段,使我可以使用导入的bert-base-uncased
的{{1}}提取特征:
from pytorch_pretrained_bert.modeling import BertModel
使用 def extract_bert_features(self,conll_dataset):
sentences = [[e.form for e in sentence] for sentence in conll_dataset]
# data loading
features = []
for sentence in sentences:
bert_tokens,map_to_original_tokens = self.convert_to_bert_tokenization(sentence)
feature = self.from_bert_tokens_to_features(bert_tokens,map_to_original_tokens)
features.append(feature)
all_input_ids = torch.tensor([f.input_ids for f in features],dtype=torch.long)
# mask with 0's for placeholders
all_input_mask = torch.tensor([f.input_mask for f in features],dtype=torch.long)
# tensor with 1...n where n is the number of examples
all_token_maps = torch.tensor([f.map_to_original_tokens for f in features],dtype=torch.long)
# indexes that map back dataset
all_example_index = torch.arange(all_input_ids.size(0),dtype=torch.long)
# create a dataset the resources needed
eval_data = TensorDataset(all_input_ids,all_input_mask,all_token_maps,all_example_index)
# create a sampler which will be used to create the batches
eval_sampler = SequentialSampler(eval_data)
eval_DataLoader = DataLoader(eval_data,sampler=eval_sampler,batch_size=self.batch_size)
for input_ids,input_mask,token_maps,example_indices in eval_DataLoader:
input_ids = input_ids.to(self.device)
input_mask = input_mask.to(self.device)
### RUN MODEL: run model to get all 12 layers of bert ###
all_encoder_layers,_ = self.model(input_ids,token_type_ids=None,attention_mask=input_mask)
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
for i,idx in enumerate(example_indices):
for j,coll_entry in enumerate(conll_dataset[idx]):
if token_maps[i,j] < 511:
coll_entry.bert = averaged_output[i,token_maps[i,j]].clone().detach().cpu()
else:
coll_entry.bert = averaged_output[i,511]].clone().detach().cpu()
中的bert-base-uncased
可以正常工作,因为我的pytorch_pretrained_bert
对象是12个隐藏层张量的列表,使我可以选择all_encoder_layers
的位置并取平均值。
具体来说,尺寸为:
idx
但是,当我将代码迁移到导入print("All encoder layers: ",all_encoder_layers) # list type
print("Number of layers:",len(all_encoder_layers)) # 12
print("Number of batches:",len(all_encoder_layers[0])) # 1
print("Number of tokens:",len(all_encoder_layers[0][0])) # 512
print("Number of hidden units:",len(all_encoder_layers[0][0][0])) # 768
print("Idx: ",self.layer_indexes) # [-1,-2,-3,-4]
print("Averaged_output len: ",len(averaged_output)) # 1
print("Averaged_output dim: ",averaged_output.shape) # torch.Size([1,512,768])
的{{1}}库时,生成的transformers
对象不再是12个隐藏层的完整列表,而是一个火炬张量对象形状为AutoTokenizer,AutoModel
。现在特别是尺寸为:
all_encoded_layers
当我尝试创建torch.Size([1,768])
时导致以下错误:
print("All encoder layers: ",all_encoder_layers) # tensor type
print("Number of layers:",len(all_encoder_layers)) # 1
print("Number of batches:",len(all_encoder_layers[0])) # 512
print("Number of tokens:",len(all_encoder_layers[0][0])) # 768
print("Size of all encoder_layers: ",all_encoder_layers.size()) # torch.Size([1,768])
print("Idx: ",self.layer_indexes) # [-1,-4]
The migration documentation指出我应该使用averaged_output
对象的第一个元素作为替换,但是这样做与之前创建平均值的操作相同吗?
如果答案是肯定的,那么我很好,否则您对如何尝试从{{1}复制适用于 File "/.../bert_features.py",line 103,in extract_bert_features
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
File "/.../bert_features.py",in <listcomp>
averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
IndexError: index -2 is out of bounds for dimension 0 with size 1
的这一行all_encoded_layers
有任何想法}也用于averaged_output = torch.stack([all_encoder_layers[idx] for idx in self.layer_indexes]).mean(0) / len(self.layer_indexes)
吗?
非常感谢大家!
编辑:最明显的是bert-base-uncased
的形状与pytorch_pretrained_bert
相同。实际上,使用后者可以使代码工作并获得出色的结果。问题在于这样做,我们不仅仅考虑最后4层,而是将所有12层都浓缩为一个张量。
有人知道此更改是否与Bert的(尤其是UmBERTo)特征提取目的相关,或者我可以推迟它吗?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)