如何在spacy中标记新的vocab？

问题描述

我正在使用spacy从其依赖项解析中受益，我在使spcay标记化器标记化我添加的新vocab时遇到麻烦。这是我的代码：

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

输出：

[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone',('morphogenetic','JJ'),('protein',('(BMP)-2','NNP'),('as','RB'),('well',('mRNAs',('for',('BMP',('receptor',('type',('IB',('(BMPRIB).','NN')]

愿望输出：

[('This',('bone morphogenetic protein (BMP)-2',('BMP receptor type IB',('(','('),('BMPRIB',(')',')'),('.','.')]

我怎样才能使spacy标记化我添加的新vocab？

解决方法

看看Doc.retokenize()是否可以帮助您：

import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[6:11])

print([(token.text,token.tag_) for token in doc])

[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone morphogenetic protein (BMP)-2',('as','RB'),('well',('mRNAs','NNP'),('for',('BMP',('receptor',('type',('IB',('(','-LRB-'),('BMPRIB',(')','-RRB-'),('.','.')]

我在nlp.tokenizer.tokens_from_list中找到了解决方案我把句子分解成单词列表，然后根据需要将其标记化

导入空间

nlp = spacy.load（“ en_core_web_sm”）

nlp.tokenizer = nlp.tokenizer.tokens_from_list

nlp.pipe中的doc（[[''s'，'study'，'describes'，'the'，'distributions，'of'，'bone morphogenetic protein（BMP）-2'， 'as'，'well'，'as'，'mRNA'，'for'，'BMP IB型受体'，'（'，'BMPRIB'，'）'，'。']]）））：

对于文档中的令牌：

   print(token,'//',token.dep_)

python spacy tokenize vocabulary