如何在spacy中标记新的vocab?

问题描述

我正在使用spacy从其依赖项解析中受益,我在使spcay标记化器标记化我添加的新vocab时遇到麻烦。 这是我的代码

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

输出

[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone',('morphogenetic','JJ'),('protein',('(BMP)-2','NNP'),('as','RB'),('well',('mRNAs',('for',('BMP',('receptor',('type',('IB',('(BMPRIB).','NN')]

愿望输出

[('This',('bone morphogenetic protein (BMP)-2',('BMP receptor type IB',('(','('),('BMPRIB',(')',')'),('.','.')]

我怎样才能使spacy标记化我添加的新vocab?

解决方法

看看Doc.retokenize()是否可以帮助您:

import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[6:11])

print([(token.text,token.tag_) for token in doc])

[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone morphogenetic protein (BMP)-2',('as','RB'),('well',('mRNAs','NNP'),('for',('BMP',('receptor',('type',('IB',('(','-LRB-'),('BMPRIB',(')','-RRB-'),('.','.')]
,

我在nlp.tokenizer.tokens_from_list中找到了解决方案 我把句子分解成单词列表,然后根据需要将其标记化

导入空间

nlp = spacy.load(“ en_core_web_sm”)

nlp.tokenizer = nlp.tokenizer.tokens_from_list

nlp.pipe中的doc([[''s','study','describes','the','distributions,'of','bone morphogenetic protein(BMP)-2', 'as','well','as','mRNA','for','BMP IB型受体','(','BMPRIB',')','。']]))):

对于文档中的令牌:

   print(token,'//',token.dep_)