问题描述
我正在使用spacy从其依赖项解析中受益,我在使spcay标记化器标记化我添加的新vocab时遇到麻烦。 这是我的代码:
nlp = spacy.load("en_core_web_md")
nlp.vocab['bone morphogenetic protein (BMP)-2']
nlp.tokenizer = Tokenizer(nlp.vocab)
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'
doc = nlp(text)
print([(token.text,token.tag_) for token in doc])
输出:
[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone',('morphogenetic','JJ'),('protein',('(BMP)-2','NNP'),('as','RB'),('well',('mRNAs',('for',('BMP',('receptor',('type',('IB',('(BMPRIB).','NN')]
愿望输出:
[('This',('bone morphogenetic protein (BMP)-2',('BMP receptor type IB',('(','('),('BMPRIB',(')',')'),('.','.')]
解决方法
看看Doc.retokenize()
是否可以帮助您:
import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'
doc = nlp(text)
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[6:11])
print([(token.text,token.tag_) for token in doc])
[('This','DT'),('study','NN'),('describes','VBZ'),('the',('distributions','NNS'),('of','IN'),('bone morphogenetic protein (BMP)-2',('as','RB'),('well',('mRNAs','NNP'),('for',('BMP',('receptor',('type',('IB',('(','-LRB-'),('BMPRIB',(')','-RRB-'),('.','.')]
,
我在nlp.tokenizer.tokens_from_list中找到了解决方案 我把句子分解成单词列表,然后根据需要将其标记化
导入空间
nlp = spacy.load(“ en_core_web_sm”)
nlp.tokenizer = nlp.tokenizer.tokens_from_list
nlp.pipe中的doc([[''s','study','describes','the','distributions,'of','bone morphogenetic protein(BMP)-2', 'as','well','as','mRNA','for','BMP IB型受体','(','BMPRIB',')','。']]))):
对于文档中的令牌:
print(token,'//',token.dep_)