问题描述
我正在尝试创建一个由词形还原名词和名词块组成的文档语料库。我正在使用此代码:
import spacy
nlp = spacy.load('en_core_web_sm')
def lemmatizer(doc,allowed_postags=['NOUN']):
doc = [token.lemma_ for token in doc if token.pos_ in allowed_postags]
doc = u' '.join(doc)
return nlp.make_doc(doc)
nlp.add_pipe(nlp.create_pipe('merge_noun_chunks'),after='ner')
nlp.add_pipe(lemmatizer,name='lemm',after='merge_noun_chunks')
doc_list = []
for doc in data:
pr = nlp(doc)
doc_list.append(pr)
在识别名词块 'the euro area has advanced a long way as a monetary union'
和词形还原之后的句子 ['the euro area','advanced','long','way','a monetary union']
变为:['euro','area','monetary','union']
。
有没有办法将识别出的名词块的单词组合起来得到这样的输出:['the euro area','a monetary union']
或 ['the_euro_area','a_monetary_union']
?
感谢您的帮助!
解决方法
我认为您的问题与词形还原无关。 此方法适用于您的示例。
# merge noun phrase and entities
def merge_noun_phrase(doc):
spans = list(doc.ents) + list(doc.noun_chunks)
spans = spacy.util.filter_spans(spans)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
return doc
sentence = "the euro area has advanced a long way as a monetary union"
doc = nlp(sentence)
doc2 = merge_noun_phrase(doc)
for token in doc2:
print(token)
#['the euro area','way','a monetary union']
我必须注意,我使用的是 spacy2.3.5,也许 spacy.util.filter_spans
在最新版本中已被弃用。这个答案会帮助你。 :)
Module 'spacy.util' has no attribute 'filter_spans'
而且,如果您仍然尝试对名词块进行词形还原,您可以按以下方式进行:
doc = nlp("the euro area has advanced a long way as a monetary union")
for chunk in doc.noun_chunks:
print(chunk.lemma_)
#['the euro area','a monetary union']
根据 What is the lemma for 'two pets' 中的答案,“在跨度级别查看引理可能不是很有用,在令牌级别上工作更有意义。”