如何使用Spacy NLP查找专有名词

问题描述

我使用spacy构建关键字提取器。我在寻找的关键字是以下文本中的OpTic Gaming

“该公司还是OpTic Gaming的主要赞助商之一。 传奇组织参加了他们的第一个使命召唤锦标赛 回到2017年”

如何从此文本中解析OpTic Gaming。如果使用noun_chunks,我将获得OpTic Gaming's main sponsors sponsors,如果获得令牌,则将获得[“ OpTic”,“ Gaming”,“'s”]。

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for chunk in doc.noun_chunks:
    print(chunk.text,chunk.root.text,chunk.root.dep_,chunk.root.head.text)

公司公司nsubj是

OpTic Gaming的主要赞助商赞助

的pobj

他们的第一个呼叫呼叫pobj至

当值冠军冠军pobj

解决方法

Spacy为您提取词性(专有名词,行列式,动词等)。您可以使用token.pos_

在令牌级别访问它们

在您的情况下:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for tok in doc:
    print(tok,tok.pos_)

...

一个NUM

ADP

OpTic PROPN

游戏 PROPN

...

然后您可以过滤专有名词,对连续专有名词进行分组,然后对文档进行切片以获得名义组:

def extract_proper_nouns(doc):
    pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
    consecutives = []
    current = []
    for elt in pos:
        if len(current) == 0:
            current.append(elt)
        else:
            if current[-1] == elt - 1:
                current.append(elt)
            else:
                consecutives.append(current)
                current = [elt]
    if len(current) != 0:
        consecutives.append(current)
    return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

extract_proper_nouns(doc)

[OpTic Gaming,Duty Championship]

此处有更多详细信息:https://spacy.io/usage/linguistic-features