如何使用Spacy NLP查找专有名词

问题描述

我使用spacy构建关键字提取器。我在寻找的关键字是以下文本中的OpTic Gaming。

“该公司还是OpTic Gaming的主要赞助商之一。传奇组织参加了他们的第一个使命召唤锦标赛回到2017年”

如何从此文本中解析OpTic Gaming。如果使用noun_chunks，我将获得OpTic Gaming's main sponsors sponsors，如果获得令牌，则将获得[“ OpTic”，“ Gaming”，“'s”]。

import spacy

nlp = spacy.load("en_core_web_sm")

doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for chunk in doc.noun_chunks:
    print(chunk.text,chunk.root.text,chunk.root.dep_,chunk.root.head.text)

公司公司nsubj是

OpTic Gaming的主要赞助商赞助
的pobj
他们的第一个呼叫呼叫pobj至

当值冠军冠军pobj

解决方法

Spacy为您提取词性（专有名词，行列式，动词等）。您可以使用token.pos_

在令牌级别访问它们

在您的情况下：

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")

for tok in doc:
    print(tok,tok.pos_)

...

一个NUM

ADP

OpTic PROPN

游戏 PROPN

...

然后您可以过滤专有名词，对连续专有名词进行分组，然后对文档进行切片以获得名义组：

def extract_proper_nouns(doc):
    pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
    consecutives = []
    current = []
    for elt in pos:
        if len(current) == 0:
            current.append(elt)
        else:
            if current[-1] == elt - 1:
                current.append(elt)
            else:
                consecutives.append(current)
                current = [elt]
    if len(current) != 0:
        consecutives.append(current)
    return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]

extract_proper_nouns（doc）

[OpTic Gaming，Duty Championship]

此处有更多详细信息：https://spacy.io/usage/linguistic-features

python spacy