问题描述
我使用spacy构建关键字提取器。我在寻找的关键字是以下文本中的OpTic Gaming
。
“该公司还是OpTic Gaming的主要赞助商之一。 传奇组织参加了他们的第一个使命召唤锦标赛 回到2017年”
如何从此文本中解析OpTic Gaming
。如果使用noun_chunks,我将获得OpTic Gaming's main sponsors sponsors
,如果获得令牌,则将获得[“ OpTic”,“ Gaming”,“'s”]。
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")
for chunk in doc.noun_chunks:
print(chunk.text,chunk.root.text,chunk.root.dep_,chunk.root.head.text)
公司公司nsubj是
OpTic Gaming的主要赞助商赞助
的pobj他们的第一个呼叫呼叫pobj至
当值冠军冠军pobj
解决方法
Spacy为您提取词性(专有名词,行列式,动词等)。您可以使用token.pos_
在您的情况下:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The company was also one of OpTic Gaming's main sponsors during the legendary organization's run to their first Call of Duty Championship back in 2017")
for tok in doc:
print(tok,tok.pos_)
...
一个NUM
ADP
OpTic PROPN
游戏 PROPN
...
然后您可以过滤专有名词,对连续专有名词进行分组,然后对文档进行切片以获得名义组:
def extract_proper_nouns(doc):
pos = [tok.i for tok in doc if tok.pos_ == "PROPN"]
consecutives = []
current = []
for elt in pos:
if len(current) == 0:
current.append(elt)
else:
if current[-1] == elt - 1:
current.append(elt)
else:
consecutives.append(current)
current = [elt]
if len(current) != 0:
consecutives.append(current)
return [doc[consecutive[0]:consecutive[-1]+1] for consecutive in consecutives]
extract_proper_nouns(doc)
[OpTic Gaming,Duty Championship]