在Python中使用NLTK Stanford NER提取多词命名实体

问题描述

我正在尝试使用Stanford-NER从文本中提取命名实体。我已经阅读了有关分块的所有相关主题，却没有找到任何解决问题的方法。

输入：

联合国正在美国举行会议。

预期输出：

联合国/组织

美国/位置

我能够获得此输出，但是它没有为多任务命名实体组合令牌：

[('The','O'),('united','ORGANIZATION'),('nations',('is',('holding',('a',('meeting',('in',('the','LOCATION'),('states',('of',('America',('.','O')]

或树形格式：

(S
  The/O
  united/ORGANIZATION
  nations/ORGANIZATION
  is/O
  holding/O
  a/O
  meeting/O
  in/O
  the/O
  united/LOCATION
  states/LOCATION
  of/LOCATION
  America/LOCATION
  ./O)

我正在寻找以下输出：

[('The',('united nations',('united states of America','O')]

当我尝试在其他线程中找到的一些代码以树格式连接命名实体时，它返回了一个空列表。

import nltk
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
import os
java_path = "C:\Program Files (x86)\Java\jre1.8.0_251/java.exe"
os.environ['JAVAHOME'] = java_path

st = StanfordNERTagger(r'stanford-ner-4.0.0/stanford-ner-4.0.0/classifiers/english.all.3class.distsim.crf.ser.gz',r'stanford-ner-4.0.0/stanford-ner-4.0.0/stanford-ner.jar',encoding='utf-8')

text = 'The united nations is holding a meeting in the united states of America.'
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
namedEnt = nltk.ne_chunk(classified_text,binary = True)

#this line makes the tree return an empty list
np = [' '.join([y[0] for y in x.leaves()]) for x in namedEnt.subtrees() if x.label() == "NE"]

print(np)

print(classified_text)

解决方法

nltk 中的 StanfordNERTagger 不保留有关命名实体边界的信息。如果您尝试解析标注器的输出，则无法判断具有相同标记的两个连续名词是否属于同一实体或它们是否不同。

或者，https://stanfordnlp.github.io/CoreNLP/other-languages.html#python 表示斯坦福团队正在积极开发一个名为 Stanza 的 Python 包，它使用斯坦福 CoreNLP。它很慢，但真的很容易使用。

$ pip3 安装节

>>> import stanza
>>> stanza.download ('en')
>>> nlp = stanza.Pipeline ('en')
>>> results = nlp (<insert your text string here>)

分块实体位于 results.ents 中。

named-entity-recognition nltk python-3.x stanford-nlp