如何从文本中提取所有可能的名词短语

问题描述

我想在文本中自动提取一些理想的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两个分类(即,理想的短语和非理想的短语)。之后,训练分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.,我想获取所有短语,例如shoulderricher mixshoulder of richer mixjunctionsjunctions of columns and beams,{{ 1}},columns and beamscolumns或任何可能的方式。理想的短语是beamsshoulderjunctions。但是我不在乎此步骤的正确性,我只想首先获得培训。是否有用于执行此任务的工具?

我在rake_nltk中尝试了Rake,但是结果未能包含我想要的短语(即,它没有提取所有可能的短语)

junctions of columns and beams

结果:from rake_nltk import Rake data = 'Where a shoulder of richer mix is required at these junctions,the items are so described.' r = Rake() r.extract_keywords_from_text(data) phrase = r.get_ranked_phrases() print(phrase)enter code herenter code here (此处缺少['richer mix','shoulder','required','junctions','items','described','columns','beams']

我也尝试了词组机器,结果也错过了一些理想的结果。

junctions of columns and beams

结果:

import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens,postags=pos,output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
    start,end = out['token_spans'].pop()
    print(tokens[start:end])

(此处缺少许多名词短语)

解决方法

您可能希望利用noun_chunks属性:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.')

phrases = set() 
for nc in doc.noun_chunks:
    phrases.add(nc.text)
    phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams','junctions','the items','a shoulder','columns','richer mix','beams','columns and beams','a shoulder of richer mix','these junctions'}

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...