如何从文本中提取所有可能的名词短语

问题描述

我想在文本中自动提取一些理想的概念（名词短语）。我的计划是提取所有名词短语，然后将它们标记为两个分类（即，理想的短语和非理想的短语）。之后，训练分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如，一个句子是Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.，我想获取所有短语，例如shoulder，richer mix，shoulder of richer mix，junctions，junctions of columns and beams，{{ 1}}，columns and beams，columns或任何可能的方式。理想的短语是beams，shoulder，junctions。但是我不在乎此步骤的正确性，我只想首先获得培训。是否有用于执行此任务的工具？

我在rake_nltk中尝试了Rake，但是结果未能包含我想要的短语（即，它没有提取所有可能的短语）

junctions of columns and beams

结果：from rake_nltk import Rake data = 'Where a shoulder of richer mix is required at these junctions,the items are so described.' r = Rake() r.extract_keywords_from_text(data) phrase = r.get_ranked_phrases() print(phrase)enter code herenter code here （此处缺少['richer mix','shoulder','required','junctions','items','described','columns','beams']）

我也尝试了词组机器，结果也错过了一些理想的结果。

junctions of columns and beams

结果：

import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens,postags=pos,output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
    start,end = out['token_spans'].pop()
    print(tokens[start:end])

（此处缺少许多名词短语）

解决方法

您可能希望利用noun_chunks属性：

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.')

phrases = set() 
for nc in doc.noun_chunks:
    phrases.add(nc.text)
    phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams','junctions','the items','a shoulder','columns','richer mix','beams','columns and beams','a shoulder of richer mix','these junctions'}

information-extraction ner nlp python spacy

如何从文本中提取所有可能的名词短语

问题描述

解决方法

相关问答