问题描述
我想在文本中自动提取一些理想的概念(名词短语)。我的计划是提取所有名词短语,然后将它们标记为两个分类(即,理想的短语和非理想的短语)。之后,训练分类器对它们进行分类。我现在正在尝试的是首先提取所有可能的短语作为训练集。例如,一个句子是Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.
,我想获取所有短语,例如shoulder
,richer mix
,shoulder of richer mix
,junctions
,junctions of columns and beams
,{{ 1}},columns and beams
,columns
或任何可能的方式。理想的短语是beams
,shoulder
,junctions
。但是我不在乎此步骤的正确性,我只想首先获得培训。是否有用于执行此任务的工具?
我在rake_nltk中尝试了Rake,但是结果未能包含我想要的短语(即,它没有提取所有可能的短语)
junctions of columns and beams
结果:from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions,the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
(此处缺少['richer mix','shoulder','required','junctions','items','described','columns','beams']
)
我也尝试了词组机器,结果也错过了一些理想的结果。
junctions of columns and beams
结果:
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens,postags=pos,output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
(此处缺少许多名词短语)
解决方法
您可能希望利用noun_chunks
属性:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions,or at junctions of columns and beams,the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams','junctions','the items','a shoulder','columns','richer mix','beams','columns and beams','a shoulder of richer mix','these junctions'}