我正在努力总结文本,使用nltk库我能够提取bigrams unigrams和trigrams并按频率对它们进行排序
由于我是这个领域的新手(NLP),我想知道我是否可以使用统计模型,这将允许我自动选择正确大小的Ngrams(我的意思是大小N-gram一个单词unigram的长度,两个单词二元组,或三个单词trigram)
例如,假设我有这个我想要总结的文本,作为总结,我将保留5个最相关的N-gram:
"A more principled way to estimate sentence importance is using random walks and eigenvector centrality. LexRank[5] is an algorithm essentially identical to TextRank,and both use this approach for document summarization. The two methods were developed by different groups at the same time,and LexRank simply focused on summarization,but Could just as easily be used for keyphrase extraction or any other NLP ranking task." wikipedia
然后作为我想要的输出,“随机漫步”,“texRank”,“lexRanks”,“文档摘要”,“关键短语提取”,“NLP排名任务”
换句话说,我的问题是:如何推断unigram将比bigram或trigram更具相关性? (仅使用频率作为N-gram相关性的度量将不会给我我想要的结果)
任何人都可以向我指出研究论文,算法或已经使用或解释过这种方法的课程
先感谢您.