计算一个句子中同时出现的两个单词的出现频率

问题描述

我有一个pandas数据框，其中一栏中有经过修饰的文本。

我想计算两个给定单词在同一句子中同时出现的频率，并计算这些单词在文档中出现多少次。例如，给定“ I”和“ have”，计算文档“ I”和“ have”在同一句子中一起出现多少次。

理想情况下，我想创建一个新的DataFrame，其结果是在一栏中我将两个单词放在一起，在另一栏中将两个单词一起出现在一个句子中，而在第三栏中则是原始文本。

我的结果需要像这样：

text,given_words,frequency_in_sentence
text1 | "I have " | 2 times in same sentence 
text2 | "I have " | 3 times in same sentence 
text3 | "I have " | 1 times in same sentence

解决方法

这是伪代码，但可以用于任何语言：

word1="whatever"
word2="yes"


for (text:texts)
     sentances=text.getSentances()

count=0
for (sentance:sentances)
     if ( sentance.contains(word1,word2) )
          count++

print ( "text " + text.name + " " + word1 + " " + word2 + " appears in same sentances " + count + " times" )

然后您将需要以下方法来“确定”

boolean contains (String ... words){
     int args = words.length;     
     int matchCount=0;
     for (word : words)
           if (this.text.match(word)
                 matchCount++ && continue


     if matchCount==args
           return true


     return false
}

您可以使用count并通过数据框上的apply函数使用它：

def count(sentence,pattern):
    """ count pattern occurence """
    return word.count(sentence)

df['frequency_in_sentence'] = df.apply(lambda row:count(row['text'],row['given_words']),axis = 1)

nlp nltk python spacy token token