如何计算数据框列中列表中单词的频率?

问题描述

如果我的数据框具有以下布局:

ID#      Response
1234     Covid-19 was a disaster for my business
3456     The way you handled this pandemic was awesome

我希望能够计算列表中特定单词的出现频率。

list=['covid','COVID','Covid-19','pandemic','coronavirus']

最后我想生成一个像下面的字典

{covid:0,COVID:0,Covid-19:1,pandemic:1,'coronavirus':0}

请帮助我,我真的对如何在python中进行编码

解决方法

对于每个字符串,找到匹配项的数量。

dict((s,df['response'].str.count(s).fillna(0).sum()) for s in list_of_strings)

请注意,Series.str.count接受正则表达式输入。您可能需要附加(?=\b)以获得积极的前瞻性词尾。

Series.str.count在计数NA时返回NA,因此,请填入0。对于每个字符串,请在列上求和。

,
import pandas as pd
import numpy as np


df = pd.DataFrame({'sheet':['sheet1','sheet2','sheet3','sheet2'],'tokenized_text':[['efcc','fficial','billiontwits','since','covid','landed'],['when','people','say','the','fatality','rate','of','coronavirus','is'],['in','coronavirus-induced','crisis','are','cyvbwx'],'be-induced','cyvbwx']] })

print(df)

words_collection = ['covid','COVID','Covid-19','pandemic','coronavirus']

# Extract the words from all lines
all_words = []
for index,row in df.iterrows():
    all_words.extend(row['tokenized_text'])

# Create a dictionary that maps for each word from `words_collection` the counter it appears
word_to_number_of_occurences = dict()

# Go over the word collection and set it's counter
for word in words_collection:
    word_to_number_of_occurences[word] = all_words.count(word)

# {'covid': 1,'COVID': 0,'Covid-19': 0,'pandemic': 0,'coronavirus': 1}
print(word_to_number_of_occurences)
,

尝试使用np.hstackCounter

from collections import Counter

a = np.hstack(df['Response'].str.split())
dct = {**dict.fromkeys(lst,0),**Counter(a[np.isin(a,lst)])}

{'covid': 0,'Covid-19': 1,'pandemic': 1,'coronavirus': 0}
,

您可以很容易地通过理解来做到这一点:

{x:df.Response.str.count(x).sum() for x in list}

输出

{'covid': 0,'coronavirus': 0}

相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...