问题描述
我将在我的数据集中为数千个社区中的每一个找到前 10 个主题标签。数据集中的每个 user_name 都属于一个特定的社区(例如,“a”、“b”、“c”、“d”属于社区 0)。我只有 10 个社区的数据集示例如下所示:
df <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),user_name = c("a","b","c","d","e","f","g","h","i","j"),community_id =c(0,1,3),hashtags = c("#illness,#ebola","#coronavirus,#covid","#vaccine,#lie","#flue,#ebola,#usa","#vaccine","#flue","#coronavirus","#ebola","#ebola,#vaccine","#china,#virus") )
要查找每个社区(在以下情况下为社区 0)的前 10 个主题标签,我需要运行以下代码:
#select community 0
df_comm_0 <- df %>%
filter (community == 0)
#remove NAs
df_comm_0 <- na.omit(df_comm_0)
#find top 10 hashtags
df_hashtags_0 <- df_comm_0 %>%
unnest_tokens(hashtag,hashtags,token = "tweets") %>%
count(hashtag,sort = TRUE) %>%
top_n(10)
我知道使用循环可以避免我运行代码约 15,000 次(数据集中的社区数量)。我不熟悉循环,即使搜索了几个小时,也无法编写循环。以下代码是我编写的,它为我提供了整个数据集的主题标签!
x <- (df$community_id)
for (val in x) {
print (
df %>%
unnest_tokens(hashtag,sort = TRUE) %>%
top_n(10)
)
}
print()
有没有一种方法可以通过遍历所有社区并将每个社区的前 10 个主题标签输出到 1 个文件(或单独的文件)来运行所有社区的主题标签频率?
非常感谢您的助手。
解决方法
aggregate
通过社区,您可以strsplit
逗号处的主题标签并unlist
它们:names
ed的sort
的前十个元素table
为您提供所需的前十个主题标签,您可以将其paste
恢复为原始格式。
aggregate(hash ~ community,df1,function(x)
paste(names(sort(table(unlist(strsplit(x,","))),decreasing=TRUE)[1:5]),collapse=","))
# community hash
# 1 1 #covid,#fatalities,#china,#ebola,#illness
# 2 2 #ebola,#lie,#usa,#covid,#fatalities
# 3 3 #vaccine,#farright,#virus
# 4 4 #china,#vaccine,#flue,#virus,#conspiracy
# 5 5 #illness,#conspiracy,#fatalities
# 6 6 #farright,#illness
# 7 7 #virus,#illness,#farright
# 8 8 #lie,#coronavirus,#covid
# 9 9 #conspiracy,#lie
# 10 10 #china,#coronavirus
为了清楚起见,我展示了前五个主题标签,前十个在函数中使用 [1:10]
而不是 [1:5]
。
数据:
n <- 100
df1 <- data.frame(user=1:n,community=rep(1:(n/10),each=10))
set.seed(42)
df1$hash <-
replicate(n,paste(sample(c("#illness","#ebola","#coronavirus","#covid","#vaccine","#lie","#flue","#usa","#china","#fatalities","#conspiracy","#farright","#virus"),3),"))
,
使用 tidyverse
你可以:
df %>%
group_by(community_id) %>%
tidytext::unnest_tokens(hashtags,hashtags) %>%
count(hashtags)%>%
slice_max(n,n = 5)%>%
summarise(hashtags = toString(hashtags),.groups = 'drop')
,
拆分应用组合:
tt_by_cid <- Map(function(x){
head(names(sort(table(unlist(strsplit(x,decreasing = TRUE)),10)},with(df,split(sapply(hashtags,as.character),community_id)))
data.frame(do.call(rbind,mapply(cbind,"community_id" = names(tt_by_cid),hashtags = tt_by_cid,SIMPLIFY = TRUE)),stringsAsFactors = FALSE,row.names = NULL)