如何通过 ID 循环访问社区?

问题描述

我将在我的数据集中为数千个社区中的每一个找到前 10 个主题标签。数据集中的每个 user_name 都属于一个特定的社区(例如,“a”、“b”、“c”、“d”属于社区 0)。我只有 10 个社区的数据集示例如下所示:

df <- data.frame(N = c(1,2,3,4,5,6,7,8,9,10),user_name = c("a","b","c","d","e","f","g","h","i","j"),community_id =c(0,1,3),hashtags   = c("#illness,#ebola","#coronavirus,#covid","#vaccine,#lie","#flue,#ebola,#usa","#vaccine","#flue","#coronavirus","#ebola","#ebola,#vaccine","#china,#virus") )

要查找每个社区(在以下情况下为社区 0)的前 10 个主题标签,我需要运行以下代码

#select community 0
df_comm_0 <- df %>%
  filter (community == 0)

#remove NAs
df_comm_0 <- na.omit(df_comm_0)

#find top 10 hashtags
df_hashtags_0 <- df_comm_0 %>% 
unnest_tokens(hashtag,hashtags,token = "tweets") %>%
  count(hashtag,sort = TRUE) %>%
  top_n(10)

我知道使用循环可以避免我运行代码约 15,000 次(数据集中的社区数量)。我不熟悉循环,即使搜索了几个小时,也无法编写循环。以下代码是我编写的,它为我提供了整个数据集的主题标签

x <- (df$community_id)

for (val in x) {
  
print (
df %>%
unnest_tokens(hashtag,sort = TRUE) %>%
  top_n(10)
)
}
print()

有没有一种方法可以通过遍历所有社区并将每个社区的前 10 个主题标签输出到 1 个文件(或单独的文件)来运行所有社区的主题标签频率?

非常感谢您的助手。

解决方法

aggregate通过社区,您可以strsplit逗号处的主题标签并unlist它们:namesed的sort的前十个元素table 为您提供所需的前十个主题标签,您可以将其paste 恢复为原始格式。

aggregate(hash ~ community,df1,function(x)
  paste(names(sort(table(unlist(strsplit(x,","))),decreasing=TRUE)[1:5]),collapse=","))
#    community                                                     hash
# 1          1            #covid,#fatalities,#china,#ebola,#illness
# 2          2                  #ebola,#lie,#usa,#covid,#fatalities
# 3          3                #vaccine,#farright,#virus
# 4          4             #china,#vaccine,#flue,#virus,#conspiracy
# 5          5         #illness,#conspiracy,#fatalities
# 6          6         #farright,#illness
# 7          7         #virus,#illness,#farright
# 8          8                #lie,#coronavirus,#covid
# 9          9        #conspiracy,#lie
# 10        10 #china,#coronavirus

为了清楚起见,我展示了前五个主题标签,前十个在函数中使用 [1:10] 而不是 [1:5]


数据:

n <- 100
df1 <- data.frame(user=1:n,community=rep(1:(n/10),each=10))
set.seed(42)
df1$hash <- 
  replicate(n,paste(sample(c("#illness","#ebola","#coronavirus","#covid","#vaccine","#lie","#flue","#usa","#china","#fatalities","#conspiracy","#farright","#virus"),3),"))
,

使用 tidyverse 你可以:

df %>%
  group_by(community_id) %>%
  tidytext::unnest_tokens(hashtags,hashtags) %>%
  count(hashtags)%>%
  slice_max(n,n = 5)%>%
  summarise(hashtags = toString(hashtags),.groups = 'drop')
,

拆分应用组合:

tt_by_cid <- Map(function(x){
  head(names(sort(table(unlist(strsplit(x,decreasing = TRUE)),10)},with(df,split(sapply(hashtags,as.character),community_id)))

data.frame(do.call(rbind,mapply(cbind,"community_id" = names(tt_by_cid),hashtags = tt_by_cid,SIMPLIFY = TRUE)),stringsAsFactors = FALSE,row.names = NULL)