在 R 中组合 nest() 和aggregate()?

问题描述

寻求帮助和建议:

我用 rtweet 包收集了推文。这为我提供了一个数据框,其中包含行中的观察结果(即推文)和列中的变量。变量在推文级别(例如文本、喜欢、主题标签等)和帐户级别(关注者数量、生物等)。我对推文进行了情感分析,将带有推文级别的情感分数的变量添加到数据框。

模拟我的数据现在的样子(实际上我有 100,000 多个观察和 115 个变量):

df <- data.frame(users = c('u1','u2','u3','u4','u5','u1','u6','u1'),text = c('this is u1 first tweet','this is another tweet','hello hello','hashtag tweettext','tweet text','this is u1 second tweet','this is u6 first tzeet','this is u6 second tweet','this is u6 third tweet','this is u1 third tweet'),likes= sample(1:10,10),sentiment= rnorm(10,mean=0,sd=1),followers = c(111,200,300,400,500,111,666,111),bio = paste0(rep('lorem ipsum'," ",c('u1','u1')))
   users                    text likes   sentiment followers            bio
1     u1  this is u1 first tweet     1  0.96445407       111 lorem ipsum u1
2     u2   this is another tweet    10  1.03840459       200 lorem ipsum u2
3     u3             hello hello     7  1.76887362       300 lorem ipsum u3
4     u4       hashtag tweettext     5 -0.57165015       400 lorem ipsum u4
5     u5              tweet text     4 -1.47028289       500 lorem ipsum u5
6     u1 this is u1 second tweet     2 -1.11036644       111 lorem ipsum u1
7     u6  this is u6 first tzeet     3  0.25440339       666 lorem ipsum u6
8     u6 this is u6 second tweet     8  0.02334468       666 lorem ipsum u6
9     u6  this is u6 third tweet     9 -2.71592529       666 lorem ipsum u6
10    u1  this is u1 third tweet     6  1.18528925       111 lorem ipsum u1

现在,我想做的是在用户帐户级别上工作。为此,我想汇总每个用户的喜欢和情绪的平均分数,同时将每个用户的所有推文文本也组合到一个向量中(或者一个长字符串也可以)。个人简介不应合并。

总的来说,聚合没有问题:

df%>% 
  group_by(users)%>%
  summarise(meanlikes = mean(likes),meansentiment = mean(sentiment))

就嵌套数据而言,我是这样的:

data %>%
  select(-likes,-sentiment) %>%
  nest(-users,-followers,-bio)

将两者结合在一段代码中并没有做任何有意义的事情。我分别运行了这两个操作并使用了inner_join(),这似乎工作正常,但是这种方法非常麻烦,因为我有115 个变量。

d1<- df %>%
  select(-likes,-bio)

d2 <- df %>%
  group_by(users)%>%
  summarise(meanlikes = mean(likes),meansentiment = mean(sentiment))

d1 <- d1 %>%
  inner_join(d2)

有什么建议吗?

所以要清楚我正在寻找的是一种方法/代码位,它给了我这个数据框:

  users                                                                    text followers
1    u1 this is u1 first tweet,this is u1 second tweet,this is u1 third tweet       111
2    u2                                                   this is another tweet       200
3    u3                                                             hello hello       300
4    u4                                                       hashtag tweettext       400
5    u5                                                              tweet text       500
6    u6 this is u6 first tzeet,this is u6 second tweet,this is u6 third tweet       666
             bio meanlikes meansentiment
1 lorem ipsum u1  4.333333    -0.2846824
2 lorem ipsum u2  6.000000    -0.5443194
3 lorem ipsum u3  2.000000     1.8001123
4 lorem ipsum u4  4.000000     1.0114402
5 lorem ipsum u5  9.000000    -0.5637166
6 lorem ipsum u6  7.000000     1.2346833

希望你能帮助我!

解决方法

您可以group_by users,保留firstbiofollowers 值,因为它们都是一样的。取 meanlikessentiment 并使用 texttoString 折叠成一个逗号分隔的字符串。

library(dplyr)

df %>%
  group_by(users) %>%
  summarise(across(c(bio,followers),first),across(c(likes,sentiment),mean),text = toString(text))

#  users bio      followers likes sentiment text             
#  <chr> <chr>        <dbl> <dbl>     <dbl> <chr>            
#1 u1    lorem i…       111  6.67    0.0870 this is u1 first…
#2 u2    lorem i…       200  8      -0.945  this is another …
#3 u3    lorem i…       300  6       0.225  hello hello      
#4 u4    lorem i…       400  3       0.359  hashtag tweettext
#5 u5    lorem i…       500  5      -0.664  tweet text       
#6 u6    lorem i…       666  4.33    0.206  this is u6 first…
,

你可以试试这个:

# set seed to make df reproducible
set.seed(1234)

df <- data.frame(users = c('u1','u2','u3','u4','u5','u1','u6','u1'),text = c('this is u1 first tweet','this is another tweet','hello hello','hashtag tweettext','tweet text','this is u1 second tweet','this is u6 first tzeet','this is u6 second tweet','this is u6 third tweet','this is u1 third tweet'),likes= sample(1:10,10),sentiment= rnorm(10,mean=0,sd=1),followers = c(111,200,300,400,500,111,666,111),bio = paste0(rep('lorem ipsum'," ",c('u1','u1')))


df %>% group_by(users)%>%
  mutate(tweets = str_c(text,collapse = ""),meanlikes = mean(likes),meansentiment = mean(sentiment)) %>%
  select(-text,-likes,-sentiment) %>%
  distinct()


相关问答

依赖报错 idea导入项目后依赖报错,解决方案:https://blog....
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下...
错误1:gradle项目控制台输出为乱码 # 解决方案:https://bl...
错误还原:在查询的过程中,传入的workType为0时,该条件不起...
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct...