问题描述
对于我的文本挖掘任务,我正在尝试创建一个矩阵,其中包含三个独立文本的字数统计(我已经对其进行了过滤和标记化)。我知道每个文本都有这个数据框:
word count
film 82
camera 18
director 10
action 5
character 2
我还创建了一个列表,其中将三个文本的所有单词组合在一起,并将单词计数组合在一起,但是我试图达到以下目的:
word text1. text2. text3.
film. 82. 16. 8
camera. 18. 76. 3
director. 10. 2. 91
character. 2. 20. 0
screen. 0. 4. 10
movie. 12. 0. 0
action. 5. 23. 54
dance. 0. 1. 16
为此使用什么代码?如上例所示,我想为文本中没有出现数字“ 0”的每个单词填写。我总共约有4459个单词,其中的文字分别为1804、1522和1133个单词。
非常感谢!
解决方法
如果您已经计算了三个表。然后,您只需要对这些表进行完全合并,然后再删除NA。就像
library(dplyr)
first <- data.frame(word = sample(letters,10),count = sample(1:100,10))
second <- data.frame(word = sample(letters,10))
third <- data.frame(word = sample(letters,10))
combined <- merge(first,second,by = "word",all = TRUE)
combined <- merge(combined,third,all = TRUE)
combined %>%
mutate_all(.funs = function(x){
ifelse(is.na(x),x)
})
,
使用dplyr
和tidyr
的解决方案
library(dplyr)
library(tidyr)
full_join(df1,df2,suffix = c(".text1",".text2")) %>%
full_join(.,df3,by = "word") %>%
rename(count.text3 = count) %>%
mutate_at(vars(count.text1:count.text3),tidyr::replace_na,0)
#> word count.text1 count.text2 count.text3
#> 1 film 82 16 8
#> 2 camera 18 76 3
#> 3 director 10 2 91
#> 4 action 5 23 54
#> 5 character 2 20 0
#> 6 screen 0 4 10
#> 7 dance 0 1 16
模拟您的数据示例
df1 <- data.frame(
word = c("film","camera","director","action","character"),count = c(82,18,10,5,2)
)
df2 <- data.frame(
word = c("film","character","screen","dance"),count = c(16,76,2,20,4,23,1)
)
df3 <- data.frame(
word = c("film",count = c(8,3,91,54,16)
)