R中因子的问题折叠级别 解决方案 1解决方案 2

问题描述

我有一个杂乱的因子变量,其级别比应有的多。这些案例来自一项公开调查,许多参与者写错了或只是以不同的方式回应了类似的答案。

这是代表我的问题的示例 df:


df <- data.frame(ID=seq(1:10),Nationality=c("espanol","spaniol","ESPANOL","spanish","colombia","Colombian","British","brit","ESPanol","UK")
                               )

我想要的输出是这样的:

> df
   ID Nationality
1   1     Spanish
2   2     Spanish
3   3     Spanish
4   4     Spanish
5   5   Colombian
6   6   Colombian
7   7     British
8   8     British
9   9     Spanish
10 10     British

为了将这 10 个人为的因子水平降低到应有的 3(西班牙文、哥伦比亚文、英国文),我试图这样做:

library(forcats) 
                              
levels(df$Nationality) <- fct_collapse(df$Nationality,Spanish = c("espanol","ESPanol"),Colombian = c("colombia","Colombian"),British = c("British","UK")
                                        )

这有效地将我的“国籍”因素降低到 3 个级别,但输出看起来像这样并且与第一个不对应:

> df
   ID Nationality
1   1   Colombian
2   2     British
3   3     British
4   4     Spanish
5   5     Spanish
6   6     Spanish
7   7     Spanish
8   8     Spanish
9   9   Colombian
10 10     British

在我使用的更大的数据集中,它也不起作用,但输出更糟,因为所有案例都变成了“西班牙语”,而且我没有任何线索说明为什么会发生这种情况。

在此先感谢您的帮助! 最好, 卢卡斯

解决方法

您是否曾尝试将国籍作为首要考虑因素?

df <- data.frame(ID=seq(1:10),Nationality=c("espanol","spaniol","ESPANOL","spanish","colombia","Colombian","British","brit","ESPanol","UK")
)
library(forcats) 


df2 <- df %>% 
  mutate(Nationality = factor(Nationality)) %>% 
 mutate(Nationality = fct_collapse(Nationality,Spanish = c("espanol","ESPanol"),Colombian = c("colombia","Colombian"),British = c("British","UK")))



#more concise

mutate(across(Nationality,~ fct_collapse(factor(.),"UK")
))) 
,

以下是一些使用内置函数的解决方案:

解决方案 1

此解决方案假定列 Nationality 是一个字符变量

cases <- c(espanol = "Spanish",spaniol = "Spanish",ESPANOL = "Spanish",spanish = "Spanish",British = "British",brit = "British",ESPanol = "Spanish",UK = "British",colombia = "Colombian",Colombian = "Colombian")

df$Nationality <- factor(cases[df$Nationality])

解决方案 2

df$Nationality <- as.factor(df$Nationality)

levels(df$Nationality) <- list(Spanish = c("espanol","UK"))

输出数据

#    ID Nationality
# 1   1     Spanish
# 2   2     Spanish
# 3   3     Spanish
# 4   4     Spanish
# 5   5   Colombian
# 6   6   Colombian
# 7   7     British
# 8   8     British
# 9   9     Spanish
# 10 10     British