问题描述
我有一个包含 100 多个变量的调查数据集,几乎所有变量都有 1-10 个代码值。每列的代码值在另一个 df 中提供。
示例数据:
survey_df = structure(list(resp_id = 1:5,gender = c("1","2","1","1"),state = c("1","3","4"),education = c("1","2")),class = "data.frame",row.names = c(NA,-5L))
coded_df = structure(list(col = c("state","gender","education"),col_values = c("1-CA,2-TX,3-AZ,4-CO","1-Male,2-Female","1-High School,2-Bachelor")),-3L))
由于调查列随时间/产品而变化,我想避免任何硬编码重新编码,因此有一个函数可以输入列名并从 coded_df 返回“命名向量”。
get_named_vec <- function(x) {
tmp_chr <- coded_df %>%
filter(col == x) %>%
mutate(col_values = str_replace_all(col_values,"\\n","")) %>%
separate_rows(col_values,sep = ",") %>%
separate(col_values,into = c("var1","var2"),sep = "-") %>%
mutate(var1 = as.character(as.numeric(var1)),var2 = str_trim(var2)) %>%
pull(var2,var1)
return(tmp_chr)
}
然后我使用命名向量如下更新survey_df。
survey_df%>%
mutate(gender = recode(gender,!!!get_named_vec("gender"),.default = "NA_character_"))
到目前为止,这项工作是基于每列进行的,这意味着 100 多次执行!
但是我如何通过 mutate_at 运行它,以便我在单次执行中选择性地重新编码某些变量。
# This does not work.
to_update_col<-c("state","gender")
survey_df%>%
mutate_at(.vars=all_of(to_update_col),.funs=function(x) recode(x,!!!get_named_vec(x))))
非常感谢任何帮助!
谢谢
维奈
解决方法
我希望将其转换为枢轴-联接-枢轴操作会更简单、更高效,在该操作中,您可以将源表和查找表转换为长格式,将它们连接起来,然后再次调整宽度。
鉴于此调查信息:
survey_df = structure(list(resp_id = 1:5,gender = c(1L,2L,1L,1L),state = c(1,2,3,1,4),education = c(1L,2L)),class = "data.frame",row.names = c(NA,-5L)) %>%
mutate(across(-resp_id,as.character))
我们可以将查找表转换成长格式:
coded_df_long <- coded_df %>%
separate_rows(col_values,sep = ",") %>%
separate(col_values,c("old","new"),extra = "merge")
然后将调查旋转很长,加入编码,然后再次旋转。
survey_df %>%
pivot_longer(-resp_id) %>%
left_join(coded_df_long,by = c("name" = "col","value" = "old")) %>%
select(-value) %>%
pivot_wider(names_from = name,values_from = new)
结果
# A tibble: 5 x 4
resp_id gender state education
<int> <chr> <chr> <chr>
1 1 Male CA High School
2 2 Female TX High School
3 3 Female AZ High School
4 4 Male CA Bachelor
5 5 Male CO Bachelor