问题描述
Company Subsidiary1 Subsidiary2 Subsidiary3
DE5930 DE5931 NA NA
GB3489 GB3490 NA NA
GB3489 GB3490 GB3491 NA
US2036 US2037 NA NA
US2036 US2037 US2038 NA
US2036 US2037 US2038 GB3491
....# and so on
现在,我想为每个公司在所有子公司中创建一列,如下所示:
Company Subsidiaries
DE5930 DE5931
GB3489 GB3490
GB3489 GB3491
US2036 US2037
US2036 US2038
US2036 GB3491
数据集确实很大(超过100.000行),我无法使用group_by
或aggregate
函数找到任何解决方案,因为大多数示例都是针对数字变量(例如,平均值)。
一个想法是删除带有df[ !duplicated(df$Subsidiary1),]
的重复项,以保留每个子公司的首次出现,然后将值向左移动,但是问题是一个子公司可能属于多个公司(例如“ GB3491 ”),而我不想放弃这些意见。有没有解决这个问题的好方法?
提前谢谢!
解决方法
我建议使用下一种tidyverse
方法:
library(tidyverse)
#Data
df <- structure(list(Company = c("DE5930","GB3489","US2036","US2036"),Subsidiary1 = c("DE5931","GB3490","US2037","US2037"),Subsidiary2 = c(NA,NA,"GB3491","US2038","US2038"),Subsidiary3 = c(NA,"GB3491")),class = "data.frame",row.names = c(NA,-6L))
代码:
df %>% pivot_longer(cols = -Company) %>% select(-name) %>%
filter(!is.na(value)) %>%
filter(!duplicated(paste(Company,value)))
输出:
# A tibble: 6 x 2
Company value
<chr> <chr>
1 DE5930 DE5931
2 GB3489 GB3490
3 GB3489 GB3491
4 US2036 US2037
5 US2036 US2038
6 US2036 GB3491
,
我们可以使用coalesce
library(dplyr)
df1 %>%
transmute(Company,Subsidiaries =
coalesce(!!! rlang::syms(rev(names(df1)[-1]))))
# Company Subsidiaries
#1 DE5930 DE5931
#2 GB3489 GB3490
#3 GB3489 GB3491
#4 US2036 US2037
#5 US2036 US2038
#6 US2036 GB3491
或者通过base R
使用max.col
cbind(df1[1],Subsidiaries = df1[-1][cbind(seq_len(nrow(df1)),max.col(!is.na(df1[-1]),"last"))])
数据
df1 <- structure(list(Company = c("DE5930",-6L))