问题描述
我有两个数据帧,它们是边缘列表,前两列包含“源”和“目标”列,第二个数据帧包括具有边缘属性的第三列。这两个数据帧的长度不同,我想(1)从一个数据帧中检索不在另一个数据帧中的边缘,(2)从第二个数据帧中获取值以匹配边缘。
示例:
> A <- data.frame(source=c("v1","v1","v2","v2"),target=c("v2","v4","v3","v4"))
> B <- data.frame(source=c("V1","V2","V4","V5"),target=c("V2","V5","V3","V4"),variable=c(3,4,2,1,0))
> A
source target
1 v1 v2
2 v1 v4
3 v2 v3
4 v2 v4
> B
source target variable
1 V1 V2 3
2 V2 V5 4
3 v1 V3 0
4 V4 V3 2
5 V4 V2 1
6 V5 V4 0
理想的结果(1):
source target
1 V2 V5
2 V1 V3
3 V4 V3
4 V5 V4
理想的结果(2):
source target variable
1 V1 V2 3
2 V2 V4 1
R如何实现?
解决方法
首先,您将获得anti_join
,尽管由于方向在您的示例中似乎无关紧要,因此您将需要在源和目标的两种组合上进行反联接。请注意,我必须使用toupper
,因为您的示例中的大写字母不固定,并且示例中所建议的情况应忽略。
library(dplyr)
anti_join(anti_join(B,A %>% mutate_all(toupper),by = c("source","target")),by = c(target = "source",source = "target")) %>%
select(-variable)
#> source target
#> 1 V2 V5
#> 2 v1 V3
#> 3 V4 V3
#> 4 V5 V4
绑定两个inner_join
可以得到的第二个结果:
bind_rows(inner_join(B,inner_join(B,by = c(source = "target",target = "source")))
#> source target variable
#> 1 V1 V2 3
#> 2 V4 V2 1
,
使用data.table
:
# Load data.table and convert to data.frames to data.tables
library(data.table)
setDT(A)
setDT(B)
# If direction doesn't matter sort "source/target"
# Also need to standardise the data format,toupper()
cols <- c("source","target")
foo <- function(x) paste(toupper(sort(unlist(x))),collapse="-")
A[,oedge := foo(.SD),.SDcols = cols,by = seq_len(nrow(A))]
B[,by = seq_len(nrow(B))]
# Do anti-join and inner join
B[!A,.SD,on="oedge",.SDcols=cols]
# source target
# 1: V2 V5
# 2: v1 V3
# 3: V4 V3
# 4: V5 V4
B[A,.SDcols=c(cols,"variable"),nomatch = NULL]
# source target variable
# 1: V1 V2 3
# 2: V4 V2 1