删除行，直到列在多个数据框中相同

问题描述

我有 4 个名为 data frames 的 w,x,y,z，每个都有 3 列和相同的列名。我现在执行一个删除行的操作，直到名为 Type 的列在所有四个数据框中都相同。

为了实现这一点，我使用了一个 while 循环，代码如下：


list_df <- list(z,w,y)
tmp <- lapply(list_df,`[[`,'Type')
i <- as.integer(as.logical(all(sapply(tmp,function(x) all(x == tmp[[1]])))))
                
while (i == 0) {
                  
 z <- z[(z$Type %in% x$Type),]
 y <- y[(y$Type %in% x$Type),]
 w <- w[(w$Type %in% x$Type),]
                      
 z <- z[(z$Type %in% w$Type),]
 y <- y[(y$Type %in% w$Type),]
 x <- x[(x$Type %in% w$Type),]
                     
 z <- z[(z$Type %in% y$Type),]
 x <- x[(x$Type %in% y$Type),]
 w <- w[(w$Type %in% y$Type),]
                      
 x <- x[(x$Type %in% z$Type),]
 w <- w[(w$Type %in% z$Type),]
 y <- y[(y$Type %in% z$Type),]
                     
 list_df <- list(z,y)
 tmp <- lapply(list_df,'Type')
 i <- as.integer(as.logical(all(sapply(tmp,function(x) all(x == tmp[[1]])))))
 }

在此代码中，为每个数据框的 Type 列创建了一个列表。然后值 i 测试相同性，如果为假则产生 0，如果为真则产生 1。然后 while loop 执行删除未包含在每个数据帧中的行，直到 i 变为 1。

此代码有效，但将其应用于更大的数据可能会导致代码运行很长时间。有人知道如何简化此执行吗？

对于可重现的示例：

w <- structure(list(Type = c("26809D","28503C","360254","69298N","32708V","680681","329909","696978","32993F","867609","51206K","130747"),X1980 = c(NA,NA,271835,NA),X1981 = c(NA,290314,NA)),row.names = c("2","4","7","8","10","11","13","16","17","21","22","23"),class = "data.frame")

x <- structure(list(Type = c("26809D","329909"),1026815,826849,"13"),class = "data.frame")

y <- structure(list(Type = c("26809D","32708V"),X1980 = c(NA_real_,NA_real_,NA_real_),X1981 = c(NA_real_,NA_real_)),"10"),class = "data.frame")

z <- structure(list(Type = c("26809D","130747","50610H"),0.264736101439889,0.351108848169376,"23","24"
),class = "data.frame")

解决方法

我们假设问题是如何获取 4 个数据帧共有的 Type 值，每个数据帧都有一个包含唯一值的 Type 列。

形成数据框的列表L，使用Type和lapply提取[列，并使用{{1}迭代merge }：

Reduce

或者用这个替换最后一行给出相同的结果，除了顺序：

L <- list(w,x,y,z)
L.Type <- lapply(L,"[",TRUE,"Type",drop = FALSE) # list of DFs w only Type col
Reduce(merge,L.Type)$Type
## [1] "26809D" "28503C" "32708V" "360254" "69298N"

另一种有点乏味但确实将计算减少到一行的方法是手动迭代Reduce(intersect,L.Type)$Type ## [1] "26809D" "28503C" "360254" "69298N" "32708V"：

intersect

另一个例子

示例数据不能很好地说明这一点，因为每个数据框都有相同的 Type 值，所以让我们创建另一个示例。 intersect(w$Type,intersect(x$Type,intersect(y$Type,z$Type))) ## [1] "26809D" "28503C" "360254" "69298N" "32708V" 是一个有 6 行的内置数据框。我们将其分配给 BOD 并重命名列，以便第一个列具有名称 X。然后对于 Type 等于 1,2,3,4 我们删除第 i 行，给出 4 个数据帧，每行 5 行，并且 i 的 2 个值对所有 4 行都通用。结果正确显示 5和 7 是唯一常见的 Type 值。

Type

r r simplification while-loop