在R中将具有多个图纸的多个数据集绑定

问题描述

我有4个Excel数据集，每个数据集有15张。

首先，我想将所有数据集作为一个列表导入到R中，以便该列表包含每个数据集（df1，df2，df3，df4），每个数据集包含所有15个工作表（sheets1，sheets2，sheets3等）。。，sheets15）。这些工作表在每个数据集中具有相同的名称。数据集都以相同的词开头，比如说“咖啡”。数据集“ coffee_1.xlsx”，“ coffee_2.xlsx”，“ coffee_3.xlsx”和“ coffee_4.xlsx”也是如此。有一种方法可以一次导入所有数据集吗？

第二，我想按表格重新整理所有数据集。因此，例如，应该将df1的sheet1与df2，df3和df4的sheet1组合在一起。

我不想手动执行此操作，因为我必须对100个数据集（每个15页）重复该过程。

到目前为止，我已尝试分别导入所有数据集，并将它们组合到更大的列表中，如下所示：

df.list<-list(df.list1,df.list2,df.list3,df.list4,df.list5)

每个列表包含15页。然后，我尝试使用do.call来修饰它们：

df.list.big<-do.call(rbind,df.list)

但是我无法逐页查找数据。这一点，非常感谢您的帮助。谢谢！

解决方法

我将使用openxlsx创建一些示例xlsx文件：

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb,"tab1")
openxlsx::writeData(wb,"tab1",data.frame(a = 1101:1103,b = 1111:1113))
openxlsx::addWorksheet(wb,"tab2")
openxlsx::writeData(wb,"tab2",data.frame(a = 1201:1203,b = 1211:1213))
openxlsx::addWorksheet(wb,"tab3")
openxlsx::writeData(wb,"tab3",data.frame(a = 1301:1303,b = 1311:1313))
openxlsx::saveWorkbook(wb,"book1.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb,data.frame(a = 2101:2103,b = 2111:2113))
openxlsx::addWorksheet(wb,data.frame(a = 2201:2203,b = 2211:2213))
openxlsx::addWorksheet(wb,data.frame(a = 2301:2303,b = 2311:2313))
openxlsx::saveWorkbook(wb,"book2.xlsx")

wb <- openxlsx::createWorkbook()
openxlsx::addWorksheet(wb,data.frame(a = 3101:3103,b = 3111:3113))
openxlsx::addWorksheet(wb,data.frame(a = 3201:3203,b = 3211:3213))
openxlsx::addWorksheet(wb,data.frame(a = 3301:3303,b = 3311:3313))
openxlsx::saveWorkbook(wb,"book3.xlsx")

一般流程

我不确定您为什么喜欢每张纸保持一帧；如果您要对不同的数据组执行相同的操作，那么拥有一个框架仍然很有意义，因为它要保持尽可能多的上下文，以便分组自然进行。

虽然base R确实进行了分组操作，但与使用data.table或dplyr软件包时相比，我发现它们的直观性/灵活性稍差一些，因此在此我将坚持使用这两个进行处理（并让您确定是否要使用哪个，然后调整处理以分组方式进行。）

无论哪种方式，这就是我的流程：

我们需要一个函数来读取工作簿中的所有工作表，然后在文件名的向量上进行迭代；
我将演示将所有数据放入一帧（我的建议）；然后
我将演示按工作表对它们进行分组。

我将从data.table开始，但稍后将在dplyr中提供等效内容。

基本阅读所有纸张功能

readOneBook <- function(fn) {
  shtnms <- openxlsx::getSheetNames(fn)
  sheets <- lapply(setNames(nm = shtnms),openxlsx::readWorkbook,xlsxFile = fn)
  sheets
}
readOneBook("book1.xlsx")
# $tab1
#      a    b
# 1 1101 1111
# 2 1102 1112
# 3 1103 1113
# $tab2
#      a    b
# 1 1201 1211
# 2 1202 1212
# 3 1203 1213
# $tab3
#      a    b
# 1 1301 1311
# 2 1302 1312
# 3 1303 1313

因此，我们将使用

为工作簿创建一个列表（即工作表列表）

workbooks <- lapply(setNames(nm = list.files(pattern = "\\.xlsx$")),readOneBook)

data.table

这是一个列表，其中每个元素都是一个工作簿：

library(data.table)
lapply(workbooks,rbindlist,idcol = "sheet")
# $book1.xlsx
#    sheet    a    b
# 1:  tab1 1101 1111
# 2:  tab1 1102 1112
# 3:  tab1 1103 1113
# 4:  tab2 1201 1211
# 5:  tab2 1202 1212
# 6:  tab2 1203 1213
# 7:  tab3 1301 1311
# 8:  tab3 1302 1312
# 9:  tab3 1303 1313
# $book2.xlsx
#    sheet    a    b
# 1:  tab1 2101 2111
# 2:  tab1 2102 2112
# 3:  tab1 2103 2113
# 4:  tab2 2201 2211
# 5:  tab2 2202 2212
# 6:  tab2 2203 2213
# 7:  tab3 2301 2311
# 8:  tab3 2302 2312
# 9:  tab3 2303 2313
# $book3.xlsx
#    sheet    a    b
# 1:  tab1 3101 3111
# 2:  tab1 3102 3112
# 3:  tab1 3103 3113
# 4:  tab2 3201 3211
# 5:  tab2 3202 3212
# 6:  tab2 3203 3213
# 7:  tab3 3301 3311
# 8:  tab3 3302 3312
# 9:  tab3 3303 3313

然后将其组合为一个大框架：

rbindlist(
  lapply(workbooks,idcol = "sheet"),idcol = "workbook"
)
#       workbook sheet    a    b
#  1: book1.xlsx  tab1 1101 1111
#  2: book1.xlsx  tab1 1102 1112
#  3: book1.xlsx  tab1 1103 1113
#  4: book1.xlsx  tab2 1201 1211
#  5: book1.xlsx  tab2 1202 1212
# ---                           
# 23: book3.xlsx  tab2 3202 3212
# 24: book3.xlsx  tab2 3203 3213
# 25: book3.xlsx  tab3 3301 3311
# 26: book3.xlsx  tab3 3302 3312
# 27: book3.xlsx  tab3 3303 3313

工作表列表略有不同，需要一点“移调”功能。这可以防止（1）并非所有工作簿中都存在的工作表；和（2）不同的纸张顺序。

commonsheets <- Reduce(intersect,lapply(workbooks,names))
commonsheets
# [1] "tab1" "tab2" "tab3"
lapply(setNames(nm = commonsheets),function(sht) rbindlist(lapply(workbooks,`[[`,sht),idcol = "workbook"))
# $tab1
#      workbook    a    b
# 1: book1.xlsx 1101 1111
# 2: book1.xlsx 1102 1112
# 3: book1.xlsx 1103 1113
# 4: book2.xlsx 2101 2111
# 5: book2.xlsx 2102 2112
# 6: book2.xlsx 2103 2113
# 7: book3.xlsx 3101 3111
# 8: book3.xlsx 3102 3112
# 9: book3.xlsx 3103 3113
# $tab2
#      workbook    a    b
# 1: book1.xlsx 1201 1211
# 2: book1.xlsx 1202 1212
# 3: book1.xlsx 1203 1213
# 4: book2.xlsx 2201 2211
# 5: book2.xlsx 2202 2212
# 6: book2.xlsx 2203 2213
# 7: book3.xlsx 3201 3211
# 8: book3.xlsx 3202 3212
# 9: book3.xlsx 3203 3213
# $tab3
#      workbook    a    b
# 1: book1.xlsx 1301 1311
# 2: book1.xlsx 1302 1312
# 3: book1.xlsx 1303 1313
# 4: book2.xlsx 2301 2311
# 5: book2.xlsx 2302 2312
# 6: book2.xlsx 2303 2313
# 7: book3.xlsx 3301 3311
# 8: book3.xlsx 3302 3312
# 9: book3.xlsx 3303 3313

dplyr

功能相同，有效数据相同，所以我只显示命令（实际上只是将rbindlist替换为bind_cols，并更改了参数名称）。

library(dplyr)

# list,one workbook per element
lapply(workbooks,idcol = "sheet")

# one big frame
bind_rows(
  lapply(workbooks,bind_rows,.id = "sheet"),.id = "workbook"
)

# list,one common sheet per element
lapply(setNames(nm = commonsheets),function(sht) bind_rows(lapply(workbooks,.id = "workbook"))

我有一种方法可以完成工作。您将使用三个软件包：

library(readxl)
library(dplyr)
library(purrr)

在这种情况下，我将假设您的所有数据都在您的工作目录中，并且所有工作簿的图纸数均相同。

第一步：列出所有文件

# list all your workbooks
files <- list.files()

第二步：创建一个函数，该函数使用文件的路径和工作表索引，返回数据表，其中工作表由行绑定。

read_workbook_sheets <- function(files,sheet_index = 1) {
  
  # import from all workbooks the sheet in the sheet_index position
  data <- purrr::map(files,~readxl::read_excel(path = .x,sheet = sheet_index))
  
  # bind the all sheets together
  data <- dplyr::bind_rows(data)
  
  # return the dataframe  
  return(data)
}

第三步：在按顺序排列的工作表上使用该功能，例如10

my_list_of_df <- purrr::map(1:10,read_workbooks_sheets(files,.x))

PS：对不起，我的英语语法，我不是母语人士。

编辑：您没有提供可复制的示例，因此我将做一些假设。您的工作表名称相同，列相同。我制作了一个我认为适合您描述的小型数据集。

如果可以的话，我建议使用data.table

library(data.table)
df.list1 <- list(sheet1 = data.table(a = 1,b = 1),sheet2 = data.table(a = 2,b = 2))
df.list2 <- list(sheet1 = data.table(a = 3,b = 3),sheet2 = data.table(a = 4,b = 4))
df.list <- list(dataset1 = df.list1,dataset2 = df.list2)
# Now I have a dataset like yours 
# First -- transpose them so "sheets" are on the outside
# Then data.table::rbindlist them,keeping the dataset names,if you like
lapply(purrr::transpose(df.list),data.table::rbindlist,idcol = "dataset")

您可以尝试这样的事情：

library(tidyverse)
library(readxl)

df <- tibble(filename = list.files(path = ".",pattern = "coffee",full.names = TRUE)) %>% 
    mutate(sheet = map(filename,excel_sheets)) %>% 
    unnest(sheet) %>% 
    mutate(data_from_excel = map2(filename,sheet,read_excel)) %>% 
    group_by(sheet) 

df2 <- df %>% group_split
names(df2) <- group_keys(df) %>% pull

df2 %>% map(~summarize(.,bind_rows(data_from_excel)))

do.call list r r rbind