将多张表格的结果汇总到 R 中的一张表格中

问题描述

我正在阅读包含多张工作表的 excel 文件。

 file_to_read <- "./file_name.xlsx"
 
 # Get all names of sheets in the file
 sheet_names <- readxl::excel_sheets(file_to_read)
 
 # Loop through sheets
 L <- lapply(sheet_names,function(x) {
 all_cells <-
 tidyxl::xlsx_cells(file_to_read,sheets = x)
})

L 这里有所有的床单。现在，我需要从每个工作表中获取数据以将所有列和行合并到一个文件中。确切地说，我想将数据中匹配的列和行汇总到一个文件中。

我会举一个简单的例子来说明。

例如，这张表在一张纸中，

df1 <- data.frame(x = 1:5,y = 2:6,z = 3:7)
rownames(df1) <- LETTERS[1:5]
df1
M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7

下一张表格中的第二个表格，

df2 <- data.frame(x = 1:5,z = 3:7,w = 8:12)
rownames(df2) <- LETTERS[3:7]
df2
M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12

我的目标是将一个 excel 文件中所有 100 个表中的匹配记录合并（求和），以获得一个包含每个值总和的大表。

决赛桌应该是这样的：

M x y  z   w
A 1 2  3   0
B 2 3  4   0
C 4 6  8   8
D 6 8  10  9
E 8 10 12 10
F 4 5  6  11
G 5 6  7  12

有没有办法在 R 中实现这一点？我不是 R 方面的专家，但我希望我能知道如何阅读所有工作表并计算总和，然后将输出保存到文件中。

谢谢

解决方法

正如您所说，您有数百张纸，建议您将所有这些导入到一个列表中，例如 R 中的 my.list（根据 this link 或 this readxl documentation 建议) 并遵循这个策略，而不是每两个 dfs 一个一个绑定

df1 <- read.table(text = 'M x y z
A 1 2 3
B 2 3 4
C 3 4 5
D 4 5 6
E 5 6 7',header = T)
df2 <- read.table(text = 'M x y z  w
C 1 2 3  8
D 2 3 4  9
E 3 4 5 10
F 4 5 6 11
G 5 6 7 12',header = T)

library(tibble)
library(tidyverse)

my.list <- list(df1,df2)

map_dfr(my.list,~.x)
#>    M x y z  w
#> 1  A 1 2 3 NA
#> 2  B 2 3 4 NA
#> 3  C 3 4 5 NA
#> 4  D 4 5 6 NA
#> 5  E 5 6 7 NA
#> 6  C 1 2 3  8
#> 7  D 2 3 4  9
#> 8  E 3 4 5 10
#> 9  F 4 5 6 11
#> 10 G 5 6 7 12
map_dfr(my.list,~ .x) %>%
  group_by(M) %>%
  summarise(across(everything(),sum,na.rm = T))
#> # A tibble: 7 x 5
#>   M         x     y     z     w
#>   <chr> <int> <int> <int> <int>
#> 1 A         1     2     3     0
#> 2 B         2     3     4     0
#> 3 C         4     6     8     8
#> 4 D         6     8    10     9
#> 5 E         8    10    12    10
#> 6 F         4     5     6    11
#> 7 G         5     6     7    12

^{由 reprex package (v2.0.0) 于 2021 年 5 月 26 日创建}

一种可行的方法是以下步骤：

将每张纸读入一个列表
将每张纸转换成长格式
绑定到单个数据框
对那个长数据帧进行求和和分组
转换回表格格式

这应该适用于在这些工作表中具有任意行和列标题组合的 N 个工作表。例如

file <- "D:\\Book1.xlsx"
sheet_names <- readxl::excel_sheets(file)
sheet_data <- lapply(sheet_names,function(sheet_name) {
  readxl::read_xlsx(path = file,sheet = sheet_name)
})

# use pivot_longer on each sheet to make long data
long_sheet_data <- lapply(sheet_data,function(data) {
  long <- tidyr::pivot_longer(
    data = data,cols = !M,names_to = "col",values_to = "val"
  )
})

# combine into a single tibble
long_data = dplyr::bind_rows(long_sheet_data)

# sum up matching pairs of `M` and `col`
summarised <- long_data %>%
  group_by(M,col) %>%
  dplyr::summarise(agg = sum(val))
  
# convert to a tabular format
tabular <- summarised %>%
  tidyr::pivot_wider(
    names_from = col,values_from = agg,values_fill = 0
  )

tabular

我使用您的初始输入通过电子表格获得此输出：

> tabular
# A tibble: 7 x 5
# Groups:   M [7]
  M         x     y     z     w
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A         1     2     3     0
2 B         2     3     4     0
3 C         4     6     8     8
4 D         6     8    10     9
5 E         8    10    12    10
6 F         4     5     6    11
7 G         5     6     7    12

您可以使用 dplyr 和 tidyr 来获得您想要的结果：

就这样

df <- data.frame(subject=c(rep("Mother",2),rep("Child",2)),modifier=c("chart2","child","tech","unkn"),mother_chart2=1:4,mother_child=5:8,child_tech=9:12,child_unkn=13:16)
> df
  subject modifier mother_chart2 mother_child child_tech child_unkn
1  Mother   chart2             1            5          9         13
2  Mother    child             2            6         10         14
3   Child     tech             3            7         11         15
4   Child     unkn             4            8         12         16

和

df2 <- data.frame(subject=c(rep("Mother",modifier=c("chart",mother_chart=101:104,mother_child=105:108,child_tech=109:112,child_unkn=113:116)

> df2
  subject modifier mother_chart mother_child child_tech child_unkn
1  Mother    chart          101          105        109        113
2  Mother    child          102          106        110        114
3   Child     tech          103          107        111        115
4   Child     unkn          104          108        112        116

然后

library(dplyr)
library(tidyr)

df2_tmp <- df2 %>%
  pivot_longer(col=-c("subject","modifier"))

df %>%
  pivot_longer(col=-c("subject","modifier")) %>%
  full_join(df2_tmp,by=c("subject","modifier","name")) %>%
  mutate(across(starts_with("value"),~ replace_na(.,0)),sum = value.x + value.y) %>%
  select(-value.x,-value.y) %>%
  pivot_wider(names_from=name,values_from=sum,values_fill=0)

# A tibble: 5 x 7
  subject modifier mother_chart2 mother_child child_tech child_unkn mother_chart
  <chr>   <chr>            <dbl>        <dbl>      <dbl>      <dbl>        <dbl>
1 Mother  chart2               1            5          9         13            0
2 Mother  child                2          112        120        128          102
3 Child   tech                 3          114        122        130          103
4 Child   unkn                 4          116        124        132          104
5 Mother  chart                0          105        109        113          101

contingency excel excel r r