R嵌套/嵌套数据帧会导致对象不同

问题描述

我第一次在R中使用nest / unnest函数，但我不理解结果。我嵌套并立即嵌套，并比较之前/之后的数据帧。为什么数据帧不相同？

> library(tidyverse)  
> concentration_original <- readRDS("./Data/concentration.Rds")
> print(concentration_original,n=15)
# A tibble: 12 x 5
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02 0      0.74
 2      1  79.6  4.02 0.25   2.84
 3      1  79.6  4.02 0.570  6.57
 4      1  79.6  4.02 1.12  10.5 
 5      1  79.6  4.02 2.02   9.66
 6      1  79.6  4.02 3.82   8.58
 7      2  72.4  4.4  0      0   
 8      2  72.4  4.4  0.27   1.72
 9      2  72.4  4.4  0.52   7.91
10      2  72.4  4.4  1      8.31
11      2  72.4  4.4  1.92   8.33
12      2  72.4  4.4  3.5    6.85
> 
> concentration_nested <- concentration_original %>% nest(data = c(TIME,CONC))
> concentration_nested
# A tibble: 2 x 4
  SUBJID    WT  DOSE data            
   <dbl> <dbl> <dbl> <list>          
1      1  79.6  4.02 <tibble [6 × 2]>
2      2  72.4  4.4  <tibble [6 × 2]>
> 
> concentration_unnested <- unnest(concentration_nested,cols = c(data))
> print(concentration_unnested,n=15)
# A tibble: 12 x 5
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02 0      0.74
 2      1  79.6  4.02 0.25   2.84
 3      1  79.6  4.02 0.570  6.57
 4      1  79.6  4.02 1.12  10.5 
 5      1  79.6  4.02 2.02   9.66
 6      1  79.6  4.02 3.82   8.58
 7      2  72.4  4.4  0      0   
 8      2  72.4  4.4  0.27   1.72
 9      2  72.4  4.4  0.52   7.91
10      2  72.4  4.4  1      8.31
11      2  72.4  4.4  1.92   8.33
12      2  72.4  4.4  3.5    6.85
> 
> if (identical(concentration_unnested,concentration_original)) {
+   print("After nest/unnest,we have a dataframe which IS IDENTICAL to the original")
+ } else {
+   print("After nest/unnest,we have a dataframe which IS NOT IDENTICAL to the original")
+ }
[1] "After nest/unnest,we have a dataframe which IS NOT IDENTICAL to the original"
> 
> all.equal(concentration_unnested,concentration_original)
[1] "Attributes: < Length mismatch: comparison on first 2 components >"
>

请注意，我使用的是 all.equal ，以查看问题可能与属性有关。如果我改用 all_equal ，则结果为TRUE，但我仍然坚持使用 identical 函数，说数据帧不相同。感谢您的帮助！

添加了原始df和嵌套/未嵌套df的dput。

> dput(concentration_original)
structure(list(SUBJID = c(1,1,2,2),WT = c(79.6,79.6,72.4,72.4),DOSE = c(4.02,4.02,4.4,4.4),TIME = c(0,0.25,0.57,1.12,2.02,3.82,0.27,0.52,1.92,3.5),CONC = c(0.74,2.84,6.57,10.5,9.66,8.58,1.72,7.91,8.31,8.33,6.85)),spec = structure(list(cols = list(SUBJID = structure(list(),class = c("collector_double","collector")),WT = structure(list(),DOSE = structure(list(),TIME = structure(list(),CONC = structure(list(),"collector"))),default = structure(list(),class = c("collector_guess",skip = 1),class = "col_spec"),row.names = c(NA,-12L),class = c("tbl_df","tbl","data.frame"))
> dput(concentration_unnested)
structure(list(SUBJID = c(1,"data.frame"))
>

其他信息：我想我找到了问题。有关原始小标题的spec = info包含与何时使用read_csv创建小标题相关的信息。当小标题通过嵌套/嵌套转换时，spec = info已被丢弃。还有另一个线程提到spec = info与小标题的内容不同步：Remove attributes from data read in readr::read_csv。在这种情况下，他们建议删除spec =属性：

attr(df,'spec') <- NULL

解决方法

根据我的发现，您的原始数据帧与输出不相同的原因是原始数据帧属于col_spec类，而输出却不是。

使用新的waldo程序包（属于tidyverse的一部分），我运行了以下程序：

compare(df,df %>% nest(data = c(TIME,CONC)) %>% unnest(cols = c(data)))
`attr(old,'spec')` is an S3 object of class <col_spec>
`attr(new,'spec')` is absent

似乎您使用readr读取了数据，结果df是col_spec类的对象。嵌套原始df会删除此属性。

attr(df %>% nest(data = c(TIME,CONC)),'spec')
NULL

因此，当您unnest时，df并不相同。

r r tibble unnest