R嵌套/嵌套数据帧会导致对象不同

问题描述

我第一次在R中使用nest / unnest函数,但我不理解结果。我嵌套并立即嵌套,并比较之前/之后的数据帧。为什么数据帧不相同?

> library(tidyverse)  
> concentration_original <- readRDS("./Data/concentration.Rds")
> print(concentration_original,n=15)
# A tibble: 12 x 5
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02 0      0.74
 2      1  79.6  4.02 0.25   2.84
 3      1  79.6  4.02 0.570  6.57
 4      1  79.6  4.02 1.12  10.5 
 5      1  79.6  4.02 2.02   9.66
 6      1  79.6  4.02 3.82   8.58
 7      2  72.4  4.4  0      0   
 8      2  72.4  4.4  0.27   1.72
 9      2  72.4  4.4  0.52   7.91
10      2  72.4  4.4  1      8.31
11      2  72.4  4.4  1.92   8.33
12      2  72.4  4.4  3.5    6.85
> 
> concentration_nested <- concentration_original %>% nest(data = c(TIME,CONC))
> concentration_nested
# A tibble: 2 x 4
  SUBJID    WT  DOSE data            
   <dbl> <dbl> <dbl> <list>          
1      1  79.6  4.02 <tibble [6 × 2]>
2      2  72.4  4.4  <tibble [6 × 2]>
> 
> concentration_unnested <- unnest(concentration_nested,cols = c(data))
> print(concentration_unnested,n=15)
# A tibble: 12 x 5
   SUBJID    WT  DOSE  TIME  CONC
    <dbl> <dbl> <dbl> <dbl> <dbl>
 1      1  79.6  4.02 0      0.74
 2      1  79.6  4.02 0.25   2.84
 3      1  79.6  4.02 0.570  6.57
 4      1  79.6  4.02 1.12  10.5 
 5      1  79.6  4.02 2.02   9.66
 6      1  79.6  4.02 3.82   8.58
 7      2  72.4  4.4  0      0   
 8      2  72.4  4.4  0.27   1.72
 9      2  72.4  4.4  0.52   7.91
10      2  72.4  4.4  1      8.31
11      2  72.4  4.4  1.92   8.33
12      2  72.4  4.4  3.5    6.85
> 
> if (identical(concentration_unnested,concentration_original)) {
+   print("After nest/unnest,we have a dataframe which IS IDENTICAL to the original")
+ } else {
+   print("After nest/unnest,we have a dataframe which IS NOT IDENTICAL to the original")
+ }
[1] "After nest/unnest,we have a dataframe which IS NOT IDENTICAL to the original"
> 
> all.equal(concentration_unnested,concentration_original)
[1] "Attributes: < Length mismatch: comparison on first 2 components >"
> 

请注意,我使用的是 all.equal ,以查看问题可能与属性有关。如果我改用 all_equal ,则结果为TRUE,但我仍然坚持使用 identical 函数,说数据帧不相同。感谢您的帮助!

添加了原始df和嵌套/未嵌套df的dput。

> dput(concentration_original)
structure(list(SUBJID = c(1,1,2,2),WT = c(79.6,79.6,72.4,72.4),DOSE = c(4.02,4.02,4.4,4.4),TIME = c(0,0.25,0.57,1.12,2.02,3.82,0.27,0.52,1.92,3.5),CONC = c(0.74,2.84,6.57,10.5,9.66,8.58,1.72,7.91,8.31,8.33,6.85)),spec = structure(list(cols = list(SUBJID = structure(list(),class = c("collector_double","collector")),WT = structure(list(),DOSE = structure(list(),TIME = structure(list(),CONC = structure(list(),"collector"))),default = structure(list(),class = c("collector_guess",skip = 1),class = "col_spec"),row.names = c(NA,-12L),class = c("tbl_df","tbl","data.frame"))
> dput(concentration_unnested)
structure(list(SUBJID = c(1,"data.frame"))
> 

其他信息: 我想我找到了问题。有关原始小标题的spec = info包含与何时使用read_csv创建小标题相关的信息。当小标题通过嵌套/嵌套转换时,spec = info已被丢弃。还有另一个线程提到spec = info与小标题内容不同步:Remove attributes from data read in readr::read_csv在这种情况下,他们建议删除spec =属性

attr(df,'spec') <- NULL

解决方法

根据我的发现,您的原始数据帧与输出不相同的原因是原始数据帧属于col_spec类,而输出却不是。

使用新的waldo程序包(属于tidyverse的一部分),我运行了以下程序:

compare(df,df %>% nest(data = c(TIME,CONC)) %>% unnest(cols = c(data)))
`attr(old,'spec')` is an S3 object of class <col_spec>
`attr(new,'spec')` is absent

似乎您使用readr读取了数据,结果df是col_spec类的对象。嵌套原始df会删除此属性。

attr(df %>% nest(data = c(TIME,CONC)),'spec')
NULL

因此,当您unnest时,df并不相同。