问题描述
下面的例子是一个简单的例子,它试图断言 y 列始终为正 (y>0)。如何在允许工作流继续使用“已清理”数据的同时提取错误数据(带有负值的第 3 行,可能提取到数据帧或任何方便的对象中)?
library(tidyverse)
library(assertr)
#> Warning: package 'assertr' was built under R version 4.0.5
test = tribble(
~x,~y,1,4,3,8,-2,1
)
test %>%
assert(within_bounds(0,100),y) %>%
group_by(x) %>%
summarise(avg = mean(y))
#> Column 'y' violates assertion 'within_bounds(0,100)' 1 time
#> verb redux_fn predicate column index value
#> 1 assert NA within_bounds(0,100) y 3 -2
#> Error: assertr stopped execution
由 reprex package (v0.3.0) 于 2021 年 4 月 12 日创建
更新
在尝试@Sirius 解决方案后,我第一次尝试它时对我有用,但随后的试验都失败了,并显示以下表示中显示的消息:
library(tidyverse)
library(assertr)
#> Warning: package 'assertr' was built under R version 4.0.5
test = tribble(
~x,5,81
)
my_error_fun = function( ... ) {
args = list(...)
do.call( just_warn,args )
bad.data = args[[1]][[1]]$error_df
these.Failed.rows = args$data %>%
slice( bad.data$index )
if(!exists("my.Failed.rows",inherits=TRUE)) {
my.Failed.rows = NULL
}
my.Failed.rows = rbind( my.Failed.rows,these.Failed.rows )
assign( "my.Failed.rows",my.Failed.rows,envir=parent.frame(n=3) )
good.rows = args$data %>% slice(-bad.data$index)
return(good.rows)
}
my.result <- test %>%
assert(within_bounds(0,y,error_fun = my_error_fun) %>%
group_by(x) %>%
summarise(avg = mean(y))
#> Column 'y' violates assertion 'within_bounds(0,100) y 3 -2
#> Warning: assertr encountered errors
#> `summarise()` ungrouping output (override with `.groups` argument)
print(my.result)
#> # A tibble: 3 x 2
#> x avg
#> <dbl> <dbl>
#> 1 1 4
#> 2 3 4.5
#> 3 5 81
print(my.Failed.rows)
#> Error in print(my.Failed.rows): object 'my.Failed.rows' not found
由 reprex package (v0.3.0) 于 2021 年 5 月 2 日创建
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Windows 10 x64
#> system x86_64,mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz Africa/Nairobi
#> date 2021-05-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertr * 2.8 2021-01-25 [1] CRAN (R 4.0.5)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.9 2020-08-24 [1] CRAN (R 4.0.2)
#> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.2)
#> broom 0.7.0 2020-07-09 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.2)
#> Crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
#> haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
#> jsonlite 1.7.1 2020-09-07 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.3)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.2)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.3)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.2)
#> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.4)
#> rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
#> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.3)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.2)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] C:/Users/tnkil/OneDrive/Documents/R/win-library/4.0
#> [2] C:/Program Files/R/R-4.0.2/library
解决方法
这很棘手,下面的答案并不能 100% 解决这个问题。现在,assertr 可以通过多种不同的方式让您处理错误/停止,只需查看 ?error_stop(这是默认设置)。
您不仅需要过滤掉失败的行,还需要收集它们(全部)以供以后检查。
下面我写了我自己的错误处理程序。它提取那些失败的行,将它们过滤掉,并将它们存储在变量 my.failed.rows
下的全局环境中。
library(tidyverse)
library(assertr)
#> Warning: package 'assertr' was built under R version 4.0.5
test = tribble(
~x,~y,1,4,3,8,-2,1
)
my_error_fun <- function( ... ) {
args <- list(...)
do.call( just_warn,args )
bad.data <- args[[1]][[1]]$error_df
these.failed.rows <- args$data %>% slice( bad.data$index )
if(!exists("my.failed.rows",inherits=TRUE)) {
my.failed.rows <- NULL
}
my.failed.rows <- rbind( my.failed.rows,these.failed.rows )
assign( "my.failed.rows",my.failed.rows,envir=parent.frame(n=3) )
good.rows <- args$data %>% slice( -bad.data$index )
return( good.rows )
}
my.result <- test %>%
assert(within_bounds(0,100),y,error_fun = my_error_fun ) %>%
group_by(x) %>%
summarise(avg = mean(y))
print(my.result)
print(my.failed.rows)
输出:
> print(my.result)
# A tibble: 2 x 2
x avg
<dbl> <dbl>
1 1 4
2 3 4.5
> print(my.failed.rows)
# A tibble: 1 x 2
x y
<dbl> <dbl>
1 1 -2
现在下一次,它会继续附加到 my.failed.rows
,所以当你检查完它时,截断或删除它是由你决定的。我还没有想出一种方法来自动化这个。本质上是一种检测新的链式 dplyr 操作正在进行的方法。