问题描述
我有一个数据框,其中包含连续天以来的不同ID和观测值。如果连续几天没有一个ID的数据,我想删除它们。
我使用diff(days)
函数来显示几天之间的差异,但是我只需要一个ID就可以做到。
我的df看起来像这样:
ani_id_year days
1 ID468_2006 2006-04-01
2 ID468_2006 2006-04-02
3 ID468_2006 2006-04-03
4 ID468_2006 2006-04-04
5 ID468_2006 2006-04-05
6 ID599_2006 2006-03-06
7 ID599_2006 2006-03-14
8 ID599_2006 2006-03-15
9 ID599_2006 2006-03-16
所以我可以看到ID599_2006中存在7天的空白,如果gab == 7,我想自动删除它。由于我有数百个ID,因此无法手动执行此操作。
也许您可以帮助我,非常感谢!
最好的基督徒
解决方法
如果您想删除每个ID的所有条目,这是一种方法。
library(tidyverse)
df <- structure(list(ani_id_year = c("ID468_2006","ID468_2006","ID599_2006","ID599_2006"),days = c("2006-04-01","2006-04-02","2006-04-03","2006-04-04","2006-04-05","2006-03-06","2006-03-14","2006-03-15","2006-03-16")),row.names = c(NA,-9L),class = c("tbl_df","tbl","data.frame"))
data <- as_tibble(df) %>%
mutate(days = as.Date(days))
data %>% group_by(ani_id_year) %>%
mutate(difference = as.numeric(days - lag(days))) %>%
mutate(to_delete = ifelse(max(difference,na.rm = TRUE) <= 7,"keep","remove")) %>%
filter(to_delete == "keep")
#> # A tibble: 5 x 4
#> # Groups: ani_id_year [1]
#> ani_id_year days difference to_delete
#> <chr> <date> <dbl> <chr>
#> 1 ID468_2006 2006-04-01 NA keep
#> 2 ID468_2006 2006-04-02 1 keep
#> 3 ID468_2006 2006-04-03 1 keep
#> 4 ID468_2006 2006-04-04 1 keep
#> 5 ID468_2006 2006-04-05 1 keep
由reprex package(v0.3.0)于2020-08-18创建
, 1。 base
解决方案
subset(df,!ani_id_year %in% ani_id_year[c(F,diff(days) > 7)])
2。 dplyr
解决方案
library(dplyr)
-
选项1
df %>% filter(!ani_id_year %in% ani_id_year[c(F,diff(days) > 7)])
-
选项2
df %>% group_by(ani_id_year) %>% filter(!any(diff(days) > 7))
输出
# ani_id_year days
# 1 ID468_2006 2006-04-01
# 2 ID468_2006 2006-04-02
# 3 ID468_2006 2006-04-03
# 4 ID468_2006 2006-04-04
# 5 ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006",days = structure(c(13239,13240,13241,13242,13243,13213,13221,13222,13223),class = "Date")),class = "data.frame")
,
带有data.table
library(data.table)
setDT(df)[,.SD[!any(diff(days) > 7)],(ani_id_year)]
# ani_id_year days
#1: ID468_2006 2006-04-01
#2: ID468_2006 2006-04-02
#3: ID468_2006 2006-04-03
#4: ID468_2006 2006-04-04
#5: ID468_2006 2006-04-05
数据
df <- structure(list(ani_id_year = c("ID468_2006",class = "data.frame")