每次出现在一组观察中时，我如何让 R 标记特定模式即，列值从 1 变为 0？

问题描述

下面的 reprex 模拟了我的数据：对于每个人，我在不同时间有不同的 'res' 值。我需要一个指示变量 ('flag') 来告诉我每次在给定的人中 'res' 从 1 变为 0 时，我希望 'flag' 在第一次（也是第一次仅) 'res' = 1 后的 'res' = 0。最后，我想计算每个人的 'flag' = 1 的次数。

我的代码有两个问题：

它每次在 'res'= 1 之后标记为 'res' = 0（但我需要 'flag'= 1 仅第一次 'res'=0）。
计算 'flag' = 1 的次数不起作用。

注意：最后一个 'res_next_time' 不可避免地是 NA。根据我的数据中的定义，我在这里永远不会有 'flag'=1，所以它默认为 0 是可以的。

感谢您的帮助！

#Load packages
library(Hmisc)
#> Loading required package: lattice
#> Loading required package: survival
#> Loading required package: Formula
#> Loading required package: ggplot2
#> 
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:base':
#> 
#>     format.pval,units
library(dplyr)
#> Warning: package 'dplyr' was built under R version 4.0.4
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:Hmisc':
#> 
#>     src,summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter,lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect,setdiff,setequal,union
library(tidyr)
#> Warning: package 'tidyr' was built under R version 4.0.5

#Create data set
person <- c(1,1,2,3,3)
time <- c(1,4,5)
res <- c(1,1)

#Populate data frame
d <- cbind(person,time,res)
d <- as.data.frame(d)

#Create new variable equal to 'res' at the person's next time point
d$res_next_time <- Lag(d$res,-1)

#Group times by person
d %>% 
  group_by(person) %>% 
#Create a new variable 'flag' = 1 when a person's 'res' changes from 1 to 0,and 'flag' = 0 otherwise
  mutate(flag = case_when(res_next_time < 1 ~ 1,TRUE ~ 0)) %>%
#Because 'flag'= 1 is at the time of 'res'= 1 before 'res'= 0,we lag it to have 'flag' = 1 at 'res' = 0
  mutate(flag_res0 = Lag(flag,+1)) %>%
#Replace the NAs in 'flag_res0' with 0
  replace_na(list(flag_res0 = 0)) %>%
  #mutate(flag_res0 = as.numeric(flag_res0 & cumsum(flag_res0) <= 1)) %>%
#Count number of flags per person
  mutate(mig_freq = sum(flag_res0)) %>%
#Limit the data to only include the final indicator
  select('person','time','res','flag_res0')
#> # A tibble: 10 x 4
#> # Groups:   person [3]
#>    person  time   res flag_res0
#>     <dbl> <dbl> <dbl>     <dbl>
#>  1      1     1     1         0
#>  2      1     2     0         1
#>  3      1     3     1         0
#>  4      2     1     1         0
#>  5      3     2     1         0
#>  6      3     1     0         1
#>  7      3     2     0         1
#>  8      3     3     1         0
#>  9      3     4     0         1
#> 10      3     5     1         0

^{由 reprex package (v0.3.0) 于 2021 年 4 月 15 日创建}

解决方法

我的解决方案不需要列 res_next_time。我认为@Paul PR 更简洁。

# using your data d
d %>% 
  group_by(person) %>% 
  mutate(flag2 = if_else(lag(res) == 1 & res == 0 &  
                           !(duplicated(lag(res) == 1 & res == 0)),1,0))

您可以在末尾添加 ungroup()。这可能很重要，具体取决于接下来会发生什么。这基本上是“如果 TRUE TRUE 且不重复，则...”

您的评论表明您不是在寻找第一次出现，而是在组内寻找任何出现。

那实际上要简单得多。

(d %>% 
  group_by(person) %>% 
  mutate(flag = if_else(lag(res) == 1 & res == 0,0)))

输出看起来像这样。（我在示例数据的末尾添加了数据以显示我出现的情况。）

# # A tibble: 13 x 4
# # Groups:   person [3]
#    person  time   res  flag
#     <dbl> <dbl> <dbl> <dbl>
#  1      1     1     1     0
#  2      1     2     0     1
#  3      1     3     1     0
#  4      2     1     1     0
#  5      3     2     1     0
#  6      3     1     0     1
#  7      3     2     0     0
#  8      3     3     1     0
#  9      3     4     0     1
# 10      3     5     1     0
# 11      3     6     0     1
# 12      1     7     1     0
# 13      1     8     0     1

这是一个分两步解决问题的解决方案：

使用 dplyr 的 lag 函数计算 res 的前一个值，而不是 res 的下一个值。我们在分组数据框中执行此操作，因此 res_last_time 的第一个值是每个人的 NA。
在分组数据框中使用 cumsum 只为每个人保留第一个标志 = 1。

d %>% 
    group_by(person) %>% 
    mutate(res_last_time = lag(res,1)) %>%
    mutate(flag = res == 0 & res_last_time == 1) %>%
    mutate(flag = as.numeric(flag & cumsum(flag) <= 1))

使用相同的 d data.frame，这是我得到的结果：

#> # A tibble: 10 x 5
#> # Groups:   person [3]
#>    person  time   res res_last_time  flag
#>     <dbl> <dbl> <dbl>         <dbl> <dbl>
#>  1      1     1     1            NA     0
#>  2      1     2     0             1     1
#>  3      1     3     1             0     0
#>  4      2     1     1            NA     0
#>  5      3     2     1            NA     0
#>  6      3     1     0             1     1
#>  7      3     2     0             0     0
#>  8      3     3     1             0     0
#>  9      3     4     0             1     0
#> 10      3     5     1             0     0

^{由 reprex package (v1.0.0) 于 2021 年 4 月 15 日创建}

case-when dplyr dplyr pattern-matching r r tidyverse