R考虑到字符串出现的顺序，对向量group_by中的字符串进行检查和计数

问题描述

数据采用以下格式，其中我必须使用Date对其进行分组。为了方便起见，我将其显示为数字。

Msg <- c("Errors","Errors","Start","Stop","LostControl","Failed","Error","Stop")
Date <- c(11,11,12,14,19,20,21,22,22)
data<- data.frame(Msg,Date)

我需要计算每个START-STOP周期中失败的数量，并按日期进行汇总。
数据具有三种类型的消息。错误和失败是两种类型的失败消息，而 LostControl 不是失败。条件是在该START-STOP循环中， Failed 消息不得在 LostControl 消息之前。如果仅在错误之前，则为失败。另外，如果仅找到“错误”消息，则也不会算作失败。

编辑：在 Msg 向量中，如果找到两个“开始”或“停止”，则START_STOP周期是从极端开始到极端停止。如果START没有跟随STOP，则将其忽略。

编辑已添加一行-（Msg = Stop，Date = 20）

解决方法

我们可以修改我昨天在您的post中编写的功能。

between_valid_anchors <- function(x,bgn = "Start",end = "Stop") {
  are_anchors <- x %in% c(bgn,end)
  xid <- seq_along(x)
  id <- xid[are_anchors]
  x <- x[are_anchors]
  start_pos <- id[which(x == bgn & c("",head(x,-1L)) %in% c("",end))]
  stop_pos <- id[which(x == end & c(tail(x,-1L),"") %in% c("",bgn))]
  if (length(start_pos) < 1L || length(stop_pos) < 1L)
    return(logical(length(xid)))
  xid %in% unlist(mapply(`:`,start_pos,stop_pos))
}

然后就

library(dplyr)

data %>% 
  group_by(Date) %>% 
  filter(between_valid_anchors(Msg)) %>% 
  summarise(Msg = sum(Msg %in% c("Err","Errors","Failed")))

输出

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 6 x 2
   Date   Msg
  <dbl> <int>
1    11     0
2    12     0
3    14     0
4    19     1
5    21     1
6    22     2

更新

您可以再添加一个过滤器，以仅选择感兴趣的消息（即开始，停止，失败，丢失控制）。然后，只求所有Msg == "Failed"而不是lag(Msg) == "LostControl"

的总和

library(dplyr)

data %>% 
  group_by(Date) %>% 
  filter(between_valid_anchors(Msg)) %>% 
  filter(Msg %in% c("Start","Stop","Failed","LostControl")) %>% 
  summarise(Msg = sum(Msg == "Failed" & lag(Msg,default = "") != "LostControl"))

输出

`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 7 x 2
   Date   Msg
  <dbl> <int>
1    11     0
2    12     0
3    14     0
4    19     0
5    20     0
6    21     1
7    22     1