R-延迟不适用于函数[目标:匹配相似的相邻行]

问题描述

我最近尝试根据两个变量(以下条件1和结果1)匹配数据框中的相邻相同的行。我见过有人对所有行都执行此操作,而不是对相邻行执行此操作,这就是为什么我开发了以下三步变通方法(我希望不要过分考虑)的原因:

-我滞后了想要进行匹配的变量。

-我比较了变量和滞后变量

-我删除了两个都相同的行(并删除了其余不必要的列)。

Case <- c("Case 1","Case 2","Case 3","Case 4","Case 5")
Condition1 <- c(0,1,1)
Outcome1 <- c(0,1)
mwa.df <- data.frame(Case,Condition1,Outcome1)

new.df <- mwa.df
Condition_lag <- c(new.df$Condition1[-1],0)
Outcome_lag <- c(new.df$Outcome1[-1],0)
new.df <- cbind(new.df,Condition_lag,Outcome_lag)
new.df$Comp <- 0
new.df$Comp[new.df$Outcome1 == new.df$Outcome_lag & new.df$Condition1 == new.df$Condition_lag] <- 1
new.df <- subset(new.df,Comp == 0)
new.df <- subset(new.df,select = -c(Condition_lag,Outcome_lag,Comp))

这很好。但是当我尝试为此创建函数时,因为我必须对大量数据帧执行此操作,所以遇到了滞后不起作用的问题(即未执行condition_lag <- c(new.df$condition[-1],0)outcome_lag <- c(new.df$outcome[-1],0)操作)。功能代码为:

FLC.Dframe <- function(old.df,condition,outcome){
      new.df <- old.df
      condition_lag <- c(new.df$condition[-1],0)
      outcome_lag <- c(new.df$outcome[-1],0)
      new.df <- cbind(new.df,condition_lag,outcome_lag)
      new.df$comp <- 0
      new.df$comp[new.df$outcome == new.df$outcome_lag & new.df$condition == new.df$condition_lag] <- 1
      new.df <- subset(new.df,comp == 0)
      new.df <- subset(new.df,select = -c(condition_lag,outcome_lag,comp))
      return(new.df)
}

关于使用该功能,我写了new.df <- FLC.Dframe(mwa.df,Outcome1)

有人可以帮我吗?预先非常感谢。

解决方法

只需生成游程长度ID并删除重复项即可。

with(mwa.df,mwa.df[!duplicated(data.table::rleid(Condition1,Outcome1)),])

输出

    Case Condition1 Outcome1
1 Case 1          0        0
2 Case 2          1        0
3 Case 3          0        0
5 Case 5          1        1

如果您想要一个功能,那么

FLC.Dframe <- function(df,cols) df[!duplicated(data.table::rleidv(df[,cols])),]

像这样调用此函数

> FLC.Dframe(mwa.df,c("Condition1","Outcome1"))

    Case Condition1 Outcome1
1 Case 1          0        0
2 Case 2          1        0
3 Case 3          0        0
5 Case 5          1        1

函数的主要问题与$的错误使用有关。该操作员按原样对待RHS输入。例如,在此行new.df$condition中,$运算符试图在new.df中找到一个名为"condition"的列,但找不到"Condition1"的列,它是{的值{1}}。如果您按照以下方式重写函数,那么它将起作用。

condition

您还需要这样称呼它(请注意,您需要使用字符作为输入)

FLC.Dframe <- function(old.df,condition,outcome){
  new.df <- old.df
  condition_lag <- c(new.df[[condition]][-1],0)
  outcome_lag <- c(new.df[[outcome]][-1],0)
  new.df <- cbind(new.df,condition_lag,outcome_lag)
  new.df$comp <- 0
  new.df$comp[new.df[[outcome]] == new.df[["outcome_lag"]] & new.df[[condition]] == new.df[["condition_lag"]]] <- 1
  new.df <- subset(new.df,comp == 0)
  new.df <- subset(new.df,select = -c(condition_lag,outcome_lag,comp))
  return(new.df)
}