如何向量化嵌套循环和更新数据框

问题描述

我有一个数据框，其中有一个名为Product的列（有很多产品），一个称为Timestamp的列（表示离散序数变量中的日期）和一个称为Rating的列。我正在尝试考虑时间戳记，计算每种产品的Rating变量的移动平均值和移动标准偏差。

数据看起来像这样：

DF <- data.frame(Product=c("a","a","b","c","c"),Timestamp=c(1,2,3,4,1,5),Rating=c(4,5,5))

现在，我添加移动平均线和移动标准差的列：

DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))

最后，我将此代码与嵌套的for循环一起使用，以获得所需的结果：

for (product in unique(DF$Product)) {
  for (timestamp in DF[DF$Product==product,]$Timestamp){
    if (timestamp==1) {
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.avg <- 
        DF[DF$Product==product &
             DF$Timestamp==timestamp,]$Rating
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.sd <- 0
    }else{
      index_start <- which(DF$Product==product &
                             DF$Timestamp==1)
      index_end <- which(DF$Product==product &
                           DF$Timestamp==timestamp)
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.avg <- 
        mean(DF[index_start:index_end,]$Rating)
  
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.sd <- 
        sd(DF[index_start:index_end,]$Rating)
    }
  }
}

该代码可以正常运行，但速度太慢。我想知道如何使用向量化来使其更快？

解决方法

如果您想将所有矢量都以R为底进行矢量处理，则可以尝试：

   Col1  Col2  Col3
0     1   2.0   2.0
1     2   4.8   4.8
2     3   1.0   1.0
3     5   9.0   2.9
4     1   7.8   2.2
5     2   2.0   7.0

请注意，尽管单个数字的DF <- data.frame(Product=c("a","a","b","c","c"),Timestamp=c(1,2,3,4,1,5),Rating=c(4,5,5)) cbind(DF,do.call(rbind,lapply(split(DF,DF$Product),function(x) { do.call(rbind,lapply(seq(nrow(x)),function(y) { c(Moving.avg = mean(x$Rating[1:y]),Moving.sd = sd(x$Rating[1:y]))}))}))) #> Product Timestamp Rating Moving.avg Moving.sd #> 1 a 1 4 4.000000 NA #> 2 a 2 3 3.500000 0.7071068 #> 3 a 3 5 4.000000 1.0000000 #> 4 a 4 3 3.750000 0.9574271 #> 5 b 1 3 3.000000 NA #> 6 b 2 4 3.500000 0.7071068 #> 7 b 3 5 4.000000 1.0000000 #> 8 c 1 3 3.000000 NA #> 9 c 2 1 2.000000 1.4142136 #> 10 c 3 1 1.666667 1.1547005 #> 11 c 4 2 1.750000 0.9574271 #> 12 c 5 5 2.400000 1.6733201是sd而不是0。如果需要，用NA替换它们很简单

^{由reprex package（v0.3.0）于2020-08-31创建}

我认为您正在寻找累积均值和累积标准偏差。

对于累积平均值，您可以使用cummean函数和TTR::runSD进行累积标准偏差。

library(dplyr)

DF %>%
  group_by(Product) %>%
  mutate(cum_avg = cummean(Rating),cum_std = TTR::runSD(Rating,n = 1,cumulative = TRUE))

#  Product Timestamp Rating cum_avg cum_std
#   <chr>       <dbl>  <dbl>   <dbl>   <dbl>
# 1 a               1      4    4    NaN    
# 2 a               2      3    3.5    0.707
# 3 a               3      5    4      1    
# 4 a               4      3    3.75   0.957
# 5 b               1      3    3    NaN    
# 6 b               2      4    3.5    0.707
# 7 b               3      5    4      1    
# 8 c               1      3    3    NaN    
# 9 c               2      1    2      1.41 
#10 c               3      1    1.67   1.15 
#11 c               4      2    1.75   0.957
#12 c               5      5    2.4    1.67

此示例对您有用吗？在这里，我使用的是来自Runner包的功能Runner（）。 Runner（）将在您定义的窗口上应用一个函数，并且可以与dplyr的group_by（）函数一起正常工作。您可以在k参数上定义函数窗口的大小。

library(runner)
library(dplyr)
library(magrittr)

DF <- data.frame(Product=c("a",5))


DF <- DF %>% 
  group_by(Product) %>% 
  arrange(Timestamp,.by_group = T)


DF <- DF %>% 
  mutate(
    average = runner(Rating,f = function(x) mean(x),k = 3),deviation = runner(Rating,f = function(x) sd(x),k = 3)
  )

值得一提的是，该函数将扩大data.frame上每个组（或每个产品）的第一行的窗口大小，直到达到k参数定义的大小。因此，在前两行（我们仍然没有3个先前的值）中，Runner（）将函数应用于这两行。

以this为基础回答相关问题，您也可以通过dplyr这样操作：

DF <- DF %>% 
  # Sort in order of product and then timestamp within product 
  arrange(Product,Timestamp) %>% 
  # group data by product
  group_by(Product) %>% 
  # use the cumulative mean function to calculate the means 
  mutate(Moving.avg = cummean(Rating),# use the map_dbl function to calculate standard deviations up to a certain index value       
    Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),# replace Moving.sd=0 when Timestamp takes on its smallest value
    Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,TRUE ~ Moving.sd)) %>%
  # ungroup the data
  ungroup

loops r vectorization

如何向量化嵌套循环和更新数据框

问题描述

解决方法

相关问答