问题描述
我有一个数据框,其中有一个名为Product的列(有很多产品),一个称为Timestamp的列(表示离散序数变量中的日期)和一个称为Rating的列。 我正在尝试考虑时间戳记,计算每种产品的Rating变量的移动平均值和移动标准偏差。
数据看起来像这样:
DF <- data.frame(Product=c("a","a","b","c","c"),Timestamp=c(1,2,3,4,1,5),Rating=c(4,5,5))
现在,我添加移动平均线和移动标准差的列:
DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))
最后,我将此代码与嵌套的for循环一起使用,以获得所需的结果:
for (product in unique(DF$Product)) {
for (timestamp in DF[DF$Product==product,]$Timestamp){
if (timestamp==1) {
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Rating
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <- 0
}else{
index_start <- which(DF$Product==product &
DF$Timestamp==1)
index_end <- which(DF$Product==product &
DF$Timestamp==timestamp)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.avg <-
mean(DF[index_start:index_end,]$Rating)
DF[DF$Product==product &
DF$Timestamp==timestamp,]$Moving.sd <-
sd(DF[index_start:index_end,]$Rating)
}
}
}
该代码可以正常运行,但速度太慢。 我想知道如何使用向量化来使其更快?
解决方法
如果您想将所有矢量都以R为底进行矢量处理,则可以尝试:
Col1 Col2 Col3
0 1 2.0 2.0
1 2 4.8 4.8
2 3 1.0 1.0
3 5 9.0 2.9
4 1 7.8 2.2
5 2 2.0 7.0
请注意,尽管单个数字的DF <- data.frame(Product=c("a","a","b","c","c"),Timestamp=c(1,2,3,4,1,5),Rating=c(4,5,5))
cbind(DF,do.call(rbind,lapply(split(DF,DF$Product),function(x) {
do.call(rbind,lapply(seq(nrow(x)),function(y) {
c(Moving.avg = mean(x$Rating[1:y]),Moving.sd = sd(x$Rating[1:y]))}))})))
#> Product Timestamp Rating Moving.avg Moving.sd
#> 1 a 1 4 4.000000 NA
#> 2 a 2 3 3.500000 0.7071068
#> 3 a 3 5 4.000000 1.0000000
#> 4 a 4 3 3.750000 0.9574271
#> 5 b 1 3 3.000000 NA
#> 6 b 2 4 3.500000 0.7071068
#> 7 b 3 5 4.000000 1.0000000
#> 8 c 1 3 3.000000 NA
#> 9 c 2 1 2.000000 1.4142136
#> 10 c 3 1 1.666667 1.1547005
#> 11 c 4 2 1.750000 0.9574271
#> 12 c 5 5 2.400000 1.6733201
是sd
而不是0。如果需要,用NA
替换它们很简单
由reprex package(v0.3.0)于2020-08-31创建
,我认为您正在寻找累积均值和累积标准偏差。
对于累积平均值,您可以使用cummean
函数和TTR::runSD
进行累积标准偏差。
library(dplyr)
DF %>%
group_by(Product) %>%
mutate(cum_avg = cummean(Rating),cum_std = TTR::runSD(Rating,n = 1,cumulative = TRUE))
# Product Timestamp Rating cum_avg cum_std
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 a 1 4 4 NaN
# 2 a 2 3 3.5 0.707
# 3 a 3 5 4 1
# 4 a 4 3 3.75 0.957
# 5 b 1 3 3 NaN
# 6 b 2 4 3.5 0.707
# 7 b 3 5 4 1
# 8 c 1 3 3 NaN
# 9 c 2 1 2 1.41
#10 c 3 1 1.67 1.15
#11 c 4 2 1.75 0.957
#12 c 5 5 2.4 1.67
,
此示例对您有用吗?在这里,我使用的是来自Runner包的功能Runner()。 Runner()将在您定义的窗口上应用一个函数,并且可以与dplyr的group_by()函数一起正常工作。您可以在k参数上定义函数窗口的大小。
library(runner)
library(dplyr)
library(magrittr)
DF <- data.frame(Product=c("a",5))
DF <- DF %>%
group_by(Product) %>%
arrange(Timestamp,.by_group = T)
DF <- DF %>%
mutate(
average = runner(Rating,f = function(x) mean(x),k = 3),deviation = runner(Rating,f = function(x) sd(x),k = 3)
)
值得一提的是,该函数将扩大data.frame上每个组(或每个产品)的第一行的窗口大小,直到达到k参数定义的大小。因此,在前两行(我们仍然没有3个先前的值)中,Runner()将函数应用于这两行。
,以this为基础回答相关问题,您也可以通过dplyr
这样操作:
DF <- DF %>%
# Sort in order of product and then timestamp within product
arrange(Product,Timestamp) %>%
# group data by product
group_by(Product) %>%
# use the cumulative mean function to calculate the means
mutate(Moving.avg = cummean(Rating),# use the map_dbl function to calculate standard deviations up to a certain index value
Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])),# replace Moving.sd=0 when Timestamp takes on its smallest value
Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0,TRUE ~ Moving.sd)) %>%
# ungroup the data
ungroup