R中的算法可平滑向量，同时保留等级顺序

问题描述

我需要编写一个函数，该函数可以平滑向量而不丢失向量值的原始排名顺序。我想到的是以下内容：

#1 Sort all values of vector in ascending order
#2 for the kth value in vector s_k in the ordered list,collect the list of 2N+1 values in the window of values between [s_{k-N},s_{k+N}]
#3 by deFinition,s_k is the median of the values in that window
#4 replace s_k with the mean of value in that same window for all values of k

理想情况下，我希望能够在处理远程数据时编写依赖于dbplyr的函数，但这并不是绝对必要的，因为我可以将数据分成多个块，所以基数R也可以同样，这也可以是所有的postgressql 代码或局部sql局部dbplyr，它们是相同的，但有一些要求。我需要能够对N进行参数化，并且需要能够向该函数提供一个数据帧列表或一组表（如果在数据库中）以供循环（在R中这很简单，一个具有N包装器中lapply的单个参数）。

这是我到目前为止N=3得到的：

#Example Data
s <- rnorm(1000,mean=50,sd=10)
test.in <- as.data.frame(s)
test.in$id <- 1:length(s)

#Non parameterized attempt 
test.out <- test.in %>%
  rename(s = union_v_corporate_candidate) %>%
  mutate(lag_k_3 = lag(s,3),lead_k_3 = lead(s,lag_k_2 = lag(s,2),lead_k_2 = lead(s,lag_k_1 = lag(s,1),lead_k_1 = lead(s,1)) %>%
  mutate(window_mean = (lag_k_3 + lead_k_3 + lag_k_2 + lead_k_2 + lag_k_1 + lead_k_1 + s)/7) %>%
  select(id,s,window_mean)

上述方法的逻辑问题在于，我无法参数化N，因为每个附加的N值都需要两个附加的mutate子句。

解决方法

您正在寻找的被称为SQL中的窗口框架。我是从this和this链接引用的。在SQL中，这样的命令可能类似于：

SELECT Col1,Col2,SUM(Col2) OVER(ORDER BY Col1 ROWS BETWEEN N PRECEDING AND N FOLLOWING) AS window_sum
FROM db.table

N是要查找的当前行前后多少行的参数。因此，上面的命令将产生2N+1行移动总和。

在dbplyr中，此功能由window_order和window_frame提供。官方参考文献here和替代文本here。

根据他们的示例，您可能想要以下内容：

N = 3

test_out = test_in %>%
  group_by(id) %>% # find the moving mean for each group separately
  window_order(s) %>% # how should your data be sorted (think 'arrange') often date
  window_frame(-N,N) %>% # set width of window
  mutate(window_mean = mean(s))

# check SQL produced
sql_build(test_out)
# or
show_query(test_out)

我强烈建议您检查生成的SQL，以确保您的R代码按您的预期做。

dbplyr dplyr dplyr postgresql r r smoothing