问题描述
我正在尝试使用PostgreSQL中的窗口函数来查找数据库中列的移动argmax。 这是我到目前为止的内容:
select *,(max(case when price = roll_max then (row_num) end) over (partition by roll_max order by s_date)) as argmax
from (
select s_id,s_date,price,row_number() over (partition by s_id order by s_date) as row_num,max(high_price) over (partition by s_id order by s_date rows 10 preceding) as roll_max
from sample_table
) tb1
order by s_date
以上代码是从this answer修改而来的。我必须通过s_id
添加分区,因为有许多不同的s_id-表的唯一键是:(s_id,s_date)
。因此,在所有可用日期中,我需要每对的argmax。
这是一些示例输出数据(窗口大小为10)的输出:
+-------+--------------+---------+---------+----------+------------------------------------------+
| s_id | s_date | price | row_num | roll_max | argmax |
+-------+--------------+---------+---------+----------+------------------------------------------+
| "ABC" | "2020-06-10" | 322.390 | 1 | 322.390 | 1 |
| "ABC" | "2020-06-11" | 312.150 | 2 | 322.390 | 1 |
| "ABC" | "2020-06-12" | 309.080 | 3 | 322.390 | 1 |
| "ABC" | "2020-06-15" | 308.280 | 4 | 322.390 | 1 |
| "ABC" | "2020-06-16" | 315.640 | 5 | 322.390 | 1 |
| "ABC" | "2020-06-17" | 314.390 | 6 | 322.390 | 1 |
| "ABC" | "2020-06-18" | 312.300 | 7 | 322.390 | 1 |
| "ABC" | "2020-06-19" | 314.380 | 8 | 322.390 | 1 |
| "ABC" | "2020-06-22" | 311.050 | 9 | 322.390 | 1 |
| "ABC" | "2020-06-23" | 314.500 | 10 | 322.390 | 1 |
| "ABC" | "2020-06-24" | 310.510 | 11 | 322.390 | 1 |
| "ABC" | "2020-06-25" | 307.640 | 12 | 315.640 | NULL /* how to get row_num (5) here? */ |
| "ABC" | "2020-06-26" | 306.390 | 13 | 315.640 | NULL /* how to get row_num (5) here? */ |
| "ABC" | "2020-06-29" | 304.610 | 14 | 315.640 | NULL /* how to get row_num (5) here? */ |
| "ABC" | "2020-06-30" | 310.200 | 15 | 315.640 | NULL /* how to get row_num (5) here? */ |
| "ABC" | "2020-07-01" | 311.890 | 16 | 314.500 | NULL /* how to get row_num (10) here? */ |
| "ABC" | "2020-07-02" | 315.700 | 17 | 315.700 | 17 |
| "ABC" | "2020-07-06" | 317.680 | 18 | 317.680 | 18 |
+-------+--------------+---------+---------+----------+------------------------------------------+
我了解上面编写的查询仅将当前行与最大值匹配,如果匹配,则返回行号-但是这种情况并不总是适用,如上表所示,其中315.640是滚动最大值直到(包括)第12行,但该值来自上一个窗口而不是当前行。
我的问题是:在上面的示例中,如何获取值5代替NULL-即,对于每个实例,获取实际argmax
的row_num(315.640的row_num
为5) argmax
-row_num
的值可以用于表格或每个窗口(在此示例中,窗口大小为10)。
我看过other similar个问题,但仍然无法获得想要的结果,因为我要做的是滚动argmax
而不是整个问题该表的列。
有人可以为此建议解决方案吗?我也愿意使用UDF。我只有聚合UDF的基本知识,所以我使用临时数组保存最后10个值并取其最大值的方法似乎不是很有效(甚至不确定我是否这样做)。在这一点上,我没有想法:/
解决方法
虽然有点难以理解,但是您可以执行以下操作:
- 将该窗口内所有价格值放入数组;
- 使用
array_position
查找滚动最高价格的值; - 通过在输出中添加
row_number()
(窗口大小)来调整row_number() - 10
; - 使用
GREATEST(row_number() - 10,0)
防止出现负数来调整数组的开头:
WITH sample_table(s_id,s_date,price) AS (
VALUES ('ABC','2020-06-10'::date,322.390),('ABC','2020-06-11'::date,312.150),'2020-06-12'::date,309.080),'2020-06-15'::date,308.280),'2020-06-16'::date,315.640),'2020-06-17'::date,314.390),'2020-06-18'::date,312.300),'2020-06-19'::date,314.380),'2020-06-22'::date,311.050),'2020-06-23'::date,314.500),'2020-06-24'::date,310.510),'2020-06-25'::date,307.640),'2020-06-26'::date,306.390),'2020-06-29'::date,304.610),'2020-06-30'::date,310.200),'2020-07-01'::date,311.890),'2020-07-02'::date,315.700),'2020-07-06'::date,317.680)
)
SELECT s_id,price,row_number() over (PARTITION BY s_id ORDER BY s_date),max(price) over (partition by s_id order by s_date rows 10 preceding) as roll_max,GREATEST(row_number() over (PARTITION BY s_id ORDER BY s_date) - 10,0)
+ array_position(
array_agg(price) over (partition by s_id order by s_date rows 10 preceding),max(price) over (partition by s_id order by s_date rows 10 preceding)
) as argmax
FROM sample_table
或者,带有子查询,但更易于阅读:
WITH sample_table(s_id,row_number,roll_max,GREATEST(row_number - 10,0)
+ array_position(
prices,roll_max
) as argmax
FROM (
SELECT s_id,max(price) over (partition by s_id order by s_date rows 10 preceding) as roll_max,array_agg(price)
over (partition by s_id order by s_date rows 10 preceding) as prices
FROM sample_table
) as s