问题描述
我正在寻找一种有效的方法,以查询Clickstrong中过去的值作为数组在ClickHouse中按一列(即Time
)排序的每一行的位置,其中值应该被检索为数组。
ClickHouse仍不支持窗口功能(请参阅#1469),所以我希望使用groupArray()
之类的聚合功能来解决此问题?
表格:
Time | Value
12:11 | 1
12:12 | 2
12:13 | 3
12:14 | 4
12:15 | 5
12:16 | 6
窗口大小为n=3
的预期结果:
Time | Value
12:13 | [1,2,3]
12:14 | [2,3,4]
12:15 | [3,4,5]
12:16 | [4,5,6]
ClickHouse当前用于有效查询滑动/移动窗口的方式/功能是什么?如何获得所需的结果?
编辑:
我的解决方案基于@vladimir的响应:
select max(Time) as Time,groupArray(Value) as Values
from (
select
*,rowNumberInAllBlocks() as row_number,arrayJoin(range(row_number,row_number + 3)) as window_id
from (
/* BEGIN emulate origin dataset */
select toDateTime(a) as Time,rowNumberInAllBlocks()+1 as Value
from (
select arrayJoin([
'2020-01-01 12:11:00','2020-01-01 12:12:00','2020-01-01 12:13:00','2020-01-01 12:14:00','2020-01-01 12:15:00','2020-01-01 12:16:00']) a
)
order by Time
/* END emulate origin dataset */
)
order by Time
) s
group by window_id
having length(Values) = 3
order by Time
请注意,3
在查询中出现两次,代表窗口大小 n 。
输出:
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:13:00 │ [1,3] │
│ 2020-01-01 12:14:00 │ [2,4] │
│ 2020-01-01 12:15:00 │ [3,5] │
│ 2020-01-01 12:16:00 │ [4,6] │
└─────────────────────┴─────────┘
解决方法
ClickHouse具有几个数据块范围的窗口功能,让我们来研究neighbor:
const uint64_t*
基于源行重复window_size倍的另一种方法:
SELECT Time,[neighbor(Value,-2),neighbor(Value,-1),0)] Values
FROM (
/* emulate origin data */
SELECT toDateTime(data.1) as Time,data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00',1),('2020-01-01 12:12:00',2),('2020-01-01 12:13:00',3),('2020-01-01 12:14:00',4),('2020-01-01 12:15:00',5),('2020-01-01 12:16:00',6)]) as data)
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [0,1] │
│ 2020-01-01 12:12:00 │ [0,1,2] │
│ 2020-01-01 12:13:00 │ [1,2,3] │
│ 2020-01-01 12:14:00 │ [2,3,4] │
│ 2020-01-01 12:15:00 │ [3,4,5] │
│ 2020-01-01 12:16:00 │ [4,5,6] │
└─────────────────────┴─────────┘
*/
其他示例:
SELECT
arrayReduce('max',arrayMap(x -> x.1,raw_result)) Time,arrayMap(x -> x.2,raw_result) Values
FROM (
SELECT groupArray((Time,Value)) raw_result,max(row_number) max_row_number
FROM (
SELECT
3 AS window_size,*,rowNumberInAllBlocks() row_number,arrayJoin(arrayMap(x -> x + row_number,range(window_size))) window_id
FROM (
/* emulate origin dataset */
SELECT toDateTime(data.1) as Time,data.2 as Value
FROM (
SELECT arrayJoin([('2020-01-01 12:11:00',6)]) as data)
ORDER BY Value
)
)
GROUP BY window_id
HAVING max_row_number = window_id
ORDER BY window_id
)
/*
┌────────────────Time─┬─Values──┐
│ 2020-01-01 12:11:00 │ [1] │
│ 2020-01-01 12:12:00 │ [1,2] │
│ 2020-01-01 12:13:00 │ [1,6] │
└─────────────────────┴─────────┘
*/