问题描述
我有一个如下表,其中State是一组有限的更新(例如,开始,结束):
CREATE TABLE event_updates (
event_id Int32,timestamp DateTime,state String
) ENGINE Log;
而且我希望能够迅速运行以下查询:
SELECT count(*)
FROM (
SELECT event_id,minorNullIf(timestamp,state = 'Start') as start,state = 'End') as end,end - start as duration,duration < 10 as is_fast,duration > 300 as is_slow
FROM event_updates
GROUP BY event_id)
WHERE start >= '2020-08-20 00:00:00'
AND start < '2020-08-20 00:00:00'
AND is_slow;
但是,在有大量数据时,这些查询的速度很慢,我想是因为每一行都需要计算。
示例数据:
┌─event_id─┬───────────timestamp─┬─state─┐
│ 1 │ 2020-08-21 09:58:00 │ Start │
│ 1 │ 2020-08-21 10:18:00 │ End │
│ 2 │ 2020-08-21 10:23:00 │ Start │
│ 2 │ 2020-08-21 10:23:05 │ End │
│ 3 │ 2020-08-21 10:23:00 │ Start │
│ 3 │ 2020-08-21 10:24:00 │ End │
│ 3 │ 2020-08-21 11:24:00 │ End │
│ 4 │ 2020-08-21 10:30:00 │ Start │
└──────────┴─────────────────────┴───────┘
查询示例:
SELECT
event_id,state = 'Start') AS start,state = 'End') AS end,end - start AS duration,duration < 10 AS is_fast,duration > 300 AS is_slow
FROM event_updates
GROUP BY event_id
ORDER BY event_id ASC
┌─event_id─┬───────────────start─┬─────────────────end─┬─duration─┬─is_fast─┬─is_slow─┐
│ 1 │ 2020-08-21 09:58:00 │ 2020-08-21 10:18:00 │ 1200 │ 0 │ 1 │
│ 2 │ 2020-08-21 10:23:00 │ 2020-08-21 10:23:05 │ 5 │ 1 │ 0 │
│ 3 │ 2020-08-21 10:23:00 │ 2020-08-21 10:24:00 │ 60 │ 0 │ 0 │
│ 4 │ 2020-08-21 10:30:00 │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │
└──────────┴─────────────────────┴─────────────────────┴──────────┴─────────┴─────────┘
CREATE TABLE event_stats (
event_id Int32,start Nullable(DateTime),end Nullable(DateTime),duration Nullable(Int32),is_fast Nullable(UInt8),is_slow Nullable(UInt8)
);
但是我不知道如何用实例化视图创建此表或找到更好的方法。
解决方法
起初,我会
- 使用MergeTree引擎而不是Log来获得排序键的好处
CREATE TABLE event_updates (
event_id Int32,timestamp DateTime,state String
) ENGINE MergeTree
PARTITION BY toYYYYMM(timestamp)
ORDER BY (timestamp,state);
- 通过将WHERE子句应用于时间戳和 state (在您的查询中处理了整个数据集)来约束原始数据集
SELECT count(*)
FROM (
SELECT event_id,minOrNullIf(timestamp,state = 'Start') as start,state = 'End') as end,end - start as duration,duration < 10 as is_fast,duration > 300 as is_slow
FROM event_updates
WHERE timestamp >= '2020-08-20 00:00:00' AND timestamp < '2020-09-20 00:00:00'
AND state IN ('Start','End')
GROUP BY event_id
HAVING start >= '2020-08-20 00:00:00' AND start < '2020-09-20 00:00:00'
AND is_slow);
如果这些方法不起作用,则需要考虑使用AggregatingMergeTree来操纵预先计算的聚合而不是原始数据。