优化sql关于group by和has

问题描述

我有一个关于如何高效进行查询的常见问题。

id	时间
d048533c-92d2-11eb-8dbb-fa163e962e00	1617272028623
6b5b455e-92d3-11eb-8dbb-fa163e962e00	1617272279382
024d0a5e-92d3-11eb-8dbb-fa163e962e00	1617272106615

我们有一张像上面一样的表格。我们要根据以下条件过滤掉ID：

如果两个或多个 ID 在 3 分钟内有时间，我们将它们称为两个以上的组。
我们有 10000 个 ID，我们希望找到超过 10 个的所有组。

这是我的答案：

SELECT B.ID FROM TEMP B,TEMP A
WHERE B.ID != A.ID
AND (B.TIME <= A.TIME + 180000 AND B.TIME >= A.TIME - 180000) GROUP BY B.ID HAVING COUNT(*) >= 9;

有没有更有效的方法？

解决方法

如果您想要在 3 分钟内出现的第 10 个 id，那么您可以使用 lag()：

select t.*
from (select t.*,lag(time,9) over (order by time) as time_9
      from t
     ) t
where time < time_9 + 3 * 60 * 1000;

我不确定这是否正是您想要的。但关键思想是一般使用窗口函数——特别是 lag()——而不是自联接。

性能应该会好很多。

编辑：

如果要查找属于在 3 分钟内至少有 10 个行的组中的所有行，请查找第一个 -- 然后确定“第一个”这样的行是否在任何其他行的 9 行之内行：

with t9 as (
      select t.*,(case when time_9 < time + 3 * 60 * 1000 then 1 else 0 end) as group_start
      from (select t.*,lead(time,9) over (order by time) as time_9
            from t
           ) t
     )
select t9.*
from (select t9.*,sum(group_start) over (order by time rows between 9 preceding and current row) as in_group_flag
      from t9
     ) t9
where in_group_flag > 0

common-table-expression group-by having having having optimization optimization sql sql