问题描述
我有这两张桌子:
SELECT sp.page_id,u.date_registered
FROM users u
JOIN solved_pages sp ON u.username = sp.solver;
SELECT date_begin::date,(date_begin + '31 day'::interval)::date as date_end
FROM generate_series(timestamp '2020-01-01',timestamp '2021-01-01',interval '1 day') AS date_begin;
我想实现某种连接。我想计算所有 page_id,在每一行的 date_begin 和 date_end 之间使用 date_registered。 有什么提示吗?提前致谢:)
表用户
username | date_registered
------------------------------
user1 | 2020-04-01 20:00:00
user2 | 2020-04-07 21:00:00
user3 | 2020-12-01 14:00:00
表已解决_pages
solver | page_id
------------------------------
user1 | page1
user1 | page2
user1 | page3
user2 | page1
user2 | page2
user3 | page1
date_begin | date_end | no_solvers
-------------------------------------
2020-01-01 | 2020-02-01 | 0
-------------------------------------
2020-02-01 | 2020-03-01 | 0
--------------------------------------
................
2020-04-01 | 2020-05-01 | 2 -> because user1 and user2 has registered in that period and both solved page1
解决方法
这看起来像:
With Registrations as
(
SELECT sp.page_id,u.date_registered
FROM users u
JOIN solved_pages sp ON u.username = sp.solver
and sp.page_id=‘page1’
),TimeSeries as
(
SELECT date_begin::date,(date_begin + '31 day'::interval)::date as date_end
FROM generate_series(timestamp '2020-01-01',timestamp '2021-01-01',interval '1 day') AS date_begin;
)
Select a.date_begin,a.date_end,b.sum(case when b.page_id is null then 0 else 1 end) as no_solvers
from TimeSeries a
Left join Registrations b
on b.date_registered between a.date_begin and a.date_end
Group by a.date_begin,a.date_end
,
生成测试数据:
library(dplyr)
test %>%
group_by(ID) %>%
filter(sum(value == 0) <2)
现在,如果您只想计算 31 天间隔内的行数,删除 JOIN 并仅使用 date-date 返回天数、除以 31 和分组的事实要快得多、简单得多通过这个:
BEGIN;
CREATE TABLE foo( id SERIAL PRIMARY KEY,d DATE NOT NULL);
INSERT INTO foo (d) SELECT
'2020-01-01'::DATE + '1 DAY'::INTERVAL*((365*random())::INTEGER)
FROM generate_series(1,100000);
COMMIT;
VACUUM ANALYZE foo;
...但事实并非如此,而是看起来您想要滚动计数之类的东西。让我们试试 join 方法:
SELECT count(*),(d-'2020-01-01')/31 mon FROM foo GROUP BY mon ORDER BY mon;
count | mon
-------+-----
8334 | 0
8535 | 1
8497 | 2
8390 | 3
8529 | 4
8525 | 5
8316 | 6
8486 | 7
8504 | 8
8581 | 9
8553 | 10
6750 | 11
这行得通,但除非日期有索引,否则它会很慢,即使这样它仍然很慢(大约 350 毫秒)。由于计数的粒度是按天计算的,因此首先通过计算每天的行数然后对其运行连接来简化数据是有意义的。
SELECT dates.*,count(*) FROM
(SELECT date_begin,date_begin+'31 day'::interval AS date_end FROM
(SELECT '2020-01-01'::DATE + '1 DAY'::INTERVAL * generate_series(0,31) AS date_begin)d
) dates
JOIN foo ON (foo.d BETWEEN dates.date_begin AND dates.date_end)
GROUP BY date_begin,date_end;
这在更短的时间内(大约 10 毫秒)给出了相同的结果。请注意子查询中的 WHERE 将计数的行数限制为至少应包括运行计数所涵盖的时间段。
虽然使用窗口函数更整洁:
SELECT dates.*,sum(cnt) FROM
(SELECT date_begin,31) AS date_begin)d
) dates
JOIN (SELECT d,count(*) cnt FROM foo
WHERE d < '2020-03-01'
GROUP BY d) f
ON (f.d BETWEEN dates.date_begin AND dates.date_end)
GROUP BY date_begin,date_end
ORDER BY date_begin;