问题描述
我有一个包含 test
列和 Batch
列的表 ID
。我想计算与最早的批次相比,每个批次中有多少 ID 丢失,例如比较批次 2 与批次 1 的下面批次 2 的值。
SELECT COUNT(T1.ID) AS LOST_CNT FROM
(SELECT * FROM TEST WHERE BATCH=1)T1
LEFT JOIN (SELECT * FROM TEST WHERE BATCH=2)T2
ON T1.ID=T2.ID WHERE T2.ID IS NULL
我希望每个批次都获得 lost_cnt
,因为批次数量会随着时间的推移而增加。像下面这样的东西不会返回我想要的东西。(我明白为什么,只是把它放在这里作为失败的尝试)
SELECT A.BATCH,COUNT(disTINCT CASE WHEN A.ID IS NULL THEN M.ID ELSE NULL END) AS lost_cnt
FROM
(SELECT disTINCT ID FROM TEST WHERE BATCH=(SELECT MIN(BATCH) FROM TEST)) M
LEFT JOIN TEST A ON M.ID=A.ID
GROUP BY 1;
有没有办法得到我想要的东西?
解决方法
您想要实现的目标并不完全清楚,但我想您想找出与第一批相比丢失了多少 id。您可以使用第一批中的 id 过滤表,计算每批中 id 的数量,然后从第一批的数量中减去。
with t as (
select *
from test
where id in (
select id
from test
where batch = (select min(batch) from test)
)
)
select
batch,(select count(distinct id)
from t
where batch = (select min(batch) from test)
) - count(distinct id) as missing
from t
group by batch
order by batch;
样本数据:
batch id
1 1
1 2
1 3
2 2
2 3
2 4
3 3
3 4
结果:
batch missing
1 0
2 1
3 2
,
您可以使用 lag
解析函数查找上一个批次,然后使用 NOT EXISTS
查找上一个批次中存在的 id,如下所示:
SELECT T.BATCH,T.ID
FROM ( SELECT T.BATCH,T.ID,LAG(BATCH) OVER( ORDER BY BATCH) AS PREV_BATCH
FROM YOUR_TABLE T ) T
WHERE NOT EXISTS (
SELECT 1
FROM YOUR_TABLE TT
WHERE TT.BATCH = T.PREV_BATCH
AND TT.ID = T.ID)
,
在 Hive 中,我会使用窗口函数来解决这个问题:
with firstbatch (
select t.*,count(*) over () as num_in_first_batch
from (select t.*,min(batch) over () as min_batch
from t
) t
where min_batch = 1
)
select t.batch,count(fb.id) as num_in_first_batch,(fb.num_in_first_batch - count(fb.id)) as num_missing_in_first_batch
from t left join
first_batch fb
on t.id = fb.id
group by t.batch,fb.num_in_first_batch;