SQL:按批次计算丢失的值

问题描述

我有一个包含 test 列和 Batch 列的表 ID。我想计算与最早的批次相比,每个批次中有多少 ID 丢失,例如比较批次 2 与批次 1 的下面批次 2 的值。

SELECT COUNT(T1.ID) AS LOST_CNT FROM
(SELECT * FROM TEST WHERE BATCH=1)T1
LEFT JOIN (SELECT * FROM TEST WHERE BATCH=2)T2
ON T1.ID=T2.ID WHERE T2.ID IS NULL

我希望每个批次都获得 lost_cnt,因为批次数量会随着时间的推移而增加。像下面这样的东西不会返回我想要的东西。(我明白为什么,只是把它放在这里作为失败的尝试)

SELECT A.BATCH,COUNT(disTINCT CASE WHEN A.ID IS NULL THEN M.ID ELSE NULL END) AS lost_cnt
FROM
 (SELECT disTINCT ID FROM TEST WHERE BATCH=(SELECT MIN(BATCH) FROM TEST)) M 
LEFT JOIN TEST A ON M.ID=A.ID 
GROUP BY 1;

有没有办法得到我想要的东西?

解决方法

您想要实现的目标并不完全清楚,但我想您想找出与第一批相比丢失了多少 id。您可以使用第一批中的 id 过滤表,计算每批中 id 的数量,然后从第一批的数量中减去。

with t as (
    select *
    from test
    where id in (
        select id
        from test
        where batch = (select min(batch) from test)
    )
)
select
    batch,(select count(distinct id)
     from t
     where batch = (select min(batch) from test)
    ) - count(distinct id) as missing
from t
group by batch
order by batch;

样本数据:

batch   id
1       1
1       2
1       3
2       2
2       3
2       4
3       3
3       4

结果:

batch   missing
1       0
2       1
3       2
,

您可以使用 lag 解析函数查找上一个批次,然后使用 NOT EXISTS 查找上一个批次中存在的 id,如下所示:

SELECT T.BATCH,T.ID
  FROM ( SELECT T.BATCH,T.ID,LAG(BATCH) OVER( ORDER BY BATCH) AS PREV_BATCH
      FROM YOUR_TABLE T ) T
 WHERE NOT EXISTS (
    SELECT 1
      FROM YOUR_TABLE TT
     WHERE TT.BATCH = T.PREV_BATCH
       AND TT.ID = T.ID)
,

在 Hive 中,我会使用窗口函数来解决这个问题:

with firstbatch (
      select t.*,count(*) over () as num_in_first_batch
      from (select t.*,min(batch) over () as min_batch
            from t
           ) t
      where min_batch = 1
     )
select t.batch,count(fb.id) as num_in_first_batch,(fb.num_in_first_batch - count(fb.id)) as num_missing_in_first_batch
from t left join
     first_batch fb
     on t.id = fb.id
group by t.batch,fb.num_in_first_batch;