问题描述
我通常能够搜索并找到适合我的情况的解决方案,但是我没有看到任何适合我的情况的空白和孤岛问题。
我有一个SCD Dim表,其中包含Type 2 Project数据。当类型2维度发生更改时,现有项目记录将被关闭,并创建新的项目记录。关闭项目记录时,将使用当前日期/时间填充RowEndDateTime列,并且将RowIsCurrent标志设置为0。从关闭的记录中,使用与RowStartDateTime值相同的RowEndDateTime值创建新记录。 RowIsCurrent标志设置为1。
我刚刚发现该表包含一些本不应该存在的错误记录,并且没有根本原因,尽管我猜想在对其ADF管道/数据流不存在的相关表进行修复时可能已经发生了无法正确关闭记录。无论如何,我都需要标识并删除无效的行,并更新引用了无效ProjectKeys的任何其他表,以使用正确的ProjectKeys。
我整理了一个几乎可以满足我需要的查询,但是,如果在一组有效记录之间存在多于一行的无效记录,则该查询将无法正常工作。
以下是测试数据:
drop table if exists #Temp;
create table #Temp (PK int,ProjectID varchar(20),RowStartDateTime datetime2(3),RowEndDateTime datetime2(3),RowIsCurrent int,RowNum int);
insert into #Temp
select *,ROW_NUMBER() OVER(partition by ProjectID order by RowStartDateTime,isnull(RowEndDateTime,'2099-12-31')) as RowNum
from (select 596538 as PK,'131789' as ProjectID,'1900-01-01 00:00:00.000' as RowStartDateTime,'2020-05-06 07:14:21.451' as RowEndDateTime,0 as RowIsCurrent union
select 601293,'131789','2020-05-05 07:14:40.828','2020-05-22 07:07:00.083',0 union
select 601424,'2020-05-06 07:14:21.451',0 union
select 603545,0 union
select 603546,NULL,1 union
select 601443,'192105',1 union
select 601300,0 union
select 484832,'2020-02-11 09:45:15.112',0 union
select 483736,'2020-01-31 07:48:21.447',0 union
select 482418,'1900-01-01 00:00:00.000',0 union
select 662565,'201427','2020-08-25 09:34:57.674',1 union
select 641261,'2020-07-26 08:36:18.325',0 union
select 620787,'2020-07-25 08:41:00.695',0 union
select 601433,0 union
select 601295,0 union
select 601292,0 union
select 601445,'202248',1 union
select 601401,'2020-04-30 00:04:32.000',0 union
select 601298,0 union
select 601297,0 union
select 597910,'2020-04-19 08:14:52.111',0 union
select 587915,0) vals;
这是我当前的查询:
select *,case when RowStartDateTime = '1900-01-01' then 'Keep 1'
else case when RowStartDateTime = RowEndDateTime then 'Delete 1'
else case when RowStartDateTime = LAG(RowEndDateTime,1) OVER (PARTITION BY ProjectID ORDER BY RowStartDateTime) and
LAG(RowEndDateTime,1) OVER (PARTITION BY ProjectID ORDER BY RowStartDateTime) !=
LAG(RowStartDateTime,1) OVER (PARTITION BY ProjectID ORDER BY RowStartDateTime) then 'Keep 2'
when RowStartDateTime = LAG(RowEndDateTime,2) OVER (PARTITION BY ProjectID ORDER BY RowStartDateTime) then 'Keep 3'
else 'Delete 2' end end
end AS KeepOrDeleteRow
from #Temp
order by ProjectID,RowNum desc
您可以看到初始项目记录的RowStartDateTime为1900-01-01,而当前记录的NULL RowEndDateTime以及RowIsCurrent =1。所有有效记录应具有连续的RowStart和RowEnd日期值,例如:
PK ProjectID RowStartDateTime RowEndDateTime RowIsCurrent RowNum
====== ========= ================ ============== ============ ======
601445 202248 2020-05-06 07:14:21.451 NULL 1 4
601401 202248 2020-04-30 00:04:32.000 2020-05-06 07:14:21.451 0 3
597910 202248 2020-04-19 08:14:52.111 2020-04-30 00:04:32.000 0 2
587915 202248 1900-01-01 00:00:00.000 2020-04-19 08:14:52.111 0 1
问题在于,如果在有效记录之间存在多个无效记录,则由于LAG函数具有硬编码的增量(1和2),因此KeepOrDeleteRow逻辑将失败。如果您查看ProjectID 202248的记录,在@底部下方,您会看到RowNums 1-5是正确的,但是RowNum 6应该是“ Keep”。结果如下:
PK ProjectID RowStartDateTime RowEndDateTime RowIsCurrent RowNum KeepOrDeleteRow
====== ========= ================ ============== ============ ====== ===============
603546 131789 2020-05-22 07:07:00.083 NULL 1 5 Keep 3
603545 131789 2020-05-22 07:07:00.083 2020-05-22 07:07:00.083 0 4 Delete 1
601424 131789 2020-05-06 07:14:21.451 2020-05-22 07:07:00.083 0 3 Keep 3
601293 131789 2020-05-05 07:14:40.828 2020-05-22 07:07:00.083 0 2 Delete 2
596538 131789 1900-01-01 00:00:00.000 2020-05-06 07:14:21.451 0 1 Keep 1
601443 192105 2020-05-06 07:14:21.451 NULL 1 5 Keep 3
601300 192105 2020-05-05 07:14:40.828 2020-05-05 07:14:40.828 0 4 Delete 1
484832 192105 2020-02-11 09:45:15.112 2020-05-06 07:14:21.451 0 3 Keep 2
483736 192105 2020-01-31 07:48:21.447 2020-02-11 09:45:15.112 0 2 Keep 2
482418 192105 1900-01-01 00:00:00.000 2020-01-31 07:48:21.447 0 1 Keep 1
662565 201427 2020-08-25 09:34:57.674 NULL 1 6 Keep 2
641261 201427 2020-07-26 08:36:18.325 2020-08-25 09:34:57.674 0 5 Keep 2
620787 201427 2020-07-25 08:41:00.695 2020-07-26 08:36:18.325 0 4 Keep 2
601433 201427 2020-05-06 07:14:21.451 2020-07-25 08:41:00.695 0 3 Keep 3
601295 201427 2020-05-05 07:14:40.828 2020-05-05 07:14:40.828 0 2 Delete 1
601292 201427 1900-01-01 00:00:00.000 2020-05-06 07:14:21.451 0 1 Keep 1
601445 202248 2020-05-06 07:14:21.451 NULL 1 6 Delete 2
601298 202248 2020-05-05 07:14:40.828 2020-05-05 07:14:40.828 0 5 Delete 1
601297 202248 2020-05-05 07:14:40.828 2020-05-05 07:14:40.828 0 4 Delete 1
601401 202248 2020-04-30 00:04:32.000 2020-05-06 07:14:21.451 0 3 Keep 2
597910 202248 2020-04-19 08:14:52.111 2020-04-30 00:04:32.000 0 2 Keep 2
587915 202248 1900-01-01 00:00:00.000 2020-04-19 08:14:52.111 0 1 Keep 1
我希望有人能够提供一种不需要硬编码值即可工作的更优雅,动态的解决方案。
问题是,什么是给我所需结果的更好(准确)的方法?
解决方法
暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!
如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。
小编邮箱:dio#foxmail.com (将#修改为@)