优化两个日期之间的工作日统计信息查询

问题描述

我有一个包含两个字段的表：start_date和end_date。现在，我要计算加班的总数。我创建了一个新的日历表来维护该日期的工作日状态。

表格：工作日

id                  status
2020-01-01          4
2020-01-02          1
2020-01-03          1
2020-01-04          2

4：假期，1：工作日，2：周末

我创建了一个函数来计算两个日期（不包括周末，节假日）之间的工作日。

create or replace function get_workday_count (start_date in date,end_date in date)
return number is
    day_count int;
begin
    select count(0) into day_count from WORKDAYS
    where Trunc(ID) >= Trunc(start_date)
    and Trunc(ID) <= Trunc(end_date)
    and status in (1,3,5);
    return day_count;
end;

当我执行以下查询语句时，大约需要5分钟才能显示结果，erp_sj表具有大约200000行数据。

select count(0) from ERP_SJ GET_WORKDAY_COUNT(start_date,end_date) > 5;

查询语句中使用的字段已建立索引。

如何优化？还是有更好的解决方案？

解决方法

首先，优化您的功能： 1，添加pragma udf（用于在SQL中更快地执行 2.添加确定性子句（用于缓存） 3.将count（0）替换为count（*）（允许cbo优化计数） 4.将返回号替换为int

create or replace function get_workday_count (start_date in date,end_date in date)
return int deterministic is
    pragma udf;
   day_count int;
begin
    select count(*) into day_count from WORKDAYS w
    where w.ID >= TRUNC(start_date)
    and w.ID <= TRUNC(end_date)
    and status in (1,3,5);
    return day_count;
end;

然后，如果（end_date-start_date）

select count(*) 
from ERP_SJ 
where 
case 
   when trunc(end_date) - trunc(start_date) > 5 
      then GET_WORKDAY_COUNT(trunc(start_date),trunc(end_date)) 
   else 0
 end > 5

或使用子查询：

select count(*) 
from ERP_SJ e
where 
case 
   when trunc(end_date) - trunc(start_date) > 5 
      then (select count(*) from WORKDAYS w
    where w.ID >= TRUNC(e.start_date)
    and w.ID <= TRUNC(e.end_date)
    and w.status in (1,5)) 
   else 0
 end > 5

WORKDAY_STATUSES表（出于完整性考虑，下面不再使用）：

create table workday_statuses
( status number(1) constraint workday_statuses_pk primary key,status_name varchar2(10) not null constraint workday_status_name_uk unique );

insert all
    into workday_statuses values (1,'Weekday')
    into workday_statuses values (2,'Weekend')
    into workday_statuses values (3,'Unknown 1')
    into workday_statuses values (4,'Holiday')
    into workday_statuses values (5,'Unknown 2')
select * from dual;

WORKDAYS表：2020年每天每天一行：

create table workdays
( id date constraint workdays_pk primary key,status references workday_statuses not null )
organization index;

insert into workdays (id,status)
select date '2019-12-31' + rownum,case
           when to_char(date '2019-12-31' + rownum,'Dy','nls_language = English') like 'S%' then 2
           when date '2019-12-31' + rownum in
                ( date '2020-01-01',date '2020-04-10',date '2020-04-13',date '2020-05-08',date '2020-05-25',date '2020-08-31',date '2020-12-25',date '2020-12-26',date '2020-12-28' ) then 4
           else 1
       end
from   xmltable('1 to 366')
where  date '2019-12-31' + rownum < date '2021-01-01';

ERP_SJ表，其中包含3万行随机数据：

create table erp_sj
( id          integer generated always as identity,start_date  date not null,end_date    date not null,filler      varchar2(100) );

insert into erp_sj (start_date,end_date,filler)
select dt,dt + dbms_random.value(0,7),dbms_random.string('x',100)
from   ( select date '2019-12-31' + dbms_random.value(1,366) as dt
         from   xmltable('1 to 30000') );

commit;

get_workday_count（）函数：

create or replace function get_workday_count
    ( start_date in date,end_date in date )
    return integer
    deterministic    -- Cache some results
    parallel_enable  -- In case you want to use it in parallel queries
as
    pragma udf;      -- Tell compiler to optimise for SQL
    day_count integer;
begin
    select count(*) into day_count
    from   workdays w
    where  w.id between trunc(start_date) and end_date
    and    w.status in (1,5);

    return day_count;
end;

请注意，您不应截断w.id，因为所有值的时间都已经为00:00:00。（我假设如果end_date位于一天中的某个地方，那么您要计算这一天，因此我没有截断end_date参数。）

测试：

select count(*) from erp_sj
where  get_workday_count(start_date,end_date) > 5;

COUNT(*)
--------
    1302

大约1.4秒后返回结果。

函数中查询的执行计划：

select count(*)
from   workdays w
where  w.id between trunc(sysdate) and sysdate +10
and    w.status in (1,5);

--------------------------------------------------------------------------------------------
| Id  | Operation          | Name        | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |             |      1 |        |      1 |00:00:00.01 |       1 |
|   1 |  SORT AGGREGATE    |             |      1 |      1 |      1 |00:00:00.01 |       1 |
|*  2 |   FILTER           |             |      1 |        |      7 |00:00:00.01 |       1 |
|*  3 |    INDEX RANGE SCAN| WORKDAYS_PK |      1 |      7 |      7 |00:00:00.01 |       1 |
--------------------------------------------------------------------------------------------

现在尝试将函数添加为虚拟列并为其编制索引：

create index erp_sj_workday_count_ix on erp_sj(workday_count);

select count(*) from erp_sj
where  workday_count > 5;

在0.035秒内得出相同的结果。计划：

-------------------------------------------------------------------------------------------------------
| Id  | Operation         | Name                    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |                         |      1 |        |      1 |00:00:00.01 |       5 |
|   1 |  SORT AGGREGATE   |                         |      1 |      1 |      1 |00:00:00.01 |       5 |
|*  2 |   INDEX RANGE SCAN| ERP_SJ_WORKDAY_COUNT_IX |      1 |   1302 |   1302 |00:00:00.01 |       5 |
-------------------------------------------------------------------------------------------------------

在19.0.0中进行了测试。

编辑：如Sayan所指出的，如果WORKDAYS中有任何更改，虚拟列上的索引将不会自动更新，因此存在错误结果的风险。这种方法。但是，如果性能至关重要，则可以通过在每次更新ERP_SJ时在WORKDAYS上重建索引来解决。也许您可以在WORKDAYS上的语句级触发器中执行此操作，或者如果更新很少且ERP_SJ不太大而无法重建索引，则可以通过计划的IT维护过程来执行此操作。如果索引已分区，则可以选择重建受影响的分区。

或者，没有索引，只需要1.4秒的查询执行时间即可。

我知道列ID和status上都有索引（不是TRUNC(ID)上的功能索引）。所以使用这个查询

SELECT count(0)
  INTO day_count
  FROM WORKDAYS
 WHERE ID BETWEEN TRUNC(start_date) AND TRUNC(end_date)
   AND status in (1,5);

以便也可以利用日期ID列上的索引。

可以尝试Scalar Subquery Caching

（如果有大量erp_sj和start_date相同的end_date记录）

select count(0) from ERP_SJ where
 (select GET_WORKDAY_COUNT(start_date,end_date) from dual) > 5

您正在处理数据仓库查询（不是OLTP查询）。

一些最佳实践表明您应该

获得替代功能-避免进行contenxt切换（UDF pragma可以通过某种方式缓解这种情况，但是如果不需要的话为什么要使用功能呢？）>
摆脱索引-快速浏览几行；大量记录的速度慢
摆脱缓存-缓存基本上是重复相同内容的一种解决方法

因此，针对该问题的数据仓库方法包含两个步骤

扩展工作日表

工作日表可以进行一些查询，并扩展一个新列MIN_END_DAY，该列为每个（开始）日定义达到5个工作日限制的最小阈值。

查询使用LEAD聚合函数来获取第4个工作日（请检查PARTITION BY子句与工作日和其他工作日之间的区别。

在非工作日，您只需计算下一个工作日的LAST_VALUE。

示例

with wd as (
select ID,STATUS,case when status in (1,5) then
lead(id,4) over (partition by case when status in (1,5) then 'workday' end order by id)  /* 4 working days ahead */
end as min_stop_day
from workdays),wd2 as (
select ID,last_value(MIN_STOP_DAY) ignore nulls over (order by id desc) MIN_END_DAY
from wd)
select ID,MIN_END_DAY 
from wd2
order by 1;

ID,MIN_END_DAY
01.01.2020 00:00:00 4   08.01.2020 00:00:00
02.01.2020 00:00:00 1   08.01.2020 00:00:00
03.01.2020 00:00:00 1   09.01.2020 00:00:00
04.01.2020 00:00:00 2   10.01.2020 00:00:00
05.01.2020 00:00:00 2   10.01.2020 00:00:00
06.01.2020 00:00:00 1   10.01.2020 00:00:00

加入基表

现在，您可以简单地将基本表与workday上扩展的start_day表连接起来，并通过比较end_day与MIN_END_DAY

来过滤行

查询

with wd as (
select ID,5) then 'workday' end order by id) 
end as min_stop_day
from workdays),last_value(MIN_STOP_DAY) ignore nulls over (order by id desc) MIN_END_DAY
from wd)
select count(*) from erp_sj 
join wd2
on trunc(erp_sj.start_date) = wd2.ID
where trunc(end_day) >= min_end_day

这将导致大型表达到预期的HASH JOIN执行计划。

请注意，我假设1）工作日表已完成（否则您不能使用内部联接），2）包含足够的未来数据（最后5行显然不可用）。

oracle query-performance sql sql