选择在给定时间范围内两次出现的值

问题描述

我在GOOGLE CLOUD中使用STANDARD sql拥有数据集，并带有日期，时间和客户ID来访商店，我希望仅保留同一天看到的那些客户ID，但只能保留06-09（上午）至14点16分（下午）。因此，只有早上和下午都在场的客户，而不仅是早上或下午都在

customerID  Date       start_time

1234      01.10.2019    07:52:27
1234      01.10.2019    14:10:18
5678      01.10.2019    15:19:18
5678      01.10.2019    16:54:25
1011      02.10.2019    06:15:00
1011      02.10.2019    17:00:00
2222      02.10.2019    08:00:00
2222      02.10.2019    08:45:00

输出应如下所示：

customerID  Date   start_time morning/afternoon

1234    01.10.2019  07:52:27  seen both morning and afternoon
1234    01.10.2019  14:10:18  seen both morning and afternoon
1011    02.10.2019  06:15:00  seen both morning and afternoon
1011    02.10.2019  17:00:00  seen both morning and afternoon

如您所见，只有上午（06-09之间）和下午（14-17之间）具有start_time的那些被取出。不需要最后一栏（上午/下午），仅用于演示。我不确定如何实现此目标，并且我尝试了各种AND / OR，WHERE，WHERE EXISTS，但还远远没有完成。谁能帮我吗？

解决方法

我尝试使用spark-sql解决。请参考下面的逻辑

从开始时间提取小时，并标记所有在06-09年访问的客户 1和14-17标记为-1称为新列visit_status
创建一组customerID，先进行日期然后求和（visit_status）
选择总和（visit_status）= 0的记录，这些记录是早上和下午拜访了客户

#Step 1
df_temp = spark.sql("""select customerID,Date,start_time,case when EXTRACT(HOUR from start_time) between 6 and 9 then 1 
when EXTRACT(HOUR from start_time) between 14 and 17 then -1
else 10
end as visit_status from customers""")

#Step 2
df_temp.registerTempTable("temp")
new_df = spark.sql("""select customerID,sum(visit_status) 
over( partition by customerID,Date) as final_status from temp""")

#Step 3 filter record for which sum = 0 
new_df = new_df.filter("final_status = 0")
new_df = new_df.withColumn("final_status",when(new_df['final_status'] == 0,"seen both morning and afternoon"))
new_df.show(10,False)

最简单的版本为每个客户返回一行：

select
   customerID,min(start_time),max(start_time)
from tab
-- no time before 6 and after 16 
where start_time between time '06:00:00' and time '16:00:00'
-- to filter exactly
-- where start_time between time '06:00:00' and time '09:00:00'
--    or start_time between time '14:00:00' and time '16:00:00'
group by
   customerID,Date
having min(start_time) <= time '09:00:00' -- at least one row between  6 and  9
   and max(start_time) >= time '14:00:00' -- at least one row between 14 and 16

如果您确实需要这两行，则可以使用 Windowed Aggregates 应用相同的逻辑，如下所示：

with cte as
 (
    select
       customerID,count(*) over (partition by customerI,Date) as cnt
    from tab
    where start_time between time '06:00:00' and time '09:00:00'
       or start_time between time '14:00:00' and time '16:00:00'
    group by
       customerID,case when start_time between time '06:00:00' and time '09:00:00' then 1
            when start_time between time '14:00:00' and time '16:00:00' then 2
       end
 )
select * 
from cte 
where cnt = 2

或者：

SELECT a.customerid,a.date,MIN(a.start_time) AS am_start,MIN(b.start_time) AS pm_start
FROM   t1 a
       JOIN t1 b
         ON a.customerid = b.customerid
            AND a.date = b.date
            AND a.start_time BETWEEN TIME '06:00:00' AND TIME '09:00:00'
            AND b.start_time BETWEEN TIME '14:00:00' AND TIME '16:00:00'
GROUP BY  a.customerid,a.t1date;

您可以使用窗口功能：

select t.*
from (
    select t.*,count(*) filter(where start_time <= '09:00:00') over(partition by customerid,date) cnt_morning,count(*) filter(where start_time >= '15:00:00') over(partition by customerid,date) cnt_afternoon,from mytable t
    where start_time between '06:00:00' and '09:00:00' 
       or start_time between '15:00:00' and '19:00:00' 
) t
where cnt_morning > 0 and cnt_afternoon > 0

子查询会在您所插入的两个时间范围内进行过滤，并使用窗口计数来每天计算每个客户在每个范围内的出现次数。然后，外部查询仅对计数进行过滤。

您没有告诉您正在运行哪个数据库，而是为问题ansi-sql加上了标签，因此它使用了窗口函数中的标准filter子句。一种更便携的表达方式是：

        sum(case when start_time <= '09:00:00' then 1 else 0 end) over(partition by customerid,sum(case when start_time >= '15:00:00' then 1 else 0 end) over(partition by customerid,

我倾向于使用item.toJson().forEach((key,value) { if (key != "type" && key != 'distance' && key != 'index') { final _newValue = CalculatorInitial.calculateNew(item,_upperItem,formData.distance,"distance",key); myMap.add(key: key,value:vale); } }); _rebuild = User(myMap);：

exists

我没有包括最后一列select t.* from t where (t.start_time between '06:00:00' and '09:00:00' and exists (select 1 from t t2 where t2.customerid = t.customerid and t2.date = t.date and t2.start_time between '14:00:00' and '16:00:00' ) ) or (t.start_time between '14:00:00' and '16:00:00' and exists (select 1 from t t2 where t2.customerid = t.customerid and t2.date = t.date and t2.start_time between '06:00:00' and '09:00:00' ) );。看来是多余的。

我不确定100％是否完全符合您的要求。例如，您可能表示早上有以下任何情况：

seen both morning and afternoon

根据您的样本数据，您需要最后一个（尽管我不是对该问题的第一个解释）。但这似乎只是一个小细节。

我建议使用此方法有两个原因：

即使早上和下午有多行，它也会保留所有原始行。这似乎是您提出问题的目的，尽管我个人希望每个客户仅一行。
使用start_time >= '06:00:00' and start_time < '09:00:00' start_time >= '06:00:00' and start_time <= '09:00:00' start_time >= '06:00:00' and start_time < '10:00:00' start_time >= '06:00:00' and start_time <= '10:00:00'上的索引，它应该具有非常好的性能。

exists sql sql where-clause

选择在给定时间范围内两次出现的值

问题描述

解决方法

相关问答