选择在给定时间范围内两次出现的值

问题描述

我在GOOGLE CLOUD中使用STANDARD sql拥有数据集,并带有日期,时间和客户ID来访商店,我希望仅保留同一天看到的那些客户ID,但只能保留06-09(上午)至14点16分(下午)。因此,只有早上和下午都在场的客户,而不仅是早上或下午都在

customerID  Date       start_time

1234      01.10.2019    07:52:27
1234      01.10.2019    14:10:18
5678      01.10.2019    15:19:18
5678      01.10.2019    16:54:25
1011      02.10.2019    06:15:00
1011      02.10.2019    17:00:00
2222      02.10.2019    08:00:00
2222      02.10.2019    08:45:00

输出应如下所示:

customerID  Date   start_time morning/afternoon

1234    01.10.2019  07:52:27  seen both morning and afternoon
1234    01.10.2019  14:10:18  seen both morning and afternoon
1011    02.10.2019  06:15:00  seen both morning and afternoon
1011    02.10.2019  17:00:00  seen both morning and afternoon

如您所见,只有上午(06-09之间)和下午(14-17之间)具有start_time的那些被取出。不需要最后一栏(上午/下午),仅用于演示。我不确定如何实现此目标,并且我尝试了各种AND / OR,WHERE,WHERE EXISTS,但还远远没有完成。谁能帮我吗?

解决方法

我尝试使用spark-sql解决。请参考下面的逻辑

  1. 从开始时间提取小时,并标记所有在06-09年访问的客户 1和14-17标记为-1称为新列visit_status
  2. 创建一组customerID,先进行日期然后求和(visit_status)
  3. 选择总和(visit_status)= 0的记录,这些记录是 早上和下午拜访了客户
#Step 1
df_temp = spark.sql("""select customerID,Date,start_time,case when EXTRACT(HOUR from start_time) between 6 and 9 then 1 
when EXTRACT(HOUR from start_time) between 14 and 17 then -1
else 10
end as visit_status from customers""")

#Step 2
df_temp.registerTempTable("temp")
new_df = spark.sql("""select customerID,sum(visit_status) 
over( partition by customerID,Date) as final_status from temp""")

#Step 3 filter record for which sum = 0 
new_df = new_df.filter("final_status = 0")
new_df = new_df.withColumn("final_status",when(new_df['final_status'] == 0,"seen both morning and afternoon"))
new_df.show(10,False)
,

最简单的版本为每个客户返回一行:

select
   customerID,min(start_time),max(start_time)
from tab
-- no time before 6 and after 16 
where start_time between time '06:00:00' and time '16:00:00'
-- to filter exactly
-- where start_time between time '06:00:00' and time '09:00:00'
--    or start_time between time '14:00:00' and time '16:00:00'
group by
   customerID,Date
having min(start_time) <= time '09:00:00' -- at least one row between  6 and  9
   and max(start_time) >= time '14:00:00' -- at least one row between 14 and 16

如果您确实需要这两行,则可以使用 Windowed Aggregates 应用相同的逻辑,如下所示:

with cte as
 (
    select
       customerID,count(*) over (partition by customerI,Date) as cnt
    from tab
    where start_time between time '06:00:00' and time '09:00:00'
       or start_time between time '14:00:00' and time '16:00:00'
    group by
       customerID,case when start_time between time '06:00:00' and time '09:00:00' then 1
            when start_time between time '14:00:00' and time '16:00:00' then 2
       end
 )
select * 
from cte 
where cnt = 2 
,

或者:

SELECT a.customerid,a.date,MIN(a.start_time) AS am_start,MIN(b.start_time) AS pm_start
FROM   t1 a
       JOIN t1 b
         ON a.customerid = b.customerid
            AND a.date = b.date
            AND a.start_time BETWEEN TIME '06:00:00' AND TIME '09:00:00'
            AND b.start_time BETWEEN TIME '14:00:00' AND TIME '16:00:00'
GROUP BY  a.customerid,a.t1date; 
,

您可以使用窗口功能:

select t.*
from (
    select t.*,count(*) filter(where start_time <= '09:00:00') over(partition by customerid,date) cnt_morning,count(*) filter(where start_time >= '15:00:00') over(partition by customerid,date) cnt_afternoon,from mytable t
    where start_time between '06:00:00' and '09:00:00' 
       or start_time between '15:00:00' and '19:00:00' 
) t
where cnt_morning > 0 and cnt_afternoon > 0

子查询会在您所插入的两个时间范围内进行过滤,并使用窗口计数来每天计算每个客户在每个范围内的出现次数。然后,外部查询仅对计数进行过滤。

您没有告诉您正在运行哪个数据库,而是为问题ansi-sql加上了标签,因此它使用了窗口函数中的标准filter子句。一种更便携的表达方式是:

        sum(case when start_time <= '09:00:00' then 1 else 0 end) over(partition by customerid,sum(case when start_time >= '15:00:00' then 1 else 0 end) over(partition by customerid,
,

我倾向于使用item.toJson().forEach((key,value) { if (key != "type" && key != 'distance' && key != 'index') { final _newValue = CalculatorInitial.calculateNew(item,_upperItem,formData.distance,"distance",key); myMap.add(key: key,value:vale); } }); _rebuild = User(myMap);

exists

我没有包括最后一列select t.* from t where (t.start_time between '06:00:00' and '09:00:00' and exists (select 1 from t t2 where t2.customerid = t.customerid and t2.date = t.date and t2.start_time between '14:00:00' and '16:00:00' ) ) or (t.start_time between '14:00:00' and '16:00:00' and exists (select 1 from t t2 where t2.customerid = t.customerid and t2.date = t.date and t2.start_time between '06:00:00' and '09:00:00' ) ); 。看来是多余的。

我不确定100%是否完全符合您的要求。例如,您可能表示早上有以下任何情况:

seen both morning and afternoon

根据您的样本数据,您需要最后一个(尽管我不是对该问题的第一个解释)。但这似乎只是一个小细节。

我建议使用此方法有两个原因:

  • 即使早上和下午有多行,它也会保留所有原始行。这似乎是您提出问题的目的,尽管我个人希望每个客户仅一行。
  • 使用start_time >= '06:00:00' and start_time < '09:00:00' start_time >= '06:00:00' and start_time <= '09:00:00' start_time >= '06:00:00' and start_time < '10:00:00' start_time >= '06:00:00' and start_time <= '10:00:00' 上的索引,它应该具有非常好的性能。

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...