问题描述
我在GOOGLE CLOUD中使用STANDARD sql拥有数据集,并带有日期,时间和客户ID来访商店,我希望仅保留同一天看到的那些客户ID,但只能保留06-09(上午)至14点16分(下午)。因此,只有早上和下午都在场的客户,而不仅是早上或下午都在
customerID Date start_time
1234 01.10.2019 07:52:27
1234 01.10.2019 14:10:18
5678 01.10.2019 15:19:18
5678 01.10.2019 16:54:25
1011 02.10.2019 06:15:00
1011 02.10.2019 17:00:00
2222 02.10.2019 08:00:00
2222 02.10.2019 08:45:00
输出应如下所示:
customerID Date start_time morning/afternoon
1234 01.10.2019 07:52:27 seen both morning and afternoon
1234 01.10.2019 14:10:18 seen both morning and afternoon
1011 02.10.2019 06:15:00 seen both morning and afternoon
1011 02.10.2019 17:00:00 seen both morning and afternoon
如您所见,只有上午(06-09之间)和下午(14-17之间)具有start_time的那些被取出。不需要最后一栏(上午/下午),仅用于演示。我不确定如何实现此目标,并且我尝试了各种AND / OR,WHERE,WHERE EXISTS,但还远远没有完成。谁能帮我吗?
解决方法
我尝试使用spark-sql解决。请参考下面的逻辑
- 从开始时间提取小时,并标记所有在06-09年访问的客户 1和14-17标记为-1称为新列visit_status
- 创建一组customerID,先进行日期然后求和(visit_status)
- 选择总和(visit_status)= 0的记录,这些记录是 早上和下午拜访了客户
#Step 1
df_temp = spark.sql("""select customerID,Date,start_time,case when EXTRACT(HOUR from start_time) between 6 and 9 then 1
when EXTRACT(HOUR from start_time) between 14 and 17 then -1
else 10
end as visit_status from customers""")
#Step 2
df_temp.registerTempTable("temp")
new_df = spark.sql("""select customerID,sum(visit_status)
over( partition by customerID,Date) as final_status from temp""")
#Step 3 filter record for which sum = 0
new_df = new_df.filter("final_status = 0")
new_df = new_df.withColumn("final_status",when(new_df['final_status'] == 0,"seen both morning and afternoon"))
new_df.show(10,False)
,
最简单的版本为每个客户返回一行:
select
customerID,min(start_time),max(start_time)
from tab
-- no time before 6 and after 16
where start_time between time '06:00:00' and time '16:00:00'
-- to filter exactly
-- where start_time between time '06:00:00' and time '09:00:00'
-- or start_time between time '14:00:00' and time '16:00:00'
group by
customerID,Date
having min(start_time) <= time '09:00:00' -- at least one row between 6 and 9
and max(start_time) >= time '14:00:00' -- at least one row between 14 and 16
如果您确实需要这两行,则可以使用 Windowed Aggregates 应用相同的逻辑,如下所示:
with cte as
(
select
customerID,count(*) over (partition by customerI,Date) as cnt
from tab
where start_time between time '06:00:00' and time '09:00:00'
or start_time between time '14:00:00' and time '16:00:00'
group by
customerID,case when start_time between time '06:00:00' and time '09:00:00' then 1
when start_time between time '14:00:00' and time '16:00:00' then 2
end
)
select *
from cte
where cnt = 2
,
或者:
SELECT a.customerid,a.date,MIN(a.start_time) AS am_start,MIN(b.start_time) AS pm_start
FROM t1 a
JOIN t1 b
ON a.customerid = b.customerid
AND a.date = b.date
AND a.start_time BETWEEN TIME '06:00:00' AND TIME '09:00:00'
AND b.start_time BETWEEN TIME '14:00:00' AND TIME '16:00:00'
GROUP BY a.customerid,a.t1date;
,
您可以使用窗口功能:
select t.*
from (
select t.*,count(*) filter(where start_time <= '09:00:00') over(partition by customerid,date) cnt_morning,count(*) filter(where start_time >= '15:00:00') over(partition by customerid,date) cnt_afternoon,from mytable t
where start_time between '06:00:00' and '09:00:00'
or start_time between '15:00:00' and '19:00:00'
) t
where cnt_morning > 0 and cnt_afternoon > 0
子查询会在您所插入的两个时间范围内进行过滤,并使用窗口计数来每天计算每个客户在每个范围内的出现次数。然后,外部查询仅对计数进行过滤。
您没有告诉您正在运行哪个数据库,而是为问题ansi-sql
加上了标签,因此它使用了窗口函数中的标准filter
子句。一种更便携的表达方式是:
sum(case when start_time <= '09:00:00' then 1 else 0 end) over(partition by customerid,sum(case when start_time >= '15:00:00' then 1 else 0 end) over(partition by customerid,
,
我倾向于使用item.toJson().forEach((key,value) {
if (key != "type" && key != 'distance' && key != 'index')
{
final _newValue = CalculatorInitial.calculateNew(item,_upperItem,formData.distance,"distance",key);
myMap.add(key: key,value:vale);
}
});
_rebuild = User(myMap);
:
exists
我没有包括最后一列select t.*
from t
where (t.start_time between '06:00:00' and '09:00:00' and
exists (select 1
from t t2
where t2.customerid = t.customerid and
t2.date = t.date and
t2.start_time between '14:00:00' and '16:00:00'
)
) or
(t.start_time between '14:00:00' and '16:00:00' and
exists (select 1
from t t2
where t2.customerid = t.customerid and
t2.date = t.date and
t2.start_time between '06:00:00' and '09:00:00'
)
);
。看来是多余的。
我不确定100%是否完全符合您的要求。例如,您可能表示早上有以下任何情况:
seen both morning and afternoon
根据您的样本数据,您需要最后一个(尽管我不是对该问题的第一个解释)。但这似乎只是一个小细节。
我建议使用此方法有两个原因:
- 即使早上和下午有多行,它也会保留所有原始行。这似乎是您提出问题的目的,尽管我个人希望每个客户仅一行。
- 使用
start_time >= '06:00:00' and start_time < '09:00:00' start_time >= '06:00:00' and start_time <= '09:00:00' start_time >= '06:00:00' and start_time < '10:00:00' start_time >= '06:00:00' and start_time <= '10:00:00'
上的索引,它应该具有非常好的性能。