问题描述
我需要以1分钟为间隔将数据分组进行一系列操作。我的数据如下:
id MetroId Time ActionName refererurl
111 a 2020-09-01-09:19:00 First www.stackoverflow/a12345
111 b 2020-09-01-12:36:54 First www.stackoverflow/a12345
111 f 2020-09-01-12:36:56 First www.stackoverflow/xxxx
111 b 2020-09-01-12:36:58 Midpoint www.stackoverflow/a12345
111 f 2020-09-01-12:37:01 Midpoint www.stackoverflow/xxx
111 b 2020-09-01-12:37:03 Third www.stackoverflow/a12345
111 b 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
222 d 2020-09-01-15:17:44 First www.stackoverflow/a2222
222 d 2020-09-01-15:17:48 Midpoint www.stackoverflow/a2222
222 d 2020-09-01-15:18:05 Third www.stackoverflow/a2222
我需要在以下情况下获取数据:如果x_id
列的x_url
和Complete
具有action_name
值,请获取该值。如果没有Complete
,请抓住Third
,依此类推。
ARRAY_AGG(current_query_result
ORDER BY CASE ActionName
WHEN 'Complete' THEN 1
WHEN 'Third' THEN 2
WHEN 'Midpoint' THEN 3
WHEN 'First' THEN 4
END
LIMIT 1
)[OFFSET(0)]
FROM
(
SELECT d.id,c.Time,c.ActionName,c.refererurl,c.MetroId
FROM
`bq_query_table_c` c
INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
WHERE
c.refererurl LIKE "https://www.stackoverflow/%"
AND c.ActionName in ('First','Midpoint','Third','Complete')
) current_query_result
GROUP BY
id,refererurl,MetroId
TIMESTAMP_SUB(
PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%s',time),INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%s',time)),1 * 60)
SECOND
)
所需的输出:
id MetroId Time ActionName refererurl
111 a 2020-09-01-09:19:00 First www.stackoverflow/a12345
111 f 2020-09-01-12:37:01 Midpoint www.stackoverflow/xxx
111 b 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
222 c 2020-09-01-15:18:05 Third www.stackoverflow/a2222
解决方法
这听起来像是一个“差距与岛屿”的问题,其中的差距大于1分钟,而孤岛则代表“行动链”。
我将从建立代表岛屿的组开始:为此,您可以使用lag()
来检索先前的动作时间,以及两个连续动作之间每间隔1分钟或更长时间的累积总和。 :
select t.*,sum(case when time > timestamp_add(lag_time,interval 1 minute) then 1 else 0 end)
over(partition by x_id,x_url order by time) grp
from (
select d.id,c.time,c.actionname,c.refererurl,lag(time) over(partition by id,refererurl order by time) lag_time
from `bq_query_table_c` c
inner join `bq_table_d` d on d.id = c.creativeid
where c.refererurl like "https://www.stackoverflow/%"
and c.actionname in ('First','Midpoint','Third','Complete')
) t
grp
是岛屿标识符。
从那时起,我们可以使用您的原始逻辑来过滤每个组的首选操作。我们不需要每隔1分钟进行汇总-我们可以改用grp
:
select
array_agg(t) order by case actionname
when 'Complete' then 1
when 'Third' then 2
when 'midpoint' then 3
when 'first' then 4
end limit 1)[offset(0)]
from (
select t.*,interval 1 minute) then 1 else 0 end)
over(partition by x_id,x_url order by time) grp
from (
select d.id,refererurl order by time) lag_time
from `bq_query_table_c` c
inner join `bq_table_d` d on d.id = c.creativeid
where c.refererurl like "https://www.stackoverflow/%"
and c.actionname in ('First','Complete')
) t
) t
group by id,refererurl,grp
请注意,如果在单个孤岛上有两个“完成”操作,则未定义将选择哪个操作(您的原始查询几乎具有相同的缺陷)。为了使结果具有确定性,您想向ARRAY_AGG()
添加另一个排序条件,例如time
:
array_agg(t) order by case actionname
when 'Complete' then 1
when 'Third' then 2
when 'midpoint' then 3
when 'first' then 4
end,time limit 1)[offset(0)]
,
以下是用于BigQuery标准SQL
#standardSQL
WITH temp AS (
SELECT *,PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S',time) ts
FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts,time_lag) FROM (
SELECT *,TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts),ts,SECOND) time_lag
FROM (
SELECT
AS VALUE ARRAY_AGG(t
ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC
LIMIT 1
)[OFFSET(0)]
FROM temp t
WHERE action_name IN ('First','Complete')
GROUP BY id,url,TIMESTAMP_SUB(ts,INTERVAL MOD(UNIX_SECONDS(ts),60) SECOND
)
)
)
WHERE NOT IFNULL(time_lag,777) < 60
您可以使用问题中的示例数据来测试,玩转上面的示例
#standardSQL
WITH `project.dataset.bq_table` AS (
SELECT 111 id,'2020-09-01-09:19:00' time,'First' action_name,'www.stackoverflow/a12345' url UNION ALL
SELECT 111,'2020-09-01-12:36:54','First','www.stackoverflow/a12345' UNION ALL
SELECT 111,'2020-09-01-12:36:58','2020-09-01-12:37:03','2020-09-01-12:37:09','Complete','www.stackoverflow/a12345' UNION ALL
SELECT 222,'2020-09-01-15:17:44','www.stackoverflow/a2222' UNION ALL
SELECT 222,'2020-09-01-15:17:48','2020-09-01-15:18:05','www.stackoverflow/a2222'
),temp AS (
SELECT *,777) < 60
有结果
Row id time action_name url
1 111 2020-09-01-09:19:00 First www.stackoverflow/a12345
2 111 2020-09-01-12:37:09 Complete www.stackoverflow/a12345
3 222 2020-09-01-15:18:05 Third www.stackoverflow/a2222
注意:对于您的用例,我仍然不是100%肯定-但以上内容是基于到目前为止所讨论/评论的内容