按1分钟间隔分组以执行一系列操作SQL BigQuery

问题描述

我需要以1分钟为间隔将数据分组进行一系列操作。我的数据如下:

id    MetroId            Time             ActionName            refererurl
111     a          2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     b         2020-09-01-12:36:54      First           www.stackoverflow/a12345
111     f         2020-09-01-12:36:56      First     www.stackoverflow/xxxx
111     b         2020-09-01-12:36:58      Midpoint        www.stackoverflow/a12345
111     f         2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b          2020-09-01-12:37:03     Third           www.stackoverflow/a12345
111     b          2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     d          2020-09-01-15:17:44     First           www.stackoverflow/a2222
222     d          2020-09-01-15:17:48     Midpoint        www.stackoverflow/a2222
222     d          2020-09-01-15:18:05     Third           www.stackoverflow/a2222

我需要在以下情况下获取数据:如果x_id列的x_urlComplete具有action_name值,请获取该值。如果没有Complete,请抓住Third,依此类推。

  ARRAY_AGG(current_query_result 
    ORDER BY CASE ActionName
      WHEN 'Complete' THEN 1
      WHEN 'Third' THEN 2
      WHEN 'Midpoint' THEN 3
      WHEN 'First' THEN 4
    END
    LIMIT 1
  )[OFFSET(0)]
FROM
    (
        SELECT d.id,c.Time,c.ActionName,c.refererurl,c.MetroId
        FROM
            `bq_query_table_c` c
            INNER JOIN `bq_table_d` d ON d.id = c.CreativeId
        WHERE
            c.refererurl LIKE "https://www.stackoverflow/%"
            AND c.ActionName in ('First','Midpoint','Third','Complete')
    ) current_query_result
GROUP BY
    id,refererurl,MetroId 
    TIMESTAMP_SUB(
    PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%s',time),INTERVAL MOD(UNIX_SECONDS(PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%s',time)),1 * 60) 
    SECOND
  ) 

所需的输出

id   MetroId         Time             ActionName            refererurl
111      a     2020-09-01-09:19:00     First           www.stackoverflow/a12345
111     f     2020-09-01-12:37:01      Midpoint    www.stackoverflow/xxx
111     b     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345
222     c      2020-09-01-15:18:05     Third           www.stackoverflow/a2222

解决方法

这听起来像是一个“差距与岛屿”的问题,其中的差距大于1分钟,而孤岛则代表“行动链”。

我将从建立代表岛屿的组开始:为此,您可以使用lag()来检索先前的动作时间,以及两个连续动作之间每间隔1分钟或更长时间的累积总和。 :

select t.*,sum(case when time > timestamp_add(lag_time,interval 1 minute) then 1 else 0 end)
        over(partition by x_id,x_url order by time) grp
from (
    select d.id,c.time,c.actionname,c.refererurl,lag(time) over(partition by id,refererurl order by time) lag_time
    from `bq_query_table_c` c
    inner join `bq_table_d` d on d.id = c.creativeid
    where c.refererurl like "https://www.stackoverflow/%"
        and c.actionname in ('First','Midpoint','Third','Complete')
) t

grp是岛屿标识符。

从那时起,我们可以使用您的原始逻辑来过滤每个组的首选操作。我们不需要每隔1分钟进行汇总-我们可以改用grp

select   
    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end limit 1)[offset(0)]
from (
    select t.*,interval 1 minute) then 1 else 0 end)
            over(partition by x_id,x_url order by time) grp
    from (
        select d.id,refererurl order by time) lag_time
        from `bq_query_table_c` c
        inner join `bq_table_d` d on d.id = c.creativeid
        where c.refererurl like "https://www.stackoverflow/%"
            and c.actionname in ('First','Complete')
    ) t
) t
group by id,refererurl,grp

请注意,如果在单个孤岛上有两个“完成”操作,则未定义将选择哪个操作(您的原始查询几乎具有相同的缺陷)。为了使结果具有确定性,您想向ARRAY_AGG()添加另一个排序条件,例如time

    array_agg(t) order by case actionname
        when 'Complete' then 1 
        when 'Third'    then 2
        when 'midpoint' then 3
        when 'first'    then 4
    end,time limit 1)[offset(0)]
,

以下是用于BigQuery标准SQL

#standardSQL
WITH temp AS (
  SELECT *,PARSE_TIMESTAMP('%Y-%m-%d-%H:%M:%S',time) ts
  FROM `project.dataset.bq_table`
)
SELECT * EXCEPT (ts,time_lag) FROM (
  SELECT *,TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY id ORDER BY ts),ts,SECOND) time_lag
  FROM (
    SELECT 
      AS VALUE ARRAY_AGG(t 
        ORDER BY STRPOS('First,Midpoint,Third,Complete',action_name) DESC 
        LIMIT 1
      )[OFFSET(0)]
    FROM temp t
    WHERE action_name IN ('First','Complete')
    GROUP BY id,url,TIMESTAMP_SUB(ts,INTERVAL MOD(UNIX_SECONDS(ts),60) SECOND
      )   
  )
)
WHERE NOT IFNULL(time_lag,777) < 60    

您可以使用问题中的示例数据来测试,玩转上面的示例

#standardSQL
WITH `project.dataset.bq_table` AS (
  SELECT 111 id,'2020-09-01-09:19:00' time,'First' action_name,'www.stackoverflow/a12345' url UNION ALL
  SELECT 111,'2020-09-01-12:36:54','First','www.stackoverflow/a12345' UNION ALL
  SELECT 111,'2020-09-01-12:36:58','2020-09-01-12:37:03','2020-09-01-12:37:09','Complete','www.stackoverflow/a12345' UNION ALL
  SELECT 222,'2020-09-01-15:17:44','www.stackoverflow/a2222' UNION ALL
  SELECT 222,'2020-09-01-15:17:48','2020-09-01-15:18:05','www.stackoverflow/a2222' 
),temp AS (
  SELECT *,777) < 60   

有结果

Row     id      time                    action_name     url  
1       111     2020-09-01-09:19:00     First           www.stackoverflow/a12345     
2       111     2020-09-01-12:37:09     Complete        www.stackoverflow/a12345     
3       222     2020-09-01-15:18:05     Third           www.stackoverflow/a2222    

注意:对于您的用例,我仍然不是100%肯定-但以上内容是基于到目前为止所讨论/评论的内容