在事件发生之前,将事件表中的数据与变更日志表中的最新数据进行联接 模式示例数据模式示例数据

问题描述

我正在寻找将JOIN来自用户更改日志样式表中的数据转换为具有匹配ID的事件表

这些表格如下:

project_events

模式

timestamp       TIMESTAMP
event_id        STRING  
user_id         STRING
data            STRING  

示例数据

| timestamp                   | event_id  | user_id | data             |
|-----------------------------|-----------|---------|------------------|
| 2020-08-22 17:01:18.807 UTC | hHZuTE8Y= | ABC123  | {"some":"json" } |
| 2020-08-20 16:57:28.022 UTC | tF5Gky8Q= | ZXY432  | {"foo":"item" }  |
| 2020-08-15 16:44:25.607 UTC | 1dOU8pOo= | ABC123  | {"bar":"val" }   |

users_changelog

模式

timestamp       TIMESTAMP
event_id        STRING  
operation       STRING  
user_id         STRING
data            STRING  

示例数据

| timestamp                   | event_id  | operation | user_id | data                |
|-----------------------------|-----------|-----------|---------|---------------------|
| 2020-08-30 12:50:59.036 UTC | mGdNKy+o= | DELETE    | ABC123  | {"name":"removed" } |
| 2020-08-20 16:50:59.036 UTC | mGdNKy+o= | UPDATE    | ABC123  | {"name":"final" }   |
| 2020-08-05 20:45:36.936 UTC | mIICo9LY= | UPDATE    | ZXY432  | {"name":"asdf" }    |
| 2020-08-05 20:45:21.023 UTC | nEDKyCks= | UPDATE    | ABC123  | {"name":"other" }   |
| 2020-08-03 12:40:49.036 UTC | GxnbUqQ0= | CREATE    | ABC123  | {"name":"initial" } |
| 1970-01-01 00:00:00 UTC     | 1y+6fVWo= | IMPORT    | ZXY432  | {"name":"test" }    |

注意:操作可以是“创建”,“更新”,“删除”或“导入”。由于可以多次更新用户,因此可以有多个具有相同user_id的行

目标是在用户表中显示与ID匹配的最新操作的event_id和data列。使用示例数据,预期结果将是:

| event_id  | event_data       | user_id | user_data         |
|-----------|------------------|---------|-------------------|
| hHZuTE8Y= | {"some":"json" } | ABC123  | {"name":"final" } |
| tF5Gky8Q= | {"foo":"item" }  | ZXY432  | {"name":"asdf" }  |
| 1dOU8pOo= | {"bar":"val" }   | ABC123  | {"name":"other" } |

我尝试了以下操作,但是它会产生重复的行(更改日志表中具有匹配ID的每一行一个):

SELECT
  events.event_id as event_id,events.data as event_data,users.user_id as user_id,users.data as user_data
FROM my_project.my_dataset.project_events as events
LEFT JOIN my_project.my_dataset.users_changelog as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp

解决方法

以下是用于BigQuery标准SQL

#standardSQL
SELECT event_id,data AS event_data,user_id,( SELECT data
    FROM UNNEST(arr) rec
    WHERE rec.timestamp < t.timestamp
    ORDER BY rec.timestamp DESC
    LIMIT 1
  ) AS user_data
FROM (
  SELECT
    ANY_VALUE(events).*,ARRAY_AGG(STRUCT(users.data,users.timestamp)) arr
  FROM `my_project.my_dataset.project_events` AS events
  LEFT JOIN `my_project.my_dataset.users_changelog` AS users
  ON events.user_id = users.user_id 
  GROUP BY FORMAT('%t',events)
) t    

如果要应用于问题的样本数据-输出为

Row event_id        event_data          user_id     user_data    
1   hHZuTE8Y=       {"some":"json" }    ABC123      {"name":"final" }    
2   tF5Gky8Q=       {"foo":"item" }     ZXY432      {"name":"asdf" }     
3   1dOU8pOo=       {"bar":"val" }      ABC123      {"name":"other" }    
,

我使用SQL Server,使用ROW_NUMBER()路线来检索目标输出:

SELECT event_id,event_data,user_data
FROM (
      SELECT 
        events.event_id as event_id,events.data as event_data,users.user_id as user_id,users.data as user_data,ROW_NUMBER() OVER (PARTITION BY users.user_id,events.event_id ORDER BY users.timestamp desc) AS Count_by_User
      FROM #TEMP1 as events
      LEFT JOIN #TEMP2 as users
            ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
) as a 
WHERE Count_by_User = 1

输出:

event_id    event_data          user_id  user_data
1dOU8pOo=   {"bar":"val" }      ABC123  {"name":"other" }  
hHZuTE8Y=   {"some":"json" }    ABC123  {"name":"final" }  
tF5Gky8Q=   {"foo":"item" }     ZXY432  {"name":"asdf" }   

这是我用来生成测试表的代码(如果其他人想验证):

create table #TEMP1
(timestamp  VARCHAR(max),event_id  VARCHAR(max),user_id VARCHAR(max),data VARCHAR(max))
INSERT INTO #TEMP1 (timestamp,event_id,data)
VALUES
('2020-08-22 17:01:18.807 UTC','hHZuTE8Y=','ABC123','{"some":"json" }' ),('2020-08-20 16:57:28.022 UTC','tF5Gky8Q=','ZXY432','{"foo":"item" } ' ),('2020-08-15 16:44:25.607 UTC','1dOU8pOo=','{"bar":"val" }  ' )


create table #TEMP2
(timestamp  VARCHAR(max),operation VARCHAR(MAX),data VARCHAR(max))

INSERT INTO #TEMP2 (timestamp,operation,data)
VALUES
('2020-08-30 12:50:59.036 UTC','mGdNKy+o=','DELETE','{"name":"removed" }'),('2020-08-20 16:50:59.036 UTC','UPDATE','{"name":"final" }  '),('2020-08-05 20:45:36.936 UTC','mIICo9LY=','{"name":"asdf" }   '),('2020-08-05 20:45:21.023 UTC','nEDKyCks=','{"name":"other" }  '),('2020-08-03 12:40:49.036 UTC','GxnbUqQ0=','CREATE','{"name":"initial" }'),('1970-01-01 00:00:00 UTC','1y+6fVWo=','IMPORT','{"name":"test" }   ')

相关问答

Selenium Web驱动程序和Java。元素在(x,y)点处不可单击。其...
Python-如何使用点“。” 访问字典成员?
Java 字符串是不可变的。到底是什么意思?
Java中的“ final”关键字如何工作?(我仍然可以修改对象。...
“loop:”在Java代码中。这是什么,为什么要编译?
java.lang.ClassNotFoundException:sun.jdbc.odbc.JdbcOdbc...