问题描述
我正在寻找将JOIN
来自用户更改日志样式表中的数据转换为具有匹配ID的事件表
这些表格如下:
project_events
模式
timestamp TIMESTAMP
event_id STRING
user_id STRING
data STRING
示例数据
| timestamp | event_id | user_id | data |
|-----------------------------|-----------|---------|------------------|
| 2020-08-22 17:01:18.807 UTC | hHZuTE8Y= | ABC123 | {"some":"json" } |
| 2020-08-20 16:57:28.022 UTC | tF5Gky8Q= | ZXY432 | {"foo":"item" } |
| 2020-08-15 16:44:25.607 UTC | 1dOU8pOo= | ABC123 | {"bar":"val" } |
users_changelog
模式
timestamp TIMESTAMP
event_id STRING
operation STRING
user_id STRING
data STRING
示例数据
| timestamp | event_id | operation | user_id | data |
|-----------------------------|-----------|-----------|---------|---------------------|
| 2020-08-30 12:50:59.036 UTC | mGdNKy+o= | DELETE | ABC123 | {"name":"removed" } |
| 2020-08-20 16:50:59.036 UTC | mGdNKy+o= | UPDATE | ABC123 | {"name":"final" } |
| 2020-08-05 20:45:36.936 UTC | mIICo9LY= | UPDATE | ZXY432 | {"name":"asdf" } |
| 2020-08-05 20:45:21.023 UTC | nEDKyCks= | UPDATE | ABC123 | {"name":"other" } |
| 2020-08-03 12:40:49.036 UTC | GxnbUqQ0= | CREATE | ABC123 | {"name":"initial" } |
| 1970-01-01 00:00:00 UTC | 1y+6fVWo= | IMPORT | ZXY432 | {"name":"test" } |
注意:操作可以是“创建”,“更新”,“删除”或“导入”。由于可以多次更新用户,因此可以有多个具有相同user_id的行
目标是在用户表中显示与ID匹配的最新操作的event_id和data列。使用示例数据,预期结果将是:
| event_id | event_data | user_id | user_data |
|-----------|------------------|---------|-------------------|
| hHZuTE8Y= | {"some":"json" } | ABC123 | {"name":"final" } |
| tF5Gky8Q= | {"foo":"item" } | ZXY432 | {"name":"asdf" } |
| 1dOU8pOo= | {"bar":"val" } | ABC123 | {"name":"other" } |
我尝试了以下操作,但是它会产生重复的行(更改日志表中具有匹配ID的每一行一个):
SELECT
events.event_id as event_id,events.data as event_data,users.user_id as user_id,users.data as user_data
FROM my_project.my_dataset.project_events as events
LEFT JOIN my_project.my_dataset.users_changelog as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
解决方法
以下是用于BigQuery标准SQL
#standardSQL
SELECT event_id,data AS event_data,user_id,( SELECT data
FROM UNNEST(arr) rec
WHERE rec.timestamp < t.timestamp
ORDER BY rec.timestamp DESC
LIMIT 1
) AS user_data
FROM (
SELECT
ANY_VALUE(events).*,ARRAY_AGG(STRUCT(users.data,users.timestamp)) arr
FROM `my_project.my_dataset.project_events` AS events
LEFT JOIN `my_project.my_dataset.users_changelog` AS users
ON events.user_id = users.user_id
GROUP BY FORMAT('%t',events)
) t
如果要应用于问题的样本数据-输出为
Row event_id event_data user_id user_data
1 hHZuTE8Y= {"some":"json" } ABC123 {"name":"final" }
2 tF5Gky8Q= {"foo":"item" } ZXY432 {"name":"asdf" }
3 1dOU8pOo= {"bar":"val" } ABC123 {"name":"other" }
,
我使用SQL Server,使用ROW_NUMBER()路线来检索目标输出:
SELECT event_id,event_data,user_data
FROM (
SELECT
events.event_id as event_id,events.data as event_data,users.user_id as user_id,users.data as user_data,ROW_NUMBER() OVER (PARTITION BY users.user_id,events.event_id ORDER BY users.timestamp desc) AS Count_by_User
FROM #TEMP1 as events
LEFT JOIN #TEMP2 as users
ON events.user_id = users.user_id AND users.timestamp <= events.timestamp
) as a
WHERE Count_by_User = 1
输出:
event_id event_data user_id user_data
1dOU8pOo= {"bar":"val" } ABC123 {"name":"other" }
hHZuTE8Y= {"some":"json" } ABC123 {"name":"final" }
tF5Gky8Q= {"foo":"item" } ZXY432 {"name":"asdf" }
这是我用来生成测试表的代码(如果其他人想验证):
create table #TEMP1
(timestamp VARCHAR(max),event_id VARCHAR(max),user_id VARCHAR(max),data VARCHAR(max))
INSERT INTO #TEMP1 (timestamp,event_id,data)
VALUES
('2020-08-22 17:01:18.807 UTC','hHZuTE8Y=','ABC123','{"some":"json" }' ),('2020-08-20 16:57:28.022 UTC','tF5Gky8Q=','ZXY432','{"foo":"item" } ' ),('2020-08-15 16:44:25.607 UTC','1dOU8pOo=','{"bar":"val" } ' )
create table #TEMP2
(timestamp VARCHAR(max),operation VARCHAR(MAX),data VARCHAR(max))
INSERT INTO #TEMP2 (timestamp,operation,data)
VALUES
('2020-08-30 12:50:59.036 UTC','mGdNKy+o=','DELETE','{"name":"removed" }'),('2020-08-20 16:50:59.036 UTC','UPDATE','{"name":"final" } '),('2020-08-05 20:45:36.936 UTC','mIICo9LY=','{"name":"asdf" } '),('2020-08-05 20:45:21.023 UTC','nEDKyCks=','{"name":"other" } '),('2020-08-03 12:40:49.036 UTC','GxnbUqQ0=','CREATE','{"name":"initial" }'),('1970-01-01 00:00:00 UTC','1y+6fVWo=','IMPORT','{"name":"test" } ')