Athena/Presto:使用左连接

问题描述

所以我有 2 个 Json 数组需要取消嵌套,并根据 json 结构中的键进行连接。 理论上很容易,但如果没有“左连接取消嵌套”功能,一切都会变得一团糟。

我已经实现了我想要的,通过对结果进行分组;但我也担心它会进行 2 个交叉联接,在再次过滤掉它们之前有效地生成了数千个多余的行(在实时环境中)。

因此,我在这里的问题实际上是在寻找一种更有效的策略来执行相同的逻辑。我很清楚我的 Presto 经验和知识还处于起步阶段!

感谢您的指导!

工作:

基本逻辑: 'left' 数组中的每一项都有一个 $.id 值。 对于某些“左”项,将有一个匹配的具有 $.a.id 值的右项

示例:

  1. 下面的第一个 sql 和结果显示了设置,如果不是想要的结果。
  2. 第二组,显示我当前的解决方案。

(1) Cross Join的原始结果

with cte as (
    Select 
        123 as record_id,'[ {"id":"01","key1":["val1"]},{"id":"02","key1":["val2"]},{"id":"03","key1":["val3"]} ]' as "left",'[ {"a":{"id":"02","key1":["apples"]},"b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]},"b":{"lala":"oneone"}} ]' as "right"
)
select 
    record_id,l.i as "left",r.i as "right",json_extract(l.i,'$.id') as left_id,json_extract(r.i,'$.a.id') as right_id
from 
    cte,unnest(cast (json_parse("left") as array(json))) as l(i),-- left array
    unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array

输出

record_id 右边 left_id right_id
123 {"id":"01","key1":["val1"]} {"a":{"id":"02","b":{"lala":"bananas"}} “01” “02”
123 {"id":"01","key1":["val1"]} {"a":{"id":"01","b":{"lala":"oneone"}} “01” “01”
123 {"id":"02","key1":["val2"]} {"a":{"id":"02","b":{"lala":"bananas"}} “02” “02”
123 {"id":"02","key1":["val2"]} {"a":{"id":"01","b":{"lala":"oneone"}} “02” “01”
123 {"id":"03","key1":["val3"]} {"a":{"id":"02","b":{"lala":"bananas"}} “03” “02”
123 {"id":"03","key1":["val3"]} {"a":{"id":"01","b":{"lala":"oneone"}} “03” “01”

(2) 当前解决方

select 
  record_id,max( if(json_extract(l.i,'$.id') = json_extract(r.i,'$.a.id'),json_format(r.i),null) )as match
from 
  cte,-- left array
   unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array
group by 
  record_id,l.i 
record_id 匹配
123 {"id":"01","b":{"lala":"oneone"}}
123 {"id":"02","b":{"lala":"bananas"}}
123 {"id":"03","key1":["val3"]}

解决方法

在 CTE 和左连接 CTE 中取消嵌套两个数组,在这种情况下您将消除交叉连接,但代码有点长:

with cte as (
    Select 
        123 as record_id,'[ {"id":"01","key1":["val1"]},{"id":"02","key1":["val2"]},{"id":"03","key1":["val3"]} ]' as "left",'[ {"a":{"id":"02","key1":["apples"]},"b":{"lala":"bananas"}},{"a":{"id":"01","key1":["one"]},"b":{"lala":"oneone"}} ]' as "right"
),"left" as (
select 
    record_id,l.i as "left",json_extract(l.i,'$.id') as left_id
from 
    cte,unnest(cast (json_parse("left") as array(json))) as l(i)    -- left array
),"right" as (
  select 
    record_id,r.i as "right",json_extract(r.i,'$.a.id') as right_id
from 
    cte,unnest(cast (json_parse("right") as array(json))) as r(i)    -- right array
)

select 
    l.record_id,l."left",r."right",l.left_id,r.right_id
from 
    "left" l left join "right" r on l.record_id=r.record_id and l.left_id=r.right_id

结果:

record_id 右边 left_id right_id
123 {"id":"01","key1":["val1"]} {"a":{"id":"01","b":{"lala":"oneone"}} “01” “01”
123 {"id":"02","key1":["val2"]} {"a":{"id":"02","b":{"lala":"bananas"}} “02” “02”
123 {"id":"03","key1":["val3"]} \N “03” \N