按 JSON 数组中的元素选择前 3 个

问题描述

源数据为

.as-console-wrapper { min-height: 100%!important; top: 0; }

输出

user_id video_interest
1 [{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]
2 [{"category":"e",{"category":"f",{"category":"g","score":-3}]

我需要过滤score>0,然后按score降序选择每个user_id的top3 video_interest

解决方法

分解 JSON 数组,提取分数,计算每个用户的最大分数(如有必要,按分数降序排列最终输出)和按分数排列的 row_number 以过滤前 3 名,再次收集数组并在必要时将其连接到 STRING。查看代码中的注释。我添加了排序数组和整个输出,因为最初不清楚究竟应该排序什么:数组或最终输出,如果不需要,请删除 max_score 排序。

演示:

with mytable as (
select stack(2,1,'[{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]',2,'[{"category":"e",{"category":"f",{"category":"g","score":-3}]'
) as (user_id,video_interest)
)

select --collect array and convert to JSON string
      user_id,max_score,concat('[',concat_ws(',',collect_list(category_score)),']') as video_interest
from
(
select user_id,category_score,score
from
(  
select --extract score,filter and sort
      user_id,vi.category_score,get_json_object(vi.category_score,'$.score') as score,row_number() over(partition by user_id order by get_json_object(vi.category_score,'$.score') desc) rn,max(get_json_object(vi.category_score,'$.score')) over (partition by user_id) max_score
from
(--prepare for exploding array
select user_id,regexp_replace(regexp_replace(video_interest,'\\[|\\]',''),--remove []
                          '\\},\\{','},{') as video_interest --replace,between array elements with,to split
  from mytable
)s 
--split and explode
lateral view outer explode(split(video_interest,')) vi as category_score
where get_json_object(vi.category_score,'$.score')>0
)s
where rn<=3 --filter top 3
distribute by user_id sort by score desc --Sort collection,remove if not necessary
)s
group by user_id,max_score
order by max_score desc --Sorting users by max_score desc,remove if not necessary

结果:

user_id max_score   video_interest
1       4           [{"category":"d","score":4},"score":2}]
2       2           [{"category":"f",{"category":"e","score":1}]
,

首先,我分解 video_interest 并创建其类别并为单个字段评分。 其次,我使用 row_number() 函数按 user_id order by score(descending) 进行分区,然后将每一行标记为它们在组中的顺序并过滤 order

select  user_id,collect_list(pos) as first_video_interest_top3
from    (
            select  user_id,category,score,pos,row_number() over(
                    partition by
                                user_id
                    order by
                                score desc
                    ) rNum
            from    (
                        select  user_id,pos.category,pos.score,pos
                        from    myData
                        lateral view explode(video_interest) e as pos
                        ) t1
            where   score > 0
            ) t2
where   rNum <= 3
group by
            user_id