问题描述
源数据为
.as-console-wrapper { min-height: 100%!important; top: 0; }
输出是
user_id video_interest
1 [{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]
2 [{"category":"e",{"category":"f",{"category":"g","score":-3}]
我需要过滤score>0,然后按score降序选择每个user_id的top3 video_interest
解决方法
分解 JSON 数组,提取分数,计算每个用户的最大分数(如有必要,按分数降序排列最终输出)和按分数排列的 row_number 以过滤前 3 名,再次收集数组并在必要时将其连接到 STRING。查看代码中的注释。我添加了排序数组和整个输出,因为最初不清楚究竟应该排序什么:数组或最终输出,如果不需要,请删除 max_score 排序。
演示:
with mytable as (
select stack(2,1,'[{"category":"a","score":1},{"category":"b","score":2},{"category":"c","score":3},{"category":"d","score":4}]',2,'[{"category":"e",{"category":"f",{"category":"g","score":-3}]'
) as (user_id,video_interest)
)
select --collect array and convert to JSON string
user_id,max_score,concat('[',concat_ws(',',collect_list(category_score)),']') as video_interest
from
(
select user_id,category_score,score
from
(
select --extract score,filter and sort
user_id,vi.category_score,get_json_object(vi.category_score,'$.score') as score,row_number() over(partition by user_id order by get_json_object(vi.category_score,'$.score') desc) rn,max(get_json_object(vi.category_score,'$.score')) over (partition by user_id) max_score
from
(--prepare for exploding array
select user_id,regexp_replace(regexp_replace(video_interest,'\\[|\\]',''),--remove []
'\\},\\{','},{') as video_interest --replace,between array elements with,to split
from mytable
)s
--split and explode
lateral view outer explode(split(video_interest,')) vi as category_score
where get_json_object(vi.category_score,'$.score')>0
)s
where rn<=3 --filter top 3
distribute by user_id sort by score desc --Sort collection,remove if not necessary
)s
group by user_id,max_score
order by max_score desc --Sorting users by max_score desc,remove if not necessary
结果:
user_id max_score video_interest
1 4 [{"category":"d","score":4},"score":2}]
2 2 [{"category":"f",{"category":"e","score":1}]
,
首先,我分解 video_interest 并创建其类别并为单个字段评分。 其次,我使用 row_number() 函数按 user_id order by score(descending) 进行分区,然后将每一行标记为它们在组中的顺序并过滤 order
select user_id,collect_list(pos) as first_video_interest_top3
from (
select user_id,category,score,pos,row_number() over(
partition by
user_id
order by
score desc
) rNum
from (
select user_id,pos.category,pos.score,pos
from myData
lateral view explode(video_interest) e as pos
) t1
where score > 0
) t2
where rNum <= 3
group by
user_id