问题描述
我的输入如下:
+----+-------+--------+
| id | txt1 | txt2 |
+----+-------+--------+
| 1 | null | aaaaaa |
+----+-------+--------+
| 1 | bbbbb | null |
+----+-------+--------+
| 1 | cccc | null |
+----+-------+--------+
| 1 | dddd | null |
+----+-------+--------+
| 1 | null | eeeee |
+----+-------+--------+
并期望输出为:
+----+-------+--------+
| id | txt1 | txt2 |
+----+-------+--------+
| 1 | bbbbb | aaaaaa |
+----+-------+--------+
| 1 | cccc | eeeee |
+----+-------+--------+
| 1 | dddd | null |
+----+-------+--------+
如何在 Hive 中实现这一目标?我试过下面的查询,但它的行为就像一个交叉连接。
select distinct id,myq,myq1 from (
select id,collect_set(txt1) as txt1_set,collect_set(txt2) as txt2_set
from add_flat group by addr_who)
lateral view explode(txt1_set) q as myq
lateral view explode(txt2_set) q1 as myq1
解决方法
这有点棘手。您想“堆叠”列。一种方法是对值、过滤器和重新聚合进行反透视:
select id,max(case when which = 1 then txt end) as txt1,max(case when which = 2 then txt end) as txt2
from (select tt.*,row_number() over (partition by which order by txt) as seqnum
from ((select id,txt1 as txt,1 as which
from t
) union all
(select id,txt2 as txt,2 as which
from t
)
) tt
where txt is not null
) tt
group by id,seqnum
order by id,seqnum;
请注意,每列中的结果都是根据 order by
中的 row_number()
排序的。 SQL 表表示无序集(技术上是多集),因此这是控制排序的唯一方法。
您可以按如下方式使用 full join
和 row_number
:
Select coalesce(t1.id,t2.id) as id,t1.txt1,t2.txt2
From
(Select t.*,row_number() over (partition by id order by order by txt1) as rn
From t where txt1 is not null) t1
Full join
(Select t.*,row_number() over (partition by id order by order by txt2) as rn
From t where txt2 is not null) t2
On t1.id = t2.id and t1.rn = t2.rn
,
您需要对位置进行poseexplode和FULL JOIN:
with mytable as (
select stack (5,1,null,'aaaaaa','bbbbb','cccc','dddd','eeeee'
) as ( id,txt1,txt2)
),agg as (
select id,collect_set(txt1) as txt1_set,collect_set(txt2) as txt2_set
from mytable group by id
)
select coalesce(a.id,b.id) id,a.myq,b.myq1
from
(select id,myq,pos
from agg
lateral view outer posexplode(txt1_set) q as pos,myq
) a full join
(select id,myq1,pos
from agg
lateral view outer posexplode(txt2_set) q as pos,myq1
) b ON a.id=b.id and a.pos=b.pos
结果:
id a.myq b.myq1
1 bbbbb aaaaaa
1 cccc eeeee
1 dddd NULL