折叠配置单元中的行

问题描述

我的输入如下:

+----+-------+--------+
| id | txt1  | txt2   |
+----+-------+--------+
| 1  | null  | aaaaaa |
+----+-------+--------+
| 1  | bbbbb | null   |
+----+-------+--------+
| 1  | cccc  | null   |
+----+-------+--------+
| 1  | dddd  | null   |
+----+-------+--------+
| 1  | null  | eeeee  |
+----+-------+--------+

并期望输出为:

+----+-------+--------+
| id | txt1  | txt2   |
+----+-------+--------+
| 1  | bbbbb | aaaaaa |
+----+-------+--------+
| 1  | cccc  | eeeee  |
+----+-------+--------+
| 1  | dddd  | null   |
+----+-------+--------+

如何在 Hive 中实现这一目标?我试过下面的查询,但它的行为就像一个交叉连接。

select distinct id,myq,myq1 from (
select id,collect_set(txt1)  as txt1_set,collect_set(txt2)  as txt2_set
from add_flat group by addr_who)
lateral view explode(txt1_set) q as myq
lateral view explode(txt2_set) q1 as myq1

解决方法

这有点棘手。您想“堆叠”列。一种方法是对值、过滤器和重新聚合进行反透视:

select id,max(case when which = 1 then txt end) as txt1,max(case when which = 2 then txt end) as txt2
from (select tt.*,row_number() over (partition by which order by txt) as seqnum
      from ((select id,txt1 as txt,1 as which
             from t
            ) union all
            (select id,txt2 as txt,2 as which
             from t
            )
           ) tt
      where txt is not null
     ) tt
group by id,seqnum
order by id,seqnum;

请注意,每列中的结果都是根据 order by 中的 row_number() 排序的。 SQL 表表示无序集(技术上是多集),因此这是控制排序的唯一方法。

,

您可以按如下方式使用 full joinrow_number

Select coalesce(t1.id,t2.id) as id,t1.txt1,t2.txt2
From
(Select t.*,row_number() over (partition by id order by order by txt1) as rn
   From t where txt1 is not null) t1
Full join
(Select t.*,row_number() over (partition by id order by order by txt2) as rn
   From t where txt2 is not null) t2
On t1.id = t2.id and t1.rn = t2.rn
,

您需要对位置进行poseexplode和FULL JOIN:

with mytable as (
select stack (5,1,null,'aaaaaa','bbbbb','cccc','dddd','eeeee' 
) as ( id,txt1,txt2)
),agg as (
select id,collect_set(txt1)  as txt1_set,collect_set(txt2)  as txt2_set
from mytable group by id
) 

select coalesce(a.id,b.id) id,a.myq,b.myq1
from
(select id,myq,pos
       from agg
       lateral view outer posexplode(txt1_set) q as pos,myq
) a full join 
(select id,myq1,pos
       from agg
       lateral view outer posexplode(txt2_set) q as pos,myq1
) b ON a.id=b.id and a.pos=b.pos

结果:

id  a.myq   b.myq1
1   bbbbb   aaaaaa
1   cccc    eeeee
1   dddd    NULL