问题描述
我有一个两列的配置单元表。第一列是时间,第二列是分散的物体。我希望获得所有在时间上连续的相同对象组,并获取第一条和最后一条记录。如何在蜂巢中实现这一目标?
id time object
1 10:01:00 a
2 10:02:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
7 10:07:00 a
8 10:08:00 a
9 10:09:00 a
10 10:10:00 a
11 10:11:00 c
我希望得到这个(因为对象 'a' 从 10:01:00 到 10:03:00 和从 10:06:00 到 10:10:00 是连续的,所以 line1&line3 和 line6&line10 都被拾取):
id time object
1 10:01:00 a
3 10:03:00 a
4 10:04:00 b
5 10:05:00 b
6 10:06:00 a
10 10:10:00 a
11 10:11:00 c
我应该怎么做才能实现这一目标?
解决方法
您可以分别使用lead
和select id,time,object from
(select *,(lead(object) over (order by time) != object or row_number() over (order by time desc) = 1) cond1,(lag(object) over (order by time) != object or row_number() over (order by time) = 1) cond2
from table)
where cond1 or cond2;
选择不等于前一行或不等于下一行的行。还需要检查第一行/最后一行以包含它们。
{{1}},
这是孤岛和间隙问题,您可以使用 row_number 分析函数如下:
select * from
(select t.*,row_number() over (partition by rn-rn_o order by time) as rn,row_number() over (partition by rn-rn_o order by time desc) as rn_d
from
(select t.*,row_number() over (order by time) as rn,row_number() over (partition by object order by time) as rn_o
from your_table t) t)
where 1 in (rn,rn_d);
,
我不认为这是一个间隙和岛屿问题。您似乎只想要发生变化的行。这表明 lag()
和 lead()
的简单应用:
select t.*
from (select t.*,lag(object) over (order by time) as prev_object,lead(object) over (order by time) as next_object
from t
) t
where (prev_object is null or prev_object <> object) or
(next_object is null or next_object <> object);
Hive 支持 NULL
安全的比较运算符,因此您可以将 where
表述为:
where not (prev_object <=> object) or
not (next_object <=> object)