如何根据hive中的时间序列选择第一条和最后一条记录?

问题描述

我有一个两列的配置单元表。第一列是时间,第二列是分散的物体。我希望获得所有在时间上连续的相同对象组,并获取第一条和最后一条记录。如何在蜂巢中实现这一目标?

例如,我有一个这样的表:

id   time      object
1   10:01:00   a
2   10:02:00   a
3   10:03:00   a
4   10:04:00   b
5   10:05:00   b
6   10:06:00   a
7   10:07:00   a
8   10:08:00   a
9   10:09:00   a
10  10:10:00   a
11  10:11:00   c

我希望得到这个(因为对象 'a' 从 10:01:00 到 10:03:00 和从 10:06:00 到 10:10:00 是连续的,所以 line1&line3 和 line6&line10 都被拾取):

id   time      object
1   10:01:00   a
3   10:03:00   a
4   10:04:00   b
5   10:05:00   b
6   10:06:00   a
10  10:10:00   a
11  10:11:00   c

我应该怎么做才能实现这一目标?

解决方法

您可以分别使用leadselect id,time,object from (select *,(lead(object) over (order by time) != object or row_number() over (order by time desc) = 1) cond1,(lag(object) over (order by time) != object or row_number() over (order by time) = 1) cond2 from table) where cond1 or cond2; 选择不等于前一行或不等于下一行的行。还需要检查第一行/最后一行以包含它们。

{{1}}
,

这是孤岛和间隙问题,您可以使用 row_number 分析函数如下:

select * from
(select t.*,row_number() over (partition by rn-rn_o order by time) as rn,row_number() over (partition by rn-rn_o order by time desc) as rn_d
  from
(select t.*,row_number() over (order by time) as rn,row_number() over (partition by object order by time) as rn_o
  from your_table t) t)
where 1 in (rn,rn_d);
,

我不认为这是一个间隙和岛屿问题。您似乎只想要发生变化的行。这表明 lag()lead() 的简单应用:

select t.*
from (select t.*,lag(object) over (order by time) as prev_object,lead(object) over (order by time) as next_object
      from t
     ) t
where (prev_object is null or prev_object <> object) or
      (next_object is null or next_object <> object);

Hive 支持 NULL 安全的比较运算符,因此您可以将 where 表述为:

where not (prev_object <=> object) or
      not (next_object <=> object)