在java中的spark数据帧中选择groupBy中未包含的列的对应值

问题描述

我有一个如下的数据框

col1、col2、version_time、col3

root
 |-- col1: string (nullable = true)
 |-- col2: integer (nullable = true)
 |-- version_time: timestamp (nullable = true) 
 |-- col3: string (nullable = true)

以下是一些示例行

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2020-05-09T13:53:20.219Z   E
 1     A    2019-05-09T13:53:20.219Z   F

我想要的是 groupBy col1 和 col2 与 max(timestamp) 上的聚合并返回所有列。

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D

如果我在数据帧上使用 groupBy，它将删除 col3。我将不得不加入原始数据框以获取 col3 的值

    col1  col2  timestamp                 
     1     A    2021-05-09T13:53:20.219Z
     2     A    2021-01-09T13:53:20.219Z
     3     A    2021-02-09T13:53:20.219Z

如果我使用 Window.partitionBy，我仍然会有 5 行，col1 和 col2 的时间戳值相同，这不是我想要的。

col1  col2  timestamp                 col3
 1     A    2021-05-09T13:53:20.219Z   B
 2     A    2021-01-09T13:53:20.219Z   C
 3     A    2021-02-09T13:53:20.219Z   D
 1     A    2021-05-09T13:53:20.219Z   E
 1     A    2021-05-09T13:53:20.219Z   F

还有其他选择吗？

解决方法

您可以在 col1 和 col2 上使用排名窗口函数分区并根据时间戳对其进行排序，然后选择 rank=1 的记录。 Spark sql 等价物将是这样的。

select * from (select col1,col2,rank() over(partition by col1,col2 order by timestamp desc) as rnk)temp where rnk=1

apache-spark apache-spark-sql dataframe spark-java