问题描述
---------------------------------------------------------
primaryKey | start_timestamp | end_timestamp
---------------------------------------------------------
key1 | 2020-08-13 15:40:00 | 2020-08-13 15:44:47
key2 | 2020-08-14 12:00:00 | 2020-08-14 12:01:13
我想创建一个数据帧,该数据帧的所有键之间的时间序列在start_timestamp和end_timestamp之间,间隔为x秒。 例如,对于x = 120秒的间隔,输出将为:-
-----------------------------------------------------------
primaryKey | start_timestamp_new | end_timestamp_new
key1 | 2020-08-13 15:40:00 | 2020-08-13 15:41:59
key1 | 2020-08-13 15:42:00 | 2020-08-13 15:43:59
key1 | 2020-08-13 15:44:00 | 2020-08-13 15:45:59
key2 | 2020-08-14 12:00:00 | 2020-08-14 12:01:59
我正在尝试使用提到的方法https://docs.microsoft.com/en-us/azure/devops/pipelines/build/triggers?view=azure-devops,但无法将其应用于spark数据框。
有关创建此文件的任何信息都会有很大帮助。
解决方法
您可以使用sequence
功能。
x = 120
df.withColumn('start_timestamp',to_timestamp('start_timestamp')) \
.withColumn('end_timestamp',to_timestamp('end_timestamp')) \
.withColumn('start_timestamp',explode(sequence('start_timestamp','end_timestamp',expr(f'interval {x} seconds')))) \
.withColumn('end_timestamp',col('start_timestamp') + expr(f'interval {x - 1} seconds')) \
.show()
+----------+-------------------+-------------------+
|primaryKey| start_timestamp| end_timestamp|
+----------+-------------------+-------------------+
| key1|2020-08-13 15:40:00|2020-08-13 15:41:59|
| key1|2020-08-13 15:42:00|2020-08-13 15:43:59|
| key1|2020-08-13 15:44:00|2020-08-13 15:45:59|
| key2|2020-08-14 12:00:00|2020-08-14 12:01:59|
+----------+-------------------+-------------------+