问题描述
df.select(
*df.columns[:2],*[F.sum(F.col(i)).over(Window.orderBy('Month')).alias(i) for i in df.columns[2:8]]
)
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|Month |month_index|QA_count|BS_count|BV_count|QT_count|B |QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09|0 |3 |0 |1 |1 |2 |3 |7 |
|2020-10|1 |4 |1 |2 |2 |7 |12 |8 |
|2020-11|2 |5 |2 |3 |3 |12 |21 |9 |
|2020-12|3 |6 |3 |4 |4 |17 |30 |10 | |
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
我目前有一个数据集,像上面的数据集一样按月显示列的累积总和,但是即使我还没有其他新数据,我也希望自动添加当前月份的行。我想要的输出看起来与此类似
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|Month |month_index|QA_count|BS_count|BV_count|QT_count|B |QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09|0 |3 |0 |1 |1 |2 |3 |7 |
|2020-10|1 |4 |1 |2 |2 |7 |12 |8 |
|2020-11|2 |5 |2 |3 |3 |12 |21 |9 |
|2020-12|3 |6 |3 |4 |4 |17 |30 |10 |
|2021-01|4 |6 |3 |4 |4 |17 |30 |10 |
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
ps:但是,当 2021-01 月份有新的计数时,它应该自动将该新计数添加到累积总和中。
解决方法
import pyspark.sql.functions as F
df2 = df.select(
*df.columns[:2],*[F.sum(F.col(i)).over(Window.orderBy('Month')).alias(i) for i in df.columns[2:8]]
)
# check if there is any new data. if there isn't,add the same row as the last row.
if df2.select('Month').orderBy(F.desc('Month')).head(1)[0] != df2.select(F.date_format(F.current_date(),'yyyy-MM')).head(1)[0]:
df3 = df2.union(
df2.orderBy(F.desc('Month')).limit(1)
.withColumn('Month',F.date_format(F.current_date(),'yyyy-MM'))
.withColumn('month_index',F.col('month_index')+1)
)
else:
df3 = df2
df3.show()
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
| Month|month_index|QA_count|BS_count|BV_count|QT_count| B|QB_count|BT_count|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+
|2020-09| 0| 3| 0| 1| 1| 2| 3| 7|
|2020-10| 1| 4| 1| 2| 2| 7| 12| 8|
|2020-11| 2| 5| 2| 3| 3| 12| 21| 9|
|2020-12| 3| 6| 3| 4| 4| 17| 30| 10|
|2021-01| 4| 6| 3| 4| 4| 17| 30| 10|
+-------+-----------+--------+--------+--------+--------+---+--------+--------+