用上一个和下一个非缺失值填充行缺失值

问题描述

我知道您可以将上一个函数与窗口函数结合使用的下一个非缺失值向前/向后填写缺失值。

但我有一个数据如下：

Area,Date,Population
A,1/1/2000,10000
A,2/1/2000,A,3/1/2000,4/1/2000,10030
A,5/1/2000,

在此示例中，对于5月人口，我想填写10030，这很容易。但是对于2月和3月，我想填写的值是10000和10030的平均值，而不是10000或10030。

您知道如何实现吗？

谢谢

解决方法

获取next和previous的值并计算平均值，如下所示-

df2.show(false)
    df2.printSchema()
    /**
      * +----+--------+----------+
      * |Area|Date    |Population|
      * +----+--------+----------+
      * |A   |1/1/2000|10000     |
      * |A   |2/1/2000|null      |
      * |A   |3/1/2000|null      |
      * |A   |4/1/2000|10030     |
      * |A   |5/1/2000|null      |
      * +----+--------+----------+
      *
      * root
      * |-- Area: string (nullable = true)
      * |-- Date: string (nullable = true)
      * |-- Population: integer (nullable = true)
      */

    val w1 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.unboundedPreceding,Window.currentRow)
    val w2 = Window.partitionBy("Area").orderBy("Date").rowsBetween(Window.currentRow,Window.unboundedFollowing)
    df2.withColumn("previous",last("Population",ignoreNulls = true).over(w1))
      .withColumn("next",first("Population",ignoreNulls = true).over(w2))
      .withColumn("new_Population",(coalesce($"previous",$"next") + coalesce($"next",$"previous")) / 2)
      .drop("next","previous")
      .show(false)

    /**
      * +----+--------+----------+--------------+
      * |Area|Date    |Population|new_Population|
      * +----+--------+----------+--------------+
      * |A   |1/1/2000|10000     |10000.0       |
      * |A   |2/1/2000|null      |10015.0       |
      * |A   |3/1/2000|null      |10015.0       |
      * |A   |4/1/2000|10030     |10030.0       |
      * |A   |5/1/2000|null      |10030.0       |
      * +----+--------+----------+--------------+
      */

这是我的尝试。

w1和w2用于分隔窗口，而w3和w4用于填充前面和后面的值。之后，您可以给条件以计算Population的填充方式。

import pyspark.sql.functions as f
from pyspark.sql import Window

w1 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.unboundedPreceding,Window.currentRow)
w2 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.currentRow,Window.unboundedFollowing)
w3 = Window.partitionBy('Area','partition1').orderBy('Date')
w4 = Window.partitionBy('Area','partition2').orderBy(f.desc('Date'))

df.withColumn('check',f.col('Population').isNotNull().cast('int')) \
  .withColumn('partition1',f.sum('check').over(w1)) \
  .withColumn('partition2',f.sum('check').over(w2)) \
  .withColumn('first',f.first('Population').over(w3)) \
  .withColumn('last',f.first('Population').over(w4)) \
  .withColumn('fill',f.when(f.col('first').isNotNull() & f.col('last').isNotNull(),(f.col('first') + f.col('last')) / 2).otherwise(f.coalesce('first','last'))) \
  .withColumn('Population',f.coalesce('Population','fill')) \
  .orderBy('Date') \
  .select(*df.columns).show(10,False)

+----+--------+----------+
|Area|Date    |Population|
+----+--------+----------+
|A   |1/1/2000|10000.0   |
|A   |2/1/2000|10015.0   |
|A   |3/1/2000|10015.0   |
|A   |4/1/2000|10030.0   |
|A   |5/1/2000|10030.0   |
+----+--------+----------+

apache-spark-sql pyspark pyspark-dataframes