更新基于另一个列的Spark数据框中的列值

问题描述

我有如下所述的spark数据框。

val data = spark.sparkContext.parallelize(Seq(
    (1,"","SNACKS","BISCUITS - AMBIENT","REFLETS DE FRANCE CROQUANT","UNCOATED  BISCUIT","NO PROMOTION","400G",""),(2,"GROCERY","BISCUITS","SWEET BISCUITS ","AMBIENT BISCUIT","CHOCOS")
  ))
  .toDF("id","c4","c1001","c1002","c1003","c1008","c1008_unmasked","c1009","c1011","c1012","c1013","c1015","c1016","c1016_unmasked")

data.show(false)

样品输入:

+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008|c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|     |REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |     |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|     |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |     |CHOCOS        |
+---+-------+--------+------------------+------------------+-----+--------------------------+-----------------+------------+-----+-----+-----+-----+--------------+

仅当相同的 cXXXX_unmasked 中具有值时,才需要使用值“已屏蔽填充列 cXXXX 。请检查示例输出以更好地理解。

+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|id |c4     |c1001   |c1002             |c1003             |c1008 |c1008_unmasked            |c1009            |c1011       |c1012|c1013|c1015|c1016 |c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+
|1  |       |SNACKS  |BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE CROQUANT|UNCOATED  BISCUIT|NO PROMOTION|     |     |400G |      |              |
|2  |GROCERY|BISCUITS|SWEET BISCUITS    |BISCUITS - AMBIENT|      |                          |AMBIENT BISCUIT  |NO PROMOTION|     |     |400G |MASKED|CHOCOS        |
+---+-------+--------+------------------+------------------+------+--------------------------+-----------------+------------+-----+-----+-----+------+--------------+

预先感谢

解决方法

这是我的尝试。

val cols = data.columns.filter(_.endsWith("_unmasked"))

val new_data = cols.foldLeft(data) { (df,c) => 
    df.withColumn(c.split("_").head,when(col(c) =!= "" && col(c).isNotNull,lit("MASKED")).otherwise(col(c))) 
}
new_data.show

+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
| id|     c4|   c1001|             c1002|             c1003| c1008|      c1008_unmasked|            c1009|       c1011|c1012|c1013|c1015| c1016|c1016_unmasked|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+
|  1|       |  SNACKS|BISCUITS - AMBIENT|BISCUITS - AMBIENT|MASKED|REFLETS DE FRANCE...|UNCOATED  BISCUIT|NO PROMOTION|     |     | 400G|      |              |
|  2|GROCERY|BISCUITS|   SWEET BISCUITS |BISCUITS - AMBIENT|      |                    |  AMBIENT BISCUIT|NO PROMOTION|     |     | 400G|MASKED|        CHOCOS|
+---+-------+--------+------------------+------------------+------+--------------------+-----------------+------------+-----+-----+-----+------+--------------+