根据某些条件替换列的值

问题描述

输入：

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          11
a       x     8        2020    colgate        18
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          40
b       y     8        2020    colgate        9
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

预期输出：

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          0
a       x     8        2020    colgate        0
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          0
b       y     8        2020    colgate        0
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

说明：

对于item，loc，month，year组合：

如果chocolate>0，则除了巧克力和口香糖外，其他所有值都将变为0（这发生在itam和b中）

并且如果不存在巧克力，那么值将保持不变（这在item = c和loc = s中是封闭的）

解决方法

如果使用的是mysql 8或更高版本，则可以使用窗口函数。在这里COUNT() OVER()对另一列中的巧克力进行计数，并使其所有行的值相同。然后在上层查询中可以检查结果。

SELECT ITEM,LOC,MONTH,YEAR,QTY_NAME,CASE
          WHEN QTY_NAME NOT IN ('chocolate','gum') AND CNT > 0 THEN 0
          ELSE QTY_NAME
       END
          QTY_NAME
  FROM (  SELECT ITEM,QTY_VALUE,COUNT (CASE WHEN QTY_NAME = 'chocolate' THEN 1 ELSE NULL END)
                    OVER ()
                    CNT
            FROM TEST_TABLE
        GROUP BY ITEM,QTY_VALUE)

下面的解决方案假设在给定的item，loc，month，year组合中没有多个“ chocolate”记录。与样本数据一样。有了这个假设，就不需要对每个组合进行汇总。

仅将所有记录更新为零数量，这些数量不是“ chocolate”或“ gum”，对于相同组合存在记录且“ chocolate”的数量大于0。

样本数据

create table quantities
(
  item nvarchar(1),loc nvarchar(1),month int,year int,qty_name nvarchar(10),qty_value int
);

insert into quantities (item,loc,month,year,qty_name,qty_value) values
('a','x',8,2020,'chocolate',10),('a','gum',15),'maggi',11),'colgate',18),('b','y',20),30),40),9),('c','s',18);

解决方案

update quantities q
join quantities q2
  on  q2.item = q.item
  and q2.loc = q.loc
  and q2.month = q.month
  and q2.year = q.year
  and q2.qty_name = 'chocolate'
  and q2.qty_value > 0
set q.qty_value = 0
where q.qty_name not in ('chocolate','gum');

结果

select * from quantities;

item    loc month   year    qty_name    qty_value
------- --- ------- ------- ----------- ----------
a       x   8       2020    chocolate   10
a       x   8       2020    gum         15
a       x   8       2020    maggi       0
a       x   8       2020    colgate     0
b       y   8       2020    chocolate   20
b       y   8       2020    gum         30
b       y   8       2020    maggi       0
b       y   8       2020    colgate     0
c       s   8       2020    gum         15
c       s   8       2020    maggi       11
c       s   8       2020    colgate     18

SQL Fiddle

EDIT：这是一个MySql解决方案，因为该问题先前已用它进行了标记。我手头没有Apache Spark SQL引擎来验证此解决方案。

这是pyspark方式。

import pyspark.sql.functions as f

df2 = df.filter("qty_name = 'chocolate' and qty_value > 0").select('item','loc','month','year').withColumn('marker',f.lit('Y'))

df.join(df2,['item','year'],'left') \
  .withColumn('qty_value',f.when(f.expr("marker = 'Y' and qty_name not in ('chocolate','gum')"),0).otherwise(f.col('qty_value'))) \
  .drop('marker').show(12,False)

+----+---+-----+----+---------+---------+
|item|loc|month|year|qty_name |qty_value|
+----+---+-----+----+---------+---------+
|a   |x  |8    |2020|chocolate|10       |
|a   |x  |8    |2020|gum      |15       |
|a   |x  |8    |2020|maggi    |0        |
|a   |x  |8    |2020|colgate  |0        |
|b   |y  |8    |2020|chocolate|20       |
|b   |y  |8    |2020|gum      |30       |
|b   |y  |8    |2020|maggi    |0        |
|b   |y  |8    |2020|colgate  |0        |
|c   |s  |8    |2020|gum      |15       |
|c   |s  |8    |2020|maggi    |11       |
|c   |s  |8    |2020|colgate  |18       |
+----+---+-----+----+---------+---------+

apache-spark apache-spark-sql pyspark sql sql