问题描述
输入:
item loc month year qty_name qty_value
a x 8 2020 chocolate 10
a x 8 2020 gum 15
a x 8 2020 maggi 11
a x 8 2020 colgate 18
b y 8 2020 chocolate 20
b y 8 2020 gum 30
b y 8 2020 maggi 40
b y 8 2020 colgate 9
c s 8 2020 gum 15
c s 8 2020 maggi 11
c s 8 2020 colgate 18
预期输出:
item loc month year qty_name qty_value
a x 8 2020 chocolate 10
a x 8 2020 gum 15
a x 8 2020 maggi 0
a x 8 2020 colgate 0
b y 8 2020 chocolate 20
b y 8 2020 gum 30
b y 8 2020 maggi 0
b y 8 2020 colgate 0
c s 8 2020 gum 15
c s 8 2020 maggi 11
c s 8 2020 colgate 18
说明:
对于item
,loc
,month
,year
组合:
如果chocolate>0
,则除了巧克力和口香糖外,其他所有值都将变为0(这发生在itam和b中)
并且如果不存在巧克力,那么值将保持不变(这在item = c和loc = s中是封闭的)
解决方法
如果使用的是mysql 8或更高版本,则可以使用窗口函数。在这里COUNT() OVER()
对另一列中的巧克力进行计数,并使其所有行的值相同。然后在上层查询中可以检查结果。
SELECT ITEM,LOC,MONTH,YEAR,QTY_NAME,CASE
WHEN QTY_NAME NOT IN ('chocolate','gum') AND CNT > 0 THEN 0
ELSE QTY_NAME
END
QTY_NAME
FROM ( SELECT ITEM,QTY_VALUE,COUNT (CASE WHEN QTY_NAME = 'chocolate' THEN 1 ELSE NULL END)
OVER ()
CNT
FROM TEST_TABLE
GROUP BY ITEM,QTY_VALUE)
,
下面的解决方案假设在给定的item
,loc
,month
,year
组合中没有多个“ chocolate”记录。与样本数据一样。有了这个假设,就不需要对每个组合进行汇总。
仅将所有记录更新为零数量,这些数量不是“ chocolate”或“ gum”,对于相同组合存在记录且“ chocolate”的数量大于0。
样本数据
create table quantities
(
item nvarchar(1),loc nvarchar(1),month int,year int,qty_name nvarchar(10),qty_value int
);
insert into quantities (item,loc,month,year,qty_name,qty_value) values
('a','x',8,2020,'chocolate',10),('a','gum',15),'maggi',11),'colgate',18),('b','y',20),30),40),9),('c','s',18);
解决方案
update quantities q
join quantities q2
on q2.item = q.item
and q2.loc = q.loc
and q2.month = q.month
and q2.year = q.year
and q2.qty_name = 'chocolate'
and q2.qty_value > 0
set q.qty_value = 0
where q.qty_name not in ('chocolate','gum');
结果
select * from quantities;
item loc month year qty_name qty_value
------- --- ------- ------- ----------- ----------
a x 8 2020 chocolate 10
a x 8 2020 gum 15
a x 8 2020 maggi 0
a x 8 2020 colgate 0
b y 8 2020 chocolate 20
b y 8 2020 gum 30
b y 8 2020 maggi 0
b y 8 2020 colgate 0
c s 8 2020 gum 15
c s 8 2020 maggi 11
c s 8 2020 colgate 18
EDIT:这是一个MySql解决方案,因为该问题先前已用它进行了标记。我手头没有Apache Spark SQL引擎来验证此解决方案。
,这是pyspark方式。
import pyspark.sql.functions as f
df2 = df.filter("qty_name = 'chocolate' and qty_value > 0").select('item','loc','month','year').withColumn('marker',f.lit('Y'))
df.join(df2,['item','year'],'left') \
.withColumn('qty_value',f.when(f.expr("marker = 'Y' and qty_name not in ('chocolate','gum')"),0).otherwise(f.col('qty_value'))) \
.drop('marker').show(12,False)
+----+---+-----+----+---------+---------+
|item|loc|month|year|qty_name |qty_value|
+----+---+-----+----+---------+---------+
|a |x |8 |2020|chocolate|10 |
|a |x |8 |2020|gum |15 |
|a |x |8 |2020|maggi |0 |
|a |x |8 |2020|colgate |0 |
|b |y |8 |2020|chocolate|20 |
|b |y |8 |2020|gum |30 |
|b |y |8 |2020|maggi |0 |
|b |y |8 |2020|colgate |0 |
|c |s |8 |2020|gum |15 |
|c |s |8 |2020|maggi |11 |
|c |s |8 |2020|colgate |18 |
+----+---+-----+----+---------+---------+