如何在循环中过滤pyspark数据帧并追加到数据帧?

问题描述

我有一个按列值过滤pyspark数据帧的函数。我想针对不同的值在循环中运行它,并将每个循环的输出附加到单个数据帧中。我目前的代码覆盖了每个循环的数据帧。如何为每个循环添加它而不是覆盖?

这是我的pyspark数据框(df):

+--------------+-------------------+------------------------+
|user_id       |purchase_date_all  |product                 |
+--------------+-------------------+------------------------+
|226575        |2018-04-04 17:41:23|12 months of global news|
|227729        |2018-04-19 16:50:09|2  months of global news|
|228544        |2018-04-28 17:01:16|18 months of global news|
|231795        |2018-06-11 20:27:48|36 months of global news|
|234206        |2018-07-19 00:52:10|12 months of global news|
|234607        |2018-07-23 20:41:47|12 months of global news|
|235133        |2018-07-30 02:34:58|12 months of global news|
|237883        |2018-08-07 18:52:53|1 months of global news | 
|237924        |2018-08-08 01:31:13|6 months of global news |
|238892        |2018-08-14 02:45:51|9 months of global news |
|242200        |2018-08-19 21:22:05|3 months of global news |
|249034        |2018-10-11 15:01:06|16 months of global news|
|254415        |2018-12-28 12:13:18|16 months of global news|
|257317        |2019-02-09 18:49:12|11 months of global news|
+--------------+-------------------+------------------------+

例如,这是我选择“ 12个月的全球新闻”产品的功能

def renewal_filter(df,n):
    prod_type = str(n)+' months of global news'
    df_first_xmo = df.filter(df.product == prod_type)
    return df_first_xmo

如果我在循环中调用该函数,它将覆盖每个循环的数据帧。

month = [12,2]
for x in month:
    renewal_filter(df,x)
+--------------+-------------------+------------------------+
|user_id       |purchase_date_all  |product                 |
+--------------+-------------------+------------------------+
|226575        |2018-04-04 17:41:23|12 months of global news|
|234206        |2018-07-19 00:52:10|12 months of global news|
|234607        |2018-07-23 20:41:47|12 months of global news|
|235133        |2018-07-30 02:34:58|12 months of global news|
+--------------+-------------------+------------------------+


+--------------+-------------------+------------------------+
|user_id       |purchase_date_all  |product                 |
+--------------+-------------------+------------------------+
|227729        |2018-04-19 16:50:09|2  months of global news|
+--------------+-------------------+------------------------+

我该如何更改循环逻辑以追加而不是在每个循环上覆盖数据帧,以便获得此结果?

+--------------+-------------------+------------------------+
|user_id       |purchase_date_all  |product                 |
+--------------+-------------------+------------------------+
|226575        |2018-04-04 17:41:23|12 months of global news|
|234206        |2018-07-19 00:52:10|12 months of global news|
|234607        |2018-07-23 20:41:47|12 months of global news|
|235133        |2018-07-30 02:34:58|12 months of global news|
|227729        |2018-04-19 16:50:09|2  months of global news|
+--------------+-------------------+------------------------+

解决方法

暂无找到可以解决该程序问题的有效方法,小编努力寻找整理中!

如果你已经找到好的解决方法,欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@)