问题描述
我想为每组名称选择第二行。我使用orderby按名称排序,然后按购买日期/时间戳排序。请务必为每个名称选择第二次购买(按日期时间)。
以下是构建数据框的数据:
data = [
('George',datetime(2020,3,24,19,58),datetime(2018,2,22,55)),('Andrew',datetime(2019,12,17,21,30),7,14,22)),('Micheal',11,13,29,40),5,8,10,19)),('Maggie',31,23),6,33)),('ravi',1,4,47),('Xien',33,51),50)),('George',45)),9,11)),37)),42)),17)),54)),30,41)),datetime(2017,38)),12)),24)),]
df = sqlContext.createDataFrame(data,['name','trial_start','purchase'])
df.show(truncate=False)
我先按名称订购数据,然后购买
df.orderBy("name","purchase").show()
产生结果:
+-------+-------------------+-------------------+
| name| trial_start| purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-07-21 06:14:22|
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| Andrew|2019-12-12 22:21:30|2019-09-19 05:14:11|
| George|2020-03-24 07:19:58|2018-02-24 08:22:55|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| George|2020-03-24 07:19:58|2020-04-24 07:22:54|
| Maggie|2019-02-08 08:31:23|2018-02-19 11:11:42|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
| Maggie|2019-02-08 08:31:23|2020-03-19 10:11:12|
|Micheal|2018-11-22 18:29:40|2017-05-17 12:10:38|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
|Micheal|2018-11-22 18:29:40|2018-08-19 11:11:37|
| ravi|2019-01-01 09:19:47|2018-02-01 09:22:24|
| ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
| ravi|2019-01-01 09:19:47|2019-01-01 09:22:55|
| Xien|2020-03-02 09:33:51|2018-09-21 11:11:41|
| Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
| Xien|2020-03-02 09:33:51|2020-06-21 11:11:11|
+-------+-------------------+-------------------+
如何获得每个名称的第二行?在大熊猫中很容易。我可以只使用nth。我一直在看sql,但没有找到解决方案。任何建议表示赞赏。
我正在寻找的输出将是:
+-------+-------------------+-------------------+
| name| trial_start| purchase|
+-------+-------------------+-------------------+
| Andrew|2019-12-12 22:21:30|2019-08-30 07:12:41|
| George|2020-03-24 07:19:58|2020-03-24 07:22:45|
| Maggie|2019-02-08 08:31:23|2019-05-19 10:11:33|
|Micheal|2018-11-22 18:29:40|2018-05-17 12:10:19|
| ravi|2019-01-01 09:19:47|2019-01-01 09:22:17|
| Xien|2020-03-02 09:33:51|2020-05-21 11:11:50|
+-------+-------------------+-------------------+
解决方法
尝试使用window row_number()
函数,然后在按 2
排序后仅过滤 purchase
行。
Example:
from pyspark.sql import *
from pyspark.sql.functions import *
w=Window.partitionBy("name").orderBy(col("purchase"))
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==2).drop(*["rn"]).show()
SQL Api:
df.createOrReplaceTempView("tmp")
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true")
sql("select `(rn)?+.+` from (select *,row_number() over(partition by name order by purchase) rn from tmp) e where rn =2").\
show()