Pyspark:202001 和 202053 (yyyyww) to_date 给出 null

问题描述

我有一个带有 yearweek 列的数据框,我想将其转换为日期。除了“202001”和“202053”周外,我编写的代码似乎每周都有效,例如:

df = spark.createDataFrame([
(1,"202001"),(2,"202002"),(3,"202003"),(4,"202052"),(5,"202053")
],['id','week_year'])

df.withColumn("date",F.to_date(F.col("week_year"),"yyyyw")).show()

这几周我无法弄清楚错误是什么或如何解决。如何将 202001 和 202053 周转换为有效日期?

解决方法

在 Spark 中处理 ISO 周确实令人头疼——事实上,这个功能在 Spark 3 中已被弃用(删除?)。我认为在 UDF 中使用 Python 日期时间实用程序是一种更灵活的方法。

import datetime
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1','%G%V%u')

df = spark.createDataFrame([
(1,"202001"),(2,"202002"),(3,"202003"),(4,"202052"),(5,"202053")
],['id','week_year'])

df.withColumn("date",week_year_to_date('week_year')).show()
+---+---------+----------+
| id|week_year|      date|
+---+---------+----------+
|  1|   202001|2019-12-30|
|  2|   202002|2020-01-06|
|  3|   202003|2020-01-13|
|  4|   202052|2020-12-21|
|  5|   202053|2020-12-28|
+---+---------+----------+
,

根据 mck 的回答,这是我最终用于 Python 3.5.2 版的解决方案:

import datetime
from dateutil.relativedelta import relativedelta
import pyspark.sql.functions as F

@F.udf('date')
def week_year_to_date(week_year):
    # the '1' is for specifying the first day of the week
    return datetime.datetime.strptime(week_year + '1','%Y%W%w') - relativedelta(weeks = 1)

df = spark.createDataFrame([
(9,"201952"),(1,week_year_to_date('week_year')).show()

如果不使用 3.6 中添加的 '%G%V%u',我必须从日期中减去一周才能得到正确的日期。