问题描述
我试图降低 PySpark Dataframe 架构的所有列名称的大小写,包括复杂类型列的元素名称。
示例:
original_df
|-- USER_ID: long (nullable = true)
|-- COMPLEX_COL_ARRAY: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- KEY: timestamp (nullable = true)
| | |-- VALUE: integer (nullable = true)
target_df
|-- user_id: long (nullable = true)
|-- complex_col_array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: timestamp (nullable = true)
| | |-- value: integer (nullable = true)
但是,我只能使用下面的脚本来减少列名的大小写:
from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(),field.dataType),schema.fields))
我知道我可以使用以下语法访问嵌套元素的字段名称:
for f in schema.fields:
if hasattr(f.dataType,'elementType') and hasattr(f.dataType.elementType,'fieldNames'):
print(schema.f.dataType.elementType.fieldNames())
感谢您的帮助!
解决方法
建议回答我自己的问题,灵感来自这里的这个问题:Rename nested field in spark dataframe
from pyspark.sql.types import StructField
# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema
# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(),field.dataType),schema.fields))
for f in schema.fields:
# if field is nested and has named elements,lower the case of all element names
if hasattr(f.dataType,'elementType') and hasattr(f.dataType.elementType,'fieldNames'):
for e in f.dataType.elementType.fieldNames():
schema[f.name].dataType.elementType[e].name = schema[f.name].dataType.elementType[e].name.lower()
ind = schema[f.name].dataType.elementType.names.index(e)
schema[f.name].dataType.elementType.names[ind] = e.lower()
# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd,schema)