问题描述
我们正在生成如下数据框
val res_df = df.select($"id",$"type",$"key",from_json($"value",schema).as("s")).select("id","type","key","s.*")
但是我们需要重命名“ s。*”生成的所有列,使其在字段名称之前具有前缀“ s _”。
解决方法
这是解决您问题的一种方法:
import common.sparkSession
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType,StructField,StructType}
import org.apache.spark.sql.functions._
object renameNestedColumn extends App with sparkSession{
val schema = new StructType()
.add(StructField("id",StringType))
.add(StructField("value",new StructType()
.add("city",StringType)
.add("age",StringType)
)
)
val data = Seq(Row("1",Row("montreal","30")),Row("2",Row("ny","25")))
val rdd = spark.sparkContext.parallelize(data)
val df = spark.createDataFrame(rdd,schema)
df.printSchema()
val nestedCols = df.select("value.*").columns.map(c => col(s"value.$c").as(s"prefix_$c")).toSeq++ Seq(col("id"))
df.select(nestedCols:_*).show(false)
嵌套模式
root
|-- id: string (nullable = true)
|-- value: struct (nullable = true)
| |-- city: string (nullable = true)
| |-- age: string (nullable = true)
带有前缀嵌套列的平淡输出
+-----------+----------+---+
|prefix_city|prefix_age|id |
+-----------+----------+---+
|montreal |30 |1 |
|ny |25 |2 |
+-----------+----------+---+