联合只能在具有兼容列类型的表上执行

问题描述

u“联盟只能在具有兼容列类型的表上执行。map structint:int,long:null在第二个表的第N列。

这是架构的样子:

数据集1

root
 |-- name: string (nullable = true)
 |-- count: struct (nullable = true)
 |    |-- int: integer (nullable = true)
 |    |-- long: null (nullable = true)

数据集2

root
 |-- name: string (nullable = true)
 |-- count: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

使用以下命令时,无法在DF上执行联合操作:

data= dataset1_df.union(dataset2_df) 

如何解决这个问题?

已更新: 我想更改架构,例如:

数据集1

root
 |-- name: string (nullable = true)
 |-- count: long

DataSet2

 root
     |-- name: string (nullable = true)
     |-- count: long

解决方法

一个简单的解决方案是将其中一个数据帧类型转换为匹配另一个数据帧,如下所示-

 val df1 = spark.sql("select 'foo' name,named_struct('int',1,'long',null) count")
    df1.show(false)
    df1.printSchema()
    /**
      * +----+-----+
      * |name|count|
      * +----+-----+
      * |foo |[1,] |
      * +----+-----+
      *
      * root
      * |-- name: string (nullable = false)
      * |-- count: struct (nullable = false)
      * |    |-- int: integer (nullable = false)
      * |    |-- long: null (nullable = true)
      */

    val df2 = spark.sql("select 'bar' name,map('2',3) count")
    df2.show(false)
    df2.printSchema()

    /**
      * +----+--------+
      * |name|count   |
      * +----+--------+
      * |bar |[2 -> 3]|
      * +----+--------+
      *
      * root
      * |-- name: string (nullable = false)
      * |-- count: map (nullable = false)
      * |    |-- key: string
      * |    |-- value: integer (valueContainsNull = false)
      */

    df1.withColumn("count",map($"count.int".cast("string"),$"count.long".cast("integer")))
      .union(df2)
      .show(false)

    /**
      * +----+--------+
      * |name|count   |
      * +----+--------+
      * |foo |[1 ->]  |
      * |bar |[2 -> 3]|
      * +----+--------+
      */