spark3.0读取spark2.2写入的orc表时读取所有列并应用列剪枝

问题描述

在解决这个问题之前，先说一下问题背景：我们目前使用的spark版本是2.2，近期计划迁移到spark3.0。在迁移之前，我们在 spark2.2 和 spark3.0 中测试了一些查询以检查潜在问题。这些查询的数据源表是spark2.2编写的orc格式。

Spark3.0 默认使用 native reader 读取 orc 文件（使用 2 个标志启用此功能：set spark.sql.hive.convertmetastoreOrc=true，set spark.sql.orc.impl=native），我发现即使如果应用列修剪，spark3.0 的本机阅读器将读取所有列。这会降低读取数据的速度。

例如查询: select col_a from table table_a;

table_a 有 100 列，此查询仅读取 1 列。列修剪应用于物理计划。但是 FileScanRDD 会读取所有 100 列。

然后我进行远程调试。在 OrcUtils.scala 的 requestsColumnIds 方法中，它会检查字段名称是否以“_col”开头。就我而言，字段名称以“_col”开头，例如“_col1”、“_col2”。所以 pruneCols 没有完成。然后在以下代码逻辑中读取所有列。此代码如下：

  def requestedColumnIds(
      isCaseSensitive: Boolean,dataSchema: StructType,requiredSchema: StructType,reader: Reader,conf: Configuration): Option[(Array[Int],Boolean)] = {
     ...
      if (orcFieldNames.forall(_.startsWith("_col"))) {
        // This is a ORC file written by Hive,no field names in the physical schema,assume the
        // physical schema maps to the data scheme by index.
        assert(orcFieldNames.length <= dataSchema.length,"The given data schema " +
          s"${dataSchema.catalogString} has less fields than the actual ORC physical schema," +
          "no idea which columns were dropped,fail to read.")
        // for ORC file written by Hive,no field names
        // in the physical schema,there is a need to send the
        // entire dataSchema instead of required schema.
        // So pruneCols is not done in this case
        Some(requiredSchema.fieldNames.map { name =>
          val index = dataSchema.fieldindex(name)
          if (index < orcFieldNames.length) {
            index
          } else {
            -1
          }
        },false)
    ...
}

此代码注释是否意味着 spark 的本机阅读器不支持 orc 旧格式的列修剪，哪个架构是列索引而不是实际列名？

切换回hive reader可以通过设置flags来解决这个问题：set spark.sql.hive.convertmetastoreOrc=false,set spark.sql.orc.impl=hive。但这不是一个好主意，因为本机阅读器提供了配置单元阅读器所没有的优化，例如矢量化阅读。

这是一种使用本机阅读器而不会出现此列修剪问题的方法吗？

如果您能提供帮助或建议，谢谢！

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark apache-spark-sql hive orc pruning