从 spark 读取的 amazon-s3 中大括号扩展包含超过 25 个文件时出错

问题描述

我刚刚升级到使用 spark 3 而不是 spark 2.4。

以下代码在 spark 2.4 中运行良好

df = spark.read.parquet('s3a://bucket/path/{'+
                                      'file1,'+
                                      'file2,'+
                                      'file3,'+
                                      'file4,'+
                                      'file5,'+
                                      'file6,'+
                                      'file7,'+
                                      'file8,'+
                                      'file9,'+
                                      'file10,'+
                                      'file11,'+
                                      'file12,'+
                                      'file13,'+
                                      'file14,'+
                                      'file15,'+
                                      'file16,'+
                                      'file17,'+
                                      'file18,'+
                                      'file19,'+
                                      'file20,'+
                                      'file21,'+
                                      'file22,'+
                                      'file23,'+
                                      'file24,'+
                                      'file25'+
                                      '}')

但在 spark 3 中我收到此错误：

Py4JJavaError: An error occurred while calling o944.parquet.
: org.apache.hadoop.fs.s3a.AWSS3IOException: getFileStatus on s3a://

...

com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: aaa),S3 Extended Request ID:

如果我将文件数量减少到大约 24 个以下，那么查询将在 spark 3 中成功完成。

我在 s3 中找不到像这样的大括号扩展中文件数量限制的任何参考。可能出什么问题了？怎么修？

解决方法

为什么不让 spark 处理整个目录并让它扫描文件？

df = spark.read.parquet('s3a://bucket/path/')

aws 查询中的字符数限制为 1024 个。不知何故，这在 spark 2 中不是问题。

amazon-s3 apache-spark brace-expansion pyspark