无法将 AvroParquetWriter 中的多个 KMS 密钥与 SSE 的 Hadoop 配置一起使用

问题描述

使用在 AWS EC2 实例（不是 hadoop 集群）上运行的 Java 应用程序，我使用 parquet-hadoop/avro 库创建 AvroParquetWriters 以生成 parquet 文件，然后将这些文件写入 S3 中的存储桶。我使用不同的配置创建了多个 AvroParquetWriter，这些配置指定了用于加密的不同 KMS 密钥，但创建的所有文件都使用相同的 kms 密钥进行加密（它使用在配置中首次使用的密钥）。

以下是我创建 Configuration 和 Writer 的方法：

Configuration conf1 = new Configuration();

conf1.set("fs.s3a.server-side-encryption.key",awsKmsId1);
conf1.set("fs.s3a.server-side-encryption-algorithm","SSE-KMS");
conf1.set("fs.s3a.connection.ssl.enabled","true");
conf1.set("fs.s3a.endpoint",s3Endpoint);


Configuration conf2 = new Configuration();

conf2.set("fs.s3a.server-side-encryption.key",awsKmsId2);
conf2.set("fs.s3a.server-side-encryption-algorithm","SSE-KMS");
conf2.set("fs.s3a.connection.ssl.enabled","true");
conf2.set("fs.s3a.endpoint",s3Endpoint);



ParquetWriter<GenericRecord> writer1 = AvroParquetWriter.<GenericRecord>builder(path)
                    .withSchema(parquetSchema)
                    .withConf(conf1)
                    .withWriteMode(ParquetFileWriter.Mode.CREATE)
                    .build();

ParquetWriter<GenericRecord> writer2 = AvroParquetWriter.<GenericRecord>builder(path)
                    .withSchema(parquetSchema)
                    .withConf(conf2)
                    .withWriteMode(ParquetFileWriter.Mode.CREATE)
                    .build();

writer1 和 writer2 创建不同的文件，但它们都使用 awsKmsId1 密钥加密，即使我指定了不同的文件。

解决方法

我找到了解决这个问题的方法！此问题是由 hadoop-common (3.3.0) 中的 FileSystem 缓存引起的。它在构建缓存键时不使用 Configuration 对象，因此当它尝试从缓存中获取 FileSystem 时，它返回旧的 FileSystem，因为 URI 方案是相同的。我通过使用 conf.set("fs.s3a.impl.disable.cache","true"); 禁用缓存解决了这个问题。这个问题可以在这个 Apache Jira issue

中看到

amazon-kms java java parquet