AWS Spectrum 真的需要 = 在 s3 位置才能将其理解为 hive 格式吗？

问题描述

我用 spectrum 运行了一些测试。

我创建了两个 AWS glue crawler。

第一个名为 hive-tst 的扫描：

s3://hive-test/type='a'/year='2021'/month='01'
s3://hive-test/type='b'/year='2021'/month='01'
s3://hive-test/type='c'/year='2021'/month='01'
s3://hive-test/type='d'/year='2021'/month='01'
s3://hive-test/type='e'/year='2021'/month='01'

第二个扫描：

s3://non-hive-test/a/2021/01
s3://non-hive-test/b/2021/01
s3://non-hive-test/c/2021/01
s3://non-hive-test/d/2021/01
s3://non-hive-test/e/2021/01

每个bucket分区都有两个文件，两个文件都是parquet个50mb的文件。

然后我运行查询每个 spectrum 表的第一个分区的测试：

select distinct event from test.hive_tst;

花了 8 秒 272

select distinct partition_0 from test.nonhive_tst;

8s 66ms

所以添加 = 似乎并没有提高性能。还检查了两个表在分区中是否具有 Hive 格式。

select *
from svv_external_partitions
where schemaname='test'
and tablename='hive_tst';

价值观	位置	输入格式	输出格式	serialization_lib
["a","2021","01"]	s3://hive-test/event=a/year=2021/month=01/	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetoutputFormat	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

select *
from svv_external_partitions
where schemaname='test'
and tablename='nonhive_tst';

价值观	位置	输入格式	输出格式	serialization_lib
["a","01"]	s3://hive-test/a/2021/01/	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetoutputFormat	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

也许文件夹中的数据量不足以对其进行测试，但使用 svv_external_partitions 的所有内容、执行时间和分区格式似乎都相同。

那么问题来了：

AWS Spectrum 真的需要 = 在 s3 位置才能理解为 hive 格式吗？

解决方法

最后，经过大量的搜索和阅读，我得出了一个结论：两个 S3 桶都有分区，当我们使用 AWS Glue 时，所有分区都会自动添加。

唯一的区别是前缀如year=2020对应的是hive命名约定，所以AWS Glue在添加分区时知道如何处理，然后分区有一个漂亮的名字，比如{{1 }} 而不是 year。

然后，回答：AWS Spectrum 真的需要 = 在 s3 位置才能将其理解为 hive 格式吗？

不，您不需要将其理解为 partition_x 格式，而是需要使用 hive 命名约定来理解它

来源

amazon-redshift amazon-redshift-spectrum amazon-s3 amazon-web-services hive