如何使用带有 EMR 标头的 csv 格式保存表，并使用胶水存储为文本文件

问题描述

EMR spark（版本 5.26）的当前行为与关联的胶水目录，同时将数据保存到 s3 和胶水元数据如下

我有一个 EMR 集群，我正在运行以下命令

场景 1

Seq(1,2,3).toDF("id")
    .write
    .option("header","true")
    .option("delimiter","|")
    .format("csv")
    .saveAsTable("testdb.spark_csv_test_v1")

这会产生

S3 文件正确，带有标题和“|”分离数据
glue 元数据是输入格式（org.apache.hadoop.mapred.SequenceFileInputFormat）、输出格式（org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat）、序列化库（org.apache.hadoop.hive.serde2 .lazy.LazySimpleSerDe)
架构

#	列名	数据类型	分区键	评论
1	col	数组	-	来自解串器

这对 EMR 很有效，但在使用 Redshift 时会引发错误

场景 2

Seq(1,3).toDF("id").createOrReplaceTempView("df_test")
spark.sql("""
CREATE TABLE testdb.spark_csv_test_v2
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (header='true','field.delim'='|')
TBLPROPERTIES ('skip.header.line.count'='1','classification'='csv','delimiter'='|')
STORED AS TEXTFILE
AS 
select * from df_test
""")

这会产生

使用“|”正确生成的S3文件分隔数据但没有标题行
glue 元数据是输入格式（org.apache.hadoop.mapred.TextInputFormat）、输出格式（org.apache.hadoop.hive.ql.io.HiveIgnoreKeytextoutputFormat）、序列化库（org.apache.hadoop.hive.serde2 .lazy.LazySimpleSerDe)
架构

#	列名	数据类型	分区键	评论
1	id	内部	-	-

这适用于 EMR 和 redshift，但没有标题行。

Question : Is there a way in which I can write data from EMR+Spark to S3 with 
a. S3 files have a header row
b. Format is csv with provided delimeter
c. glue Metadata is properly set with a schema ( not the array with col as column)
d. Have Inputfmt = TextInputFormat and Outputfmt = HiveIgnoreKeytextoutputFormat
e. read data from redshift spectrum
f. read data from spark

在上面的场景 1 中，我得到 (a.,b.,f.) 但不是 (c.,d.,e.)
在上面的场景 2 中，我得到 (b.,c.,e.,f.) 但不是 (a.)

如果方案 2 以某种方式写入带有标题的文件，这就是解决方案。我相信我们可能会传递 option("header"="true") 但这在 CTAS 语法中效果不佳。

解决方法

在场景 2 中，您将添加到表属性 "skip.header.line.count"="1"，根据 AWS Documentation，它会跳过标题行。那么你可以在没有这个选项的情况下尝试吗？

amazon-emr amazon-redshift aws-glue aws-glue-data-catalog