问题描述
我想知道是否有一种方法可以使用Glue ETL作业来扁平化深度嵌套的JSON?它具有嵌套数组。我试图在JSON上运行Glue搜寻器,该返回的目录只有一个字段PerPlayer具有结构数据类型。 在胶粘ETL作业中,我应该使用目录还是将JSON读入dynamicframe并执行一些转换以使其平坦化? 如果只有1条记录(没有数组),我可以使用Relationalize进行展平,但是我的输入在数组结构中有多条记录,并且其中一些记录中有嵌套的数组。
我对Glue ETL完全陌生,因此,任何建议或建议都将不胜感激。
{
"PerPlayer": {
"requestNo": "REQ912","Batch_Number": "1","Total_No_Of_Batches": "1","player": [
{
"username": "user1","characteristics": {
"race": "Human","class": "Warlock","subclass": "Dawnblade","power": 300,"playercountry": "USA"
},"arsenal": [
{
"kinetic": {
"name": "Sweet Business","type": "Auto Rifle","element": "Kinetic"
},"energy": {
"name": "MIDA Mini-Tool","type": "Submachine Gun","element": "Solar"
},"power": {
"name": "Play of the Game","type": "Grenade Launcher","element": "Arc"
}
},{
"kinetic": {
"name": "Sweet Business1","type": "Auto Rifle1","element": "Kinetic1"
},"element": "Solar1"
},"power": {
"name": "Play of the Game1","type": "Grenade Launcher1","element": "Arc1"
}
}
],"armor": {
"head": "Eye of Another World","arms": "Philomath Gloves","chest": "Philomath Robes","leg": "Philomath Boots","classitem": "Philomath Bond"
},"location": {
"map": "Titan","waypoint": "The Rig"
}
},{
"username": "user2","characteristics": {
"race": "Alien","class": "Starwars","arsenal": {
"kinetic": {
"name": "salt Business","element": "Kinetic"
},"energy": {
"name": "MIDA Mini-Tool","element": "Solar"
},"power": {
"name": "Play of the Game","power": 400,"element": "Arc"
}
},"waypoint": "The Rig"
}
}
]
}
}
解决方法
不幸的是,对于Glue Crawlers来说这是不可能的,该服务只会创建看起来像数据的表,而不会更改数据-而且Athena功能也没有将嵌套层次结构映射到Serde级别的平面结构。
通过将数据转换为新的扁平化数据集,您也许可以使用Glue ETL做到这一点,但是总的来说,在我看来,尝试Glue的人最终遇到的问题多于无法解决的问题。
您可以做的是使用搜寻器为您创建的表,并在Athena中创建一个视图来展平层次结构。有一个叫做UNNEST
的运算符,可以将数组元素提升为行。看起来可能像这样:
SELECT
PerPlayer.requestNo,PerPlayer.Batch_Number,PerPlayer.Total_No_Of_Batches,player.username,player.characteristics.race,player.characteristics.class,-- and so on
FROM original_table,UNNEST (PerPlayer.player) AS t(player)
会发生什么情况,结果是每个原始行的player
数组中每个元素只有一行,并且您可以访问原始行以及player元素中的列。 AS t(player)
语法仅意味着包含数组元素的虚拟表应称为t
,并具有称为player
的列。
Stack Overflow上还有许多关于UNNEST
的其他问题,您也可以从中寻找灵感。
如果要再次运行查询,则可以从上面的查询创建一个视图,然后对该视图运行查询。除了性能之外,就好像您的数据被弄平了。
性能将取决于许多细节,除非有必要,否则请不要进行优化。您可以使用上面的查询,使用CTAS创建一个新的平面数据集。