如何使用 ETL (AWS Glue) 聚合数据，以便我们可以使用 Athena 按特定属性仅选择一定百分比的数据

问题描述

如果有更好的地方可以问这类问题，请告诉我。

我有一项服务可以存储和尝试阅读文档。对于每个文档，服务都会有一定的信心提取和阅读行和词。

单个文档的负载（ETL 前）如下所示

{
        "Blocks": [
            {
                "Type": "LINE","Confidence": 90
                "Value": "this is a sentence"
                ...
            },{
                "Type": "WORD","Confidence": 99
                "Value": "this"
                ...
            },"Confidence": 97
                "Value": "is"
                ...
            },"Confidence": 89
                "Value": "a"
                ...
            },"Confidence": 99
                "Value": "sentence"
                ...
            },"Confidence": 50
            },{
                "Type": "LINE","Confidence": 90
                "Value": "example of another line"
                ...
            },...
        ]
    }

我正在寻找 ETL 聚合函数的高级算法或想法，以便我可以使用 Athena 进行查询，该查询会给我类似的结果

“给我所有文档，其中 30% 的词有信心 > 60”

解决方法

你不需要etl，athena可以原生读取json，见

https://docs.aws.amazon.com/athena/latest/ug/querying-JSON.html

在您创建好表格之后，接下来就是为您的任务编写正确的查询。你的陈述“给我所有的文档，其中有 x% 的单词有一些信任”是非常标准的。详细信息取决于您如何定义表格列，但您可以这样做：

SELECT docid from mytable group by docid
 having count_if(confidende>60)* 1./count(*) > 0.3

amazon-athena aws-glue bigdata etl pyspark