问题描述
我正在尝试获取每天每个事件的不同用户数量,同时保持每小时的运行总和。 我使用 Athena/Presto 作为查询引擎。
我尝试了以下查询:
<ol className="item-list">
{
props.items.map((item,index) => (
<ShoppingItem
key={index} // <-- key goes here
item={item}
/>
))
}
</ol>
但是在看到结果后,我意识到取 COUNT disTINCT 的 SUM 是不正确的,因为它不是相加的。
所以,我尝试了以下查询
SELECT
eventname,date(from_unixtime(time_bucket)) AS date,(time_bucket % 86400)/3600 as hour,count,SUM(count) OVER (PARTITION BY eventname,date(from_unixtime(time_bucket)) ORDER BY eventname,time_bucket) AS running_sum_count
FROM (
SELECT
eventname,CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,COUNT(disTINCT moengageuserid) as count
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1','e2','e3','e4')
GROUP BY 1,2
ORDER BY 1,2
);
SELECT
eventname,SUM(COUNT(disTINCT moengageuserid)) OVER (PARTITION BY eventname,time_bucket) AS running_sum
FROM (
SELECT
eventname,moengageuserid
FROM clickstream.moengage
WHERE date = '2020-08-20'
AND eventname IN ('e1','e4')
);
解决方法
要计算运行的不同计数,您可以将用户 ID 收集到集合(不同的数组)中并获取大小:
cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname,date(from_unixtime(time_bucket)) ORDER BY eventname,time_bucket) AS running_sum
这是解析函数,会为整个分区分配相同的值(事件名称,日期),您可以使用max()等聚合上层子查询中的记录
,计算用户第一次出现的次数:
SELECT eventname,date(from_unixtime(time_bucket)) AS date,(time_bucket % 86400)/3600 as hour,COUNT(DISTINCT moengageuserid) as hour_cont,SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname,date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,moengageuserid as hour_count,ROW_NUMBER() OVER (PARTITION BY eventname,moengageuserid ORDER BY eventtimestamp) as seqnum
FROM clickstream.moengage
WHERE date = '2020-08-20' AND
eventname IN ('e1','e2','e3','e4')
) m
GROUP BY 1,2,3
ORDER BY 1,2;