问题描述
create table db.temp
location '/user/temp' as
SELECT t1.mobile_no
FROM db.temp t1
WHERE NOT EXISTS ( SELECT NULL
FROM db.temp t2
WHERE t1.mobile_no = t2.mobile_no
AND t1.cell != t2.cell
AND t2.access_time BETWEEN t1.access_time
AND t1.access_time_5);
我需要从 access_time_5
获取在时间间隔 (access time
) 的 5 小时内使用同一单元格的所有用户。这段代码对impala 来说完全没问题。但不适用于 Hive。
出现错误
"编译语句时出错:失败: SemanticException [错误 10249]:第 23:25 行不受支持的子查询 表达式"
我查看了与此错误相关的类似问题。想不出解决办法。任何帮助将不胜感激!
解决方法
在 Hive 和非对等连接中不支持相关 BETWEEN。尝试使用 LEFT JOIN 重写,使用您的条件和过滤器计算行数:
select mobile_no from
(
SELECT t1.mobile_no,sum(case when t1.cell != t2.cell
and t2.access_time between t1.access_time and t1.access_time_5
then 1 else 0
end) as cnt_exclude
FROM db.temp t1
LEFT JOIN db.temp t2 on t1.mobile_no = t2.mobile_no
GROUP BY t1.mobile_no
)s
where cnt_exclude=0
这种解决方案的问题是LEFT JOIN可能会产生大量重复并且会影响性能,但如果数据不是太大,它可能会起作用。
,在我看来,窗口函数对两个数据库都更好。让我假设 access_time
是 Unix 时间(即以秒为单位)。您可以轻松地将值转换为这样的时间:
SELECT t1.mobile_no
FROM (SELECT t1.*,MIN(t1.cell) OVER (PARTITION BY mobile_no
ORDER BY access_time
RANGE BETWEEN 17999 preceding AND CURRENT ROW
) as min_cell,MAX(t1.cell) OVER (PARTITION BY mobile_no
ORDER BY access_time
RANGE BETWEEN 17999 preceding AND CURRENT ROW
) as max_cell
FROM db.temp t1
) t1
WHERE min_cell = max_cell;