AppendStreamTableSink 不支持消耗由节点 Join(joinType=[InnerJoin]

问题描述

当我使用Flink sql执行如下语句时,报错如下:

请求

根据user_id字段对user_behavior_kafka_table中的数据进行分组,然后取出每组中ts字段值最大的那条数据

执行sql

SELECT user_id,item_id,ts FROM user_behavior_kafka_table AS a 
WHERE ts = (select max(b.ts) 
FROM user_behavior_kafka_table AS b 
WHERE a.user_id = b.user_id );

Flink 版本

1.11.2

错误信息

AppendStreamTableSink doesn't support consuming update changes which is produced by node Join(joinType=[InnerJoin],where=[((user_id = user_id0) AND (ts = EXPR$0))],select=[user_id,ts,user_id0,EXPR$0],leftInputSpec=[NoUniqueKey],rightInputSpec=[JoinKeyContainsUniqueKey])

作业部署

纱线上

表格消息

  • user_behavior_kafka_table 来自消费者 kafka 主题的数据

{"user_id":"aaa","item_id":"11-222-333","comment":"aaa access item at","ts":100}

{"user_id":"ccc","item_id":"11-222-334","comment":"ccc access item at","ts":200}

{"user_id":"ccc","ts":300}

{"user_id":"bbb","comment":"bbb access item at","ts":200}

{"user_id":"aaa","ts":400}

{"user_id":"ccc","ts":400}

{"user_id":"vvv","comment":"vvv access item at","ts":200}

{"user_id":"bbb","ts":300}

{"user_id":"aaa","ts":300}

{"user_id":"ccc","ts":100}

{"user_id":"bbb","ts":100}

  • user_behavior_hive_table 预期结果

{"user_id":"aaa","ts":400}

{"user_id":"bbb","ts":200}

解决方法

要从该查询中获得您期望的结果,它需要以批处理模式执行。作为流式查询,Flink SQL planner 无法应付,如果可以,它会产生一个结果流,其中每个 user_id 的最后一个结果会匹配预期的结果,但是会有额外的中间结果。

例如,对于用户 aaa,将出现以下结果:

aaa 11-222-333 100
aaa 11-222-333 200
aaa 11-222-334 400

但是将跳过 ts=300 的行,因为它从来不是具有 ts 最大值的行。

如果您想在流式传输模式下进行此操作,请尝试将其重新编写为 top-n query

SELECT user_id,item_id,ts FROM
FROM (
  SELECT *,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ts DESC) AS row_num
  FROM user_behavior_kafka_table)
WHERE row_num = 1;

我相信这应该可行,但我无法轻松对其进行测试。