在Spark SQL中使用地图数据类型查询配置单元表时出错但是在HiveQL中执行工作时

问题描述

我的蜂巢表结构如下

+---------------+--------------+----------------------+
| column_value  | metric_name  |         key          |
+---------------+--------------+----------------------+
| A37B          | Mean         | {0:"202006",1:"1"}  |
| ACCOUNT_ID    | Mean         | {0:"202006",1:"2"}  |
| ANB_200       | Mean         | {0:"202006",1:"3"}  |
| ANB_201       | Mean         | {0:"202006",1:"4"}  |
| AS82_RE       | Mean         | {0:"202006",1:"5"}  |
| ATTR001       | Mean         | {0:"202007",1:"2"}  |
| ATTR001_RE    | Mean         | {0:"202007",1:"3"}  |
| ATTR002       | Mean         | {0:"202007",1:"4"}  |
| ATTR002_RE    | Mean         | {0:"202007",1:"5"}  |
| ATTR003       | Mean         | {0:"202008",1:"3"}  |
| ATTR004       | Mean         | {0:"202008",1:"4"}  |
| ATTR005       | Mean         | {0:"202008",1:"5"}  |
| ATTR006       | Mean         | {0:"202009",1:"4"}  |
| ATTR006       | Mean         | {0:"202009",1:"5"}  |

我需要编写一个Spark SQL查询,以基于具有NOT IN条件且两个键都被合并的Key列进行过滤。

以下查询在Beeline的HiveQL中工作正常

select * from your_data where key[0] between  '202006' and '202009' and key NOT IN ( map(0,"202009",1,"5") );

但是当我在Spark sql中尝试相同的查询时。我遇到错误

由于数据类型不匹配而无法解析:map 在org.apache.spark.sql.catalyst.analysis.package $ AnalysisErrorAt.failAnalysis(package.scala:42) 在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 3.applyOrElse(CheckAnalysis.scala:115) 在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 3.applyOrElse(CheckAnalysis.scala:107) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:278) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:278) 在org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:70) 在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:277) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:275) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:275) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:326) 在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) 在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:324) 在org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:275) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:275) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 3.apply(TreeNode.scala:275) 在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:326)

请帮助!

解决方法

我从之前提出的其他问题中得到了答案。该查询工作正常

select * from your_data where key[0] between 202006 and 202009 and NOT (key[0]="202009" and key[1]="5" );