是否可以遍历分层数组列以从另一个数据帧中获取并汇总结果？

问题描述

说我在dfA中有一些数据，例如，一个键（pid）和一个数组类型列（category_ids_array）：

val dfA = spark.createDF(
  Array(
    ("10009004",Array("10009004","10348794","546313","546264","2173952")),("10086262",Array("10086262","23009642","3617058","2173952"))
  ),List(
    ("pid",StringType,true),("category_ids_array",ArrayType(StringType,true)
  )
)

dfA

+----------+---------------------------------------------------+
|pid       |category_ids_array                                 |
+----------+---------------------------------------------------+
|10009004  |[10009004,10348794,546313,546264,2173952]      |
|10086262  |[10086262,23009642,3617058,2173952]             |
+----------+---------------------------------------------------+

我也有数据框B，如下所示：

+----------+------------+---------------------+
|pid       |attribute_id|attribute_value      |                                                           
+----------+------------+---------------------+
|10086262  |10002948    |Rabbit               |
|10086262  |10002950    |Unconjugated         |                                                            
|10009004  |10670938    |BCS207B              |                                                     
|10086262  |10670938    |BP215734             |                                                         
|10009004  |10671048    |0000011756           |                                                           
|10086262  |10671048    |19397                |                                                            
|10086262  |10671049    |SCIENCE              |                                           
|10009004  |10671049    |SCIENCE,LLC         |                                                         
|10009004  |10671050    |CRYO BLUE            |                                            
|10086262  |10671050    |CBR4                 |                                                                                          
|10348794  |606921      |Green and Blue       |
|23009642  |606921      |Purple and Yellow    |
+----------+------------+---------------------+

我的问题是，如何才能尽可能遍历dfA上数组类型行中的每个字符串值，并从dfB中提取匹配结果，但将它们按层次结构顺序展平？ dfA具有唯一的pid列表作为“输入”，dfB包含许多相同pid的行，这些行具有不同的attribute_values / id，需要根据输入pid进行汇总。对于我来说，这变得很困难，因为dfA输入字符串的每个结果集都必须覆盖（字符串数组的）下一个输入，因为数组字符串是按层次结构排列的。例如，dfA：10009004的结果集的第一行必须覆盖10348794的结果集，依此类推（如果存在），直到该行数组的末尾（但仍保留先前基于attribute_id不一致的结果）。可能有数百个attribute_ids ...我不知道如何解决这个问题，也许使用zipwith？用地图覆盖？有任何想法吗？输出看起来像：

+----------+--------+-------------+-----------+----------+--------------+-----------+------------------+
|product_id|10002948|10002950     |10671048   |10670938  |10671049      |10671050   |606921            |
+----------+--------+-------------+-----------+----------+--------------+-----------+------------------+
|10086262  |Rabbit  |Unconjugated |19397      |BP215734  |SCIENCE       |CBR4       |Purple and Yellow |
|10009004  |[null]  |[null]       |0000011756 |BCS207B   |SCIENCE,LLC  |CRYO BLUE  |Green and Blue    |
+----------+--------+-------------------------+----------+--------------+-----------+------------------+

谢谢。

解决方法

暂无找到可以解决该程序问题的有效方法，小编努力寻找整理中！

如果你已经找到好的解决方法，欢迎将解决方案带上本链接一起发送给小编。

小编邮箱:dio#foxmail.com (将#修改为@）

apache-spark apache-spark-sql arrays arrays scala