scala—是否可以遍历分层数组列以从另一个Dataframe获取和聚合结果？

假设我在dfa中有一些数据，例如，一个键（pid）和一个数组类型列（category\u ids\u array）：

val dfA = spark.createDF(
  Array(
    ("10009004", Array("10009004", "10348794", "546313", "546264", "2173952")),
    ("10086262", Array("10086262", "23009642", "3617058", "2173952"))
  ), List(
    ("pid", StringType, true),
    ("category_ids_array", ArrayType(StringType, true), true)
  )
)

dfa公司

+----------+---------------------------------------------------+
|pid       |category_ids_array                                 |
+----------+---------------------------------------------------+
|10009004  |[10009004, 10348794, 546313, 546264, 2173952]      |
|10086262  |[10086262, 23009642, 3617058, 2173952]             |
+----------+---------------------------------------------------+

我还有Dataframeb，看起来像：

+----------+------------+---------------------+
|pid       |attribute_id|attribute_value      |                                                           
+----------+------------+---------------------+
|10086262  |10002948    |Rabbit               |
|10086262  |10002950    |Unconjugated         |                                                            
|10009004  |10670938    |BCS207B              |                                                     
|10086262  |10670938    |BP215734             |                                                         
|10009004  |10671048    |0000011756           |                                                           
|10086262  |10671048    |19397                |                                                            
|10086262  |10671049    |SCIENCE              |                                           
|10009004  |10671049    |SCIENCE, LLC         |                                                         
|10009004  |10671050    |CRYO BLUE            |                                            
|10086262  |10671050    |CBR4                 |                                                                                          
|10348794  |606921      |Green and Blue       |
|23009642  |606921      |Purple and Yellow    |
+----------+------------+---------------------+

我的问题是，如果可能的话，如何遍历dfa上数组类型行中的每个字符串值，并从dfb中提取匹配结果，但按层次顺序展平它们？dfa有一个唯一的PID列表作为“输入”，dfb包含许多相同PID的行，这些行具有不同的属性值/ID，需要根据输入PID进行汇总。这对我来说很困难，因为dfa的输入字符串的每个结果集都必须覆盖（字符串数组的）下一个输入，因为数组字符串是按层次顺序排列的；例如，dfa:10009004的结果集的第1行必须覆盖10348794，以此类推（如果存在）util该行数组的结尾（但仍然保留基于属性\u id的不相同的先前结果）。可以有数百个属性ID。。。我不知道如何处理这个问题，也许是使用zipwith？有Map覆盖吗？有什么想法吗？输出类似于：

+----------+--------+-------------+-----------+----------+--------------+-----------+------------------+
|product_id|10002948|10002950     |10671048   |10670938  |10671049      |10671050   |606921            |
+----------+--------+-------------+-----------+----------+--------------+-----------+------------------+
|10086262  |Rabbit  |Unconjugated |19397      |BP215734  |SCIENCE       |CBR4       |Purple and Yellow |
|10009004  |[null]  |[null]       |0000011756 |BCS207B   |SCIENCE, LLC  |CRYO BLUE  |Green and Blue    |
+----------+--------+-------------------------+----------+--------------+-----------+------------------+

提前谢谢。

scala—是否可以遍历分层数组列以从另一个Dataframe获取和聚合结果？

暂无答案！

相关问题

热门标签

最新问答