pyspark中数组的Map列

hmae6n7t  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(291)

当有数组存储在列中时,我对pyspark-df还不太熟悉,我正在寻找一些帮助来尝试基于2个pysparkDataframeMap一个列,其中一个是引用df。
参考Dataframe(每个组的子组数不同):

| Group | Subgroup |       Size        |      Type        |
| ----  | -------- | ------------------| ---------------  |
|A      | A1       |['Small','Medium'] | ['A','B']        |
|A      | A2       |['Small','Medium'] | ['C','D']        |
|B      | B1       |['Small']          | ['A','B','C','D']|

源Dataframe:

| ID    | Size     |  Type    |     
| ----  | -------- | ---------| 
|ID_001 | 'Small'  |'A'       | 
|ID_002 | 'Medium' |'B'       | 
|ID_003 | 'Small'  |'D'       |

在结果中,每个id都属于每个组,但基于引用df,对于其子组是独占的,结果如下所示:

| ID    | Size     |  Type    |  A_Subgroup  |   B_Subgroup  |
| ----  | -------- | ---------|  ----------  | ------------- |
|ID_001 | 'Small'  |'A'       | 'A1'         |  'B1'         |
|ID_002 | 'Medium' |'B'       | 'A1'         |  Null         |
|ID_003 | 'Small'  |'D'       | 'A2'         |  'B1'         |
ct3nt3jp

ct3nt3jp1#

你可以使用 array_contains 条件,并透视结果:

import pyspark.sql.functions as F

result = source.alias('source').join(
    ref.alias('ref'),
    F.expr("""
        array_contains(ref.Size, source.Size) and 
        array_contains(ref.Type, source.Type)
    """),
    'left'
).groupBy(
    'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))

result.show()
+------+------+----+---+----+
|    ID|  Size|Type|  A|   B|
+------+------+----+---+----+
|ID_003| Small|   D| A2|  B1|
|ID_002|Medium|   B| A1|null|
|ID_001| Small|   A| A1|  B1|
+------+------+----+---+----+

相关问题