python—如何将行的元素列表与sparkDataframe中的值进行匹配

slmsl1lt  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(259)

我有两个大的日期框架字典框架和输入框架。我想对照输入检查字典框架。

class_dictionary = pd.DataFrame( {   
    'Subject' :   ["qqq", "rrr", "sss", "ttt", "uuu"],
    'Class'    : ["A type",    "B type",   "C type",  "C type",   "A type"],
    })
class_dictionary

    Subject     Class
0   qqq     A type
1   rrr     B type
2   sss     C type
3   ttt     C type
4   uuu     A type

我的输入框是,

input_db = pd.DataFrame( {   
    'Obj' :   ["name1", "name2", "name3", "name5",  "name10million"],
    'Subject List'    : ["qqq, ttt, ZZZ(not in the dict)",    "qqq, ttt, sss",   "uuu",  "rrr",   "uknown"],
    })
input_db

    Obj     Subject List
0   name1   qqq, ttt, ZZZ(not in the dict)
1   name2   qqq, ttt, sss
2   name3   uuu
3   name5   rrr
4   name10million   uknown

输出应该像这样,

sample_output = pd.DataFrame( {   
    'Obj' :   ["name1", "name2", "name10million"],
    'Values'    : ["qqq, ttt, ZZZ(not in the dict)", "qqq, ttt, sss", "uknown"],
    'Calculated (can be different new columns)' : ["A type: qqq, C type: ttt", "A type: qqq, C Type: ttt, C type: sss", "unk"],
    'Count of types' : ["2", "2", "0"]
    })
sample_output

     Obj    Values-Calculated (can be different new columns)          Count of types
0   name1   qqq, ttt, ZZZ(not in the dict)  A type: qqq, C type: ttt         2
1   name2   qqq, ttt, sss   A type: qqq, C Type: ttt, C type: sss            2
2   name10million   uknown  unk                                              0

我知道用python做这件事的一种非常懒惰的方法,这并不能解决我的问题。我想用pyspark做这个。
我知道它比较复杂,任何帮助都将不胜感激。谢谢您。

xpcnnkqh

xpcnnkqh1#

您可以拆分和分解主题列表,然后加入Dataframe并进行聚合:

import pyspark.sql.functions as F

df1 = spark.createDataFrame(class_dictionary)
df2 = spark.createDataFrame(input_db)

result = df2.withColumn(
    'Subject', 
    F.explode(F.split('Subject List', ', '))
).join(
    df1, 
    'Subject', 
    'left'
).groupBy('Obj', 'Subject List').agg(
    F.collect_list(
        F.when(
            F.col('Class').isNotNull(), 
            F.struct('Class', 'Subject')
        )
    ).alias('Calculated'), 
    F.countDistinct('Class').alias('Count of types')
)

result.show(truncate=False)
+-------------+------------------------------+---------------------------------------------+--------------+
|Obj          |Subject List                  |Calculated                                   |Count of types|
+-------------+------------------------------+---------------------------------------------+--------------+
|name10million|uknown                        |[]                                           |0             |
|name2        |qqq, ttt, sss                 |[[A type, qqq], [C type, ttt], [C type, sss]]|2             |
|name3        |uuu                           |[[A type, uuu]]                              |1             |
|name5        |rrr                           |[[B type, rrr]]                              |1             |
|name1        |qqq, ttt, ZZZ(not in the dict)|[[C type, ttt], [A type, qqq]]               |2             |
+-------------+------------------------------+---------------------------------------------+--------------+

相关问题