如何根据pyspark中的值查找前n个键？

qrjkbowd 于 2021-05-18 发布在 Spark

关注(0)|答案(2)|浏览(423)

我有一个pysparkDataframe，其模式如下所示：

root
 |-- query: string (nullable = true)
 |-- collect_list(docId): array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- prod_count_dict: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

数据框如下所示：

+--------------------+--------------------+--------------------+
|               query| collect_list(docId)|     prod_count_dict|
+--------------------+--------------------+--------------------+
|1/2 inch plywood ...|[471097-153-12CC,...|[530320-62634-100...|
|             1416445|[1416445-83-HHM5S...|[1054482-2251-FFC...

请注意，列 prod_count_dict 是一个包含键值对的字典，如：

{x: 12, a: 16, b:1, f:3, ....}

我想做的是我只想选择 keys 的 top n 最大的 values 从key:value对，并将其存储在另一列中，作为与该行对应的列表，如：[x，a，…]。
我尝试了下面的代码，但它给了我一个错误，有没有办法我可以解决这个特殊的问题？

@F.udf(StringType())
def create_label(x):

# If the length of dictionary is less then 20, I want to return the keys of all the items in the dict.

    if len(x) >= 20:  
        val_sort = sorted(list(x.values()), reverse = True)
        cutoff = {k: v for (k, v) in x.items() if v > val_sort[20]}
        return cutoff.keys()
    else:
        return x.keys()

label_df = label_count_df.withColumn("label", create_label("prod_count_dict"))
label_df.show()

apache-spark pyspark apache-spark-sql user-defined-functions

来源：https://stackoverflow.com/questions/64634141/how-to-find-the-top-n-keys-based-on-the-value-in-pyspark

2条答案

按热度按时间

huwehgph1#

首先我要把这句话爆了：

df = df.select("*", f.explode("prod_count_dict").alias("key", "value"))

之后，可以使用window函数获取每个键的前n个值

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy(df['key']).orderBy(df['value'].desc())

df.select('*', f.rank().over(w).alias('rank'))\
  .filter(col('rank') <= 2) \  # setup N here
  .drop('rank')

赞(0）回复(0）举报 2021-05-19

b09cbbtk2#

你写的自定义项是正确的。您只需更改实际使用的代码。如果您使用 .map 在 rdd :


# Let the udf that you have written be a normal python function

def create_label(x):

# If the length of the dictionary is less than 20, I want to return the keys of all the items in the dict.

    if len(x) >= 20:  
        val_sort = sorted(list(x.values()), reverse = True)
        cutoff = {k: v for (k, v) in x.items() if v > val_sort[20]}
        return cutoff.keys()
    else:
        return x.keys()

您需要更改的部分是：

label_df_col = ['query','prod_count_dict']
label_df = label_count_df.rdd.map(lambda x:(x.query, create_label(x.prod_count_dict))).toDF(label_df_col)
label_df.show()

这应该管用。

赞(0）回复(0）举报 2021-05-18

我来回答

如何根据pyspark中的值查找前n个键？

2条答案

相关问题

热门标签

最新问答