基于模式pySpark对列进行分组

edqdpe6u  于 2022-12-22  发布在  Spark
关注(0)|答案(1)|浏览(138)

我有这样的输入 Dataframe 。我想把price和qty列分组到字典中,如下所示。

-------------------------------------------------------------------------------------
| item_name | price_1  |  qty_1  |  price_2 |  qty_2 | price_3 | qty_3 |     url     |
-------------------------------------------------------------------------------------
| Samsung Z |   10000  |    5    |    9000  |    10  |  7000   |   20  | amazon.com  |
| Moto G4   |   12000  |    10   |    10000 |    20  |  6000   |   50  | ebay.com    |
| Mi 4i     |   15000  |    8    |    12000 |    20  |  10000  |   25  | deals.com   |
| Moto G3   |   20000  |    5    |    18000 |    12  |  15000  |   30  | ebay.com    |
--------------------------------------------------------------------------------------
    • 输出:**
---------------------------------------------------------------------------------------------------------------------------------------
| item_name |      price_range                                                                                          |     url     |
---------------------------------------------------------------------------------------------------------------------------------------
| Samsung Z |   [{price:10000,qty:5, comments:""},{price:9000,qty:10, comments:""},{price:7000,qty:20, comments:""}]    | amazon.com  |
| Moto G4   |   [{price:12000,qty:10, comments:""},{price:10000,qty:20, comments:""},{price:6000,qty:50, comments:""}]  | ebay.com    |
| Mi 4i     |   [{price:15000,qty:8, comments:""},{price:12000,qty:20, comments:""},{price:10000,qty:25, comments:""}]  | deals.com   |
| Moto G3   |   [{price:20000,qty:5, comments:""},{price:18000,qty:12, comments:""},{price:15000,qty:30, comments:""}]  | ebay.com    |
---------------------------------------------------------------------------------------------------------------------------------------
e0bqpujr

e0bqpujr1#

迭代列以找到每个价格、数量对,并使用create_map()从它们创建Map列。
然后使用array()将所有Map列收集为数组。

import pyspark.sql.functions as F

df = spark.createDataFrame(data=[["Samsung Z",10000,5,9000,10,7000,20,"amazon.com"],["Moto G4",12000,10,10000,20,6000,50,"ebay.com"],["Mi 4i",15000,8,12000,20,10000,25,"deals.com"],["Moto G3",20000,5,18000,12,15000,30,"ebay.com"]], schema=["item_name","price_1","qty_1","price_2","qty_2","price_3","qty_3","url"])

df = df.withColumn("comments", F.lit(""))

# Create map columns for each price, qty pair.
for i, c in enumerate(list(filter(lambda x: "price_" in x, df.columns))):
  df = df.withColumn(f"map{i+1}", F.create_map(F.lit(c), F.col(c), F.lit(f"qty_{i+1}"), F.col(f"qty_{i+1}"), F.lit("comments"), F.col("comments"))) \
         .drop(c, f"qty_{i+1}")

# Collect all map columns as array.
df = df.withColumn("price_range", F.array([f"map{j+1}" for j in range(0, i+1)])) \
       .drop(*[f"map{j+1}" for j in range(0,i+1)], "comments")

df.show(truncate=False)

输出:

+---------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
|item_name|url       |price_range                                                                                                                                 |
+---------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+
|Samsung Z|amazon.com|[{price_1 -> 10000, qty_1 -> 5, comments -> }, {price_2 -> 9000, qty_2 -> 10, comments -> }, {price_3 -> 7000, qty_3 -> 20, comments -> }]  |
|Moto G4  |ebay.com  |[{price_1 -> 12000, qty_1 -> 10, comments -> }, {price_2 -> 10000, qty_2 -> 20, comments -> }, {price_3 -> 6000, qty_3 -> 50, comments -> }]|
|Mi 4i    |deals.com |[{price_1 -> 15000, qty_1 -> 8, comments -> }, {price_2 -> 12000, qty_2 -> 20, comments -> }, {price_3 -> 10000, qty_3 -> 25, comments -> }]|
|Moto G3  |ebay.com  |[{price_1 -> 20000, qty_1 -> 5, comments -> }, {price_2 -> 18000, qty_2 -> 12, comments -> }, {price_3 -> 15000, qty_3 -> 30, comments -> }]|
+---------+----------+--------------------------------------------------------------------------------------------------------------------------------------------+

相关问题