在PySpark中仅获取筛选行的行号

xvw2m8pv  于 2023-10-15  发布在  Spark
关注(0)|答案(1)|浏览(112)

我有以下dataframe:
| id_cnt| id_prd|类型|价格|
| --|--|--|--|
| 1 |一|SS| 10 |
| 2 |一|AA| 20 |
| 3 |一|AA| 25 |
| 1 |B| AA| 55 |
| 2 |B| SS| 50 |
| 3 |B| AA| 75 |
| 4 |B| AA| 80 |
我需要添加一个新栏目:rownumber。对于每个id_prd,按price排序并获得行号,但仅当type = "AA"时。
预期输出:
| id_cnt| id_prd|类型|价格|rownumber|
| --|--|--|--|--|
| 1 |一|SS| 10 |null|
| 2 |一|AA| 20 | 2 |
| 3 |一|AA| 25 | 1 |
| 1 |B| AA| 55 | 3 |
| 2 |B| SS| 50 |null|
| 3 |B| AA| 75 | 2 |
| 4 |B| AA| 80 | 1 |

zd287kbt

zd287kbt1#

您可以使用Window,并按id_prdtype进行分区,并按price排序,然后仅在有type = AA的情况下获得row_number

from pyspark.sql.functions import row_number, desc, col, lit, when
from pyspark.sql.window import Window

data = [
    (1, "A", "SS", 10),
    (2, "A", "AA", 20),
    (3, "A", "AA", 25),
    (1, "B", "AA", 55),
    (2, "B", "SS", 50),
    (3, "B", "AA", 75),
    (4, "B", "AA", 80)]

df = spark.createDataFrame(data, ["id_cnt", "id_prd", "type", "price"])
            
window = Window.partitionBy("id_prd", "type").orderBy(desc("price"))

df.withColumn("row_number", when(col("type") == lit("AA"), row_number().over(window)).otherwise(None)).show()
+------+------+----+-----+----------+                                           
|id_cnt|id_prd|type|price|row_number|
+------+------+----+-----+----------+
|     3|     A|  AA|   25|         1|
|     2|     A|  AA|   20|         2|
|     1|     A|  SS|   10|      null|
|     4|     B|  AA|   80|         1|
|     3|     B|  AA|   75|         2|
|     1|     B|  AA|   55|         3|
|     2|     B|  SS|   50|      null|
+------+------+----+-----+----------+

相关问题