为pyspark中一列的重复值生成序列号列

e3bfsja2  于 2022-11-28  发布在  Spark
关注(0)|答案(1)|浏览(257)

如何创建一个列dup_duns_number,使重复值的索引号增加1。

DUNS_NUMBER  dup_DUNS_NUMBER
0             0              1.0
1             0              1.0
2             0              1.0
3             0              1.0
4       1000231              NaN
5       1000236              NaN
6       1000363              2.0
7       1000363              2.0
8       1000368              NaN
9       1000467              NaN
10      1000470              3.0
11      1000470              3.0
12      1000470              3.0
13      1000553              4.0
14      1000553              4.0
15      1000574              NaN
16      1000657              5.0
17      1000657              5.0
18      1000694              NaN
19      1000744              6.0
20      1000744              6.0
s6fujrry

s6fujrry1#

我建议你分两步来做:首先,您必须过滤重复的值并 rank 它们以给予递增的索引号,然后合并非重复的值,如下所示:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, count, lit, dense_rank 

df = spark.createDataFrame([(0,), (0,), (0,), (0,), (1000231,), (1000236,), (1000363,), (1000363,)], "DUNS_NUMBER: int")

count_window = Window.partitionBy("DUNS_NUMBER")
order_window = Window.partitionBy().orderBy("DUNS_NUMBER")

df_with_count = df.withColumn("dup_COUNT", count(col("DUNS_NUMBER")).over(count_window))

df_with_count \
    .filter("dup_COUNT > 1") \
    .withColumn("dup_DUNS_NUMBER", dense_rank().over(order_window)) \
    .union( \
        df_with_count.filter("dup_COUNT = 1").withColumn("dup_DUNS_NUMBER", lit("NaN"))) \
    .select(["DUNS_NUMBER", "dup_DUNS_NUMBER"]) \
    .orderBy("DUNS_NUMBER") \
    .show()

+-----------+---------------+
|DUNS_NUMBER|dup_DUNS_NUMBER|
+-----------+---------------+
|          0|              1|
|          0|              1|
|          0|              1|
|          0|              1|
|    1000231|            NaN|
|    1000236|            NaN|
|    1000363|              2|
|    1000363|              2|
+-----------+---------------+

相关问题