pyspark concat字符串(按分区)

9w11ddsr  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(376)

我有一个Dataframe

+----+----------+
|id  | device   |
+----+----------+
| 123| phone    |
| 124| phone    |
| 555| phone    |
| 898| tablet   |
| 999| tablet   |
|1111| tv       |
+----+----------+

我希望得到一个新的列,它的devices值与id相关联,比如

+----+----------+--------------+
|id  | device   | device_id    |
+----+----------+--------------+
| 123| phone    | phone_00001  |
| 124| phone    | phone_00002  |
| 555| phone    | phone_00003  |
| 898| tablet   | tablet_00001 |
| 999| tablet   | tablet_00002 |
|1111| tv       | tv_00001     |
+----+----------+--------------+

在r里看起来像

df %>% group_by(device) %>% mutate(device_id = paste0(device, '_', sprintf("%04d", row_number())

我在Pypark找同样的。

ohtdti5x

ohtdti5x1#

类似于r中的方法,根据设备分区分配行号,并使用 format_string 要获得所需的输出格式:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'device_id', 
    F.format_string(
        '%s_%05d', 
        F.col('device'), 
        F.row_number().over(Window.partitionBy('device').orderBy('id'))
    )
)

df2.show()
+----+------+------------+
|  id|device|   device_id|
+----+------+------------+
| 123| phone| phone_00001|
| 124| phone| phone_00002|
| 555| phone| phone_00003|
|1111|    tv|    tv_00001|
| 898|tablet|tablet_00001|
| 999|tablet|tablet_00002|
+----+------+------------+

相关问题