删除pyspark中的特定前导零

7nbnzgx9  于 2021-05-24  发布在  Spark
关注(0)|答案(1)|浏览(415)

我想删除pyspark中一列的前导零的具体数目?
如果你能看到我只想去掉一个前导零只有一个的零,那么输出应该是:

  1. +-----------+-----------------+
  2. |subcategory|output |
  3. +-----------+-----------------+
  4. | 00EEE| 00EEE|
  5. | 0000EEE| 000EEE|
  6. | 0EEE| EEE|
  7. +-----------+-----------------+

类似地,如果我想从零开始去掉前导零是2,那么输出应该是:

  1. +-----------+-----------------+
  2. |subcategory|output |
  3. +-----------+-----------------+
  4. | 00EEE| EEE|
  5. | 0000EEE| 000EEE|
  6. | 0EEE| 0EEE|
  7. +-----------+-----------------+

有什么办法吗?

u5i3ibmn

u5i3ibmn1#

我创建了一个泛型函数,根据所需的数字删除前导“0”:

  1. from pyspark.sql import functions as F
  2. def remove_lead_zero(col, n):
  3. """
  4. col: name of the column you want to modify
  5. n: number of leading 0 you want to remove
  6. """
  7. return F.when(
  8. F.regexp_extract(col, "^0{{{n}}}[^0]".format(n=n), 0) != "",
  9. F.expr("substring({col}, {n}, length({col}))".format(col=col, n=n+1))
  10. ).otherwise(F.col(col))
  11. df.withColumn("output", remove_lead_zero("subcategory", 2)).show()
  12. +-----------+-------+
  13. |subcategory| output|
  14. +-----------+-------+
  15. | 00EEE| EEE|
  16. | 0000EEE|0000EEE|
  17. | 0EEE| 0EEE|
  18. +-----------+-------+
  19. df.withColumn("output", remove_lead_zero("subcategory", 1)).show()
  20. +-----------+-------+
  21. |subcategory| output|
  22. +-----------+-------+
  23. | 00EEE| 00EEE|
  24. | 0000EEE|0000EEE|
  25. | 0EEE| EEE|
  26. +-----------+-------+
展开查看全部

相关问题