在pyspark中将完整文件路径转换为多行父绝对路径

mf98qq94  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(475)

在pysparkDataframe中,我想将一个字符串完整的文件路径转换为每个父路径的多行。
输入Dataframe值:

  1. ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt

输出:每一行都应该显示一个绝对路径以及 / 分隔符

  1. ParentFolder/
  2. ParentFolder/Folder1/
  3. ParentFolder/Folder1/Folder2/
  4. ParentFolder/Folder1/Folder2/Folder3/
  5. ParentFolder/Folder1/Folder2/Folder3/Folder4/
  6. ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt
l0oc07j2

l0oc07j21#

可以拆分列 value/ 分隔符以获取路径的所有部分。然后使用 transform 函数,可以使用 slice 以及 array_join 功能:

  1. from pyspark.sql import functions as F
  2. df1 = df.withColumn("value", F.split(F.col("value"), "/")) \
  3. .selectExpr("""
  4. explode(
  5. transform(value,
  6. (x, i) -> struct(i+1 as rn, array_join(slice(value, 1, i+1), '/') ||
  7. IF(i+1 < size(value), '/', '') as path)
  8. )
  9. ) as paths
  10. """).select("paths.*")
  11. df1.show(truncate=False)
  12. # +---+---------------------------------------------------------+
  13. # |rn |path |
  14. # +---+---------------------------------------------------------+
  15. # |1 |ParentFolder/ |
  16. # |2 |ParentFolder/Folder1/ |
  17. # |3 |ParentFolder/Folder1/Folder2/ |
  18. # |4 |ParentFolder/Folder1/Folder2/Folder3/ |
  19. # |5 |ParentFolder/Folder1/Folder2/Folder3/Folder4/ |
  20. # |6 |ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
  21. # +---+---------------------------------------------------------+

对于spark<2.4,可以这样使用udf:

  1. import os
  2. from pyspark.sql import functions as F
  3. from pyspark.sql.types import ArrayType, StringType
  4. def get_all_paths(path: str):
  5. paths = [path]
  6. for _ in range(path.count("/")):
  7. path, base = os.path.split(path)
  8. paths.append(path + "/")
  9. return list(reversed(paths))
  10. decompose_path = F.udf(get_all_paths, ArrayType(StringType()))
  11. df1 = df.select(F.explode(decompose_path(F.col("value"))).alias("paths"))
展开查看全部
3xiyfsfu

3xiyfsfu2#

你可以用 substring_index 具体如下:

  1. df2 = df.selectExpr("""
  2. explode(
  3. transform(
  4. sequence(1, size(split(col, '/'))),
  5. (x, i) -> case when i = size(split(col, '/')) - 1
  6. then col
  7. else substring_index(col, '/', x) || '/'
  8. end
  9. )
  10. ) as col
  11. """)
  12. df2.show(20,0)
  13. +---------------------------------------------------------+
  14. |col |
  15. +---------------------------------------------------------+
  16. |ParentFolder/ |
  17. |ParentFolder/Folder1/ |
  18. |ParentFolder/Folder1/Folder2/ |
  19. |ParentFolder/Folder1/Folder2/Folder3/ |
  20. |ParentFolder/Folder1/Folder2/Folder3/Folder4/ |
  21. |ParentFolder/Folder1/Folder2/Folder3/Folder4/TestFile.txt|
  22. +---------------------------------------------------------+
展开查看全部

相关问题