python—为Dataframe中一列的每个值填充其余的列值

ac1kyiln  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(490)

我有一个带有一些列的Dataframe。
idstringtest0zero1onedef2twoghi3threejkl4fourmno公司
我想为每个id值填充其余的列
idstringtest0zero0onedef0twoghi0threejkl0fourmno1zero1onedef1twoghi1threejkl1fourmno2zero2onedef2twoghi2threejkl2fourmno3zero3onedef3twoghi3threejkl3fourmno4zero4onedef4twoghi4threejkl4fourmno

nue99wik

nue99wik1#

可以进行交叉连接:

result = df.select('ID').crossJoin(df.select('string', 'test'))

result.show(99)
+---+------+----+
| ID|string|test|
+---+------+----+
|  0|  zero| abc|
|  1|  zero| abc|
|  2|  zero| abc|
|  3|  zero| abc|
|  4|  zero| abc|
|  0|   one| def|
|  1|   one| def|
|  2|   one| def|
|  3|   one| def|
|  4|   one| def|
|  0|   two| ghi|
|  1|   two| ghi|
|  2|   two| ghi|
|  3|   two| ghi|
|  4|   two| ghi|
|  0| three| jkl|
|  1| three| jkl|
|  2| three| jkl|
|  3| three| jkl|
|  4| three| jkl|
|  0|  four| mno|
|  1|  four| mno|
|  2|  four| mno|
|  3|  four| mno|
|  4|  four| mno|
+---+------+----+
cidc1ykv

cidc1ykv2#

可以进行自交叉连接:

df1 = df.alias("a").crossJoin(df.alias("b")) \
    .select("a.ID", "b.string", "b.test")

df1.show()

# +---+------+----+

# | ID|string|test|

# +---+------+----+

# |  0|   one| def|

# |  0|   two| ghi|

# |  0| three| jkl|

# |  0|  four| mno|

# |  1|  zero| abc|

# |  1|   two| ghi|

# |  1| three| jkl|

# |  1|  four| mno|

# |  3|  zero| abc|

# |  3|   one| def|

# |  3|   two| ghi|

# |  3|  four| mno|

# |  2|  zero| abc|

# |  2|   one| def|

# |  2| three| jkl|

# |  2|  four| mno|

# |  4|  zero| abc|

# |  4|   one| def|

# |  4|   two| ghi|

# |  4| three| jkl|

# +---+------+----+

另一种方法是收集 ID 值并使用列表理解,创建具有列的Dataframe ID 作为文本值,以及列中的其他值 string 以及 test . 然后使用并集得到期望的结果

import functools
from pyspark.sql import functions as F
from pyspark.sql import DataFrame

dfs = [
    df.filter(F.col("ID") != r.ID).selectExpr(f"{r.ID} as ID", "string", "test")
    for r in df.select("ID").distinct().collect()
]

df1 = functools.reduce(DataFrame.union, dfs + [df])

相关问题