pyspark:基于一列的值替换多列中的值

9fkzdhlc  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(470)

我有一个pysparkDataframe,有多个列(大约320个)
我得找到关键词 baz 在col中 A . 万一 baz ,然后替换列表中列出的所有列中的现有值 columns_for_replacement 没有

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

我正在尝试修改以下代码以执行相同的操作:

for i in columns_for_replacement:
    df = df.withColumn(i,when((col(i)=='baz'),None).otherwise(col(i)))

上面的代码只适用于特定的列,这不是我的预期要求。
基本Dataframe:

A   B   C   D   E   F   G   H   I   J   
baz abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
def abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
map abc abc abc abc abc abc abc abc abc
baz abc abc abc abc abc abc abc abc abc
noo abc abc abc abc abc abc abc abc abc

预期Dataframe:

A   B   C   D   E   F   G   H   I   J   
baz                                 abc
baz                                 abc
def abc abc abc abc abc abc abc abc abc
baz                                 abc
map abc abc abc abc abc abc abc abc abc
baz                                 abc
noo abc abc abc abc abc abc abc abc abc
gz5pxeao

gz5pxeao1#

我想你是想用这个专栏 A 在when条件中,而不是列中 i :

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

for i in columns_for_replacement:
    df = df.withColumn(i, when((col("A")=='baz'), lit(None)).otherwise(col(i)))

另一种方法是做一个 select :

columns_for_replacement = ["B", "C", "D", "E", "F", "G", "H", "I"]

df2 = df.select([
    when((col("A")=='baz'), lit(None)).otherwise(col(c)).alias(c)
    if c in columns_for_replacement 
    else col(c)
    for c in df.columns
])
ocebsuys

ocebsuys2#

您需要检查列 A :

from pyspark.sql import functions as F

df1 = df.select(
    col("A"),
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    col("J")
)

df1.show()

# +---+----+----+----+----+----+----+----+----+---+

# |  A|   B|   C|   D|   E|   F|   G|   H|   I|  J|

# +---+----+----+----+----+----+----+----+----+---+

# |baz|null|null|null|null|null|null|null|null|abc|

# |baz|null|null|null|null|null|null|null|null|abc|

# |def| abc| abc| abc| abc| abc| abc| abc| abc|abc|

# |baz|null|null|null|null|null|null|null|null|abc|

# |map| abc| abc| abc| abc| abc| abc| abc| abc|abc|

# |baz|null|null|null|null|null|null|null|null|abc|

# |noo| abc| abc| abc| abc| abc| abc| abc| abc|abc|

# +---+----+----+----+----+----+----+----+----+---+

或更通用,以避免选择要替换的列表中以外的每一列:

df1 = df.select(
    *[F.when(F.col("A") == "baz", F.lit(None)).otherwise(F.col(c)).alias(c) for c in columns_for_replacement],
    *[F.col(c) for c in df.columns if c not in columns_for_replacement]
).toDF(*df.columns)

相关问题