在Pyspark dataframe中使用另一列格式化一列

az31mfrm  于 2023-04-19  发布在  Spark
关注(0)|答案(3)|浏览(131)

我有一个商业案例,其中一列将根据另外2列的值进行更新。我给出了一个例子如下:

+-------------------------------+------+------+---------------------------------------------------------------------+
|ErrorDescBefore                |name  |value |ErrorDescAfter                                                       |
+-------------------------------+------+------+---------------------------------------------------------------------+
|The error is in %s value is %s.|xx    |z     |The error is in xx value is z.                                       |
|The new cond is in %s is %s.   |y     |ww    |The new cond is in y is ww.                                        |
+-------------------------------+------+------+---------------------------------------------------------------------+

ErrorDescBeforecolumn有2个placeholders,即%splaceholderscolumnsnamevalue填充。输出为ErrorDescAfter
我们可以在Pyspark中实现这一点吗?我尝试了string_format,并意识到这不是正确的方法。任何帮助都将不胜感激。
谢谢你

yqkkidmi

yqkkidmi1#

您可以始终使用UDF来满足自定义需求,例如:

spark = SparkSession.builder.appName("DateDataFrame").getOrCreate()
data = [
    ("The error is in %s value is %s.", "xx", "z"),
    ("The new cond is in %s is %s.", "y", "ww"),
]
df = spark.createDataFrame(data, ['ErrorDescBefore', 'name', 'value'])

format_udf = udf(lambda str, name, value: str.replace('%s', name, 1).replace('%s', value, 1))

df.withColumn("ErrorDescAfter", format_udf(col("ErrorDescBefore"), col("name"), col("value"))).show(truncate=False)

结果:

+-------------------------------+----+-----+------------------------------+
|ErrorDescBefore                |name|value|ErrorDescAfter                |
+-------------------------------+----+-----+------------------------------+
|The error is in %s value is %s.|xx  |z    |The error is in xx value is z.|
|The new cond is in %s is %s.   |y   |ww   |The new cond is in y is ww.   |
+-------------------------------+----+-----+------------------------------+
rqcrx0a6

rqcrx0a62#

您可以将ErrorDescBefore拆分为一个数组,其中%s作为分隔符,然后使用concat函数将其元素与namevalue连接起来。

import pyspark.sql.functions as F

...
df = df.withColumn('ErrorDescAfter', F.split('ErrorDescBefore', '%s')).withColumn(
    'ErrorDescAfter',
    F.concat(F.col('ErrorDescAfter')[0], 'name', F.col('ErrorDescAfter')[1], 'value', F.col('ErrorDescAfter')[2])
)
jpfvwuh4

jpfvwuh43#

如果您知道ErrorDescBefore列中的格式将保持一致,则可以在字符串%s上执行split ErrorDescBefore,并将每个项与namevalue列连接起来:

df.withColumn(
    'ErrorDescAfter',
    F.concat(
        F.split(F.col('ErrorDescBefore'), '%s').getItem(0),
        F.col('name'),
        F.split(F.col('ErrorDescBefore'), '%s').getItem(1),
        F.col('value'),
        F.split(F.col('ErrorDescBefore'), '%s').getItem(2),
    )
)

+-------------------------------+----+-----+------------------------------+
|ErrorDescBefore                |name|value|ErrorDescAfter                |
+-------------------------------+----+-----+------------------------------+
|The error is in %s value is %s.|xx  |z    |The error is in xx value is z.|
|The new cond is in %s is %s.   |y   |ww   |The new cond is in y is ww.   |
+-------------------------------+----+-----+------------------------------+

相关问题