在Pyspark dataframe中使用另一列格式化一列

az31mfrm 于 2023-04-19 发布在 Spark

关注(0)|答案(3)|浏览(132)

我有一个商业案例，其中一列将根据另外2列的值进行更新。我给出了一个例子如下：

+-------------------------------+------+------+---------------------------------------------------------------------+
|ErrorDescBefore                |name  |value |ErrorDescAfter                                                       |
+-------------------------------+------+------+---------------------------------------------------------------------+
|The error is in %s value is %s.|xx    |z     |The error is in xx value is z.                                       |
|The new cond is in %s is %s.   |y     |ww    |The new cond is in y is ww.                                        |
+-------------------------------+------+------+---------------------------------------------------------------------+

ErrorDescBeforecolumn有2个placeholders，即%s，placeholders由columnsname和value填充。输出为ErrorDescAfter。
我们可以在Pyspark中实现这一点吗？我尝试了string_format，并意识到这不是正确的方法。任何帮助都将不胜感激。
谢谢你

pyspark

来源：https://stackoverflow.com/questions/75996371/format-one-column-with-another-column-in-pyspark-dataframe

3条答案

按热度按时间

yqkkidmi1#

您可以始终使用UDF来满足自定义需求，例如：

spark = SparkSession.builder.appName("DateDataFrame").getOrCreate()
data = [
    ("The error is in %s value is %s.", "xx", "z"),
    ("The new cond is in %s is %s.", "y", "ww"),
]
df = spark.createDataFrame(data, ['ErrorDescBefore', 'name', 'value'])

format_udf = udf(lambda str, name, value: str.replace('%s', name, 1).replace('%s', value, 1))

df.withColumn("ErrorDescAfter", format_udf(col("ErrorDescBefore"), col("name"), col("value"))).show(truncate=False)

结果：

+-------------------------------+----+-----+------------------------------+
|ErrorDescBefore                |name|value|ErrorDescAfter                |
+-------------------------------+----+-----+------------------------------+
|The error is in %s value is %s.|xx  |z    |The error is in xx value is z.|
|The new cond is in %s is %s.   |y   |ww   |The new cond is in y is ww.   |
+-------------------------------+----+-----+------------------------------+

赞(0）回复(0）举报 2023-04-19

rqcrx0a62#

您可以将ErrorDescBefore拆分为一个数组，其中%s作为分隔符，然后使用concat函数将其元素与name和value连接起来。

import pyspark.sql.functions as F

...
df = df.withColumn('ErrorDescAfter', F.split('ErrorDescBefore', '%s')).withColumn(
    'ErrorDescAfter',
    F.concat(F.col('ErrorDescAfter')[0], 'name', F.col('ErrorDescAfter')[1], 'value', F.col('ErrorDescAfter')[2])
)

赞(0）回复(0）举报 2023-04-19

jpfvwuh43#

如果您知道ErrorDescBefore列中的格式将保持一致，则可以在字符串%s上执行split ErrorDescBefore，并将每个项与name和value列连接起来：

df.withColumn(
    'ErrorDescAfter',
    F.concat(
        F.split(F.col('ErrorDescBefore'), '%s').getItem(0),
        F.col('name'),
        F.split(F.col('ErrorDescBefore'), '%s').getItem(1),
        F.col('value'),
        F.split(F.col('ErrorDescBefore'), '%s').getItem(2),
    )
)

+-------------------------------+----+-----+------------------------------+
|ErrorDescBefore                |name|value|ErrorDescAfter                |
+-------------------------------+----+-----+------------------------------+
|The error is in %s value is %s.|xx  |z    |The error is in xx value is z.|
|The new cond is in %s is %s.   |y   |ww   |The new cond is in y is ww.   |
+-------------------------------+----+-----+------------------------------+

赞(0）回复(0）举报 2023-04-19

我来回答

在Pyspark dataframe中使用另一列格式化一列

3条答案

相关问题

热门标签

最新问答