基于pyspark中的其他列值创建新行

gwbalxhn 于 2024-01-06 发布在 Spark

关注(0)|答案(4)|浏览(244)

我有一个pyspark框架如下：

c1      c2
111     null    
null    222
333     444
null    null

字符串
我需要有一个额外的列像下面的最终框架

c1      c2      new_col
111     null    111
null    222     222
333     444     333
333     444     444
null    null    null

型
如果两个列都有值，那么我需要创建一个新的行，其中包含cols1和col2的值。

df = df.withColumn('new_col', when(col('c1').isNull(), col('c2')) \
        .otherwise(when(col('c2').isNull(), col('c1')).otherwise(col(c2'))))

型
如果列c1和c2都有值，我会创建一个新行。有人能提出解决方案吗？

pyspark

来源：https://stackoverflow.com/questions/77717687/create-a-new-row-based-on-other-column-values-in-pyspark-dataframe

4条答案

按热度按时间

pn9klfpd1#

可以使用unionAll创建新行。在Scala上，可以轻松转换为Python：

df.withColumn("new_col", coalesce($"c1", $"c2"))
  .unionAll(
    df.where($"c1".isNotNull && $"c2".isNotNull)
      .withColumn("new_col", $"c2")
  )

字符串
测试结果：

+----+----+-------+
|c1  |c2  |new_col|
+----+----+-------+
|111 |null|111    |
|null|222 |222    |
|333 |444 |333    |
|null|null|null   |
|333 |444 |444    |
+----+----+-------+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

o75abkj42#

from pyspark.sql import functions as F
data = [(111, None), (None, 222), (333, 444), (None, None)]
columns = ["c1", "c2"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("c1", F.col("c1").cast('int'))
df = df.withColumn("c2", F.col("c2").cast('int'))
df1 = df.filter((F.col("c1").isNotNull()) & (F.col("c2").isNotNull()))
df2 = df.filter(~((F.col("c1").isNotNull()) & (F.col("c2").isNotNull())))
df1 = (df1.withColumn("new_col", F.array(df["c1"], df["c2"]))
       .withColumn("new_col", F.explode("new_col"))
       .withColumn("new_col", F.col("new_col").cast("int")))
df2 = df2.withColumn("new_col", F.when(df2["c1"].isNull(), df2["c2"])
                     .when(df2["c2"].isNull(), df2["c1"])
                     .when((df2["c2"].isNull() & df2["c1"].isNull()), F.lit(None).cast('int'))
                     .otherwise(F.lit(None).cast('int')))
df1.show()
df2.show()
final_df = df2.unionByName(df1)
final_df.show()

字符串
这里是结果，应该是你想要的输出

+----+----+-------+
|  c1|  c2|new_col|
+----+----+-------+
| 111|NULL|    111|
|NULL| 222|    222|
|NULL|NULL|   NULL|
| 333| 444|    333|
| 333| 444|    444|
+----+----+-------+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

wmvff8tz3#

你可以使用coalesce来创建一个数组，然后使用explode来从中创建行：

from pyspark.sql.functions import expr, explode, coalesce
df \
.withColumn(
    "array_col",
    expr(
        "CASE WHEN c1 IS NOT NULL AND c2 IS NOT NULL THEN array(c1, c2)" +
        "ELSE array(coalesce(c1, c2))" +
        "END"
    )
) \
.withColumn("new_col", explode("array_col")) \
.drop("array_col") \
.show()

字符串
输出量：

+----+----+-------+
|  c1|  c2|new_col|
+----+----+-------+
| 111|NULL|    111|
|NULL| 222|    222|
| 333| 444|    333|
| 333| 444|    444|
|NULL|NULL|   NULL|
+----+----+-------+

型

展开查看全部

赞(0）回复(0）举报 2024-01-06

jecbmhm34#

要实现所需的结果，您可以使用union操作将DataFrame与其自身的修改版本组合在一起，其中新列（new_col）根据您提到的条件填充。以下是如何做到这一点：

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample DataFrame
data = [(111, None), (None, 222), (333, 444), (None, None)]
columns = ["c1", "c2"]
df = spark.createDataFrame(data, columns)
# Create a new DataFrame with a modified new_col
new_df = df.withColumn("new_col", F.when(df["c1"].isNull(), df["c2"])
                                 .when(df["c2"].isNull(), df["c1"])
                                 .otherwise(F.array(df["c1"], df["c2"])))
# Explode the array in new_col to create separate rows
result_df = new_df.select("c1", "c2", F.explode("new_col").alias("new_col"))
# Show the result
result_df.show()

字符串
这将为您提供以下DataFrame：

+----+----+-------+
|  c1|  c2|new_col|
+----+----+-------+
| 111|null|    111|
|null| 222|    222|
| 333| 444|    333|
| 333| 444|    444|
|null|null|   null|
+----+----+-------+

型
在这里，F.array(df["c1"], df["c2"])用于创建一个数组列new_col，其中包含c1和c2值。然后使用F.explode函数将该数组分解为单独的行。这样，您可以为数组中的每个值获得一个新行。

展开查看全部

赞(0）回复(0）举报 2024-01-06

我来回答

基于pyspark中的其他列值创建新行

4条答案

相关问题

热门标签

最新问答