基于包含concat值的现有列添加新列spark dataframe

ecr0jaav 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(501)

我想根据以下条件在我的Dataframe中创建一个新列。
我的Dataframe是这样的：

my_string 

2020 test 

2020 prod 

2020 dev

我的状况：

value1=subtract string after space from my_string

value2=subtract first four digit from my_string

If value 1 contains string 'test' then new_col=value2+"01"

If value 1 contains string 'prod' then new_col=value2+"kk"

If value 1 contains string 'dev' then new_col=value2+"ff"

我需要这样的结果：

my_string       |  new_col

2020 test        | 202001

2020 prod        | 2020kk 

2020 dev        | 2020ff

有人能帮帮我吗？

apache-spark pyspark apache-spark-sql spark-streaming

来源：https://stackoverflow.com/questions/61822351/add-new-column-based-on-existing-column-with-concat-values-spark-dataframe

1条答案

按热度按时间

7z5jn7bk1#

使用 row_number 窗口函数 monotonically_increasing_id() ```
from pyspark.sql import *
from pyspark.sql.functions import *
w = Window.orderBy(monotonically_increasing_id())
df.withColumn("new_col",concat(split(col("my_string")," ")[0], lpad(row_number().over(w),2,"0"))).show()

+---------+-------+

|my_string|new_col|

+---------+-------+

|2020 test| 202001|

|2020 prod| 202002|

| 2020 dev| 202003|

+---------+-------+

``` UPDATE: 使用 when+otherwise 声明。

df.withColumn("dyn_col",when(lower(split(col("my_string")," ")[1]) =="prod","kk").\
when(lower(split(col("my_string")," ")[1]) =="dev","ff").\
when(lower(split(col("my_string")," ")[1]) =="test","01").\
otherwise("null")).\
withColumn("new_col",concat(split(col("my_string")," ")[0], col("dyn_col"))).\
drop("dyn_col").\
show()

# +---------+-------+

# |my_string|new_col|

# +---------+-------+

# |2020 test| 202001|

# |2020 prod| 2020kk|

# | 2020 dev| 2020ff|

# +---------+-------+

``` `In Scala:` ```
df.withColumn("dyn_col",when(lower(split(col("my_string")," ")(1)) ==="prod","kk").
when(lower(split(col("my_string")," ")(1)) ==="dev","ff").
when(lower(split(col("my_string")," ")(1)) ==="test","01").
otherwise("null")).
withColumn("new_col",concat(split(col("my_string")," ")(0), col("dyn_col"))).
drop("dyn_col").
show()

//+---------+-------+
//|my_string|new_col|
//+---------+-------+
//|2020 test| 202001|
//|2020 prod| 2020kk|
//| 2020 dev| 2020ff|
//+---------+-------+

赞(0）回复(0）举报 2021-05-27

我来回答

基于包含concat值的现有列添加新列spark dataframe

1条答案

+---------+-------+

|my_string|new_col|

+---------+-------+

|2020 test| 202001|

|2020 prod| 202002|

| 2020 dev| 202003|

+---------+-------+

相关问题

热门标签

最新问答