pyspark如何基于值添加选定列

wb1gzix0  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(305)

对于下面的数据结构,我希望返回一个基于 condition 列。例如,如果 "condition" =='A' 新的Dataframe在group1中应该有cols值,如果 "condition" =='B' 新的Dataframe在group2中应该有cols值。问题是我不想硬编码列名,因为后面可能有很多列 anothervalue . 我怎么能这么做?非常感谢你的帮助。例如,对于这个输入Dataframe,

+---------+---------+---------+
|condition|   group1|   group2|
+---------+---------+---------+
|        A|{SEA, WA}|{PDX, OR}|
|        B| {NY, NY}| {LA, CA}|
+---------+---------+---------+

我想得到这个输出:

+---------+---------+
|condition|   group |  
+---------+---------+
|        A|{SEA, WA}|
|        B| {LA, CA}|
+---------+---------+

上述输入Dataframe是使用此json架构创建的:

jsonStrings = ['{"condition":"A","group1":{"city":"SEA","state":"WA"},"group2":{"city":"PDX","state":"OR"}}','{"condition":"B","group1":{"city":"NY","state":"NY"},"group2":{"city":"LA","state":"CA"}}']
o75abkj4

o75abkj41#

你可以用 when 并构建如下条件的动态列表

from pyspark.sql.functions import *

conditions = when(col('condition') == 'A', col("group1"))\
    .when(col('condition') == 'B', col("group2")).otherwise(None)

df1.select(col('condition'), conditions.alias("group")).show(truncate=False)

输出:

+---------+---------+
|condition|group    |
+---------+---------+
|A        |{SEA, WA}|
|B        |{LA, CA} |
+---------+---------+

相关问题