python pySpark:将列名连接为字符串into column

zqdjd7g9  于 2022-12-10  发布在  Python
关注(0)|答案(1)|浏览(154)

我有一个pySpark数据框,其中有很多属性列(大约160个)。这些列是1和0,用来显示一个帐户是否有属性。我需要对属性的组合进行分析,所以我想在一个新列中放置一个字符串,其中包含该帐户所拥有的属性的名称。下面是一个示例:我有这些列-帐户,然后是一些其他列,然后是属性。我想添加的列是“att_list”。

我所尝试的是这样的:
我有一个变量中的属性列表

# create a list of all the attributes available
att_names=df1.drop('Account','other_col1','other_col1')
attlist=[x for x in att_names.columns ]

我试着用一个函数--扩展一个现有的:

def func_att_list(df, cols=[]):
    
    att_list_column = ','.join([when(f.col(i) > 0, i) for i in cols])

    return df.withColumn('att_list', att_list_column )

df2 = func_att_list(df1, cols=[i for i in attlist])

这才报出错误。
我也试过这个:

att_list_column = [when(df1.col(i) > 0, i) for i in attlist]
df1 = df1.withColumn('att_list', ','.join([i for i in att_list_column ])

这也行不通。
我对功能没有信心,觉得它们有点像“黑匣子”。我将非常感谢任何帮助。

b09cbbtk

b09cbbtk1#

您可以使用concat_ws并为每个属性列传递一个case when条件列表-条件可以是 if attribute column has 1 then attribute column name
下面是一个小的测试示例

# sample input creation
data_ls = [
    [random.randint(0, 1) for i in range(10)] for j in range(100)
]

data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['attr'+str(k) for k in range(10)])

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# |    0|    0|    1|    1|    1|    1|    0|    0|    0|    0|
# |    1|    1|    0|    0|    1|    1|    1|    1|    1|    1|
# |    0|    1|    0|    1|    0|    0|    1|    0|    0|    0|
# |    1|    1|    0|    0|    0|    0|    0|    1|    1|    0|
# |    1|    0|    1|    0|    1|    0|    1|    1|    1|    0|
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
# only showing top 5 rows

# concatenate when().otherwise() for each attribute field
data_sdf. \
    withColumn('attr_list', 
               func.concat_ws(',', 
                              *[func.when(func.col(c) == 1, func.lit(c))
                                for c in data_sdf.columns if c.startswith('attr')]
                              )
               ). \
    show(5, truncate=False)

# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |attr0|attr1|attr2|attr3|attr4|attr5|attr6|attr7|attr8|attr9|attr_list                                      |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# |0    |0    |1    |1    |1    |1    |0    |0    |0    |0    |attr2,attr3,attr4,attr5                        |
# |1    |1    |0    |0    |1    |1    |1    |1    |1    |1    |attr0,attr1,attr4,attr5,attr6,attr7,attr8,attr9|
# |0    |1    |0    |1    |0    |0    |1    |0    |0    |0    |attr1,attr3,attr6                              |
# |1    |1    |0    |0    |0    |0    |0    |1    |1    |0    |attr0,attr1,attr7,attr8                        |
# |1    |0    |1    |0    |1    |0    |1    |1    |1    |0    |attr0,attr2,attr4,attr6,attr7,attr8            |
# +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------------------------------------------+
# only showing top 5 rows

列表解析将导致以下结果

[Column<'CASE WHEN (attr0 = 1) THEN attr0 END'>,
 Column<'CASE WHEN (attr1 = 1) THEN attr1 END'>,
 Column<'CASE WHEN (attr2 = 1) THEN attr2 END'>,
 Column<'CASE WHEN (attr3 = 1) THEN attr3 END'>,
 Column<'CASE WHEN (attr4 = 1) THEN attr4 END'>,
 Column<'CASE WHEN (attr5 = 1) THEN attr5 END'>,
 Column<'CASE WHEN (attr6 = 1) THEN attr6 END'>,
 Column<'CASE WHEN (attr7 = 1) THEN attr7 END'>,
 Column<'CASE WHEN (attr8 = 1) THEN attr8 END'>,
 Column<'CASE WHEN (attr9 = 1) THEN attr9 END'>]

相关问题