如何在pysparkDataframe中拆分一列并保留其他列?

kxe2p93d  于 2021-07-12  发布在  Spark
关注(0)|答案(2)|浏览(293)

我有这样的数据:

>>> data = sc.parallelize([[1,5,10,0,[1,2,3,4,5,6]],[0,10,20,1,[2,3,4,5,6,7]],[1,15,25,0,[3,4,5,6,7,8]],[0,30,40,1,[4,5,6,7,8,9]]]).toDF(('a','b','c',"d","e"))
>>> data.show()
+---+---+---+---+------------------+
|  a|  b|  c|  d|                 e|
+---+---+---+---+------------------+
|  1|  5| 10|  0|[1, 2, 3, 4, 5, 6]|
|  0| 10| 20|  1|[2, 3, 4, 5, 6, 7]|
|  1| 15| 25|  0|[3, 4, 5, 6, 7, 8]|
|  0| 30| 40|  1|[4, 5, 6, 7, 8, 9]|
+---+---+---+---+------------------+

# colums should be kept in result

keep_cols = ["a","b"]

# column 'e' should be split into split_e_cols

split_e_cols = ["one","two","three","four","five","six"]

# I hope the result dataframe has keep_cols + split_res_cols

我想拆分列 e 分为多列并保留列 a 以及 b 同时。
我试过:

data.select(*(col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(len(split_e_cols)))))

data.select("e").rdd.flatMap(lambda x:x).toDF(split_e_cols)

两者都不能保留列 a 以及 b .
有人能帮我吗?谢谢。

ggazkfy8

ggazkfy81#

试试这个:

select_cols = [col(c) for c in keep_cols] + [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]

data.select(*select_cols).show()

# +---+---+---+---+-----+----+----+---+

# |  a|  b|one|two|three|four|five|six|

# +---+---+---+---+-----+----+----+---+

# |  1|  5|  1|  2|    3|   4|   5|  6|

# |  0| 10|  2|  3|    4|   5|   6|  7|

# |  1| 15|  3|  4|    5|   6|   7|  8|

# |  0| 30|  4|  5|    6|   7|   8|  9|

# +---+---+---+---+-----+----+----+---+

或用于循环和 withColumn :

data = data.select(keep_cols + ["e"])

for i in range(len(split_e_cols)):
    data = data.withColumn(split_e_cols[i], col("e").getItem(i))

data.drop("e").show()
3j86kqsm

3j86kqsm2#

可以使用 + :

from pyspark.sql.functions import col

data.select(
    keep_cols +
    [col("e").getItem(i).alias(split_e_cols[i]) for i in range(len(split_e_cols))]
).show()
+---+---+---+---+-----+----+----+---+
|  a|  b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
|  1|  5|  1|  2|    3|   4|   5|  6|
|  0| 10|  2|  3|    4|   5|   6|  7|
|  1| 15|  3|  4|    5|   6|   7|  8|
|  0| 30|  4|  5|    6|   7|   8|  9|
+---+---+---+---+-----+----+----+---+

一个更具Python习性的方法是使用 enumerate 而不是 range(len()) :

from pyspark.sql.functions import col

data.select(
    keep_cols +
    [col("e").getItem(i).alias(c) for (i, c) in enumerate(split_e_cols)]
).show()
+---+---+---+---+-----+----+----+---+
|  a|  b|one|two|three|four|five|six|
+---+---+---+---+-----+----+----+---+
|  1|  5|  1|  2|    3|   4|   5|  6|
|  0| 10|  2|  3|    4|   5|   6|  7|
|  1| 15|  3|  4|    5|   6|   7|  8|
|  0| 30|  4|  5|    6|   7|   8|  9|
+---+---+---+---+-----+----+----+---+

相关问题