合并Dataframe中的列

oxosxuxt 于 2021-07-12 发布在 Spark

关注(0)|答案(1)|浏览(387)

我有4列的dataframe，希望将前2列和后2列合并到一个新的dataframe中。
数据是相同的，顺序是无关的，任何重复必须保留。

import pyspark.sql.functions as F

df = spark.createDataFrame([
["This is line 1","xxxx12","This is line 5","hhhh29"],
["This is line 2","yyyy23","This is line 6","kkkk47"],
["This is line 3","zzzz64","This is line 7","llll88"],
["This is line 4","gggg37","This is line 8","ssss84"],
]).toDF("col_a", "col_b", "col_c", "col_d")

新Dataframe：

+---------------+-------+
| col_1         |col_2  |
+-------------- +-------+
|This is line 1 |xxxx12 |
|This is line 5 |hhhh29 |
|This is line 2 |yyyy23 |
|This is line 6 |kkkk47 |
|This is line 3 |zzzz64 |
|This is line 7 |llll88 |
|This is line 4 |gggg37 |
|This is line 8 |ssss84 |
+---------------+-------+

我该怎么做？

apache-spark pyspark apache-spark-sql merge

来源：https://stackoverflow.com/questions/66553001/merge-columns-from-the-a-dataframe

1条答案

按热度按时间

8e2ybdfx1#

如果顺序不重要，可以使用 unionAll :

df2 = df.selectExpr(
    "col_a as col_1", "col_b as col_2"
).unionAll(
    df.selectExpr("col_c as col_1", "col_d as col_2")
)

df2.show()
+--------------+------+
|         col_1| col_2|
+--------------+------+
|This is line 1|xxxx12|
|This is line 2|yyyy23|
|This is line 3|zzzz64|
|This is line 4|gggg37|
|This is line 5|hhhh29|
|This is line 6|kkkk47|
|This is line 7|llll88|
|This is line 8|ssss84|
+--------------+------+

或者你可以用 stack ，保持秩序：

df2 = df.selectExpr("stack(2, col_a, col_b, col_c, col_d) as (col_1, col_2)")

df2.show()
+--------------+------+
|         col_1| col_2|
+--------------+------+
|This is line 1|xxxx12|
|This is line 5|hhhh29|
|This is line 2|yyyy23|
|This is line 6|kkkk47|
|This is line 3|zzzz64|
|This is line 7|llll88|
|This is line 4|gggg37|
|This is line 8|ssss84|
+--------------+------+

赞(0）回复(0）举报 2021-07-12

我来回答

合并Dataframe中的列

1条答案

相关问题

热门标签

最新问答