PySpark -将数组拆分为所有列并合并为行

9fkzdhlc  于 2023-01-13  发布在  Apache
关注(0)|答案(3)|浏览(139)

在PySpark中有没有一种方法可以同时分解所有列中的数组/列表,并将分解后的数据分别合并/压缩到行中?
列数可以是动态的,取决于其他因素。
来自 Dataframe

|col1   |col2   |col3   |
|[a,b,c]|[d,e,f]|[g,h,i]|
|[j,k,l]|[m,n,o]|[p,q,r]|

到 Dataframe

|col1|col2|col3|
|a   |d   |g   |
|b   |e   |h   |
|c   |f   |i   |
|j   |m   |p   |
|k   |n   |q   |
|l   |o   |r   |
ne5o7dgx

ne5o7dgx1#

下面是使用rddflatMap()执行此操作的一种方法:

cols = df.columns
df.rdd.flatMap(lambda x: zip(*[x[c] for c in cols])).toDF(cols).show(truncate=False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|a   |d   |g   |
#|b   |e   |h   |
#|c   |f   |i   |
#|j   |m   |p   |
#|k   |n   |q   |
#|l   |o   |r   |
#+----+----+----+
gpfsuwkq

gpfsuwkq2#

试试这个

import pyspark.sql.functions as F
from pyspark.sql.types import *

a = [(['a','b','c'],['d','e','f'],['g','h','i']),(['j','k','l'],['m','n','o'],['p','q','r'])]
a = sql.createDataFrame(a,['a','b','c'])

cols = ['col1','col2','col3']
splits = [F.udf(lambda val:val[0],StringType()),F.udf(lambda val:val[1],StringType()),F.udf(lambda val:val[2],StringType())]

def exploding(cols):
    return F.explode(F.array([F.struct([F.col(c).getItem(i).alias(c)\
                                  for c in colnames]) for i in range(3)]))

a = a.withColumn("new_col", exploding(["a", "b", "c"]))\
                .select([s('new_col').alias(c) for s,c in zip(splits,cols)])
a.show()
nbysray5

nbysray53#

from pyspark.sql.functions import explode, arrays_zip

data = [(['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']), (['j', 'k', 'l'], ['m', 'n', 'o'], ['p', 'q', 'r'])]
df1 = spark.createDataFrame(data, ['col1', 'col2', 'col3'])
df1.printSchema()

root
 |-- col1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- col3: array (nullable = true)
 |    |-- element: string (containsNull = true)

df2 = df1.select(arrays_zip('col1', 'col2', 'col3').alias('zipped'))
df3 = df2.select(explode('zipped'))
df4 = df3.select('col.*')

df4.show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|a   |d   |g   |
|b   |e   |h   |
|c   |f   |i   |
|j   |m   |p   |
|k   |n   |q   |
|l   |o   |r   |
+----+----+----+

相关问题