我有以下pyspark dataframe:
import pyspark
arrayData = [
('1','temperature','21', 'Celsius'),
('2','humidity|temperature','88|21.8', 'Percent|Celsius'),
('3','temperature','21.2', 'Celsius'),
('4','temperature','19.9', 'Celsius'),
('5','humidity','85.5', 'Percent')]
df = spark.createDataFrame(data=arrayData, schema = ['id','types','values','labels'])
df.show()
+----+--------------------+---------+---------------+
| id| types| values| labels|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2|humidity|temperature| 88|21.8|Percent|Celsius|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+
我想分开类型、值和标签,以便每个值都在其对应的行中。就像这样:
+----+--------------------+---------+---------------+
| id| type| value| label|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2| temperature| 21.8| Celsius|
| 2| humidity| 88| Percent|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+
我尝试使用split和exploit函数,但它为每个值创建了一个新行:
from pyspark.sql.functions import explode,col,split
df_2 = df.withColumn("id",col("id"))\
.withColumn("type",explode(split("types", "\|")))\
.withColumn("value",explode(split("values", "\|")))\
.withColumn("label",explode(split("labels", "\|")))\
df_2.show()
+----+--------------------+---------+---------------+
| id| type| value| label|
+----+--------------------+---------+---------------+
| 1| temperature| 21| Celsius|
| 2| temperature| 21.8| Celsius|
| 2| temperature| 21.8| Percent|
| 2| temperature| 88| Celsius|
| 2| temperature| 88| Percent|
| 2| humidity| 21.8| Celsius|
| 2| humidity| 21.8| Percent|
| 2| humidity| 88| Celsius|
| 2| humidity| 88| Percent|
| 3| temperature| 21.2| Celsius|
| 4| temperature| 19.9| Celsius|
| 5| humidity| 85.5| Percent|
+----+--------------------+---------+---------------+
2条答案
按热度按时间evrscar21#
围绕
|
拆分列,然后压缩拆分后的列,以便压缩后的数组中的每个项都是(types, values, labels)
的结构。然后分解arr
列,并为(types, values, labels)
中的每一列从结构中提取相应的值。i5desfxk2#