pyspark 将多列中的管道分隔值拆分为行

huwehgph 于 2023-10-15 发布在 Spark

关注(0)|答案(2)|浏览(104)

我有以下pyspark dataframe：

import pyspark

arrayData = [
        ('1','temperature','21', 'Celsius'),
        ('2','humidity|temperature','88|21.8', 'Percent|Celsius'),
        ('3','temperature','21.2', 'Celsius'),
        ('4','temperature','19.9', 'Celsius'),
        ('5','humidity','85.5', 'Percent')]
df = spark.createDataFrame(data=arrayData, schema = ['id','types','values','labels'])
df.show()

+----+--------------------+---------+---------------+
|  id|               types|   values|         labels|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|humidity|temperature|  88|21.8|Percent|Celsius|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

我想分开类型、值和标签，以便每个值都在其对应的行中。就像这样：

+----+--------------------+---------+---------------+
|  id|                type|    value|          label|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|         temperature|     21.8|        Celsius|
|   2|            humidity|       88|        Percent|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

我尝试使用split和exploit函数，但它为每个值创建了一个新行：

from pyspark.sql.functions import explode,col,split
df_2 = df.withColumn("id",col("id"))\
         .withColumn("type",explode(split("types", "\|")))\
         .withColumn("value",explode(split("values", "\|")))\
         .withColumn("label",explode(split("labels", "\|")))\
df_2.show()

+----+--------------------+---------+---------------+
|  id|                type|    value|          label|
+----+--------------------+---------+---------------+
|   1|         temperature|       21|        Celsius|
|   2|         temperature|     21.8|        Celsius|
|   2|         temperature|     21.8|        Percent|
|   2|         temperature|       88|        Celsius|
|   2|         temperature|       88|        Percent|
|   2|            humidity|     21.8|        Celsius|
|   2|            humidity|     21.8|        Percent|
|   2|            humidity|       88|        Celsius|
|   2|            humidity|       88|        Percent|
|   3|         temperature|     21.2|        Celsius|
|   4|         temperature|     19.9|        Celsius|
|   5|            humidity|     85.5|        Percent|
+----+--------------------+---------+---------------+

pyspark

来源：https://stackoverflow.com/questions/76900360/split-pipe-separated-values-in-multiple-columns-into-rows

2条答案

按热度按时间

evrscar21#

围绕|拆分列，然后压缩拆分后的列，以便压缩后的数组中的每个项都是(types, values, labels)的结构。然后分解arr列，并为(types, values, labels)中的每一列从结构中提取相应的值。

cols = ['types', 'values', 'labels']
arr = F.arrays_zip(*[F.split(c, '\|').alias(c) for c in cols])

df1 = df.withColumn('arr', F.explode(arr))
df1 = df1.select('id', *[F.col('arr')[c].alias(c) for c in cols])

df1.show()

+---+-----------+------+-------+
| id|      types|values| labels|
+---+-----------+------+-------+
|  1|temperature|    21|Celsius|
|  2|   humidity|    88|Percent|
|  2|temperature|  21.8|Celsius|
|  3|temperature|  21.2|Celsius|
|  4|temperature|  19.9|Celsius|
|  5|   humidity|  85.5|Percent|
+---+-----------+------+-------+

赞(0）回复(0）举报 2023-10-15

i5desfxk2#

spark.sql("""
    SELECT
        id,
        inline(
            cast(
               arrays_zip(
                 split(types, '[|]'), 
                 split(values, '[|]'), 
                 split(labels, '[|]')
               ) 
               AS array<struct<types: string, values: string, labels: string>>
            )
        )
    FROM input
""")
.show(false)

// Exiting paste mode, now interpreting.

+---+-----------+------+-------+
|id |types      |values|labels |
+---+-----------+------+-------+
|1  |temperature|21    |Celsius|
|2  |humidity   |88    |Percent|
|2  |temperature|21.8  |Celsius|
|3  |temperature|21.2  |Celsius|
|4  |temperature|19.9  |Celsius|
|5  |humidity   |85.5  |Percent|
+---+-----------+------+-------+

赞(0）回复(0）举报 2023-10-15

我来回答

pyspark 将多列中的管道分隔值拆分为行

2条答案

相关问题

热门标签

最新问答