pyspark：如何将字符串类型的列分解为sparkDataframe的行和列

rbpvctlc 于 2021-05-29 发布在 Spark

关注(0)|答案(2)|浏览(462)

我正在使用spark 2.3。我有一个sparkDataframe，格式如下

| person_id | person_attributes
____________________________________________________________________________
| id_1    "department=Sales__title=Sales_executive__level=junior"
| id_2    "department=Engineering__title=Software Engineer__level=entry-level"

等等。
person\u attributes列的类型为 string 如果没有 level 属性\u键

| person_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1        department       Sales
| id_1        title            Sales_executive
| id_2        department       Engineering
| id_2        title            Software Engineer

这是一个大的分布式Dataframe，因此，转换为Pandas或缓存不是一个选项

apache-spark pyspark apache-spark-sql pyspark-dataframes Explode

来源：https://stackoverflow.com/questions/62539548/pyspark-how-to-explode-a-column-of-string-type-into-rows-and-columns-of-a-spar

2条答案

按热度按时间

xwbd5t1u1#

试试这个，

import org.apache.spark.sql.functions._

 df
  .withColumn("attributes_splitted",  split(col("person_attributes"), "__")) // Split by delimiter `__`
  .withColumn("exploded", explode(col("attributes_splitted"))) // explode the splitted column
  .withColumn("temp", split(col("exploded"), "=")) // again split based on delimiter `=`
  .withColumn("attribute_key", col("temp").getItem(0))
  .withColumn("attribute_value", col("temp").getItem(1))
  .drop("attributes_splitted", "exploded", "temp", "person_attributes")
  .show(false)

赞(0）回复(0）举报 2021-05-29

lrl1mhuk2#

试试这个 Spark2.3 :

from pyspark.sql import functions as F

df.withColumn("arr", F.split("person_attributes",'\=|__'))\
  .withColumn("map", F.create_map(F.lit('department'),F.col("arr")[1]\
                                  ,F.lit('title'),F.col("arr")[3]))\
  .select("person_id", F.explode("map").alias("attribute_key","attribute_value"))\
  .show(truncate=False)

# +---------+-------------+-----------------+

# |person_id|attribute_key|attribute_value  |

# +---------+-------------+-----------------+

# |id_1     |department   |Sales            |

# |id_1     |title        |Sales_executive  |

# |id_2     |department   |Engineering      |

# |id_2     |title        |Software Engineer|

# +---------+-------------+-----------------+

试试这个 Spark2.4+ ```
from pyspark.sql import functions as F

df.withColumn("arr", F.split("person_attributes",'=|__'))
.withColumn("map", F.map_from_arrays(F.expr("""filter(arr,(x,i)->i%2=0)""")
,F.expr("""filter(arr,(x,i)->i%2!=0)""")))
.select("person_id", F.explode("map").alias("attribute_key","attribute_value")).filter("""attribute_key!='level'""")
.show(truncate=False)

+---------+-------------+-----------------+

|person_id|attribute_key|attribute_value |

+---------+-------------+-----------------+

|id_1 |department |Sales |

|id_1 |title |Sales_executive |

|id_2 |department |Engineering |

|id_2 |title |Software Engineer|

+---------+-------------+-----------------+

赞(0）回复(0）举报 2021-05-29

我来回答

pyspark：如何将字符串类型的列分解为sparkDataframe的行和列

2条答案

+---------+-------------+-----------------+

|person_id|attribute_key|attribute_value |

+---------+-------------+-----------------+

|id_1 |department |Sales |

|id_1 |title |Sales_executive |

|id_2 |department |Engineering |

|id_2 |title |Software Engineer|

+---------+-------------+-----------------+

相关问题

热门标签

最新问答