pyspark:如何将字符串类型的列分解为sparkDataframe的行和列

rbpvctlc  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(463)

我正在使用spark 2.3。我有一个sparkDataframe,格式如下

| person_id | person_attributes
____________________________________________________________________________
| id_1    "department=Sales__title=Sales_executive__level=junior"
| id_2    "department=Engineering__title=Software Engineer__level=entry-level"

等等。
person\u attributes列的类型为 string 如果没有 level 属性\u键

| person_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1        department       Sales
| id_1        title            Sales_executive
| id_2        department       Engineering
| id_2        title            Software Engineer

这是一个大的分布式Dataframe,因此,转换为Pandas或缓存不是一个选项

xwbd5t1u

xwbd5t1u1#

试试这个,

import org.apache.spark.sql.functions._

 df
  .withColumn("attributes_splitted",  split(col("person_attributes"), "__")) // Split by delimiter `__`
  .withColumn("exploded", explode(col("attributes_splitted"))) // explode the splitted column
  .withColumn("temp", split(col("exploded"), "=")) // again split based on delimiter `=`
  .withColumn("attribute_key", col("temp").getItem(0))
  .withColumn("attribute_value", col("temp").getItem(1))
  .drop("attributes_splitted", "exploded", "temp", "person_attributes")
  .show(false)
lrl1mhuk

lrl1mhuk2#

试试这个 Spark2.3 :

from pyspark.sql import functions as F

df.withColumn("arr", F.split("person_attributes",'\=|__'))\
  .withColumn("map", F.create_map(F.lit('department'),F.col("arr")[1]\
                                  ,F.lit('title'),F.col("arr")[3]))\
  .select("person_id", F.explode("map").alias("attribute_key","attribute_value"))\
  .show(truncate=False)

# +---------+-------------+-----------------+

# |person_id|attribute_key|attribute_value  |

# +---------+-------------+-----------------+

# |id_1     |department   |Sales            |

# |id_1     |title        |Sales_executive  |

# |id_2     |department   |Engineering      |

# |id_2     |title        |Software Engineer|

# +---------+-------------+-----------------+

试试这个 Spark2.4+ ```
from pyspark.sql import functions as F

df.withColumn("arr", F.split("person_attributes",'=|__'))
.withColumn("map", F.map_from_arrays(F.expr("""filter(arr,(x,i)->i%2=0)""")
,F.expr("""filter(arr,(x,i)->i%2!=0)""")))
.select("person_id", F.explode("map").alias("attribute_key","attribute_value")).filter("""attribute_key!='level'""")
.show(truncate=False)

+---------+-------------+-----------------+

|person_id|attribute_key|attribute_value |

+---------+-------------+-----------------+

|id_1 |department |Sales |

|id_1 |title |Sales_executive |

|id_2 |department |Engineering |

|id_2 |title |Software Engineer|

+---------+-------------+-----------------+

相关问题