我正在使用spark 2.3。我有一个sparkDataframe,格式如下
| person_id | person_attributes
____________________________________________________________________________
| id_1 "department=Sales__title=Sales_executive__level=junior"
| id_2 "department=Engineering__title=Software Engineer__level=entry-level"
等等。
person\u attributes列的类型为 string
如果没有 level
属性\u键
| person_id | attribute_key| attribute_value
____________________________________________________________________________
| id_1 department Sales
| id_1 title Sales_executive
| id_2 department Engineering
| id_2 title Software Engineer
这是一个大的分布式Dataframe,因此,转换为Pandas或缓存不是一个选项
2条答案
按热度按时间xwbd5t1u1#
试试这个,
lrl1mhuk2#
试试这个
Spark2.3
:试试这个
Spark2.4+
```from pyspark.sql import functions as F
df.withColumn("arr", F.split("person_attributes",'=|__'))
.withColumn("map", F.map_from_arrays(F.expr("""filter(arr,(x,i)->i%2=0)""")
,F.expr("""filter(arr,(x,i)->i%2!=0)""")))
.select("person_id", F.explode("map").alias("attribute_key","attribute_value")).filter("""attribute_key!='level'""")
.show(truncate=False)
+---------+-------------+-----------------+
|person_id|attribute_key|attribute_value |
+---------+-------------+-----------------+
|id_1 |department |Sales |
|id_1 |title |Sales_executive |
|id_2 |department |Engineering |
|id_2 |title |Software Engineer|
+---------+-------------+-----------------+