如何使用pyspark explode()分解结构

hgc7kmma 于 2021-07-13 发布在 Spark

关注(0)|答案(1)|浏览(1266)

如何将下面的json转换为后面的关系行？我一直坚持的一点是 explode() 由于类型不匹配，函数引发异常。我还没有找到一种方法来强制将数据转换成合适的格式，以便可以从中的每个对象创建行 source 钥匙在 sample_json 对象。
json输入

sample_json = """
{
"dc_id": "dc-101",
"source": {
    "sensor-igauge": {
      "id": 10,
      "ip": "68.28.91.22",
      "description": "Sensor attached to the container ceilings",
      "temp":35,
      "c02_level": 1475,
      "geo": {"lat":38.00, "long":97.00}                        
    },
    "sensor-ipad": {
      "id": 13,
      "ip": "67.185.72.1",
      "description": "Sensor ipad attached to carbon cylinders",
      "temp": 34,
      "c02_level": 1370,
      "geo": {"lat":47.41, "long":-122.00}
    },
    "sensor-inest": {
      "id": 8,
      "ip": "208.109.163.218",
      "description": "Sensor attached to the factory ceilings",
      "temp": 40,
      "c02_level": 1346,
      "geo": {"lat":33.61, "long":-111.89}
    },
    "sensor-istick": {
      "id": 5,
      "ip": "204.116.105.67",
      "description": "Sensor embedded in exhaust pipes in the ceilings",
      "temp": 40,
      "c02_level": 1574,
      "geo": {"lat":35.93, "long":-85.46}
    }
  }
}"""

期望输出

dc_id    source_name    id    description
-------------------------------------------------------------------------------
dc-101   sensor-gauge   10    Sensor attached to the container ceilings
dc-101   sensor-ipad    13    Sensor ipad attached to carbon cylinders
dc-101   sensor-inest    8    Sensor attached to the factory ceilings
dc-101   sensor-istick   5    Sensor embedded in exhaust pipes in the ceilings

Pypark代码

from pyspark.sql.functions import *
df_sample_data = spark.read.json(sc.parallelize([sample_json]))
df_expanded = df_sample_data.withColumn("one_source",explode_outer(col("source")))
display(df_expanded)

错误
analysisexception:无法解析“explode”( source )'由于数据类型不匹配：函数explode的输入应该是数组或Map类型，而不是struct。。。。
我将这个databricks笔记本放在一起，以进一步演示挑战并清楚地显示错误。我将能够使用这个笔记本来测试这里提供的任何建议。

JSON apache-spark pyspark apache-spark-sql pyspark-dataframes

来源：https://stackoverflow.com/questions/66130815/how-to-explode-structs-with-pyspark-explode

1条答案

按热度按时间

vwhgwdsa1#

你不能使用 explode 但是你可以在结构中得到列名 source （与 df.select("source.*").columns )使用列表理解，您可以从每个嵌套结构中创建一个字段数组，然后分解以获得所需的结果：

from pyspark.sql import functions as F
df1 = df.select(
    "dc_id",
    F.explode(
        F.array(*[
            F.struct(
                F.lit(s).alias("source_name"),
                F.col(f"source.{s}.id").alias("id"),
                F.col(f"source.{s}.description").alias("description")
            )
            for s in df.select("source.*").columns
        ])
    ).alias("sources")
).select("dc_id", "sources.*") 
df1.show(truncate=False)
# +------+-------------+---+------------------------------------------------+
# |dc_id |source_name  |id |description                                     |
# +------+-------------+---+------------------------------------------------+
# |dc-101|sensor-igauge|10 |Sensor attached to the container ceilings       |
# |dc-101|sensor-inest |8  |Sensor attached to the factory ceilings         |
# |dc-101|sensor-ipad  |13 |Sensor ipad attached to carbon cylinders        |
# |dc-101|sensor-istick|5  |Sensor embedded in exhaust pipes in the ceilings|
# +------+-------------+---+------------------------------------------------+

展开查看全部

赞(0）回复(0）举报 2021-07-13

我来回答

如何使用pyspark explode()分解结构

1条答案

相关问题

热门标签

最新问答