spark json嵌套数组到dataframe

zf2sa74q  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(429)

我需要使用以下模式处理json文件:

root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)

当我将它加载到一个Dataframe中时,只有一行,但该行在有效负载数组中包含大量的数据和事件元素(一个元素既有数据也有事件,但决不能两者都有)
我想得到所有的事件,这样我就可以对它们执行一些进一步的操作,或者以后在db表中加载它们等等。。。为了做到这一点,我将需要有效负载的所有元素有事件,我不需要一个只有“数据”。最好是在最后有一个Dataframe,它只包含事件的成员的行。。。
不幸的是,当我尝试这样的事情时: df.select("Payload.Event") 或者 df.select(Payload).filter(...) 然后它仍然在根目录上进行过滤,但因为Dataframe中只有一行没有太大帮助。如何过滤内部数组,并将其元素作为单独的Dataframe获取?
示例json:

{
    "Header": {
        "Version": {
            "vfield": "0.6"
        },
        "Format": "DEFAULT"
    },
    "Payload": [
        {"Data": [
            [0, 1, 2],
            [5, 6]
        ]},

        {"Event": {
            "eventName" : "event1",
            "eventCount": 123
        }},
        {"Event": {
            "eventName" : "event2",
            "eventCount": 124
        }},
        { "Data": [
            [5,8],
            [1,2,6]
        ] }
    ]        
}
vuktfyat

vuktfyat1#

因为 Payload 属于类型 array ,如果您在没有 explode 会给你类型的结果 array 改变 df.select("Payload.Event")df.withColumn("Payload",explode("Payload")).select("Payload.Event") 检查以下代码。

scala> df.printSchema
root
 |-- Header: struct (nullable = true)
 |    |-- Format: string (nullable = true)
 |    |-- Version: struct (nullable = true)
 |    |    |-- vfield: string (nullable = true)
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Data: array (nullable = true)
 |    |    |    |-- element: array (containsNull = true)
 |    |    |    |    |-- element: long (containsNull = true)
 |    |    |-- Event: struct (nullable = true)
 |    |    |    |-- eventCount: long (nullable = true)
 |    |    |    |-- eventName: string (nullable = true)

scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event").printSchema
root
 |-- Event: struct (nullable = true)
 |    |-- eventCount: long (nullable = true)
 |    |-- eventName: string (nullable = true)

scala> df.withColumn("Payload",explode($"Payload")).select("Payload.Event.*").printSchema
root
 |-- eventCount: long (nullable = true)
 |-- eventName: string (nullable = true)

相关问题