多态json的spark处理

w1e3prcc 于 2021-07-14 发布在 Spark

关注(0)|答案(1)|浏览(469)

考虑以下json输入：

{
  "common": { "type":"A", "date":"2020-01-01T12:00:00" },
  "data": {
    "name":"Dave",
    "pets": [ "dog", "cat" ]
  }
}
{
  "common": { "type": "B", "date":"2020-01-01T12:00:00" },
  "data": {
    "whatever": { "X": {"foo":3}, "Y":"bar" },
    "favoriteInts": [ 0, 1, 7]
  }
}

我熟悉 json-schema 我可以这样形容 data 子结构可以是 name,pets 或 whatever,favoriteInts . 我们使用 common.type 用于标识类型的字段。
这在spark模式定义中可能吗？初步试验的思路如下：

schema = StructType([
        StructField("common", StructType(common_schema)), # .. because the type is consistent                                       
        StructField("data", StructType())  # attempting to declare a "generic" struct
    ])
    df = spark.read.option("multiline", "true").json(source, schema)

不起作用；一读到 data struct包含，嗯，任何东西，但在这个特殊的例子中2个字段，我们得到：

+--------------------+----+                                                     
|              common|data|
+--------------------+----+
|{2020-01-01T12:00...|  {}|
+--------------------+----+

并尝试提取任何指定字段 No such struct field <whatever> . 将“generic struct”从 schema def完全生成一个没有任何字段名的Dataframe data ，别管里面的田地。
除此之外，我最终会尝试这样做：

df = spark.read.json(source)
def processA(frame):
    frame.select( frame.data.name )  # we KNOW name exists for type A
    ...
def processB(frame):
    frame.select( frame.data.favoriteInts )  # we KNOW favoriteInts exists for type B
    ...
processA( df.filter(df.common.type == "A") )
processB( df.filter(df.common.type == "B") )

JSON apache-spark pyspark schema

来源：https://stackoverflow.com/questions/67233557/spark-processing-of-polymorphic-json

1条答案

按热度按时间

tkclm6bt1#

您可以使用嵌套的和可为空的类型（通过指定 True )以适应不确定性。

from pyspark.sql.types import StructType, StringType, ArrayType, StructField, IntegerType
data_schema = StructType([
    # Type A related attributes
    StructField("name",StringType(),True), # True implies nullable
    StructField("pets",ArrayType(StringType()),True),
   # Type B related attributes
    StructField("whatever",StructType([
        StructField("X",StructType([
            StructField("foo",IntegerType(),True)
        ]),True),
        StructField("Y",StringType(),True)
    ]),True), # True implies nullable
    StructField("favoriteInts",ArrayType(IntegerType()),True),
])
schema = StructType([
        StructField("common", StructType(common_schema)), # .. because the type is consistent                                       
        StructField("data", data_schema)  
])

展开查看全部

赞(0）回复(0）举报 2021-07-14

我来回答

多态json的spark处理

1条答案

相关问题

热门标签

最新问答