考虑以下json输入:
{
"common": { "type":"A", "date":"2020-01-01T12:00:00" },
"data": {
"name":"Dave",
"pets": [ "dog", "cat" ]
}
}
{
"common": { "type": "B", "date":"2020-01-01T12:00:00" },
"data": {
"whatever": { "X": {"foo":3}, "Y":"bar" },
"favoriteInts": [ 0, 1, 7]
}
}
我熟悉 json-schema
我可以这样形容 data
子结构可以是 name,pets
或 whatever,favoriteInts
. 我们使用 common.type
用于标识类型的字段。
这在spark模式定义中可能吗?初步试验的思路如下:
schema = StructType([
StructField("common", StructType(common_schema)), # .. because the type is consistent
StructField("data", StructType()) # attempting to declare a "generic" struct
])
df = spark.read.option("multiline", "true").json(source, schema)
不起作用;一读到 data
struct包含,嗯,任何东西,但在这个特殊的例子中2个字段,我们得到:
+--------------------+----+
| common|data|
+--------------------+----+
|{2020-01-01T12:00...| {}|
+--------------------+----+
并尝试提取任何指定字段 No such struct field <whatever>
. 将“generic struct”从 schema
def完全生成一个没有任何字段名的Dataframe data
,别管里面的田地。
除此之外,我最终会尝试这样做:
df = spark.read.json(source)
def processA(frame):
frame.select( frame.data.name ) # we KNOW name exists for type A
...
def processB(frame):
frame.select( frame.data.favoriteInts ) # we KNOW favoriteInts exists for type B
...
processA( df.filter(df.common.type == "A") )
processB( df.filter(df.common.type == "B") )
1条答案
按热度按时间tkclm6bt1#
您可以使用嵌套的和可为空的类型(通过指定
True
)以适应不确定性。