在我第一次尝试解析kafka上的json以激发结构化流媒体时需要一些帮助。
我正在努力转换传入的json并将其转换为平面Dataframe以供进一步处理。
我的输入是
[
{ "siteId": "30:47:47:BE:16:8F", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
},
{ "siteId": "30:47:47:BE:16:FF", "siteData":
[
{ "dataseries": "trend-255", "values":
[
{"ts": 1502715600, "value": 35.74 },
{"ts": 1502715660, "value": 35.65 },
{"ts": 1502715720, "value": 35.58 },
{"ts": 1502715780, "value": 35.55 }
]
},
{ "dataseries": "trend-256", "values":
[
{"ts": 1502715840, "value": 18.45 },
{"ts": 1502715900, "value": 18.35 },
{"ts": 1502715960, "value": 18.32 }
]
}
]
}
]
spark架构是
data1_spark_schema = ArrayType(
StructType([
StructField("siteId", StringType(), False),
StructField("siteData", ArrayType(StructType([
StructField("dataseries", StringType(), False),
StructField("values", ArrayType(StructType([
StructField("ts", IntegerType(), False),
StructField("value", StringType(), False)
]), False), False)
]), False), False)
]), False
)
我非常简单的代码是:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from config.general import kafka_instance
from config.general import topic
from schemas.schema import data1_spark_schema
spark = SparkSession \
.builder \
.appName("Structured_BMS_Feed") \
.getOrCreate()
stream = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_instance) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("max.poll.records", 100) \
.option("failOnDataLoss", False) \
.load()
stream_records = stream.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as bms_data1") \
.select(from_json("bms_data1", data1_spark_schema).alias("bms_data1"))
sites = stream_records.select(explode("bms_data1").alias("site")) \
.select("site.*")
sites.printSchema()
stream_debug = sites.writeStream \
.outputMode("append") \
.format("console") \
.option("numRows", 20) \
.option("truncate", False) \
.start()
stream_debug.awaitTermination()
当我运行这段代码时,我的模式是这样打印的:
root
|-- siteId: string (nullable = false)
|-- siteData: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- dataseries: string (nullable = false)
| | |-- values: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- ts: integer (nullable = false)
| | | | |-- value: string (nullable = false)
有没有可能让这个模式在一个平面数据框中获得所有字段,而不是嵌套的json。因此,对于每个ts和value,它应该给我一行它的父dataseries和site id。
1条答案
按热度按时间dluptydi1#
回答我自己的问题。我用下面的线条把它压平了: