因此,我们应该如何修改代码以从多个json列中读取,而不是仅仅一列?
目前只有“coljson”列用于Dataframe。但是列列表需要以类似的方式读取。列的列表存储在list[string]变量中。
val data = Seq(
(77, "email1", """{"key1":38,"key3":39}"""),
(78, "email2", """{"key1":38,"key4":39}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""")
).toDF("id", "name", "colJson")
data.show(false)
// +---+-------+---------------------------------------------------------------+
// |id |name |colJson |
// +---+-------+---------------------------------------------------------------+
// |77 |email1 |{"key1":38,"key3":39} |
// |78 |email2 |{"key1":38,"key4":39} |
// |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|
// |178|email8 |{"sub1":"qwerty","sub2":"42"} |
// +---+-------+---------------------------------------------------------------+
val schema = spark.read.json(data.select("colJson").as[String]).schema
val res = data.select($"id", $"name", from_json($"colJson", schema).as("s")).select("id", "name", "s.*")
res.show(false)
// +---+-------+-----------+-----+----+----+----+------+----+
// |id |name |key1 |key10|key3|key4|key6|sub1 |sub2|
// +---+-------+-----------+-----+----+----+----+------+----+
// |77 |email1 |38 |null |39 |null|null|null |null|
// |78 |email2 |38 |null |null|39 |null|null |null|
// |178|email21|when string|false|null|36 |test|null |null|
// |178|email8 |null |null |null|null|null|qwerty|42 |
// +---+-------+-----------+-----+----+----+----+------+----+
val df1 = res.filter('sub1.equalTo("qwerty"))
df1.show(false)
// +---+------+----+-----+----+----+----+------+----+
// |id |name |key1|key10|key3|key4|key6|sub1 |sub2|
// +---+------+----+-----+----+----+----+------+----+
// |178|email8|null|null |null|null|null|qwerty|42 |
// +---+------+----+-----+----+----+----+------+----+
1条答案
按热度按时间ifsvaxew1#
检查以下代码。
添加了另一个包含json数据的列。
创建了fromjson隐式函数,您可以将多个列传递给它&它将解析并从json中提取这些列。