我有一个简单的程序,它有一个带有列的数据集 resource_serialized
以json字符串作为值,如下所示:
import org.apache.spark.SparkConf
object TestApp {
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setAppName("Loading Data").setMaster("local[*]")
val spark = org.apache.spark.sql.SparkSession
.builder
.config(sparkConf)
.appName("Test")
.getOrCreate()
val json = "[{\"resource_serialized\":\"{\\\"createdOn\\\":\\\"2000-07-20 00:00:00.0\\\",\\\"genderCode\\\":\\\"0\\\"}\",\"id\":\"00529e54-0f3d-4c76-9d3\"}]"
import spark.implicits._
val df = spark.read.json(Seq(json).toDS)
df.printSchema()
df.show()
}
}
打印的架构是:
root
|-- id: string (nullable = true)
|-- resource_serialized: string (nullable = true)
控制台上打印的数据集是:
+--------------------+--------------------+
| id| resource_serialized|
+--------------------+--------------------+
|00529e54-0f3d-4c7...|{"createdOn":"200...|
+--------------------+--------------------+
这个 resource_serialized
字段具有json字符串,即(来自调试控制台)
现在,我需要用这个json字符串创建dataset/dataframe,如何实现这一点?
我的目标是得到如下数据集:
+--------------------+--------------------+----------+
| id| createdOn|genderCode|
+--------------------+--------------------+----------+
|00529e54-0f3d-4c7...|2000-07-20 00:00 | 0|
+--------------------+--------------------+----------+
2条答案
按热度按时间kr98yfug1#
下面的解决方案将允许您Map
resource_serialized
至(String,String)
以后可以解析Map的表。输出看起来像
a9wyjsp72#
使用
from_json
函数将json字符串转换为df列。Example:
```import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val sch= new StructType().add("createdOn",StringType).add("genderCode",StringType)
df.select(col("id"),from_json(col("resource_serialized"),sch).alias("str")).
select("id","str.*").
show(10,false)
//result
//+----------------------+---------------------+----------+
//|id |createdOn |genderCode|
//+----------------------+---------------------+----------+
//|00529e54-0f3d-4c76-9d3|2000-07-20 00:00:00.0|0 |
//+----------------------+---------------------+----------+