如何使用sparkscala为字段子集从json文件创建模式？

x8goxv8g 于 2021-05-17 发布在 Spark

关注(0)|答案(1)|浏览(453)

我正在尝试创建一个嵌套json文件的模式，以便它可以成为一个Dataframe。
但是，如果我只需要json文件中的“id”和“text”（一个子集），我不确定是否有方法在不定义json文件中所有字段的情况下创建模式。
我目前正在spark shell中使用scala。从文件中可以看到，我从hdfs下载了它作为-00000的一部分。

.

JSON apache-spark apache-spark-sql spark-streaming

来源：https://stackoverflow.com/questions/64925875/how-to-create-a-schema-from-json-file-using-spark-scala-for-subset-of-fields

1条答案

按热度按时间

pjngdqdw1#

根据json手册：
使用 .schema 方法。此读取仅返回架构中指定的列。
所以你很乐意接受你的暗示。
例如

import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
      .add("op_ts", StringType, true)

val df = spark.read.schema(schema)
              .option("multiLine", true).option("mode", "PERMISSIVE")
              .json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)

退货：

root
 |-- op_ts: string (nullable = true)

+--------------------------+
|op_ts                     |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+

对于此架构：

root
 |-- after: struct (nullable = true)
 |    |-- CODE: string (nullable = true)
 |    |-- CREATED: string (nullable = true)
 |    |-- ID: long (nullable = true)
 |    |-- STATUS: string (nullable = true)
 |    |-- UPDATE_TIME: string (nullable = true)
 |-- before: string (nullable = true)
 |-- current_ts: string (nullable = true)
 |-- op_ts: string (nullable = true)
 |-- op_type: string (nullable = true)
 |-- pos: string (nullable = true)
 |-- primary_keys: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- table: string (nullable = true)
 |-- tokens: struct (nullable = true)
 |    |-- csn: string (nullable = true)
 |    |-- txid: string (nullable = true)

从同一文件中获取，使用：

val df = spark.read
              .option("multiLine", true).option("mode", "PERMISSIVE")
              .json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)

后者只是为了证明。

赞(0）回复(0）举报 2021-05-17

我来回答

如何使用sparkscala为字段子集从json文件创建模式？

1条答案

相关问题

热门标签

最新问答