当使用自定义模式阅读JSON时,它给了我所有的NULL
值。我知道原因(因为实际的数据类型与自定义模式类型不匹配),但我不知道如何修复它(除了使用with open
方法读取它)。我希望你不要读取JSON模块。
spark = SparkSession \
.builder \
.appName("JSON test") \
.getOrCreate()
schema = StructType([StructField("_links", MapType(StringType(), MapType(StringType(), StringType()))),
StructField("identifier", StringType()),
StructField("enabled", BooleanType()),
StructField("family", StringType()),
StructField("categories", ArrayType(StringType())),
StructField("groups", ArrayType(StringType())),
StructField("parent", StringType()),
StructField("values", MapType(StringType(), ArrayType(MapType(StringType(), StringType())))),
StructField("created", StringType()),
StructField("updated", StringType()),
StructField("associations", MapType(StringType(), MapType(StringType(), ArrayType(StringType())))),
StructField("quantified_associations", MapType(StringType(), IntegerType())),
StructField("metadata", MapType(StringType(), StringType()))])
df = spark.read.format("json") \
.schema(schema) \
.load(f'/mnt/bronze/products/**/*.json')
df.display()
JSON原始结构:
root
|-- _embedded: struct (nullable = true)
| |-- items: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- _links: struct (nullable = true)
| | | | |-- self: struct (nullable = true)
| | | | | |-- href: string (nullable = true)
| | | |-- associations: struct (nullable = true)
| | | | |-- ERP_PIM: struct (nullable = true)
| | | | | |-- groups: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- product_models: array (nullable = true)
| | | |-- categories: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- created: string (nullable = true)
| | | |-- enabled: boolean (nullable = true)
| | | |-- family: string (nullable = true)
| | | |-- groups: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- identifier: string (nullable = true)
| | | |-- metadata: struct (nullable = true)
| | | | |-- workflow_status: string (nullable = true)
| | | |-- parent: string (nullable = true)
| | | |-- updated: string (nullable = true)
| | | |-- values: struct (nullable = true)
| | | | |-- Contrex_table: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- data: string (nullable = true)
| | | | | | |-- locale: string (nullable = true)
| | | | | | |-- scope: string (nullable = true)
| | | | |-- UFI_Table: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
|-- _links: struct (nullable = true)
| |-- first: struct (nullable = true)
| | |-- href: string (nullable = true)
| |-- next: struct (nullable = true)
| | |-- href: string (nullable = true)
| |-- self: struct (nullable = true)
| | |-- href: string (nullable = true)
1条答案
按热度按时间nbysray51#
在第一次读取数据时,我建议以原始格式阅读数据,例如,如果json中有布尔值,如
{"enabled" : "true"}
,我会将伪布尔值作为字符串读取(因此,将BooleanType()
更改为StringType()
),然后在成功读取后的后续步骤中将其转换为布尔值。这应该会停止
null
值,因为如果数据类型不匹配,spark会抛出该值