使用自定义模式阅读JSON- pyspark

k3fezbri  于 2023-01-21  发布在  Apache
关注(0)|答案(1)|浏览(114)

当使用自定义模式阅读JSON时,它给了我所有的NULL值。我知道原因(因为实际的数据类型与自定义模式类型不匹配),但我不知道如何修复它(除了使用with open方法读取它)。我希望你不要读取JSON模块。

spark = SparkSession \
        .builder \
        .appName("JSON test") \
        .getOrCreate()
    
schema = StructType([StructField("_links", MapType(StringType(), MapType(StringType(), StringType()))),
                         StructField("identifier", StringType()),
                         StructField("enabled", BooleanType()),
                         StructField("family", StringType()),
                         StructField("categories", ArrayType(StringType())),
                         StructField("groups", ArrayType(StringType())),
                         StructField("parent", StringType()),
                         StructField("values", MapType(StringType(), ArrayType(MapType(StringType(), StringType())))),
                         StructField("created", StringType()),
                         StructField("updated", StringType()),
                         StructField("associations", MapType(StringType(), MapType(StringType(), ArrayType(StringType())))),
                         StructField("quantified_associations", MapType(StringType(), IntegerType())),
                         StructField("metadata", MapType(StringType(), StringType()))])
    
df = spark.read.format("json") \
            .schema(schema) \
            .load(f'/mnt/bronze/products/**/*.json')
df.display()

JSON原始结构:

root
 |-- _embedded: struct (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _links: struct (nullable = true)
 |    |    |    |    |-- self: struct (nullable = true)
 |    |    |    |    |    |-- href: string (nullable = true)
 |    |    |    |-- associations: struct (nullable = true)
 |    |    |    |    |-- ERP_PIM: struct (nullable = true)
 |    |    |    |    |    |-- groups: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- product_models: array (nullable = true)
 |    |    |    |-- categories: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- created: string (nullable = true)
 |    |    |    |-- enabled: boolean (nullable = true)
 |    |    |    |-- family: string (nullable = true)
 |    |    |    |-- groups: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- identifier: string (nullable = true)
 |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |-- workflow_status: string (nullable = true)
 |    |    |    |-- parent: string (nullable = true)
 |    |    |    |-- updated: string (nullable = true)
 |    |    |    |-- values: struct (nullable = true)
 |    |    |    |    |-- Contrex_table: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- data: string (nullable = true)
 |    |    |    |    |    |    |-- locale: string (nullable = true)
 |    |    |    |    |    |    |-- scope: string (nullable = true)
 |    |    |    |    |-- UFI_Table: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |-- _links: struct (nullable = true)
 |    |-- first: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |    |-- next: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
 |    |-- self: struct (nullable = true)
 |    |    |-- href: string (nullable = true)
nbysray5

nbysray51#

在第一次读取数据时,我建议以原始格式阅读数据,例如,如果json中有布尔值,如{"enabled" : "true"},我会将伪布尔值作为字符串读取(因此,将BooleanType()更改为StringType()),然后在成功读取后的后续步骤中将其转换为布尔值。
这应该会停止null值,因为如果数据类型不匹配,spark会抛出该值

相关问题