aws glue-dynamicframe在json文件中具有不同的模式

qlckcl4x 于 2021-07-12 发布在 Spark

关注(0)|答案(1)|浏览(550)

示例：我在glue目录中有一个带有ddl的分区表：

CREATE EXTERNAL TABLE `test`(
  `id` int, 
  `data` struct<a:string,b:string>)
PARTITIONED BY ( 
  `partition_0` string)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe'

s3中的底层数据是json文件，具有不同的模式，这意味着某些元素可能不存在于某些文件中，而存在于其他文件中。
在这个示例中，分区\u 0='01'包含包含所有元素的json文件：

{"id": 1,"data": {"a": "value-a", "b": "value-b"}}

分区\u 0='02'中的文件不包含元素数据。b:

{"id": 1,"data": {"a": "value-a"}}

问题：当我在glue中创建dynamicframe（我使用python）时，它的模式取决于我查询的数据。如果我包含来自分区\u 0='01'的数据，那么所有元素都存在于架构中。

id_partition_predicate="partition_0 = '01'"
print("partition with 'b'")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition with 'b'
root
|-- id: int
|-- data: struct
|    |-- a: string
|    |-- b: string
|-- partition_0: string

print("both partitions")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test").printSchema()
both partitions
root
|-- id: int
|-- data: struct
|    |-- a: string
|    |-- b: string
|-- partition_0: string

如果我只查询分区\u 0='02'中的数据，那么元素data.b不存在于dynamicframe架构中，即使它存在于表定义中。

print("partition without 'b'")
id_partition_predicate="partition_0 = '02'"
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition without 'b'
root
|-- id: int
|-- data: struct
|    |-- a: string
|-- partition_0: string

问题：如何创建dynamicframe或dataframe来始终包含粘合表模式中的所有元素？
提前谢谢！

apache-spark pyspark aws-glue aws-glue-spark

来源：https://stackoverflow.com/questions/66447617/aws-glue-dynamicframe-with-varying-schema-in-json-files

1条答案

按热度按时间

sxpgvts31#

想出了这个解决方案：

id_partition_predicate="partition_0 = '02'"
dyf = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate)
dyf.printSchema()
df=dyf.toDF()
try:  
    df = df.withColumn("b", col("data").getItem("b"))
except:
    df = df.withColumn("b", lit(None).cast(StringType()))
df.show()

输出：

root
|-- id: int
|-- data: struct
|    |-- a: string
|-- partition_0: string
+---+---------+-----------+----+
| id|     data|partition_0|   b|
+---+---------+-----------+----+
|  1|[value-a]|         02|null|
+---+---------+-----------+----+

展开查看全部

赞(0）回复(0）举报 2021-07-12

我来回答

aws glue-dynamicframe在json文件中具有不同的模式

1条答案

相关问题

热门标签

最新问答