pyspark glueContext create_dynamic_frame_from_options是否从加载中排除一种文件类型?

bnlyeluc  于 2023-01-04  发布在  Spark
关注(0)|答案(1)|浏览(210)
raw_data_input_path = "s3a://{}/logs/application_id={}/component_id={}/".format(s3BucketName, application_id, component_id)

    df = glueContext.create_dynamic_frame_from_options(connection_type="s3",
                                                                connection_options={"paths": [raw_data_input_path],
                                                                                    "recurse": True},
                                                                format="json",
                                                                transformation_ctx=dbInstance)

我的存储桶键包含10个json文件1个txt文件,我只想在动态帧中包含json文件。这是create_dynamic_frame_from_options中的'format'参数的作用吗
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options
“format -格式规范(可选)。用于支持多种格式的Amazon S3或AWS Glue连接。”

col17t5w

col17t5w1#

exclusions参数将帮助您排除connection_options对象www.example.com上的文件https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3

raw_data_input_path = "s3a://{}/logs/application_id={}/component_id={}/".format(s3BucketName, application_id, component_id)

df = glueContext.create_dynamic_frame_from_options(
    connection_type="s3",
    connection_options={
        "paths": [raw_data_input_path],
        "recurse": True,
        "exclusions": ["**.txt"],
    },
    format="json",
    transformation_ctx=dbInstance,
)

相关问题