aws dms sql server到s3 parquet-更改数据类型转换规则和'parquet type not supported:int32(uint_)'

a14dhokn  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(751)

我们使用awsdms将sqlserverdbs作为Parquet文件转储到s3中。我们的想法是用spark over parquets进行一些分析。当一个完整的负载是完整的,那么它是不可能阅读Parquet,因为他们有 UINT 架构中的字段。斯帕克拒绝和我一起读 Parquet type not supported: INT32 (UINT_8) . 我们使用转换规则覆盖数据类型 UINT 柱。但看起来它们并没有被dms引擎接收到。为什么?
有许多规则,如“convert unit to int”(请注意uint1是1字节无符号dms数据类型):

{
  "rule-type": "transformation",
  "rule-id": "7",
  "rule-name": "uintToInt",
  "rule-action": "change-data-type",
  "rule-target": "column",
  "object-locator": {
    "schema-name": "%",
    "table-name": "%",
    "column-name": "%",
    "data-type": "uint1"
  },
  "data-type": {
    "type": "int4"
  }
}

s3级 DataFormat=parquet;ParquetVersion=parquet_2_0 dms引擎版本是 3.3.2 然而,仍然得到Parquet图案与uint。见下表:

id: int32
name: string
value: string
status: uint8

尝试阅读这样的Parquet用Spark给我

org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported$1(ParquetSchemaConverter.scala:100)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:136)

为什么不触发dms转换规则?

nxagd54h

nxagd54h1#

将数据直接从uint转换为dms上的int可以解决这个问题。Map规则应如下所示:

{
"rules": [
    ...
    {
        "rule-type": "transformation",
        "rule-id": "2",
        "rule-name": "unit1-to-int1",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint1"
        },
        "data-type": {
            "type": "int1"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "3",
        "rule-name": "unit2-to-int2",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint2"
        },
        "data-type": {
            "type": "int2"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "4",
        "rule-name": "unit4-to-int4",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint4"
        },
        "data-type": {
            "type": "int4"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "5",
        "rule-name": "unit8-to-int8",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint8"
        },
        "data-type": {
            "type": "int8"
        }
    }
]}

文档:https://docs.aws.amazon.com/dms/latest/userguide/chap_tasks.customizingtasks.tablemapping.html#chap_tasks.customizingtasks.tablemapping.selectiontransformation.transformations

1u4esq0p

1u4esq0p2#

唯一能让转换在Parquet文件上工作的方法是指定要转换的确切列。例如:

{
   "rules": [
   ...
   {
    "rule-type": "transformation",
    "rule-id": "2",
    "rule-name": "unit1-to-int1",
    "rule-action": "change-data-type",
    "rule-target": "column",
    "object-locator": {
        "schema-name": "acessa",
        "table-name": "<table_name>",
        "column-name": "<column_name>"
    },
    "data-type": {
        "type": "int1"
    }
   }
  ]
}

在对象定位器中使用通配符%作为列名是行不通的

相关问题