将数据导入hudi-hdfsparquetimport性能问题

tktrz96b 于 2021-05-26 发布在 Spark

关注(0)|答案(0)|浏览(513)

我们正在尝试使用hdfsparquetimporter将数据从parquet（源）导入到s3上的hudi（目标），但是我们面临一些性能问题。
hudi版本0.5.3（emr 5.30.1）。
以下是关于我们数据的重要考虑：
s3上的Parquet地板尺寸：536.8 gib
Parquet地板数量：8785
总行数：>80亿行
不分区
关于hudi的分区规则：
我们使用多级分区-组织/年/月/日
我们有成千上万的组织
avro格式的数据模式：

{ "type": "record", "name": "UsageFact", "doc": "Usage Fact", "fields": [ { "name": "sk_usage_id", "type": "string" }, { "name": "sk_comm_capability_id", "type": "string" }, { "name": "time", "type": "string" }, { "name": "mt_load_time", "type": "string" }, { "name": "direction", "type": "string" }, { "name": "channel", "type": "string" }, { "name": "provider", "type": "string" }, { "name": "metric", "type": "string" }, { "name": "sk_comm_capability_name", "type": "string" }, { "name": "sk_operation_id", "type": "string" }, { "name": "sk_operation_name", "type": "string" }, { "name": "country", "type": "string" }, { "name": "subcategory", "type": "string" }, { "name": "category", "type": "string" }, { "name": "quantity", "type": "int" }, { "name": "partition_path", "type": "string" } ] }

“分区路径”列是分区规则的定义（例如：“organization=/year=2020/month=01/day=01”）。
所以我们要执行hdfsparquetimport：

hdfsparquetimport --upsert false --srcPath "[PARQUET_SOURCE_PATH]" --targetPath "[HUDI_TARGET_PATH]" --tableName [TABLE_NAME] --tableType COPY_ON_WRITE --rowKeyField [ROW_IDENTIFIER] --partitionPathField "partition_path" --parallelism 5000 --schemaFilePath "[AVRO SCHEMA]" --format parquet --sparkMemory 20g --retry 3

问题是：
在我们的第一个测试中，进口商花了33分钟导入了2000万行。所以我们关心的是使用这个导入程序来接收我们的80亿行。我们试图更改一些性能参数（sparkmemory和parallelism），但没有任何好结果。导入程序创建的spark作业仅使用30%的集群资源。我们只是无法使此作业使用群集中的更多资源。
我们已经尝试使用spark和bulkinsert在hudi上进行写操作，但是性能比使用importer差（一个多小时内有一百万行）。
所以我们的问题是：
我们如何调整此导入程序？我们怎样才能为这项工作分配更多的资源呢？
我们可以使用批量模式在同一个hudi表上运行多个并行导入程序吗？
有没有其他好的方法可以将大量数据导入hudi？就性能而言，最好的选择是什么？
谢谢你的帮助。

apache-spark apache-hudi

来源：https://stackoverflow.com/questions/65183314/import-data-to-hudi-hdfsparquetimport-performance-issues

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

将数据导入hudi-hdfsparquetimport性能问题

暂无答案！

相关问题

热门标签

最新问答