如何通过查询加速大型配置单元表spark group？

dldeef67 于 2021-06-25 发布在 Hive

关注(0)|答案(0)|浏览(624)

我有一个输入表intab：

create table intab (
  ds string comment 'date partition filed'
  , id1 string comment 'id1'
  , id2 string comment 'id2'
  , n int comment 'n'
) comment 'test'
partition by list(ds)(partition default);

我需要计算outtab：

create table outtab as select
  id1, id2, sum(n) as sum_n
from intab group by id1, id2;

intab和outtab都作为rcfile存储在配置单元中。
当使用spark从大量输入计算outtab时，我经常遇到错误，如“error:transportresponsehandler:still have 1 requests outstanding when connection”和spark任务失败。
与中描述的问题类似https://forums.databricks.com/questions/10872/error-transportresponsehandler-still-have-1-reques-1.html，错误只出现在大输入端，即15tb。
“输入”选项卡非常大，有超过5000亿条记录和15 tb的存储容量。
我配置了以下spark参数，但没有帮助：

spark.sql.shuffle.partitions=100000
spark.blacklist.enabled=true
spark.network.timeout=600s
spark.sql.broadcastTimeout=1000
spark.driver.maxResultSize=2g
spark.executor.memoryOverhead=2048
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false

我已经检查了groupby语句中使用的密钥对（id1，id2）上没有数据倾斜。
任何关于这个问题的帮助都是非常需要的，例如通过优化存储或分区结构，或spark ops等等。。。。
++
在输入数据表intab中，id1和id2有超过10亿个不同的值。

sql Hive apache-spark bigdata hadoop-partitioning

来源：https://stackoverflow.com/questions/58917597/how-to-accelerate-large-hive-table-spark-group-by-query

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

如何通过查询加速大型配置单元表spark group？

暂无答案！

相关问题

热门标签

最新问答