2.0.3中的hadoop-spark.sql.shuffle.partitions无效

wbgh16ku 于 2021-05-31 发布在 Hadoop

关注(0)|答案(1)|浏览(1049)

我打算在spark上执行一个基于hive的sql，设置如下：

select a,b,sum(c) from tbl_a group by a,b

set hive.execution.engine=spark；设置spark.sql.shuffle.partitions=1201；
然后在应用程序启动后，我只能看到spark yarn网页上并行运行的82个任务，这并不像预期的那样。我测试了另一个更复杂的sql（其中包含groupbycube、嵌套sql查询），它在stage-2只产生17个任务，这将导致繁重的完全gc。知道为什么吗 spark.sql.shuffle.partitions 没有任何效果？谢谢！

hadoop Hive apache-spark apache-spark-sql partitioning

来源：https://stackoverflow.com/questions/49859229/spark-sql-shuffle-partitions-in-spark-2-0-3-doesnt-take-effect

1条答案

按热度按时间

1sbrub3j1#

在强制执行以下两个设置之后，分区号将变为正常，并再次如预期的那样。 set hive.exec.reducers.bytes.per.reducer=67108864; set mapred.reduce.tasks=1201; 似乎hiveonspark仍然应用一些hadoop参数来决定执行计划，并掩盖了相应的spark参数。
我已经写了一篇文章详细的场景回放和参考。

赞(0）回复(0）举报 2021-06-01

我来回答

2.0.3中的hadoop-spark.sql.shuffle.partitions无效

1条答案

相关问题

热门标签

最新问答