spark：一台机器中有更多的执行者，每个任务的持续时间更长

r3i60tvu 于 2021-05-29 发布在 Spark

关注(0)|答案(0)|浏览(215)

当我在spark中运行logisticregression时，我发现一个阶段是特殊的，随着执行者数量的增加，平均任务处理时间变得更长，为什么会发生这种情况？

环境：

所有服务器都是本地的，没有云。
服务器1:6核10g内存（spark master、hdfs master、hdfs slave）。
服务器2:6核10g内存（hdfs从）。
服务器3:6核10g内存（spark slave、hdfs slave）。
以独立模式部署。
输入文件大小：足够大，可以满足并行性的要求。spark将从hdfs读取文件。
所有工作负载都有相同的输入文件。
您可以看到，只有server3将参与计算（只有它将成为spark worker）。

特殊舞台dag

1核1g内存

spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 1 ....

中任务持续时间：1s

2核2g内存

spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 2 ....

中任务时长：2s

3核3g内存

spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 3 ....

中任务时长：2s

4核4g内存

spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 4 ....

中任务时长：3s

5核5g内存

spark-submit --executor-cores 1 --executor-memory 1g --total-executor-cores 5 ....

中任务时长：3s

从上图可以看出，机器上的执行者越多，单个任务的平均运行时间就越长。请问为什么会发生这种情况，我没有看到执行器有磁盘溢出，内存应该是足够的。
注意：只有这个阶段才会产生这种现象，其他阶段没有这个问题。

hadoop scala apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/62319481/spark-more-executors-in-one-machine-longer-duration-time-for-each-task

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答