spark失去执行器:远程rpc客户端已解除关联可能是由于容器超过阈值或网络问题

js5cn81o  于 2021-07-09  发布在  Spark
关注(0)|答案(0)|浏览(247)

我在独立模式下在gpu服务器上运行主服务器和1个工作服务器。提交作业后,当作业在超时之前检索并丢失执行器x次时,会发生错误taskschedulerimpl。
spark提交

  1. spark-submit \
  2. --conf spark.plugins=com.nvidia.spark.SQLPlugin \
  3. --conf spark.rapids.memory.gpu.pooling.enabled=false \
  4. --conf spark.executor.resource.gpu.amount=1 \
  5. --conf spark.task.resource.gpu.amount=1 \
  6. --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR}. \
  7. --master spark://<ip>:7077 \
  8. --driver-memory 16g \
  9. --executor-memory 16g \
  10. --conf spark.cores.max=1 \
  11. --class com.spark.examples.Class \
  12. app.jar \
  13. -dataPath=spark/data.csv \
  14. -format=csv \
  15. -numWorkers=1 \
  16. -treeMethod=gpu_hist \
  17. -numRound=100 \
  18. -maxDepth=8

日志

  1. Removal of executor 1 requested
  2. 21/03/30 17:50:07 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
  3. 21/03/30 17:50:07 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
  4. 21/03/30 17:50:07 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-202103302 on worker-20330778-<ip>-33921 (<ip>:3721921) with 1 core(s)
  5. 21/03/30 17:50:07 INFO StandaloneSchedulerBackend: Granted executor ID app-2024944-000/2 on hostPort <ip>:37921 with 1 core(s), 16.0 GiB RAM
  6. 21/03/30 17:50:07 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-2044-0010/2 is now RUNNING
  7. 21/03/30 17:50:09 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (<ip>:41196) with ID 2, ResourceProfileId 0
  8. 21/03/30 17:50:09 INFO BlockManagerMasterEndpoint: Registering block manager <ip>:45111 with 9.4 GiB RAM, BlockManagerId(2, <ip>, 41351, None)
  9. 21/03/30 17:50:16 ERROR TaskSchedulerImpl: Lost executor 2 on <ip>: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
  10. 21/03/30 17:50:16 INFO DAGScheduler: Executor lost: 2 (epoch 2)

规格
我用的是aws ec2 g4dn机器。

  1. GPU: TU104GL [Tesla T4]
  2. 15109MiB
  3. Driver Version: 460.32.03
  4. CUDA Version: 11.2
  5. 1 worker: 1 core, 16GB of memory.

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题