spark未利用gpu taskresourceassignmentsMap(gpu->[0]

v1l68za4  于 2021-07-09  发布在  Spark
关注(0)|答案(0)|浏览(295)

我看到任务被划分到gpu,但是gpu的利用率是0%。我怎样才能得到使用gpu的工作?我在独立模式下在gpu服务器上运行主服务器和1个工作服务器。
spark提交

  1. spark-submit \
  2. --master spark://<ip>:7077 \
  3. --conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
  4. --conf spark.worker.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
  5. --conf spark.task.resource.gpu.amount=1 \
  6. --conf spark.executor.resource.gpu.amount=1 \
  7. --conf spark.worker.resource.gpu.amount=1 \
  8. --class com.spark.Class \
  9. app.jar

日志

  1. 21/03/30 23:19:25 INFO DAGScheduler: Submitting 10 missing tasks from ShuffleMapStage 251 (MapPartitionsRDD[306] at collect at ClusteringMetrics.scala:102) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
  2. 21/03/30 23:19:25 INFO TaskSchedulerImpl: Adding task set 251.0 with 10 tasks resource profile 0
  3. 21/03/30 23:19:25 INFO TaskSetManager: Starting task 0.0 in stage 251.0 (TID 2178) (<ip>, executor 0, partition 0, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  4. 21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_319_piece0 in memory on <ip>:34559 (size: 17.3 KiB, free: 4.0 GiB)
  5. 21/03/30 23:19:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 83 to <ip>:34520
  6. 21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_316_piece0 in memory on <ip>:34559 (size: 547.0 B, free: 4.0 GiB)
  7. 21/03/30 23:19:25 INFO TaskSetManager: Starting task 1.0 in stage 251.0 (TID 2179) (<ip>, executor 0, partition 1, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  8. 21/03/30 23:19:25 INFO TaskSetManager: Finished task 0.0 in stage 251.0 (TID 2178) in 225 ms on <ip> (executor 0) (1/10)
  9. 21/03/30 23:19:25 INFO TaskSetManager: Starting task 2.0 in stage 251.0 (TID 2180) (<ip>, executor 0, partition 2, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  10. 21/03/30 23:19:25 INFO TaskSetManager: Finished task 1.0 in stage 251.0 (TID 2179) in 181 ms on <ip> (executor 0) (2/10)
  11. 21/03/30 23:19:26 INFO TaskSetManager: Starting task 3.0 in stage 251.0 (TID 2181) (<ip>, executor 0, partition 3, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  12. 21/03/30 23:19:26 INFO TaskSetManager: Finished task 2.0 in stage 251.0 (TID 2180) in 226 ms on <ip> (executor 0) (3/10)
  13. 21/03/30 23:19:26 INFO TaskSetManager: Starting task 4.0 in stage 251.0 (TID 2182) (<ip>, executor 0, partition 4, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  14. 21/03/30 23:19:26 INFO TaskSetManager: Finished task 3.0 in stage 251.0 (TID 2181) in 187 ms on <ip> (executor 0) (4/10)
  15. 21/03/30 23:19:26 INFO TaskSetManager: Starting task 5.0 in stage 251.0 (TID 2183) (<ip>, executor 0, partition 5, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  16. 21/03/30 23:19:26 INFO TaskSetManager: Finished task 4.0 in stage 251.0 (TID 2182) in 180 ms on <ip> (executor 0) (5/10)
  17. 21/03/30 23:19:26 INFO TaskSetManager: Starting task 6.0 in stage 251.0 (TID 2184) (<ip>, executor 0, partition 6, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  18. 21/03/30 23:19:26 INFO TaskSetManager: Finished task 5.0 in stage 251.0 (TID 2183) in 179 ms on <ip> (executor 0) (6/10)
  19. 21/03/30 23:19:26 INFO TaskSetManager: Starting task 7.0 in stage 251.0 (TID 2185) (<ip>, executor 0, partition 7, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  20. 21/03/30 23:19:26 INFO TaskSetManager: Finished task 6.0 in stage 251.0 (TID 2184) in 179 ms on <ip> (executor 0) (7/10)
  21. 21/03/30 23:19:27 INFO TaskSetManager: Starting task 8.0 in stage 251.0 (TID 2186) (<ip>, executor 0, partition 8, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  22. 21/03/30 23:19:27 INFO TaskSetManager: Finished task 7.0 in stage 251.0 (TID 2185) in 216 ms on <ip> (executor 0) (8/10)
  23. 21/03/30 23:19:27 INFO TaskSetManager: Starting task 9.0 in stage 251.0 (TID 2187) (<ip>, executor 0, partition 9, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
  24. 21/03/30 23:19:27 INFO TaskSetManager: Finished task 8.0 in stage 251.0 (TID 2186) in 179 ms on <ip> (executor 0) (9/10)
  25. 21/03/30 23:19:27 INFO TaskSetManager: Finished task 9.0 in stage 251.0 (TID 2187) in 179 ms on <ip> (executor 0) (10/10)
  26. 21/03/30 23:19:27 INFO TaskSchedulerImpl: Removed TaskSet 251.0, whose tasks have all completed, from pool
  27. 21/03/30 23:19:27 INFO DAGScheduler: ShuffleMapStage 251 (collect at ClusteringMetrics.scala:102) finished in 1.934 s
  28. 21/03/30 23:19:27 INFO DAGScheduler: looking for newly runnable stages

规格
我用的是aws ec2 g4dn机器。

  1. GPU: TU104GL [Tesla T4]
  2. 15109MiB
  3. Driver Version: 460.32.03
  4. CUDA Version: 11.2
  5. 1 worker: 1 core, 7GB of memory.

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题