我看到任务被划分到gpu,但是gpu的利用率是0%。我怎样才能得到使用gpu的工作?我在独立模式下在gpu服务器上运行主服务器和1个工作服务器。
spark提交
spark-submit \
--master spark://<ip>:7077 \
--conf spark.executor.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.worker.resource.gpu.discoveryScript=/opt/getGpusResources.sh \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.worker.resource.gpu.amount=1 \
--class com.spark.Class \
app.jar
日志
21/03/30 23:19:25 INFO DAGScheduler: Submitting 10 missing tasks from ShuffleMapStage 251 (MapPartitionsRDD[306] at collect at ClusteringMetrics.scala:102) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
21/03/30 23:19:25 INFO TaskSchedulerImpl: Adding task set 251.0 with 10 tasks resource profile 0
21/03/30 23:19:25 INFO TaskSetManager: Starting task 0.0 in stage 251.0 (TID 2178) (<ip>, executor 0, partition 0, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_319_piece0 in memory on <ip>:34559 (size: 17.3 KiB, free: 4.0 GiB)
21/03/30 23:19:25 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 83 to <ip>:34520
21/03/30 23:19:25 INFO BlockManagerInfo: Added broadcast_316_piece0 in memory on <ip>:34559 (size: 547.0 B, free: 4.0 GiB)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 1.0 in stage 251.0 (TID 2179) (<ip>, executor 0, partition 1, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 0.0 in stage 251.0 (TID 2178) in 225 ms on <ip> (executor 0) (1/10)
21/03/30 23:19:25 INFO TaskSetManager: Starting task 2.0 in stage 251.0 (TID 2180) (<ip>, executor 0, partition 2, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:25 INFO TaskSetManager: Finished task 1.0 in stage 251.0 (TID 2179) in 181 ms on <ip> (executor 0) (2/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 3.0 in stage 251.0 (TID 2181) (<ip>, executor 0, partition 3, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 2.0 in stage 251.0 (TID 2180) in 226 ms on <ip> (executor 0) (3/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 4.0 in stage 251.0 (TID 2182) (<ip>, executor 0, partition 4, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 3.0 in stage 251.0 (TID 2181) in 187 ms on <ip> (executor 0) (4/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 5.0 in stage 251.0 (TID 2183) (<ip>, executor 0, partition 5, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 4.0 in stage 251.0 (TID 2182) in 180 ms on <ip> (executor 0) (5/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 6.0 in stage 251.0 (TID 2184) (<ip>, executor 0, partition 6, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 5.0 in stage 251.0 (TID 2183) in 179 ms on <ip> (executor 0) (6/10)
21/03/30 23:19:26 INFO TaskSetManager: Starting task 7.0 in stage 251.0 (TID 2185) (<ip>, executor 0, partition 7, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:26 INFO TaskSetManager: Finished task 6.0 in stage 251.0 (TID 2184) in 179 ms on <ip> (executor 0) (7/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 8.0 in stage 251.0 (TID 2186) (<ip>, executor 0, partition 8, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 7.0 in stage 251.0 (TID 2185) in 216 ms on <ip> (executor 0) (8/10)
21/03/30 23:19:27 INFO TaskSetManager: Starting task 9.0 in stage 251.0 (TID 2187) (<ip>, executor 0, partition 9, NODE_LOCAL, 4446 bytes) taskResourceAssignments Map(gpu -> [name: gpu, addresses: 0])
21/03/30 23:19:27 INFO TaskSetManager: Finished task 8.0 in stage 251.0 (TID 2186) in 179 ms on <ip> (executor 0) (9/10)
21/03/30 23:19:27 INFO TaskSetManager: Finished task 9.0 in stage 251.0 (TID 2187) in 179 ms on <ip> (executor 0) (10/10)
21/03/30 23:19:27 INFO TaskSchedulerImpl: Removed TaskSet 251.0, whose tasks have all completed, from pool
21/03/30 23:19:27 INFO DAGScheduler: ShuffleMapStage 251 (collect at ClusteringMetrics.scala:102) finished in 1.934 s
21/03/30 23:19:27 INFO DAGScheduler: looking for newly runnable stages
规格
我用的是aws ec2 g4dn机器。
GPU: TU104GL [Tesla T4]
15109MiB
Driver Version: 460.32.03
CUDA Version: 11.2
1 worker: 1 core, 7GB of memory.
暂无答案!
目前还没有任何答案,快来回答吧!