因此,我在尝试遵循本教程时遇到了一个问题:https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html
虽然我能够使用CPU设备运行训练脚本,但我根本无法让它与我的GPU一起工作。具体来说,它是model_main_tf2。py(在教程的“训练自定义对象检测器”部分下)给我带来了问题。我还添加了一行“TF”。setlog_device_placement(True)”,以获得更完整的日志。
奇怪的是,它在执行开始时检测到GPU,并且(根据我的理解),它使用GPU执行某些任务,但在某些时候,它切换到CPU而没有任何错误。这部分日志在柱子下面
一些系统、硬件和软件规格:
操作系统:Ubuntu 10.044
GPU:GTX 1660 Ti移动的
nvidia-smi输出:|第470章.57.02驱动程序版本:470.57.02 CUDA版本:11.4|
tensorflow-gpu版本:2.5.0
tensorflow版本:2.5.0
Python版本:3.9.5
记录脚本切换到CPU用途:
2021-07-26 17:00:34.027672: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028022: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028148: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028357: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028418: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028506: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028683: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028741: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2021-07-26 17:00:34.028817: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0
WARNING:tensorflow:From /home/[user]/miniconda3/envs/tensorflowGPU4/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W0726 17:00:34.090738 140198244909888 deprecation.py:330] From /home/[user]/miniconda3/envs/tensorflowGPU4/lib/python3.9/site-packages/object_detection/model_lib_v2.py:557: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
2021-07-26 17:00:34.091483: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.091597: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op LookupTableImportV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.091910: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.091969: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op LookupTableImportV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.092269: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.092327: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op LookupTableImportV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.092609: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op HashTableV2 in device /job:localhost/replica:0/task:0/device:CPU:0
2021-07-26 17:00:34.092666: I tensorflow/core/common_runtime/eager/execute.cc:733] Executing op LookupTableImportV2 in device /job:localhost/replica:0/task:0/device:CPU:0
如果有人能帮忙,我将不胜感激。在这一点上,我几乎已经在互联网上搜索了任何有同样问题的人没有任何运气:/
编辑:
只是更新我的问题。我已经尝试了完全相同的设置,与相同的操作系统,图形驱动程序和CUDA版本和所有,在我的旧电脑与GTX 1060,但我仍然得到相同的行为。然后我会猜测,这只是一个配置问题或在某处的误解?
1条答案
按热度按时间yqkkidmi1#
尝试更改训练批次大小。尝试2的幂。例如:2,4,8,16.....