nvidia-docker中的TensorFlow:调用cuInit失败:CUDA错误未知

plupiseo  于 2023-01-25  发布在  Docker
关注(0)|答案(4)|浏览(203)

我一直在努力让依赖TensorFlow的应用程序作为nvidia-docker的Docker容器工作。我在tensorflow/tensorflow:latest-gpu-py3映像上编译了我的应用程序。我使用以下命令运行我的Docker容器:
sudo nvidia-docker run -d -p 9090:9090 -v /src/weights:/weights myname/myrepo:mylabel
通过portainer查看日志时,我看到以下内容:

2017-05-16 03:41:47.715682: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715896: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715948: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.715978: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.716002: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-05-16 03:41:47.718076: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_UNKNOWN
2017-05-16 03:41:47.718177: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: 1e22bdaf82f1
2017-05-16 03:41:47.718216: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: 1e22bdaf82f1
2017-05-16 03:41:47.718298: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 367.57.0
2017-05-16 03:41:47.718398: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  367.57  Mon Oct  3 20:37:01 PDT 2016
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
2017-05-16 03:41:47.718455: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 367.57.0
2017-05-16 03:41:47.718484: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 367.57.0

容器看起来确实启动正常,我的应用程序看起来也在运行。当我向它发送预测请求时,预测会正确返回-然而,在CPU上运行推理时,速度会很慢,所以我认为很明显GPU由于某种原因没有被使用。我还尝试从同一个容器中运行nvidia-smi,以确保它看到我的GPU,以下是结果:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K1             Off  | 0000:00:07.0     Off |                  N/A |
| N/A   28C    P8     7W /  31W |     25MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

我当然不是这方面的Maven,但GPU似乎可以从容器内部看到。您对如何在TensorFlow中实现这一点有什么想法?

q5lcpyga

q5lcpyga1#

我在ubuntu16.04桌面上运行tensorflow。
我运行代码与GPU工作良好天前。但今天我找不到GPU设备与以下代码
import tensorflow as tf from tensorflow.python.client import device_lib as _device_lib with tf.Session() as sess: local_device_protos = _device_lib.list_local_devices() print(local_device_protos) [print(x.name) for x in local_device_protos]
当我运行tf.Session()时,我意识到了以下问题
cuda_driver. cc:406]调用cuInit失败:CUDA错误未知
我在系统详细信息中检查了我的Nvidia驱动程序,并检查了nvcc -Vnvida-smi的驱动程序,cuda和cudnn。一切似乎都很好。
然后我去其他驱动程序检查驱动程序的细节,在那里我发现有许多版本的NVIDIA驱动程序和最新版本的选择。但当我第一次安装驱动程序只有一个。
因此,我选择旧版本,并应用更改。

然后我运行tf.Session()问题也在这里。我想我应该重新启动我的电脑,重新启动后,这个问题消失了。
sess = tf.Session() 2018-07-01 12:02:41.336648: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2018-07-01 12:02:41.464166: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-07-01 12:02:41.464482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.8225 pciBusID: 0000:01:00.0 totalMemory: 7.93GiB freeMemory: 7.27GiB 2018-07-01 12:02:41.464494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0 2018-07-01 12:02:42.308689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-07-01 12:02:42.308721: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0 2018-07-01 12:02:42.308729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N 2018-07-01 12:02:42.309686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7022 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability:

ee7vknir

ee7vknir2#

问题可能与GPU创建的JIT缓存文件权限有关。在Linux上,默认情况下,缓存文件创建在~/.nv/ComputeCache。为JIT cache设置另一个目录可以解决问题。

export CUDA_CACHE_PATH=/tmp/nvidia

在GPU上运行东西之前。

f3temu5u

f3temu5u3#

我试着安装nvidia-modrpobe,但还是同样的错误,然后一个简单的系统重新启动对我有用

x8diyxa7

x8diyxa74#

在我的示例中,此命令失败:

docker run --gpus all --runtime=nvidia -it --rm tensorflow/tensorflow:latest-gpu \                                                                                                                                                     
   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

添加--privileged可解决以下问题:

docker run --gpus all --runtime=nvidia --privileged -it --rm tensorflow/tensorflow:latest-gpu \                                                                                                                                                     
   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

相关问题