如何让Docker识别NVIDIA驱动程序?

4smxwvx5  于 2022-12-18  发布在  Docker
关注(0)|答案(5)|浏览(249)

我有一个装载Pytorch模型的容器,每次我试图启动它时,我都会得到这个错误:

Traceback (most recent call last):
  File "server/start.py", line 166, in <module>
    start()
  File "server/start.py", line 94, in start
    app.register_blueprint(create_api(), url_prefix="/api/1")
  File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
    atomic_demo_model = DemoModel(model_filepath, comet_dir)
  File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
    model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
  File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
    model.to(cfg.device)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

我知道nvidia-docker2正在工作。

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:1A:00.0 Off |                  N/A |
|  0%   44C    P0    72W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:1B:00.0 Off |                  N/A |
|  0%   44C    P0    66W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:1E:00.0 Off |                  N/A |
|  0%   44C    P0    48W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:3E:00.0 Off |                  N/A |
|  0%   41C    P0    54W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  GeForce RTX 208...  Off  | 00000000:3F:00.0 Off |                  N/A |
|  0%   42C    P0    48W / 260W |      0MiB / 10989MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
|   5  GeForce RTX 208...  Off  | 00000000:41:00.0 Off |                  N/A |
|  0%   42C    P0     1W / 260W |      0MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

但是,我不断得到上面的错误。
我尝试了以下方法:
1.在/etc/docker/daemon.json中设置"default-runtime": nvidia
1.使用docker run --runtime=nvidia <IMAGE_ID>
1.将以下变量添加到我的Dockerfile:

ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"

我希望这个容器可以运行-我们有一个工作版本在生产中没有这些问题。我知道Docker可以找到驱动程序,如上面的输出所示。有什么想法吗?

d4so4syb

d4so4syb1#

为了让Docker使用主机GPU驱动程序和GPU,需要执行一些步骤。

  • 确保主机系统上安装了nvidia驱动程序
  • 按照步骤here设置nvidia容器工具包
  • 确保在映像中安装了cuda、cudnn
  • 运行带有--gpus标志的容器(如上面链接中所述)

我猜你已经完成了前三点,因为nvidia-docker2正在工作,所以,由于你的运行命令中没有--gpus标志,这可能是问题所在。
我通常使用以下命令运行我的容器

docker run --name <container_name> --gpus all -it <image_name>

-it只是容器是交互式的,并启动一个bash环境。

pod7payv

pod7payv2#

我得到了同样的错误。在尝试了一些解决方案后,我发现下面

docker run -ti --runtime=nvidia -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=all <image_name>
biswetbf

biswetbf3#

对我来说,我是从一个普通的ubuntu基本docker映像运行的,即。

FROM ubuntu

改用Nvidia提供的Docker基本映像为我解决了这个问题:

FROM nvidia/cuda:11.2.1-runtime-ubuntu20.04
vpfxa7rd

vpfxa7rd4#

如果您在GPU支持的AWS EC2计算机上运行解决方案,并且使用EKS优化的加速AMI(我们就是这种情况),则无需自行将运行时设置为nvidia,因为这是加速AMI的默认运行时。

  • ssh到AWS计算机中
  • 目录号/etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf

所有需要的是设置这两个环境变量,如上面answerhere(Nvidia容器工具包用户指南)中Chirag所建议的那样

  • -e NVIDIA_DRIVER_CAPABILITIES=compute,utility-e NVIDIA_DRIVER_CAPABILITIES=all
  • -e NVIDIA_VISIBLE_DEVICES=all

在找到最终解决方案之前,我还尝试在daemon.json中设置运行时。首先,我们使用的AMI没有daemon.json文件,而是包含一个key.json文件。尝试在两个文件中设置运行时,但重新启动Docker总是覆盖key.json中的更改,或者干脆删除daemon.json文件。

esyap4oy

esyap4oy5#

只需使用“docker运行--gpus全部”,添加“--gpus全部”或“--gpus 0”!

相关问题