我有一个装载Pytorch模型的容器,每次我试图启动它时,我都会得到这个错误:
Traceback (most recent call last):
File "server/start.py", line 166, in <module>
start()
File "server/start.py", line 94, in start
app.register_blueprint(create_api(), url_prefix="/api/1")
File "/usr/local/src/skiff/app/server/server/api.py", line 30, in create_api
atomic_demo_model = DemoModel(model_filepath, comet_dir)
File "/usr/local/src/comet/comet/comet/interactive/atomic_demo.py", line 69, in __init__
model = interactive.make_model(opt, n_vocab, n_ctx, state_dict)
File "/usr/local/src/comet/comet/comet/interactive/functions.py", line 98, in make_model
model.to(cfg.device)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply
param.data = fn(param.data)
File "/usr/local/lib/python3.7/site-packages/torch/nn/modules/module.py", line 379, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.7/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
我知道nvidia-docker2
正在工作。
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Jul 16 22:09:40 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1A:00.0 Off | N/A |
| 0% 44C P0 72W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:1B:00.0 Off | N/A |
| 0% 44C P0 66W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 208... Off | 00000000:1E:00.0 Off | N/A |
| 0% 44C P0 48W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce RTX 208... Off | 00000000:3E:00.0 Off | N/A |
| 0% 41C P0 54W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 GeForce RTX 208... Off | 00000000:3F:00.0 Off | N/A |
| 0% 42C P0 48W / 260W | 0MiB / 10989MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
| 5 GeForce RTX 208... Off | 00000000:41:00.0 Off | N/A |
| 0% 42C P0 1W / 260W | 0MiB / 10989MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
但是,我不断得到上面的错误。
我尝试了以下方法:
1.在/etc/docker/daemon.json
中设置"default-runtime": nvidia
1.使用docker run --runtime=nvidia <IMAGE_ID>
1.将以下变量添加到我的Dockerfile:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
LABEL com.nvidia.volumes.needed="nvidia_driver"
我希望这个容器可以运行-我们有一个工作版本在生产中没有这些问题。我知道Docker可以找到驱动程序,如上面的输出所示。有什么想法吗?
5条答案
按热度按时间d4so4syb1#
为了让Docker使用主机GPU驱动程序和GPU,需要执行一些步骤。
--gpus
标志的容器(如上面链接中所述)我猜你已经完成了前三点,因为
nvidia-docker2
正在工作,所以,由于你的运行命令中没有--gpus
标志,这可能是问题所在。我通常使用以下命令运行我的容器
-it
只是容器是交互式的,并启动一个bash环境。pod7payv2#
我得到了同样的错误。在尝试了一些解决方案后,我发现下面
biswetbf3#
对我来说,我是从一个普通的
ubuntu
基本docker映像运行的,即。改用Nvidia提供的Docker基本映像为我解决了这个问题:
vpfxa7rd4#
如果您在GPU支持的AWS EC2计算机上运行解决方案,并且使用EKS优化的加速AMI(我们就是这种情况),则无需自行将运行时设置为
nvidia
,因为这是加速AMI的默认运行时。/etc/systemd/system/docker.service.d/nvidia-docker-dropin.conf
所有需要的是设置这两个环境变量,如上面answer和here(Nvidia容器工具包用户指南)中Chirag所建议的那样
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility
或-e NVIDIA_DRIVER_CAPABILITIES=all
-e NVIDIA_VISIBLE_DEVICES=all
在找到最终解决方案之前,我还尝试在
daemon.json
中设置运行时。首先,我们使用的AMI没有daemon.json
文件,而是包含一个key.json
文件。尝试在两个文件中设置运行时,但重新启动Docker总是覆盖key.json
中的更改,或者干脆删除daemon.json
文件。esyap4oy5#
只需使用“docker运行--gpus全部”,添加“--gpus全部”或“--gpus 0”!