linux Pytorch表示CUDA不可用(在Ubuntu上)

9njqaruj  于 2022-11-22  发布在  Linux
关注(0)|答案(7)|浏览(227)

我试着在我的笔记本电脑上运行Pytorch。这是一个旧型号,但它确实有一个Nvidia显卡。我意识到它可能不足以实现真正的机器学习,但我试着这样做,这样我就可以学习安装CUDA的过程。
我已经按照Ubuntu 18.04的installation guide的步骤操作了(我的特定发行版是Xubuntu)。
我的显卡是GeForce 845 M,已通过lspci | grep nvidia验证:

01:00.0 3D controller: NVIDIA Corporation GM107M [GeForce 845M] (rev a2)
01:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)

我还安装了gcc 7.5,由gcc --version验证

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

我已经安装了正确的头文件,通过尝试使用sudo apt-get install linux-headers-$(uname -r)安装它们来验证:

Reading package lists... Done
Building dependency tree       
Reading state information... Done
linux-headers-4.15.0-106-generic is already the newest version (4.15.0-106.107).

然后,我按照安装说明使用本地.deb版本10.1。
Npw,当我运行nvidia-smi时,我得到:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce 845M        On   | 00000000:01:00.0 Off |                  N/A |
| N/A   40C    P0    N/A /  N/A |     88MiB /  2004MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       982      G   /usr/lib/xorg/Xorg                            87MiB |
+-----------------------------------------------------------------------------+

运行nvcc -V,得到:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

然后,我执行了6.1节中的安装后说明,结果echo $PATH如下所示:

/home/isaek/anaconda3/envs/stylegan2_pytorch/bin:/home/isaek/anaconda3/bin:/home/isaek/anaconda3/condabin:/usr/local/cuda-10.1/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

echo $LD_LIBRARY_PATH看起来像这样:

/usr/local/cuda-10.1/lib64

我的/etc/udev/rules.d/40-vm-hotadd.rules文件如下所示:

# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}=="Microsoft Corporation", ATTR{[dmi/id]product_name}=="Virtual Machine", GOTO="vm_hotadd_apply"
ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply"
GOTO="vm_hotadd_end"

LABEL="vm_hotadd_apply"

# Memory hotadd request

# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"

LABEL="vm_hotadd_end"

完成所有这些之后,我甚至编译并运行了这些示例。./deviceQuery返回:

./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 845M"
  CUDA Driver Version / Runtime Version          10.1 / 10.1
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 2004 MBytes (2101870592 bytes)
  ( 4) Multiprocessors, (128) CUDA Cores/MP:     512 CUDA Cores
  GPU Max Clock rate:                            863 MHz (0.86 GHz)
  Memory Clock rate:                             1001 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

并且./bandwidthTest返回:

[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce 845M
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(GB/s)
   32000000         11.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(GB/s)
   32000000         11.8

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(GB/s)
   32000000         14.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

但在所有这些之后,这个Python代码片段(在安装了所有依赖项的conda环境中):

import torch
torch.cuda.is_available()

返回False
有没有人知道如何解决这个问题?我试着把/usr/local/cuda-10.1/bin加到etc/environment,如下所示:

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
PATH=$PATH:/usr/local/cuda-10.1/bin

重启了终端,但还是没修好。我真的不知道还能尝试什么。

EDIT -@kHarshit的collect_env结果

Collecting environment information...
PyTorch version: 1.5.0
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
CMake version: Could not collect

Python version: 3.6
Is CUDA available: No
CUDA runtime version: 10.1.243
GPU models and configuration: GPU 0: GeForce 845M
Nvidia driver version: 418.87.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.18.5
[pip] pytorch-ranger==0.1.1
[pip] stylegan2-pytorch==0.12.0
[pip] torch==1.5.0
[pip] torch-optimizer==0.0.1a12
[pip] torchvision==0.6.0
[pip] vector-quantize-pytorch==0.0.2
[conda] numpy                     1.18.5                   pypi_0    pypi
[conda] pytorch-ranger            0.1.1                    pypi_0    pypi
[conda] stylegan2-pytorch         0.12.0                   pypi_0    pypi
[conda] torch                     1.5.0                    pypi_0    pypi
[conda] torch-optimizer           0.0.1a12                 pypi_0    pypi
[conda] torchvision               0.6.0                    pypi_0    pypi
[conda] vector-quantize-pytorch   0.0.2                    pypi_0    pypi
tkqqtvp1

tkqqtvp11#

PyTorch不使用系统的CUDA库。当你使用pipconda预编译的二进制文件安装PyTorch时,它会附带一个本地安装的CUDA库的指定版本。事实上,你甚至不需要在系统上安装CUDA就可以使用支持CUDA的PyTorch。
有两种情况可能导致您的问题。
1.您安装了PyTorch的纯CPU版本。在这种情况下,PyTorch编译时不支持CUDA,因此它不支持CUDA。
1.您安装了PyTorch的CUDA 10.2版本。在这种情况下,问题在于您的显卡当前使用的是418.87驱动程序,而该驱动程序最高只支持CUDA 10.1。在这种情况下,两种可能的修复方法是安装更新的驱动程序(版本〉= 440.33,根据Table 2)或安装针对CUDA 10.1编译的PyTorch版本。
要确定安装PyTorch时使用的合适命令,您可以使用pytorch.org“安装PyTorch”一节中的小部件。只需选择合适的操作系统、软件包管理器和CUDA版本,然后运行推荐的命令。
在您的案例中,一个解决方案是使用

conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

它显式地向conda指定您要安装针对CUDA10.1编译的PyTorch版本。
有关PyTorch CUDA与相关驱动程序和硬件兼容性的更多信息,请参见this answer

编辑在您添加collect_env的输出后,我们可以看到问题是您安装了PyTorch的CUDA 10.2版本。基于此,另一个解决方案是更新显卡驱动程序,如第2项和链接的答案所述。

xuo3flqw

xuo3flqw2#

TL; DR

1.安装由Canonical或NVIDIA第三方PPA提供的NVIDIA工具包。
1.重新启动工作站。
1.创建干净的Python虚拟环境(或重新安装所有CUDA相关软件包)。

描述

首先安装Canonical提供的NVIDIA CUDA Toolkit

sudo apt install -y nvidia-cuda-toolkit

或遵循NVIDIA developers instructions

# ENVARS ADDED **ONLY FOR READABILITY**
NVIDIA_CUDA_PPA=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/
NVIDIA_CUDA_PREFERENCES=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
NVIDIA_CUDA_PUBKEY=https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub

# Add NVIDIA Developers 3rd-Party PPA
sudo wget ${NVIDIA_CUDA_PREFERENCES} -O /etc/apt/preferences.d/nvidia-cuda
sudo apt-key adv --fetch-keys ${NVIDIA_CUDA_PUBKEY}
echo "deb ${NVIDIA_CUDA_PPA} /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list

# Install development tools
sudo apt update
sudo apt install -y cuda

然后重新启动操作系统使用NVIDIA驱动程序加载内核

使用您喜欢的管理器(condavenv等)创建环境

conda create -n stack-overflow pytorch torchvision
conda activate stack-overflow

或**将pytorchtorchvision**重新安装到现有的服务器中:

conda activate stack-overflow
conda install --force-reinstall pytorch torchvision

否则可能无法正确检测到NVIDIA CUDA C/C++绑定。
最后,确保正确检测到CUDA:

(stack-overflow)$ python3 -c 'import torch; print(torch.cuda.is_available())'
True

个版本

aij0ehis

aij0ehis3#

在我的例子中,重新启动我的机器使GPU再次激活。我得到的最初消息是GPU当前正被另一个应用程序使用。但当我查看nvidia-smi时,我什么也没看到。所以,依赖关系没有改变,它只是再次开始工作。

flmtquvp

flmtquvp4#

另一种可能的情况是在安装PyTorch之前没有正确设置环境变量CUDA_VISIBLE_DEVICES

p1tboqfb

p1tboqfb5#

在我的情况下,它的工作如下:
删除CUDA驱动程序

sudo apt-get remove --purge nvidia*

然后,根据您的发行版和系统,从以下链接获取驱动程序的确切安装脚本:https://developer.nvidia.com/cuda-downloads?target_os=Linux
在我的情况下,它是x64上的dabian,所以我做了:

wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda

而现在nvidia-smi正如预期的那样工作!
我希望这对你有帮助

7bsow1i6

7bsow1i66#

如果您的CUDA版本与PyTorch预期的版本不匹配,您将看到此问题。
在Arch / Manjaro上:

不要更新到比PyTorch预期的更高的CUDA版本。如果PyTorch想要11.6,而您已经更新到11.7,您将收到错误消息。

yqhsw0fo

yqhsw0fo7#

确保在if __name__ == "__main__":之后设置os.environ['CUDA_VISIBLE_DEVICES'] = '0'。因此,您的代码应如下所示:

import torch
import os

if __name__ == "__main__":
     os.environ['CUDA_VISIBLE_DEVICES'] = '0'
     print(torch.cuda.is_available()) // true
     ...

相关问题