当前环境
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
**OS: Ubuntu 20.04.6 LTS (x86_64)**
**GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0**
Clang version: Could not collect
CMake version: version 3.30.1
**Libc version: glibc-2.31**
Python version: 3.10.2 | packaged by conda-forge | (main, Feb 1 2022, 19:29:00) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-1068-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
GPU 2: NVIDIA A100-SXM4-40GB
GPU 3: NVIDIA A100-SXM4-40GB
GPU 4: NVIDIA A100-SXM4-40GB
GPU 5: NVIDIA A100-SXM4-40GB
GPU 6: NVIDIA A100-SXM4-40GB
GPU 7: NVIDIA A100-SXM4-40GB
Nvidia driver version: 535.183.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.2.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7V12 64-Core Processor
Stepping: 0
CPU MHz: 2445.440
BogoMIPS: 4890.88
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 48 MiB
L3 cache: 384 MiB
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled
Vulnerability Spec rstack overflow: Mitigation; safe RET, no microcode
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] pytorch-triton==3.0.0+dedb7bdf33
[pip3] torch==2.3.1
[pip3] torchaudio==2.4.0.dev20240722+cu124
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.1
[pip3] triton==2.3.1
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] pytorch-triton 3.0.0+dedb7bdf33 pypi_0 pypi
[conda] torch 2.3.1 pypi_0 pypi
[conda] torchaudio 2.4.0.dev20240722+cu124 pypi_0 pypi
[conda] torchvision 0.18.1 pypi_0 pypi
[conda] transformers 4.43.1 pypi_0 pypi
[conda] triton 2.3.1 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE SYS SYS SYS SYS SYS SYS 24-47 1 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 NODE NODE SYS SYS SYS SYS SYS SYS 24-47 1 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS NODE NODE SYS SYS SYS SYS 0-23 0 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS NODE NODE SYS SYS SYS SYS 0-23 0 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS NODE NODE SYS SYS 72-95 3 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS NODE NODE SYS SYS 72-95 3 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS NODE NODE 48-71 2 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS NODE NODE 48-71 2 N/A
NIC0 NODE NODE SYS SYS SYS SYS SYS SYS X NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE SYS SYS SYS SYS SYS SYS NODE X SYS SYS SYS SYS SYS SYS
NIC2 SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS X NODE SYS SYS SYS SYS
NIC3 SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS X NODE SYS SYS
NIC5 SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS NODE X SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS X NODE
NIC7 SYS SYS SYS SYS SYS SYS NODE NODE SYS SYS SYS SYS SYS SYS NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
🐛 描述bug
无法从v0.4升级到Ubuntu 20.04,使用gclib 0.31 - 可能今天所有的Azure DGX VMs都这样。Azure目前没有立即计划迁移到22.04。在本地的22.04主机上可以正常工作。
由于需要在规模上实现低延迟热插拔,暂时无法禁用Tensor并行或Punica作为解决方法。
假设:vllm 0.5+的要求(torch?)现在期望GLIBC > 0.31破坏了20.04上的TP+Punica支持。
重现、错误和跟踪信息如下。感谢您的支持,cc:@njhill
3条答案
按热度按时间y3bcpkx11#
glibc的内容应该由@tlrmchlsmth在#6517中修复。你能尝试使用最新版本吗?
kzmpq1sx2#
非常感谢@youkaichao,很高兴看到这个问题已经解决。遗憾的是,问题仍然存在(vllm==0.5.3.post1,torch==2.3.1)
$x_1^a_0b_1^x$
blpfk2vs3#
解决方法(不理想):#6517 + 激进地减少max_model_length(<8K tokens)。至少在这里需要添加错误处理来改进。