我正在尝试在服务器上运行 mistralai/Mixtral-8x7B-Instruct-v0.1,使用两台 A100(总共80GB的GPU RAM)。

$ python -m vllm.entrypoints.openai.api_server \
	--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
	--tensor-parallel-size 2

vLLM似乎完全利用了GPU的内存，因此抛出了 CUDA out of memory 错误。
以下是 nvidia-smi 的输出。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
| N/A   25C    P0              54W / 250W |  40327MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   25C    P0              50W / 250W |  40327MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    209623      C   ray::RayWorkerVllm                        40314MiB |
|    1   N/A  N/A    209624      C   ray::RayWorkerVllm                        40314MiB |
+---------------------------------------------------------------------------------------+

这里是错误信息。

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 10.81 MiB is free. Including non-PyTorch memory, this process has 39.37 GiB memory in use. Of the allocated memory 38.78 GiB is allocated by PyTorch, and 17.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

当使用 Ollama 运行相同模型时，该模型仅使用了26GB的GPU RAM。以下是 nvidia-smi 的输出。

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
| N/A   24C    P0              35W / 250W |  26591MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   23C    P0              30W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    210236      C   /bin/ollama                               26578MiB |
+---------------------------------------------------------------------------------------+

此外，根据他们的 model page 。此模型需要48GB的RAM。
现在，这可能是一个苹果和橙子的比较，但是 mistralai/Mixtral-8x7B-Instruct-v0.1 ,一个46.7B模型是否需要超过80GB的RAM,或者这是一个bug/一些配置错误？
我已经用 vLLM 版本 0.2.6 和 0.2.7 以及各种可选参数(如 --enforce-eager )重复了这个实验，结果没有改变。

8条答案

按热度按时间

ubby3x7f1#

我遇到了同样的问题，在A30 GPU上。

torch.cuda.OutOfMemoryError: CUDA内存不足。尝试分配270.00 MiB。GPU 0的总容量为23.50 GiB,其中42.06 MiB可用。包括非PyTorch内存，此进程已使用23.40 GiB内存。分配的内存中，22.99 GiB由PyTorch分配，1.76 MiB由PyTorch预留但未分配。如果预留但未分配的内存较大，请尝试设置max_split_size_mb以避免碎片化。有关内存管理和PYTORCH_CUDA_ALLOC_CONF的文档，请参阅。

有人知道如何减少批量大小吗？

赞(0）回复(0）举报 4个月前

pexxcrt22#

我在使用2xA100(40GB)时也遇到了同样的问题。

o2g1uqev3#

我也有同样的问题，我有2xA100 x 80GB,但无法加载Mixtral-8x7B-Instruct-v0.1。

ttp71kqs4#

在这里，rtx 4090也存在同样的问题。

3phpmpom5#

我在使用2x NVIDIA L4(48GB)时也遇到了同样的问题。

q7solyqu6#

在这里，4xA100也存在同样的问题。

egdjgwm87#

相同的问题。还没有解决方案吗？

lfapxunr8#

降低GPU内存利用率对我来说是有效的(8*A800 80GB)。

python -m vllm.entrypoints.openai.api_server --model /Qwen-7B-Chat --dtype bfloat16 --api-key token-abc123 --trust-remote-code --gpu-memory-utilization 0.3 --max-model-len 4096

vllm 尽管有足够的内存，但仍出现CUDA内存不足错误,

8条答案

相关问题

热门标签

最新问答