vllm --tensor-parallel-size 2 fails to load on GCP

8qgya5xd  于 2个月前  发布在  其他
关注(0)|答案(8)|浏览(47)

你好,

我正在尝试在GCP上设置vLLM Mixtral 8x7b。我有一个VM,配备了两块A100 80GB,并使用以下设置:
docker镜像:vllm/vllm-openai:v0.3.0
模型:mistralai/Mixtral-8x7B-Instruct-v0.1
我在虚拟机内使用的命令:
python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --port 8888
一段时间后输出(之后):

File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 1858, in softmax
    ret = input.softmax(dim, dtype=dtype)
RuntimeError: CUDA error: invalid device function

nvidia-smi输出:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:06.0 Off |                    0 |
| N/A   32C    P0    62W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   31C    P0    61W / 400W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

出了什么问题?这是vLLM的bug吗?
附加诊断信息:

  • Mistral Instruct 7B在使用相同错误的情况下失败。
  • 没有Tensor并行性时,它成功了。(对于8x7B来说不是选项,因为它不适合一块GPU)
deikduxw

deikduxw1#

我在qwen72b模型上遇到了同样的问题。

bksxznpy

bksxznpy2#

我之前在CodeLlama34b-Python-hf上也遇到了同样的问题。

3j86kqsm

3j86kqsm3#

你解决了这个问题吗?我无法运行任何TP值大于1的模型。

ffvjumwh

ffvjumwh4#

#4431相关的任何问题?我终于让--tensor-parallel-size 2正常工作了。经过对许多模型的测试,它是可靠的。

jogvjijk

jogvjijk5#

@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?

sr4lhrrt

sr4lhrrt6#

@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are you running locally, or on a cloud instance?
@RomanKoshkin I've tried a few ways. What I have working now is pip installing the 0.4.2 tag. I have it broken in to a few scripts, so this will look a little strange, but it's copy/paste:

# create conda env
export ENV_NAME=vllm-pip-install
conda create --name ${ENV_NAME} python=3.8

# activate the conda env ... not scripted

# install vLLM
export TAG=0.4.2
pip install -vvv vllm==${TAG}

# start Ray - https://github.com/vllm-project/vllm/issues/4431#issuecomment-2084839647
export NUM_CPUS=10
ray start --head --num-cpus=$NUM_CPUS

# start vLLM
# model defaults
export DYPTE=auto
export QUANT=gptq_marlin

export NUM_GPUS=2

# this is the line that fixed my CUDA issues:
export LD_LIBRARY_PATH=$HOME/.config/vllm/nccl/cu12:$LD_LIBRARY_PATH

export MODEL=facebook/opt-125m

# start OpenAI compatible server
#
# https://docs.vllm.ai/en/latest/models/engine_args.html
python -m vllm.entrypoints.openai.api_server \
    --model $MODEL \
    --dtype $DYPTE \
    --tensor-parallel-size $NUM_GPUS \
    --quantization $QUANT
wz3gfoph

wz3gfoph7#

@chrisbraddock 我以非常相似的方式使其正常工作(我在这里描述了它)。关键是在单独的终端会话中运行 ray 并正确指定 LD_LIBRARY_PATH

zlwx9yxi

zlwx9yxi8#

@chrisbraddock 我以非常相似的方式使其正常工作(我在这里描述了它)。
@RomanKoshkin 我当然参考了你的一些信息。我不完全理解你是如何使用库的,所以最后得到了路径修改。
接下来重新启用Fast Attention并看看是否有任何问题。我认为这是我最后一个未解决的问题。

相关问题