[Bug]: VLLM usage on AWS Inferentia instances

kgqe7b3p  于 2个月前  发布在  其他
关注(0)|答案(6)|浏览(21)

你的当前环境

See below for detailed setup and run script that I use.

🐛 描述bug
你好,我正在尝试在AWS inferentia (inf2.8xlarge)示例上使用vllm部署llama-8b。经过许多hack/tiring尝试,我已经确保vllm服务器能够正确地启动。然而,当我尝试对一个简单的"hi"输入提示进行模型推理时,它会在控制台上显示一个错误警告,并且我在设置的Gradio UI上没有得到任何llm返回的结果。有关代码相关细节的详细信息,请参阅线程。如果可能的话,希望有人能帮助我解决以下问题!我正在使用Skypilot进行部署:

(task, pid=33413) INFO 06-21 09:15:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(task, pid=33413) INFO:     127.0.0.1:60198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(task, pid=33413) INFO 06-21 09:15:27 async_llm_engine.py:582] Received request cmpl-410ee0fe3db44e05a79d0112fb3ec571: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a great ai assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhi<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128009, 128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2025, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2294, 16796, 18328, 13, 128009, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271], lora_request: None.
(task, pid=33413) WARNING 06-21 09:15:27 scheduler.py:683] Input prompt (23 tokens) is too long and exceeds the capacity of block_manager

这是我在示例中设置的vllm特定内容:

. /etc/os-release
  sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
  deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
  EOF
  wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -

  sudo apt-get update -y

  # Install OS headers
  sudo apt-get install linux-headers-$(uname -r) -y

  # Install git
  sudo apt-get install git -y

  # Install Neuron Driver
  sudo apt-get install aws-neuronx-dkms=2.* -y

  # Install Neuron Runtime
  sudo apt-get install aws-neuronx-collectives=2.* -y
  sudo apt-get install aws-neuronx-runtime-lib=2.* -y

  # Install Neuron Tools
  sudo apt-get install aws-neuronx-tools=2.* -y

  # Add PATH
  export PATH=/opt/aws/neuron/bin:$PATH

  # Install Python venv
  sudo apt-get install -y python3.10-venv g++

  # Create Python venv
  python3.10 -m venv aws_neuron_venv_pytorch

  # Activate Python venv
  source aws_neuron_venv_pytorch/bin/activate

  # Install Jupyter notebook kernel
  pip install ipykernel
  python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
  pip install jupyter notebook
  pip install environment_kernels

  # Set pip repository pointing to the Neuron repository
  python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

  # Install wget, awscli
  python -m pip install wget
  python -m pip install awscli

  # Update Neuron Compiler and Framework
  python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx

  # Install vLLM from source
  git clone https://github.com/vllm-project/vllm.git
  touch ./vllm/model_executor/models/neuron/__init__.py
  cd vllm
  pip install -U -r requirements-neuron.txt
  # Create an empty __init__.py file in the neuron directory
  pip install .

  # Install Gradio for web UI
  pip install gradio openai

这是我运行服务器的方式:

source aws_neuron_venv_pytorch/bin/activate
  echo 'Starting vllm api server...'
  export LD_LIBRARY_PATH="/opt/conda/lib/:$LD_LIBRARY_PATH"
  export PATH=/opt/aws/neuron/bin:$PATH
  export NEURON_RT_VISIBLE_CORES=0-1

  # NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
  python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --max-num-seqs 1 \
    --device neuron \
    --max-model-len 2048 \
    2>&1 | tee api_server.log &

  while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
    echo 'Waiting for vllm api server to start...'
    sleep 5
  done

  echo 'Starting gradio server...'
  git clone https://github.com/vllm-project/vllm.git || true
  python vllm/examples/gradio_openai_chatbot_webserver.py \
    -m $MODEL_NAME \
    --port 8811 \
    --model-url http://localhost:8081/v1 \
    --stop-token-ids 128009,128001

我立刻想到的是NEURON_RT_VISIBLE_CORES环境变量,我尝试将其增加到大于0-1的范围,例如0-3,但是vllm服务器失败了,甚至无法启动。这是在inf2.8xlarge示例上。每个inf2加速器有8个核心(而8xlarge有一个单独的inferentia加速器),因此这本应该是0-7,但即使比这个值更小也不起作用吗?
我尝试将max-model-len增加到4096,但即使这样也无法启动vllm服务器并使其失败。

(task, pid=34615) performing partition vectorization on AG_2[[0, 1032, 0, 0, 0, 0]]{2 nodes (1 sources, 0 stops)}. dags covered: {dag_1036_TC_SRC, dag_1032}
(task, pid=34615) ..Waiting for vllm api server to start...
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing/process.py
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing
(task, pid=34615) root = /opt/conda/lib/python3.10
(task, pid=34615) root = /opt/conda/lib
(task, pid=34615) root = /opt/conda
(task, pid=34615) root = /opt
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866:  38168  ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-06-21T09:33:40Z [PGT002] Too many instructions after unroll! - Compiling under --optlevel=1 may result in smaller graphs. If you are using a transformer model, try using a smaller context_length_estimate value.
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866:  38168  ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb after 0 retries.
(task, pid=34615) 2024-06-21 09:33:40.000867:  38168  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) Waiting for vllm api server to start...
(task, pid=34615) Compiler status PASS
(task, pid=34615) 2024-06-21 09:36:42.000494:  38167  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) concurrent.futures.process._RemoteTraceback:
(task, pid=34615) """
(task, pid=34615) Traceback (most recent call last):
(task, pid=34615)   File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
(task, pid=34615)     r = call_item.fn(*call_item.args, **call_item.kwargs)
(task, pid=34615)   File "/home/ubuntu/sky_workdir/aws_neuron_venv_pytorch/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py", line 163, in call_neuron_compiler
(task, pid=34615)     raise subprocess.CalledProcessError(res.returncode, cmd, stderr=error_info)
(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.

(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.

将--max-num-seqs增加到>1也会导致启动vllm服务器失败。有人能帮我解决这个问题吗? 🙏
我已经尝试了许多方法,但大多数方法都在vllm方面失败了。 😦
请帮助我解决上述问题!

toe95027

toe950272#

@liangfu希望你能帮助解决上述问题!

jdgnovmf

jdgnovmf3#

@aws-patlange could you please look into this?

hlswsv35

hlswsv354#

我们目前不支持神经元集成中的分页注意力。您需要将 block-size 显式设置为 max-model-len 。请参阅 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html
这可能需要在这里进行一些编辑,以便将其传递给 vllm 中提供的 API 入口点之一。

ybzsozfc

ybzsozfc5#

请在编辑当前限制--block-size只能使用某些特定值的参数解析器后尝试以下操作:

python -u -m vllm.entrypoints.openai.api_server \
    --port 8081 \
    --model $MODEL_NAME \
    --trust-remote-code \
    --max-num-seqs 1 \
    --device neuron \
    --max-model-len 2048 \
    --block-size 2048 \
    2>&1 | tee api_server.log &
mspsb9vt

mspsb9vt6#

你好,我使用了你的命令,但是遇到了一个错误:
TypeError: 无法示例化具有抽象方法 execute_worker 的抽象类 NeuronWorker。有什么建议吗?谢谢!

相关问题