[Bug]: Gemma-2-2b-it load model hangs by vLLM==0.5.1 on Tesla T4 GPU

djmepvbi 于 2个月前发布在其他

关注(0)|答案(2)|浏览(56)

当前环境

python collect_env.py的输出

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31

Python version: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35)  [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-4.19.95-35-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
/bin/sh: lscpu: not found

Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.44.0
[pip3] triton==2.3.0
[conda] flashinfer                0.0.8+cu121torch2.3          pypi_0    pypi
[conda] numpy                     1.24.4                   pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] torch                     2.3.0                    pypi_0    pypi
[conda] torchvision               0.18.0                   pypi_0    pypi
[conda] transformers              4.44.0                   pypi_0    pypi
[conda] triton                    2.3.0                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity
GPU0     X      24-47,72-95     1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 描述bug

from vllm import LLM, SamplingParams

import os 
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["VLLM_DO_NOT_TRACK"] = "1"
llm = LLM(
    model="/data/test/gemma2_2b_it_prod",
    max_model_len = 2048,
    trust_remote_code = False,
    block_size = 4,
    max_num_seqs =2,
    swap_space = 16,
    max_seq_len_to_capture = 512,
    load_format = 'auto',
    dtype = 'float16',
    kv_cache_dtype = 'auto',
    seed = 0,
    enforce_eager=True,
    gpu_memory_utilization=0.95,
    tensor_parallel_size =1,
    worker_use_ray = False   
    )

当我运行上面的代码时，加载模型挂起

WARNING 08-13 07:04:00 config.py:1354] Casting torch.bfloat16 to torch.float16.
WARNING 08-13 07:04:00 utils.py:562] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-13 07:04:00 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/mnt/posfs/globalmount/gemma-2-2b-it', speculative_config=None, tokenizer='/mnt/posfs/globalmount/gemma-2-2b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/posfs/globalmount/gemma-2-2b-it, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.

vllm

来源：https://github.com/vllm-project/vllm/issues/7464