当前环境
python collect_env.py
的输出
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: Could not collect
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.31
Python version: 3.8.19 | packaged by conda-forge | (default, Mar 20 2024, 12:47:35) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-4.19.95-35-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 470.161.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
/bin/sh: lscpu: not found
Versions of relevant libraries:
[pip3] flashinfer==0.0.8+cu121torch2.3
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.44.0
[pip3] triton==2.3.0
[conda] flashinfer 0.0.8+cu121torch2.3 pypi_0 pypi
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] torchvision 0.18.0 pypi_0 pypi
[conda] transformers 4.44.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity
GPU0 X 24-47,72-95 1
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 描述bug
from vllm import LLM, SamplingParams
import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["VLLM_DO_NOT_TRACK"] = "1"
llm = LLM(
model="/data/test/gemma2_2b_it_prod",
max_model_len = 2048,
trust_remote_code = False,
block_size = 4,
max_num_seqs =2,
swap_space = 16,
max_seq_len_to_capture = 512,
load_format = 'auto',
dtype = 'float16',
kv_cache_dtype = 'auto',
seed = 0,
enforce_eager=True,
gpu_memory_utilization=0.95,
tensor_parallel_size =1,
worker_use_ray = False
)
当我运行上面的代码时,加载模型挂起
WARNING 08-13 07:04:00 config.py:1354] Casting torch.bfloat16 to torch.float16.
WARNING 08-13 07:04:00 utils.py:562] Gemma 2 uses sliding window attention for every odd layer, which is currently not supported by vLLM. Disabling sliding window and capping the max length to the sliding window size (4096).
INFO 08-13 07:04:00 llm_engine.py:169] Initializing an LLM engine (v0.5.1) with config: model='/mnt/posfs/globalmount/gemma-2-2b-it', speculative_config=None, tokenizer='/mnt/posfs/globalmount/gemma-2-2b-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/mnt/posfs/globalmount/gemma-2-2b-it, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
INFO 08-13 07:04:01 selector.py:79] Using Flashinfer backend.
WARNING 08-13 07:04:01 selector.py:80] Flashinfer will be stuck on llama-2-7b, please avoid using Flashinfer as the backend when running on llama-2-7b.
2条答案
按热度按时间bvuwiixz1#
我可以在A10中成功运行,但T4挂起。
xtfmy6hx2#
可能是一个闪存推理问题,cc @LiuXiaoxuanPKU