当前环境

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.1
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-3.10.0-1160.42.2.el7.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100-SXM4-40GB
Nvidia driver version: 470.57.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Address sizes:       46 bits physical, 57 bits virtual
Byte Order:          Little Endian
CPU(s):              112
On-line CPU(s) list: 0-111
Vendor ID:           GenuineIntel
Model name:          Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU family:          6
Model:               106
Thread(s) per core:  2
Core(s) per socket:  28
Socket(s):           2
Stepping:            6
BogoMIPS:            5187.80
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear spec_ctrl intel_stibp arch_capabilities
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           2.6 MiB (56 instances)
L1i cache:           1.8 MiB (56 instances)
L2 cache:            70 MiB (56 instances)
L3 cache:            96 MiB (2 instances)
NUMA node(s):        2
NUMA node0 CPU(s):   0-55
NUMA node1 CPU(s):   56-111

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] sentence-transformers==2.6.1
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    CPU Affinity    NUMA Affinity
GPU0     X      56-111  1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 描述bug

vllm是从最新的源代码(提交af9ad46)编译的。它在其他模型(如opt-125m)上运行正常，但总是在使用deepseek coder v2 lite时崩溃。当尝试使用export VLLM_TRACE_FUNCTION=1调试时，它没有崩溃。在取消设置后，它再次崩溃。

zxd@zxd-cuda121-0:/code/code-complete$ export VLLM_TRACE_FUNCTION=1

zxd@zxd-cuda121-0:/code/code-complete$ python3.11 testdp2.py
INFO 07-01 05:20:47 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123', speculative_config=None, tokenizer='/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-01 05:20:47 logger.py:146] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 07-01 05:20:47 logger.py:150] Trace frame log is saved to /tmp/vllm/vllm-instance-631f378da60646139e5c846da27af5d7/VLLM_TRACE_FUNCTION_for_process_32095_thread_139855533970048_at_2024-07-01_05:20:47.470343.log
DEBUG 07-01 05:20:49 parallel_state.py:788] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.18.162.132:39522 backend=nccl
Cache shape torch.Size([163840, 64])
INFO 07-01 05:21:15 model_runner.py:234] Loading model weights took 29.3010 GB
INFO 07-01 05:21:17 gpu_executor.py:83] # GPU blocks: 672, # CPU blocks: 606
Processed prompts: 100%|█| 2/2 [00:02<00:00,  1.29s/it, est. speed input: 5.44 toks/s, output:
Prompt: 'Hello, my name is', Generated text: ' ***\\<Your Name\\>*** and I am a ***\\<Your Profession'
Prompt: 'The president of the United States is', Generated text: ' not only the leader of the free world but also the commander-in-chief'

zxd@zxd-cuda121-0:/code/code-complete$ unset VLLM_TRACE_FUNCTION

zxd@zxd-cuda121-0:/code/code-complete$ python3.11 testdp2.py
INFO 07-01 05:23:43 llm_engine.py:169] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123', speculative_config=None, tokenizer='/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct-24-06-17-1123)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
DEBUG 07-01 05:23:45 parallel_state.py:788] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.18.162.132:38231 backend=nccl
Cache shape torch.Size([163840, 64])
INFO 07-01 05:24:07 model_runner.py:234] Loading model weights took 29.3010 GB
Segmentation fault (core dumped)

我正在尝试使用gdb python <THE_CORE_FILE>调试核心，但没有找到有用的信息。有人能帮助我从核心文件中获取更多信息吗？
以下是重现问题的代码：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

model_name = '/data/models/deepseek/deepseek-ai__deepseek-coder-v2-lite-instruct'
# model_name = '/data/models/opt-125m'
llm = LLM(model=model_name, trust_remote_code=True, max_model_len=8192, enforce_eager=True)

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

5条答案

按热度按时间

cgvd09ve1#

当尝试使用export VLLM_TRACE_FUNCTION=1进行调试时，程序没有崩溃。在取消设置它之后，程序再次崩溃了。这非常奇怪...
Python版本：3.11.0rc1(主版本，2022年8月12日，10:02:14) GCC 11.2.0
我注意到的一件事是你正在使用Python的发布候选版本。你是否尝试在几个Python版本之间切换？例如，官方发布的Python 3.10/3.11?

赞(0）回复(0）举报 6个月前

yr9zkbsy2#

当尝试使用export VLLM_TRACE_FUNCTION=1进行调试时，程序没有崩溃。在取消设置后，它再次崩溃了。这非常奇怪...
Python版本：3.11.0rc1(主版本，2022年8月12日，10:02:14) GCC 11.2.0
我注意到您正在使用Python的发布候选版本。您是否尝试在多个Python版本之间切换？例如，官方发布的Python 3.10/3.11?
好的，我会尝试使用Python 3.11发布版本。当前的3.11 rc1来自镜像nvidia/cuda:12.1.0-devel-ubuntu22.04。

3mpgtkmj3#

我切换到Python 3.10后，没有再出现崩溃。

oxalkeyp4#

我也在Python 3.9中遇到了这个问题。

klr1opcd5#

根据我的测试，它似乎是随机的；有时它会崩溃转储，有时又不会。

vllm [Bug]: 在加载deepseek coder v2 lite模型时发生段错误(核心转储)

当前环境

🐛 描述bug

5条答案

相关问题

热门标签

最新问答