vllm [Bug]: enable_prefix_caching 导致持续的非法内存访问错误

brvekthn  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(25)

当前环境

Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35

Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1064-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           64 MiB (4 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.11.0
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.1
[pip3] torcheval==0.0.7
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	0-31	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks```

🐛 描述错误

运行代码后

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from outlines.integrations.vllm import RegexLogitsProcessor

import os
os.environ["HF_TOKEN"] = ""

llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", enable_prefix_caching=True)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

proc = RegexLogitsProcessor(r'yes|no', llm)
sampling_params = SamplingParams(temperature=0.6, top_p=0.15, max_tokens=1, logits_processors=[proc])

prompts = ["some long text up to the max model length / 20000 chars", "some long text up to the max model length / 20000 chars", ...] <- list of length 100 to 1000

formatted_prompts = []
for prompt in prompts:
    messages = [{"role": "user", "content": prompt["prompt"]}]
    formatted_prompts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))

output = llm.generate(formatted_prompts, sampling_params)

我遇到了一个错误

RuntimeError: CUDA error: an illegal memory access was encountered

这个错误似乎随机发生,有时我在相同的环境下和版本中运行相同的命令,我不会得到错误。我已经进行了以下调查,并确认:

  • 设置 enable_prefix_caching=False 可以消 debugging 误
  • 提示长度似乎对错误影响不大,将20k字符提示更改为2k字符提示并没有消 debugging 误
  • 移除RegexLogitsProcessor无法解决问题
  • 尝试0.4.2和其他版本都没有帮助
  • 将gpu内存使用减少到0.8也没有帮助
  • 使用os.environ["VLLM_ATTENTION_BACKEND"]="XFORMERS"会导致 The Python process exited with exit code 139 (SIGSEGV: Segmentation fault)

我已经看到了很多不同的问题与 enable_prefix_caching ,有人能评论一下这个功能是否对他们有用吗?我们的用例中有大量的80-90%的重复提示,所以前缀缓存提供了巨大的速度提升。如果有任何建议,非常感谢!
完整的错误细节

Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
File <command-1575781236477471>, line 6
      4 os.environ["VLLM_TRACE_FUNCTION"]="TRACE"
      5 os.environ["CUDA_LAUNCH_BLOCKING"]="1"
----> 6 output = llm.generate(formatted_prompts[300:1000], sampling_params)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/utils.py:838, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
    831             msg += f" {additional_message}"
    833         warnings.warn(
    834             DeprecationWarning(msg),
    835             stacklevel=3,  # The inner function takes up one level
    836         )
--> 838 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:316, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
    308     sampling_params = SamplingParams()
    310 self._validate_and_add_requests(
    311     inputs=inputs,
    312     params=sampling_params,
    313     lora_request=lora_request,
    314     prompt_adapter_request=prompt_adapter_request)
--> 316 outputs = self._run_engine(use_tqdm=use_tqdm)
    317 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:569, in LLM._run_engine(self, use_tqdm)
    567 total_out_toks = 0
    568 while self.llm_engine.has_unfinished_requests():
--> 569     step_outputs = self.llm_engine.step()
    570     for output in step_outputs:
    571         if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/engine/llm_engine.py:911, in LLMEngine.step(self)
    901     finished_requests_ids = self.scheduler[
    902         0].get_and_reset_finished_requests_ids()
    903     execute_model_req = ExecuteModelRequest(
    904         seq_group_metadata_list=seq_group_metadata_list,
    905         blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
   (...)
    909         running_queue_size=scheduler_outputs.running_queue_size,
    910         finished_requests_ids=finished_requests_ids)
--> 911     output = self.model_executor.execute_model(
    912         execute_model_req=execute_model_req)
    913 else:
    914     output = []
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:110, in GPUExecutor.execute_model(self, execute_model_req)
    107 def execute_model(
    108     self, execute_model_req: ExecuteModelRequest
    109 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 110     output = self.driver_worker.execute_model(execute_model_req)
    111     return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/worker_base.py:272, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    268 if not get_pp_group().is_first_rank:
    269     intermediate_tensors = IntermediateTensors(
    270         get_pp_group().recv_tensor_dict())
--> 272 output = self.model_runner.execute_model(
    273     model_input, self.kv_cache[worker_input.virtual_engine]
    274     if self.kv_cache is not None else None, intermediate_tensors,
    275     num_steps)
    277 if not get_pp_group().is_last_rank:
    278     # output is IntermediateTensors
    279     get_pp_group().send_tensor_dict(output.tensors)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/model_runner.py:1334, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1331     return []
   1333 # Sample the next token.
-> 1334 output: SamplerOutput = self.model.sample(
   1335     logits=logits,
   1336     sampling_metadata=model_input.sampling_metadata,
   1337 )
   1339 if self.return_hidden_states:
   1340     # we only need to pass hidden states of most recent token
   1341     assert model_input.sampling_metadata is not None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:437, in LlamaForCausalLM.sample(self, logits, sampling_metadata)
    432 def sample(
    433     self,
    434     logits: torch.Tensor,
    435     sampling_metadata: SamplingMetadata,
    436 ) -> Optional[SamplerOutput]:
--> 437     next_tokens = self.sampler(logits, sampling_metadata)
    438     return next_tokens
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:91, in Sampler.forward(self, logits, sampling_metadata)
     89 # Prepare sampling tensors with pinned memory to avoid blocking.
     90 if not sampling_metadata.reuse_sampling_tensors:
---> 91     self._init_sampling_tensors(logits, sampling_metadata)
     92 elif self._do_penalties:
     93     # In this case, the sampling tensors logic depends on
     94     # "output_tokens" of a sequence. As a result, we cannot
     95     # reuse sampling tensors, since "output_tokens" changes
     96     # between decode runs.
     97     self._init_sampling_tensors(logits, sampling_metadata)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:68, in Sampler._init_sampling_tensors(self, logits, sampling_metadata)
     64 self._sampling_tensors = None
     66 # Initialize new sampling tensors
     67 (sampling_tensors, do_penalties, do_top_p_top_k,
---> 68  do_min_p) = SamplingTensors.from_sampling_metadata(
     69      sampling_metadata, vocab_size, logits.device, logits.dtype)
     71 self._sampling_tensors = sampling_tensors
     72 self._do_penalties = do_penalties
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:443, in SamplingTensors.from_sampling_metadata(cls, sampling_metadata, vocab_size, device, dtype, extra_seeds_to_generate, extra_entropy)
    440                 prompt_tokens.append(list(seq_data.prompt_token_ids))
    441                 output_tokens.append(list(seq_data.output_token_ids))
--> 443 sampling_tensors = SamplingTensors.from_lists(
    444     temperatures, top_ps, top_ks, min_ps, presence_penalties,
    445     frequency_penalties, repetition_penalties, sampling_seeds,
    446     sample_indices, prompt_tokens, output_tokens, vocab_size,
    447     extra_seeds_to_generate, device, dtype)
    448 return (sampling_tensors, do_penalties, do_top_p_top_k, do_min_p)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:487, in SamplingTensors.from_lists(cls, temperatures, top_ps, top_ks, min_ps, presence_penalties, frequency_penalties, repetition_penalties, sampling_seeds, sample_indices, prompt_tokens, output_tokens, vocab_size, extra_seeds_to_generate, device, dtype)
    484     prompt_t = empty_tensor
    485     output_t = empty_tensor
--> 487 temperatures_t = torch.tensor(
    488     temperatures,
    489     device="cpu",
    490     dtype=dtype,
    491     pin_memory=pin_memory,
    492 )
    493 top_ps_t = torch.tensor(
    494     top_ps,
    495     device="cpu",
    496     dtype=dtype,
    497     pin_memory=pin_memory,
    498 )
    499 min_ps_t = torch.tensor(
    500     min_ps,
    501     device="cpu",
    502     dtype=dtype,
    503     pin_memory=pin_memory,
    504 )
xxe27gdn

xxe27gdn1#

你能分享你发送的确切提示吗?这个问题偶尔会发生,所以详细的复现说明对我们非常有帮助。

相关问题