当前环境
Collecting environment information...
PyTorch version: 2.3.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.1
Libc version: glibc-2.35
Python version: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-1064-aws-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A10G
Nvidia driver version: 535.161.07
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7R32
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 1
Stepping: 0
BogoMIPS: 5599.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 512 KiB (16 instances)
L1i cache: 512 KiB (16 instances)
L2 cache: 8 MiB (16 instances)
L3 cache: 64 MiB (4 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] optree==0.11.0
[pip3] sentence-transformers==2.7.0
[pip3] torch==2.3.1
[pip3] torcheval==0.0.7
[pip3] torchvision==0.18.1
[pip3] transformers==4.43.2
[pip3] triton==2.3.1
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks```
🐛 描述错误
运行代码后
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from outlines.integrations.vllm import RegexLogitsProcessor
import os
os.environ["HF_TOKEN"] = ""
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct", enable_prefix_caching=True)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
proc = RegexLogitsProcessor(r'yes|no', llm)
sampling_params = SamplingParams(temperature=0.6, top_p=0.15, max_tokens=1, logits_processors=[proc])
prompts = ["some long text up to the max model length / 20000 chars", "some long text up to the max model length / 20000 chars", ...] <- list of length 100 to 1000
formatted_prompts = []
for prompt in prompts:
messages = [{"role": "user", "content": prompt["prompt"]}]
formatted_prompts.append(tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True))
output = llm.generate(formatted_prompts, sampling_params)
我遇到了一个错误
RuntimeError: CUDA error: an illegal memory access was encountered
这个错误似乎随机发生,有时我在相同的环境下和版本中运行相同的命令,我不会得到错误。我已经进行了以下调查,并确认:
- 设置
enable_prefix_caching=False
可以消 debugging 误 - 提示长度似乎对错误影响不大,将20k字符提示更改为2k字符提示并没有消 debugging 误
- 移除RegexLogitsProcessor无法解决问题
- 尝试0.4.2和其他版本都没有帮助
- 将gpu内存使用减少到0.8也没有帮助
- 使用os.environ["VLLM_ATTENTION_BACKEND"]="XFORMERS"会导致
The Python process exited with exit code 139 (SIGSEGV: Segmentation fault)
我已经看到了很多不同的问题与 enable_prefix_caching
,有人能评论一下这个功能是否对他们有用吗?我们的用例中有大量的80-90%的重复提示,所以前缀缓存提供了巨大的速度提升。如果有任何建议,非常感谢!
完整的错误细节
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
File <command-1575781236477471>, line 6
4 os.environ["VLLM_TRACE_FUNCTION"]="TRACE"
5 os.environ["CUDA_LAUNCH_BLOCKING"]="1"
----> 6 output = llm.generate(formatted_prompts[300:1000], sampling_params)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/utils.py:838, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
831 msg += f" {additional_message}"
833 warnings.warn(
834 DeprecationWarning(msg),
835 stacklevel=3, # The inner function takes up one level
836 )
--> 838 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:316, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request)
308 sampling_params = SamplingParams()
310 self._validate_and_add_requests(
311 inputs=inputs,
312 params=sampling_params,
313 lora_request=lora_request,
314 prompt_adapter_request=prompt_adapter_request)
--> 316 outputs = self._run_engine(use_tqdm=use_tqdm)
317 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/entrypoints/llm.py:569, in LLM._run_engine(self, use_tqdm)
567 total_out_toks = 0
568 while self.llm_engine.has_unfinished_requests():
--> 569 step_outputs = self.llm_engine.step()
570 for output in step_outputs:
571 if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/engine/llm_engine.py:911, in LLMEngine.step(self)
901 finished_requests_ids = self.scheduler[
902 0].get_and_reset_finished_requests_ids()
903 execute_model_req = ExecuteModelRequest(
904 seq_group_metadata_list=seq_group_metadata_list,
905 blocks_to_swap_in=scheduler_outputs.blocks_to_swap_in,
(...)
909 running_queue_size=scheduler_outputs.running_queue_size,
910 finished_requests_ids=finished_requests_ids)
--> 911 output = self.model_executor.execute_model(
912 execute_model_req=execute_model_req)
913 else:
914 output = []
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:110, in GPUExecutor.execute_model(self, execute_model_req)
107 def execute_model(
108 self, execute_model_req: ExecuteModelRequest
109 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 110 output = self.driver_worker.execute_model(execute_model_req)
111 return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/worker_base.py:272, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
268 if not get_pp_group().is_first_rank:
269 intermediate_tensors = IntermediateTensors(
270 get_pp_group().recv_tensor_dict())
--> 272 output = self.model_runner.execute_model(
273 model_input, self.kv_cache[worker_input.virtual_engine]
274 if self.kv_cache is not None else None, intermediate_tensors,
275 num_steps)
277 if not get_pp_group().is_last_rank:
278 # output is IntermediateTensors
279 get_pp_group().send_tensor_dict(output.tensors)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/worker/model_runner.py:1334, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
1331 return []
1333 # Sample the next token.
-> 1334 output: SamplerOutput = self.model.sample(
1335 logits=logits,
1336 sampling_metadata=model_input.sampling_metadata,
1337 )
1339 if self.return_hidden_states:
1340 # we only need to pass hidden states of most recent token
1341 assert model_input.sampling_metadata is not None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:437, in LlamaForCausalLM.sample(self, logits, sampling_metadata)
432 def sample(
433 self,
434 logits: torch.Tensor,
435 sampling_metadata: SamplingMetadata,
436 ) -> Optional[SamplerOutput]:
--> 437 next_tokens = self.sampler(logits, sampling_metadata)
438 return next_tokens
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:91, in Sampler.forward(self, logits, sampling_metadata)
89 # Prepare sampling tensors with pinned memory to avoid blocking.
90 if not sampling_metadata.reuse_sampling_tensors:
---> 91 self._init_sampling_tensors(logits, sampling_metadata)
92 elif self._do_penalties:
93 # In this case, the sampling tensors logic depends on
94 # "output_tokens" of a sequence. As a result, we cannot
95 # reuse sampling tensors, since "output_tokens" changes
96 # between decode runs.
97 self._init_sampling_tensors(logits, sampling_metadata)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/layers/sampler.py:68, in Sampler._init_sampling_tensors(self, logits, sampling_metadata)
64 self._sampling_tensors = None
66 # Initialize new sampling tensors
67 (sampling_tensors, do_penalties, do_top_p_top_k,
---> 68 do_min_p) = SamplingTensors.from_sampling_metadata(
69 sampling_metadata, vocab_size, logits.device, logits.dtype)
71 self._sampling_tensors = sampling_tensors
72 self._do_penalties = do_penalties
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:443, in SamplingTensors.from_sampling_metadata(cls, sampling_metadata, vocab_size, device, dtype, extra_seeds_to_generate, extra_entropy)
440 prompt_tokens.append(list(seq_data.prompt_token_ids))
441 output_tokens.append(list(seq_data.output_token_ids))
--> 443 sampling_tensors = SamplingTensors.from_lists(
444 temperatures, top_ps, top_ks, min_ps, presence_penalties,
445 frequency_penalties, repetition_penalties, sampling_seeds,
446 sample_indices, prompt_tokens, output_tokens, vocab_size,
447 extra_seeds_to_generate, device, dtype)
448 return (sampling_tensors, do_penalties, do_top_p_top_k, do_min_p)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-35999335-6bd3-46be-85fc-719f24a36190/lib/python3.11/site-packages/vllm/model_executor/sampling_metadata.py:487, in SamplingTensors.from_lists(cls, temperatures, top_ps, top_ks, min_ps, presence_penalties, frequency_penalties, repetition_penalties, sampling_seeds, sample_indices, prompt_tokens, output_tokens, vocab_size, extra_seeds_to_generate, device, dtype)
484 prompt_t = empty_tensor
485 output_t = empty_tensor
--> 487 temperatures_t = torch.tensor(
488 temperatures,
489 device="cpu",
490 dtype=dtype,
491 pin_memory=pin_memory,
492 )
493 top_ps_t = torch.tensor(
494 top_ps,
495 device="cpu",
496 dtype=dtype,
497 pin_memory=pin_memory,
498 )
499 min_ps_t = torch.tensor(
500 min_ps,
501 device="cpu",
502 dtype=dtype,
503 pin_memory=pin_memory,
504 )
1条答案
按热度按时间xxe27gdn1#
你能分享你发送的确切提示吗?这个问题偶尔会发生,所以详细的复现说明对我们非常有帮助。