当前环境
The output of `python collect_env.py`
vllm 0.4.3
rtx4090
驱动程序555.99
🐛 描述bug
2024-06-10 13:26:25 Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f1e8f46caf0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f1e847108e0>>) 2024-06-10 13:26:25 handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f1e8f46caf0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f1e847108e0>>)> 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish 2024-06-10 13:26:25 task.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop 2024-06-10 13:26:25 has_requests_in_progress = await asyncio.wait_for( 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for 2024-06-10 13:26:25 return fut.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step 2024-06-10 13:26:25 request_outputs = await self.engine.step_async() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async 2024-06-10 13:26:25 output = await self.model_executor.execute_model_async( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async 2024-06-10 13:26:25 output = await make_async(self.driver_worker.execute_model 2024-06-10 13:26:25 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-06-10 13:26:25 result = self.fn(*self.args, **self.kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model 2024-06-10 13:26:25 output = self.model_runner.execute_model(seq_group_metadata_list, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model 2024-06-10 13:26:25 output = self.model.sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 345, in sample 2024-06-10 13:26:25 next_tokens = self.sampler(logits, sampling_metadata) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-10 13:26:25 return self._call_impl(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-10 13:26:25 return forward_call(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 96, in forward 2024-06-10 13:26:25 sample_results, maybe_sampled_tokens_tensor = _sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample 2024-06-10 13:26:25 return _sample_with_torch( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch 2024-06-10 13:26:25 sample_results = _random_sample(seq_groups, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample 2024-06-10 13:26:25 random_samples = random_samples.cpu() 2024-06-10 13:26:25 RuntimeError: CUDA error: an illegal memory access was encountered 2024-06-10 13:26:25 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-06-10 13:26:25 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-06-10 13:26:25 Compile with
TORCH_USE_CUDA_DSA to enable device-side assertions. 2024-06-10 13:26:25 2024-06-10 13:26:25 2024-06-10 13:26:25 The above exception was the direct cause of the following exception: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish 2024-06-10 13:26:25 raise AsyncEngineDeadError( 2024-06-10 13:26:25 vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause. 2024-06-10 13:26:25 ERROR: Exception in ASGI application 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 265, in __call__ 2024-06-10 13:26:25 await wrap(partial(self.listen_for_disconnect, receive)) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap 2024-06-10 13:26:25 await func() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 238, in listen_for_disconnect 2024-06-10 13:26:25 message = await receive() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 54, in wrapped_receive 2024-06-10 13:26:25 msg = await self.receive() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive 2024-06-10 13:26:25 await self.message_event.wait() 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait 2024-06-10 13:26:25 await fut 2024-06-10 13:26:25 asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f1e7f94bd60 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 192, in __call__ 2024-06-10 13:26:25 await response(scope, wrapped_receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 258, in __call__ 2024-06-10 13:26:25 async with anyio.create_task_group() as task_group: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__ 2024-06-10 13:26:25 raise BaseExceptionGroup( 2024-06-10 13:26:25 exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 87, in collapse_excgroups 2024-06-10 13:26:25 yield 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 190, in __call__ 2024-06-10 13:26:25 async with anyio.create_task_group() as task_group: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/anyio/_backends/_asyncio.py", line 680, in __aexit__ 2024-06-10 13:26:25 raise BaseExceptionGroup( 2024-06-10 13:26:25 exceptiongroup.ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) 2024-06-10 13:26:25 2024-06-10 13:26:25 During handling of the above exception, another exception occurred: 2024-06-10 13:26:25 2024-06-10 13:26:25 Traceback (most recent call last): 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi 2024-06-10 13:26:25 result = await app( # type: ignore[func-returns-value] 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__ 2024-06-10 13:26:25 return await self.app(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__ 2024-06-10 13:26:25 await super().__call__(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__ 2024-06-10 13:26:25 await self.middleware_stack(scope, receive, send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__ 2024-06-10 13:26:25 raise exc 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__ 2024-06-10 13:26:25 await self.app(scope, receive, _send) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/base.py", line 189, in __call__ 2024-06-10 13:26:25 with collapse_excgroups(): 2024-06-10 13:26:25 File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__ 2024-06-10 13:26:25 self.gen.throw(typ, value, traceback) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/_utils.py", line 93, in collapse_excgroups 2024-06-10 13:26:25 raise exc 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in wrap 2024-06-10 13:26:25 await func() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 250, in stream_response 2024-06-10 13:26:25 async for chunk in self.body_iterator: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", line 227, in chat_completion_stream_generator 2024-06-10 13:26:25 async for res in result_generator: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate 2024-06-10 13:26:25 async for output in self._process_request( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request 2024-06-10 13:26:25 raise e 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request 2024-06-10 13:26:25 async for request_output in stream: 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__ 2024-06-10 13:26:25 raise result 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish 2024-06-10 13:26:25 task.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop 2024-06-10 13:26:25 has_requests_in_progress = await asyncio.wait_for( 2024-06-10 13:26:25 File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for 2024-06-10 13:26:25 return fut.result() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step 2024-06-10 13:26:25 request_outputs = await self.engine.step_async() 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async 2024-06-10 13:26:25 output = await self.model_executor.execute_model_async( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async 2024-06-10 13:26:25 output = await make_async(self.driver_worker.execute_model 2024-06-10 13:26:25 File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run 2024-06-10 13:26:25 result = self.fn(*self.args, **self.kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 272, in execute_model 2024-06-10 13:26:25 output = self.model_runner.execute_model(seq_group_metadata_list, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context 2024-06-10 13:26:25 return func(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 738, in execute_model 2024-06-10 13:26:25 output = self.model.sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 345, in sample 2024-06-10 13:26:25 next_tokens = self.sampler(logits, sampling_metadata) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl 2024-06-10 13:26:25 return self._call_impl(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl 2024-06-10 13:26:25 return forward_call(*args, **kwargs) 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 96, in forward 2024-06-10 13:26:25 sample_results, maybe_sampled_tokens_tensor = _sample( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 655, in _sample 2024-06-10 13:26:25 return _sample_with_torch( 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 544, in _sample_with_torch 2024-06-10 13:26:25 sample_results = _random_sample(seq_groups, 2024-06-10 13:26:25 File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/sampler.py", line 324, in _random_sample 2024-06-10 13:26:25 random_samples = random_samples.cpu() 2024-06-10 13:26:25 RuntimeError: CUDA error: an illegal memory access was encountered 2024-06-10 13:26:25 CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. 2024-06-10 13:26:25 For debugging consider passing CUDA_LAUNCH_BLOCKING=1. 2024-06-10 13:26:25 Compile with
TORCH_USE_CUDA_DSA to enable device-side assertions. 2024-06-10 13:26:25 2024-06-10 13:32:41 INFO 06-10 05:32:41 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%. 2024-06-10 13:32:51 INFO 06-10 05:32:51 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%. 2024-06-10 13:33:01 INFO 06-10 05:33:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 11.4%, CPU KV cache usage: 0.0%.
1条答案
按热度按时间bxgwgixi1#
看到类似的段错误,有时与0.5.0.post1一起出现: