vllm [Bug]:使用 --enable-prefix-caching 时,在某些提示长度以上将 echo=True 的情况下,/completions 会导致服务器崩溃,

pgccezyw  于 10个月前  发布在  其他
关注(0)|答案(3)|浏览(97)

当前环境

  1. vLLM 0.4.3
  2. RTX 4090 24GB (reproduces also on A100)

🐛 描述bug

你好,
当服务器以以下方式启动时:

  1. python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --enable-prefix-caching

运行以下客户端代码:

  1. import openai
  2. client = openai.OpenAI(
  3. base_url="http://localhost:8000/v1",
  4. api_key="foo"
  5. )
  6. prompt = [1] * 256
  7. out = client.completions.create(
  8. model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
  9. prompt=prompt,
  10. max_tokens=1,
  11. logprobs=5,
  12. echo=True
  13. )
  14. print(out)

触发以下Assert:

  1. INFO: 127.0.0.1:39724 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
  2. ERROR: Exception in ASGI application
  3. Traceback (most recent call last):
  4. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
  5. result = await app( # type: ignore[func-returns-value]
  6. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
  7. return await self.app(scope, receive, send)
  8. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
  9. await super().__call__(scope, receive, send)
  10. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
  11. await self.middleware_stack(scope, receive, send)
  12. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
  13. raise exc
  14. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
  15. await self.app(scope, receive, _send)
  16. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/cors.py", line 85, in __call__
  17. await self.app(scope, receive, send)
  18. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
  19. await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  20. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  21. raise exc
  22. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  23. await app(scope, receive, sender)
  24. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 756, in __call__
  25. await self.middleware_stack(scope, receive, send)
  26. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 776, in app
  27. await route.handle(scope, receive, send)
  28. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 297, in handle
  29. await self.app(scope, receive, send)
  30. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 77, in app
  31. await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  32. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
  33. raise exc
  34. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
  35. await app(scope, receive, sender)
  36. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/starlette/routing.py", line 72, in app
  37. response = await func(request)
  38. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 278, in app
  39. raw_response = await run_endpoint_function(
  40. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
  41. return await dependant.call(**values)
  42. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 118, in create_completion
  43. generator = await openai_serving_completion.create_completion(
  44. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
  45. async for i, res in result_generator:
  46. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 244, in consumer
  47. raise e
  48. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 235, in consumer
  49. raise item
  50. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/utils.py", line 219, in producer
  51. async for item in iterator:
  52. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 662, in generate
  53. async for output in self._process_request(
  54. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
  55. raise e
  56. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
  57. async for request_output in stream:
  58. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
  59. raise result
  60. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
  61. task.result()
  62. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
  63. has_requests_in_progress = await asyncio.wait_for(
  64. File "/usr/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
  65. return fut.result()
  66. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
  67. request_outputs = await self.engine.step_async()
  68. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
  69. output = await self.model_executor.execute_model_async(
  70. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 117, in execute_model_async
  71. output = await make_async(self.driver_worker.execute_model
  72. File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  73. result = self.fn(*self.args, **self.kwargs)
  74. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
  75. return func(*args, **kwargs)
  76. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/worker.py", line 272, in execute_model
  77. output = self.model_runner.execute_model(seq_group_metadata_list,
  78. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
  79. return func(*args, **kwargs)
  80. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 738, in execute_model
  81. output = self.model.sample(
  82. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 378, in sample
  83. next_tokens = self.sampler(logits, sampling_metadata)
  84. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
  85. return self._call_impl(*args, **kwargs)
  86. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
  87. return forward_call(*args, **kwargs)
  88. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 112, in forward
  89. prompt_logprobs, sample_logprobs = _get_logprobs(
  90. File "/home/user/code/play-vllm/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 760, in _get_logprobs
  91. assert len(next_token_ids) == len(query_indices)
  92. AssertionError

然后服务器进入死机状态: vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop has errored already.
根据上述错误触发器,在一定提示长度阈值以上,我怀疑这是一个由Assert遮蔽的OOM。
如果我通过添加 --gpu-memory-utilization 0.5 为服务器增加更多内存余地,从我的RTX 4090的24GB内存中留下12GB空闲,那么在将提示大小增加到512个tokens时会出现错误。
如果没有 echo=True,这种情况就不会发生。
在上面的示例中,如果没有 --enable-prefix-caching,它可以处理最大提示大小为2047。
谢谢!

hts6caw3

hts6caw31#

在LLM入口点上也看到了这个带有大批量的内容。

ewm0tg9j

ewm0tg9j2#

@KuntaiDu,你有空查看这个吗?

ff29svar

ff29svar3#

相同。可能是由40系列显卡和flash-attn引起的问题。
#5678
#5537
#5376

5376 (注解)

相关问题