vllm [Bug]: Qwen/Qwen2-72B-Instruct 128k server down

vddsk6oq  于 6个月前  发布在  其他
关注(0)|答案(9)|浏览(57)

描述bug

当前环境使用的PyTorch版本为2.3.0+cu121,并且没有使用ROCM进行构建。在Linux系统上,使用了Ubuntu Jammy Jellyfish(开发分支)作为操作系统。GCC和Clang的版本无法收集到,而CMake的版本为3.29.5。Libc的版本为glibc-2.35。Python的版本为3.10.14(主分支,于2024年5月6日发布)。当前环境中启用了CUDA,并加载了CUDA 12.1和cuDNN。GPU模型和配置信息如下:

GPU 0:NVIDIA L20
GPU 1:NVIDIA L20
GPU 2:NVIDIA L20
GPU 3:NVIDIA L20
GPU 4:NVIDIA L20
GPU 5:NVIDIA L20
GPU 6:NVIDIA L20
GPU 7:NVIDIA L20
Nvidia驱动版本为535.161.07
cuDNN版本无法收集到
HIP运行时版本为N/A
MIOpen运行时版本为N/A
XNNPACK可用。
CPU架构为x86_64,支持32位和64位指令集。地址大小为52位物理地址和57位虚拟地址。字节序为Little Endian。CPU数量为128个。供应商ID为GenuineIntel,型号为Intel(R) Xeon(R) Gold 6462C CPU。CPU家族为6,模型为143。每个CPU核心的线程数为2,每个CPU核心的缓存大小为2MB。有两个NUMA节点,节点编号分别为0和1。
使用vllm/vllm-openai:v0.5.0,在两台L40机器上分布式执行,每台机器8张卡,总共16张卡。

在头节点上

ray start --head

在工作节点上

ray start --address=xxx:6379
python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --trust-remote-code --tensor-parallel-size 16 --model /model/Qwen2-72B-Instruct --max-num-seqs 1
90000 token no problem ,100000 token server is down
然后我得到了以下错误:
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 265, in **call
await wrap(partial(self.listen_for_disconnect, receive))
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 261, in wrap
await func()
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/responses.py", line 238, in listen_for_disconnect
message = await receive()
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 553, in receive
await self.message_event.wait()
File "/root/miniconda3/envs/vllm/lib/python3.10/asyncio/locks.py", line 214, in wait
await fut
asyncio.exceptions.CancelledError: Cancelled by cancel scope 7f24f6169600
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 70, in **call
return await self.app(scope, receive, send)
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in **call
await super().**call(scope, receive, send)
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 123, in **call
await self.middleware_stack(scope, receive, send)
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in **call
raise exc
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in **call
await self.app(scope, receive, _send)
File "/root/miniconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware//cors.py", line 85, in **call
await self.app(scope, receive, send)
File "/root/miniconda3/envs

dzhpxtsq

dzhpxtsq1#

请帮我检查一下,我怀疑这是一个超时问题。谢谢

z8dt9xmd

z8dt9xmd2#

它每次都发生吗?你有没有一个请求可以稳定地触发错误?

xzv2uavs

xzv2uavs3#

我遇到了同样的问题。在运行了几个小时后,qwen2-72b和qwen1.5-110b都产生了asyncio.exceptions.CancelledError。

cygmwpex

cygmwpex4#

完全相同的问题,我从6月14日开始就遇到了。也许是因为信号量的问题?

u91tlkcl

u91tlkcl5#

今天早上又发生了这种情况。似乎生成吞吐量变得越来越慢,最后无法响应(上下文长度没有增长,每个请求都是一个新的聊天)。我想在请求完成后有一些资源没有被释放。

rmbxnbpk

rmbxnbpk6#

当我们基于Qwen2-72B-Instruct-128K使用4个L40 GPU进行推理时,遇到了相同的问题。我们通过设置以下两个参数解决了这个问题:

llm = LLM(
    ...,
    enable_chunked_prefill=True,
    max_num_batched_tokens=8192
)

您可以根据您的GPU资源适当调整max_num_batched_tokens

b1payxdu

b1payxdu7#

@junior-zsy 我们在使用基于Qwen2-72B-Instruct-128K的4 L40 GPU进行推理时遇到了相同的问题。我们通过设置以下两个参数解决了这个问题:

llm = LLM(
    ...,
    enable_chunked_prefill=True,
    max_num_batched_tokens=8192
)

您可以根据您的GPU资源适当调整 max_num_batched_tokens
谢谢,但上述方法对我来说没有用,我已经找到了真正的原因。

rdrgkggo

rdrgkggo8#

@youkaichao I have discovered the ultimate cause of this problem,The problem lies in run_engine_loop function

has_requests_in_progress = False
        while True:
            if not has_requests_in_progress:
                logger.debug("Waiting for new requests...")
                await self._request_tracker.wait_for_new_requests()
                logger.debug("Got new requests!")

            # Abort if iteration takes too long due to unrecoverable errors
            # (eg. NCCL timeouts).
            try:
                async with asyncio_timeout(ENGINE_ITERATION_TIMEOUT_S):
                    has_requests_in_progress = await self.engine_step()
            except asyncio.TimeoutError as exc:
                logger.error(
                    "Engine iteration timed out. This should never happen!")
                self.set_errored(exc)
                raise # If a timeout is triggered, the entire server crashes because there is no try exception caught on this exception
            await asyncio.sleep(0)

def start_background_loop(self) -> None:
        """Start the background loop."""
        if self.errored:
            raise AsyncEngineDeadError(
                "Background loop has errored already.") from self._errored_with
        if self.is_running:
            raise RuntimeError("Background loop is already running.")
        # Initialize the RequestTracker here so it uses the right event loop.
        self._request_tracker = RequestTracker()
       # Failure to capture exceptions resulted in program crash
        self._background_loop_unshielded = asyncio.get_event_loop(
        ).create_task(self.run_engine_loop())
        self._background_loop_unshielded.add_done_callback(
            partial(_log_task_completion, error_callback=self._error_callback))
        self.background_loop = asyncio.shield(self._background_loop_unshielded)

When a request triggers a timeout, a single exception can cause the entire service to crash. The raise passes an exception to the upper layer, but the upper layer's code does not try exception to catch the exception. In the end, the entire service crashes,I can solve the above problem by setting VLLM-ENGINE-ITERATION-TIMEOUT, but I don't understand why vllm code needs to be written like this, raise, but the upper layer did not capture it. Thank you

8i9zcol2

8i9zcol29#

我发现了这个问题的根本原因,问题出在run_engine_loop函数中。当一个请求触发超时时,单个异常可能导致整个服务崩溃。raise将异常传递给上层,但上层的代码没有尝试捕获这个异常。最后,整个服务崩溃了。

我可以通过设置VLLM-ENGINE-ITERATION-TIMEOUT来解决这个问题,但我不明白为什么vllm代码需要这样写,raise,而上层却没有捕获它。谢谢。

我已经尝试设置VLLM-ENGINE-ITERATION-TIMEOUT=600,在start_background_loop中使用"try-catch",并在run_engine_loop中中止这个无效的请求,但它们都没有效果。你还有其他关于如何处理这个错误的想法吗?

相关问题