当增加并发请求数量时,vLLM忽略了我的请求,

ux6nzvsh  于 2个月前  发布在  其他
关注(0)|答案(7)|浏览(24)

我正在使用一个runpod容器来运行vLLM。
模板:runpod/pytorch:2.1.1-py3.10-cuda12.1.1-devel-ubuntu22.04
GPU云:1 x RTX 3090 | 12 vCPU 31 GB RAM
当我发送9个并发请求时,它运行得非常好,但当我将其增加到10个时,它开始卡住。
python -m vllm.entrypoints.openai.api_server --model openchat/openchat_3.5 --tensor-parallel-size 1

...
INFO:     127.0.0.1:46228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:46230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 02-05 04:53:20 async_llm_engine.py:111] Finished request cmpl-672a8058f6cb4d1d8f5ba5397af93575.
INFO 02-05 04:53:20 async_llm_engine.py:111] Finished request cmpl-4314994fe17a4b708bdbc0570668107b.
INFO 02-05 04:53:20 async_llm_engine.py:111] Finished request cmpl-85089ac09b6241f781d49b2b05fec1c6.
INFO 02-05 04:53:20 async_llm_engine.py:111] Finished request cmpl-b66387e22ebb4b33a010835b5d31f499.
INFO 02-05 04:53:21 llm_engine.py:706] Avg prompt throughput: 1137.2 tokens/s, Avg generation throughput: 193.7 tokens/s, Running: 5 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 4.2%, CPU KV cache usage: 0.0%
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-e9f50d97a01148308ccb3e8626b6feb6.
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-b0e0c9e0b76c450c8d4990fbab5f9fa6.
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-90c28be5bdb44d079f8e3bc4b281cc29.
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-55440ec21be24922b2ef820fdead76bc.
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-e2a78285e0814d518750ef67f28a7af4.
INFO 02-05 04:53:21 async_llm_engine.py:111] Finished request cmpl-e14764b23e38400393e22015ea6c6fd7.

它只是停止处理最后一个输入并卡在那里。

Processing files:  67%|█████████████████████▎          | 2/3 [01:34<00:50, 50.91s/files]
User 0: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:06<00:01,  1.88s/lines]
User 1: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.13s/lines]
User 2: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.08s/lines]
User 5: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.00s/lines]
User 3: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.02s/lines]
User 9: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.01s/lines]
User 8: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.16s/lines]
User 4: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:01,  1.99s/lines]
User 7: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:01,  1.91s/lines]
User 6: Processing intent.txt:  80%|███████████████▏   | 4/5 [00:07<00:02,  2.06s/lines]

我尝试包含--swap-space 0,但错误仍然存在,没有任何改变。

pokxtpni

pokxtpni1#

您的处理脚本可能出现问题,我成功地管理了数百个并发请求,而没有遇到这样的问题。您能分享一下您的处理脚本吗?

ne5o7dgx

ne5o7dgx2#

@savannahfung @WoosukKwon @hmellor 我遇到了与上述描述相似的问题,包括 "vllm.entrypoints.api_server"。起初,它有效地处理并发请求,但最终开始挂起。我的假设是这可能与 GPU KV 缓存稳步增加到99.4%有关,导致崩溃和挂起,从而使其他请求处于“待定”状态。这个问题很可能与 #2731 有关。

gudnpqoy

gudnpqoy3#

解:设$f(n)$表示第n个图中不同颜色顶点的个数。

当$n=1$时,有3种颜色的顶点,即$f(1)=3$;

当$n=2$时,有5种颜色的顶点,即$f(2)=5$;

当$n=3$时,有7种颜色的顶点,即$f(3)=7$;

当$n=4$时,有9种颜色的顶点,即$f(4)=9$;

以此类推,可得:

第n个图中有$[f(1)+f(-1)]$种颜色的顶点,即$f(n+1)=f(n)+2$.

又因为$f(1)=3$,所以$f(2)=5$,$f(3)=7$,$f(4)=9$,$\ldots \ldots f(2021)=4043$.

vtwuwzda

vtwuwzda4#

这很奇怪,因为在GPU KV缓存使用率达到99.9%之前,它的使用率仅为4.2%,CPU KV缓存使用率为0.0%。除非它突然飙升到99.9%。

r3i60tvu

r3i60tvu5#

嘿,@savannahfung,只是想让你知道我已经顺利运行了100个并发请求。然而,当我尝试发送第101个请求时,我注意到GPU KV缓存使用率飙升至99.4%。因此,所有后续请求最终都处于“待定”状态。

2sbarzqh

2sbarzqh6#

我的意思是你提供的脚本中制作请求的部分,你提供了太多额外的代码来找出问题所在。你能做一个最小可复现问题的脚本吗?

我正在跟进这个问题,因为我也遇到了这个问题,就像@nehalvaghasiya描述的那样,100个并发请求将无限期地在待处理队列中进出。

hrirmatl

hrirmatl7#

大家好,
我遇到了相同的问题。
Python==3.11.5
vllm==0.4.0.post1
openai==1.23.1
这是我启动openai服务器的方式:
CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --uvicorn-log-level debug --port 8001 > vllm_server_log.txt 2>&1 &
这是我产生错误的Python代码:

import asyncio
from openai import AsyncOpenAI

model_name='mistralai/Mistral-7B-Instruct-v0.2'
client=AsyncOpenAI(api_key="EMPTY",base_url=f"http://localhost:8001/v1/")

async def _send_chat_completion(messages):
    completion = await client.chat.completions.create(model=model_name, messages=messages, temperature=0.0)
    return completion.choices[0].message.content.strip()

async def _send_async_requests(prompts_messages):
    tasks = [_send_chat_completion(msgs) for msgs in prompts_messages]
    responses = await asyncio.gather(*tasks)
    return responses

prompts_msgs = [{'role': 'user', 'content': 'suggest a dinner meal'}]
print('Starting first run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))
print('Starting second run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))

第二次运行从未完成,服务器日志甚至没有提到它收到了请求。
我想把其他面临类似问题的用户指向相应的openai github页面,在那里他们报告说他们正在积极修复这个问题,但这似乎是一个与openai使用的其他模块更严重的问题(参见 openai/openai-python#769 )。
我的解决方法是使用原始请求,在那里我没有看到这个错误(尽管在上面链接的问题中,openai报告说你也可能遇到同样的问题)。调整上面的代码如下:

import asyncio
import aiohttp
async def _send_chat_completion(messages):
    print('starting openai request')
    async with aiohttp.ClientSession() as session:
        response = await session.post(url="http://localhost:8001/v1/chat/completions",
                                      json={"messages": messages, "model": "mistralai/Mistral-7B-Instruct-v0.2"},
                                      headers={"Content-Type": "application/json"})
        return await response.json()

async def _send_async_requests(prompts_messages):
    tasks = [_send_chat_completion(msgs) for msgs in prompts_messages]
    responses = await asyncio.gather(*tasks)
    responses = [resp['choices'][0]['message']['content'].strip() for resp in responses]
    return responses

prompts_msgs = [{'role': 'user', 'content': 'suggest a dinner meal'}]
print('Starting first run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))
print('Starting second run..')
responses = asyncio.run(_send_async_requests([prompts_msgs] * 5))

希望它能尽快修复。

相关问题