在Qwen1.5-7B-Chat-AWQ和vllm v0.3.0中，使用尾随换行符" ",

stszievb 于 2个月前发布在其他

关注(0)|答案(7)|浏览(39)

我在一个Docker容器中运行vlm,使用以下参数：
["--quantization", "awq", "--enforce-eager", "--disable-custom-all-reduce", "--max-num-batched-tokens", "4096", "--max-model-len", "4096", "--model", "LoneStriker/Qwen1.5-7B-Chat-AWQ", "--host", "0.0.0.0", "--port", "8080", "--chat-template", "/chat_template/qwen1.5-7b-chat.jinja2"]
聊天模板：

{% for message in messages %}{{'<|im_start|>' + message['role'] + '
' + message['content'] + '<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
' }}{% endif %}

每2到3个查询，模型就会用换行符("
")填充所有剩余的补全标记。这种情况只发生在我向模型发送的消息数量大于1时，也就是说，如果我添加了一条消息历史记录。有人遇到过这种情况吗？我尝试不设置--enforce-eager和--disable-custom-all-reduce。我还注意到，在使用v0.3.0版本时，我会得到这些输出，就好像有人在调用模型一样：
INFO 02-14 15:30:16 llm_engine.py:877] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
这可能无关紧要，但......

更新:我使用流式传输并设置vllm.entrypoints.openai.api_server。

vllm

来源：https://github.com/vllm-project/vllm/issues/2870