vllm 对于 Mistral 模型,KV Cache 的使用率为 0%,

xbp102n0  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(28)

问题

我加载了合并后的微调过的mistral模型,并尝试使用vLLM后端运行Triton服务器,如下所示:https://github.com/triton-inference-server/vllm_backend。当我启动服务器并运行推理时,我看到这些日志语句显示没有使用GPU或CPU KV缓存,这是可能的吗?

INFO 01-31 22:08:42 llm_engine.py:649] Avg prompt throughput: 85.9 tokens/s, Avg generation throughput: 45.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:47 llm_engine.py:649] Avg prompt throughput: 82.3 tokens/s, Avg generation throughput: 46.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:52 llm_engine.py:649] Avg prompt throughput: 84.3 tokens/s, Avg generation throughput: 45.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:57 llm_engine.py:649] Avg prompt throughput: 82.3 tokens/s, Avg generation throughput: 45.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:02 llm_engine.py:649] Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 45.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:07 llm_engine.py:649] Avg prompt throughput: 84.8 tokens/s, Avg generation throughput: 45.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:12 llm_engine.py:649] Avg prompt throughput: 81.4 tokens/s, Avg generation throughput: 45.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:18 llm_engine.py:649] Avg prompt throughput: 82.6 tokens/s, Avg generation throughput: 45.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

每个请求的延迟非常高,结果约为400ms。我尝试调整引擎参数,但延迟仍然很高,缓存利用率非常低。

配置

Triton配置:

backend: "vllm"

# The usage of device is deferred to the vLLM engine
instance_group [
  {
    count: 1
    kind: KIND_MODEL
  }
]

model.json

{
    "model":"/code/triton/examples/llama/merged_model_data",
    "disable_log_requests": "true",
    "gpu_memory_utilization": 0.95,
    "dtype": "float16",
    "max_model_len": 128,
    "tensor_parallel_size": 8,
    "max_num_seqs": 32,
    "swap_space": 4,
    "tokenizer": "mistralai/Mistral-7B-v0.1"
}

我的vLLM请求如下所示:

vllm_request = {
                "text_input": context,
                "parameters": {
                    "stream": False,
                    "temperature": 0,
                    "max_tokens": 20
                }
            }

请帮助我理解为什么我的KV缓存没有被使用,导致延迟如此之高

0kjbasz6

0kjbasz61#

@nikhilshandilya,你是否仍然遇到这个问题?

相关问题