问题
我加载了合并后的微调过的mistral模型,并尝试使用vLLM后端运行Triton服务器,如下所示:https://github.com/triton-inference-server/vllm_backend。当我启动服务器并运行推理时,我看到这些日志语句显示没有使用GPU或CPU KV缓存,这是可能的吗?
INFO 01-31 22:08:42 llm_engine.py:649] Avg prompt throughput: 85.9 tokens/s, Avg generation throughput: 45.5 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:47 llm_engine.py:649] Avg prompt throughput: 82.3 tokens/s, Avg generation throughput: 46.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:52 llm_engine.py:649] Avg prompt throughput: 84.3 tokens/s, Avg generation throughput: 45.7 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:08:57 llm_engine.py:649] Avg prompt throughput: 82.3 tokens/s, Avg generation throughput: 45.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:02 llm_engine.py:649] Avg prompt throughput: 83.6 tokens/s, Avg generation throughput: 45.2 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:07 llm_engine.py:649] Avg prompt throughput: 84.8 tokens/s, Avg generation throughput: 45.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:12 llm_engine.py:649] Avg prompt throughput: 81.4 tokens/s, Avg generation throughput: 45.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
INFO 01-31 22:09:18 llm_engine.py:649] Avg prompt throughput: 82.6 tokens/s, Avg generation throughput: 45.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
每个请求的延迟非常高,结果约为400ms。我尝试调整引擎参数,但延迟仍然很高,缓存利用率非常低。
配置
Triton配置:
backend: "vllm"
# The usage of device is deferred to the vLLM engine
instance_group [
{
count: 1
kind: KIND_MODEL
}
]
model.json
{
"model":"/code/triton/examples/llama/merged_model_data",
"disable_log_requests": "true",
"gpu_memory_utilization": 0.95,
"dtype": "float16",
"max_model_len": 128,
"tensor_parallel_size": 8,
"max_num_seqs": 32,
"swap_space": 4,
"tokenizer": "mistralai/Mistral-7B-v0.1"
}
我的vLLM请求如下所示:
vllm_request = {
"text_input": context,
"parameters": {
"stream": False,
"temperature": 0,
"max_tokens": 20
}
}
请帮助我理解为什么我的KV缓存没有被使用,导致延迟如此之高
1条答案
按热度按时间0kjbasz61#
@nikhilshandilya,你是否仍然遇到这个问题?