vllm 你是否考虑过在多个模型之间共享单个GPU的KV缓存?

gcxthw6b  于 2个月前  发布在  其他
关注(0)|答案(5)|浏览(63)

有时候在相同的GPU集合上部署多个模型更具成本效益。但是vLLM仅支持运行一个模型,并且从一开始就会占用固定大小的GPU内存。这使得在多个模型之间共享GPU变得困难。

pvcm50d1

pvcm50d11#

你可能会发现--gpu-memory-utilization很有帮助。

noj0wjuj

noj0wjuj2#

你可能会发现--gpu-memory-utilization很有帮助。
这个配置在vLLM启动阶段是固定的。如果有两个模型在同一GPU上运行,每个模型占用50%的内存,即使另一个模型处于空闲状态,一个模型也无法消耗超过50%的资源。

omjgkv6w

omjgkv6w3#

我为这个功能点赞。我在工作中确实有一块Nvidia A100 80 GB的显卡,我想在上面加载几个模型,一个用于普通语言,一些用于编程。

python -m vllm.entrypoints.openai.api_server 
  --model="Open-Orca/Mistral-7B-OpenOrca" 
  --model="TheBloke/deepseek-coder-6.7B-instruct-AWQ" 
  --trust-remote-code
fcipmucu

fcipmucu4#

@tianliplus@hewr1993 你能在--gpu-memory-utilization上运行两个模型吗?我尝试通过在两个终端中运行python -m vllm.entrypoints.api_server --model ./open_llama_7B --swap-space 16 --gpu-memory-utilization 0.4来运行两个模型,但只有一个模型可以成功加载,另一个模型会报错:

INFO 12-28 17:20:57 llm_engine.py:73] Initializing an LLM engine with config: model='./open_llama_7B', tokenizer='./open_llama_7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
INFO 12-28 17:21:17 llm_engine.py:223] # GPU blocks: 0, # CPU blocks: 2048
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 113, in __init__
    self._init_cache()
  File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 227, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

它无法在GPU上分配内存块。

e1xvtsh3

e1xvtsh35#

你好,

我今天遇到了同样的问题。到目前为止,我已经测试过,如果将0.9赋值给第二个模型,它可以正常工作(使用codelama-7b和vicuna在A100上进行测试)。

因此,我怀疑这个因子可能只占用自由内存进行计算。这使得参数变得非常不可用,但我在源代码中没有找到相关信息。

祝好

敬礼

相关问题