INFO 12-28 17:20:57 llm_engine.py:73] Initializing an LLM engine with config: model='./open_llama_7B', tokenizer='./open_llama_7B', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
INFO 12-28 17:21:17 llm_engine.py:223] # GPU blocks: 0, # CPU blocks: 2048
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/entrypoints/api_server.py", line 80, in <module>
engine = AsyncLLMEngine.from_engine_args(engine_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 496, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 269, in __init__
self.engine = self._init_engine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 314, in _init_engine
return engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 113, in __init__
self._init_cache()
File "/mnt/sda/2022-0526/home/hlh/python_venvs/vllm-py3.11.6/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 227, in _init_cache
raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
5条答案
按热度按时间pvcm50d11#
你可能会发现
--gpu-memory-utilization
很有帮助。noj0wjuj2#
你可能会发现
--gpu-memory-utilization
很有帮助。这个配置在vLLM启动阶段是固定的。如果有两个模型在同一GPU上运行,每个模型占用50%的内存,即使另一个模型处于空闲状态,一个模型也无法消耗超过50%的资源。
omjgkv6w3#
我为这个功能点赞。我在工作中确实有一块Nvidia A100 80 GB的显卡,我想在上面加载几个模型,一个用于普通语言,一些用于编程。
fcipmucu4#
@tianliplus@hewr1993 你能在
--gpu-memory-utilization
上运行两个模型吗?我尝试通过在两个终端中运行python -m vllm.entrypoints.api_server --model ./open_llama_7B --swap-space 16 --gpu-memory-utilization 0.4
来运行两个模型,但只有一个模型可以成功加载,另一个模型会报错:它无法在GPU上分配内存块。
e1xvtsh35#
你好,
我今天遇到了同样的问题。到目前为止,我已经测试过,如果将0.9赋值给第二个模型,它可以正常工作(使用codelama-7b和vicuna在A100上进行测试)。
因此,我怀疑这个因子可能只占用自由内存进行计算。这使得参数变得非常不可用,但我在源代码中没有找到相关信息。
祝好
敬礼