vllm 在集群中在同一GPU上加载多个模型

6bc51xsx 于 5个月前发布在其他

关注(0)|答案(7)|浏览(92)

当前环境

vllm 0.3.0
ray 2.9.2

🐛 描述bug

我尝试在同一个GPU上运行两个模型(tinyllama 1b)。我有一组A10 GPU(22G RAM),所以我使用@serve.deployment(ray_actor_options={"num_gpus": 0.4},),并设置了以下参数：

ENGINE_ARGS = AsyncEngineArgs(
    gpu_memory_utilization=0.4,
    model=model_path,
    max_model_len=128,
    enforce_eager=True,
)

我只能在一个副本上启动模型，该副本的GPU使用率为40%,并且模型预留了10G/22G(GPU RAM)。然而，当我尝试启动第二个模型时，我遇到了这个错误，尽管它创建了另一个副本，集群的GPU使用率现在为0.8/1。

vllm

来源：https://github.com/vllm-project/vllm/issues/4242

7条答案

按热度按时间

stszievb1#

嗯，不确定vllm是否支持为两个模型共享一个GPU。至少我不知道有任何与此相关的测试。@simon-mo@ywang96,你们知道吗？

赞(0）回复(0）举报 5个月前

qzlgjiam2#

我发现vlm使用这个来获取基于free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()的总计内存，这并不遵循哪个副本拥有哪个，这是GPU内存的0.2倍！这个仍然可以看到整个内存大小！

赞(0）回复(0）举报 5个月前

nlejzf6q3#

我刚刚在启动第一个模型后修改了代码，所以我将
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
改为
(ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 13923123200 total_gpu_memory: 16070606848 (replica I added)
,而不是生成的：
(ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 16070606848 total_gpu_memory: 23609475072
这样就可以了！在 total_gpu_memory 中的问题是返回了GPU的整个内存，这是错误的！应该返回基于其分配的副本可以看到的内存。

赞(0）回复(0）举报 5个月前

bxjv4tth4#

@hahmad2008,请问您能解释一下I just modify the code after starting the first model是什么意思吗？

赞(0）回复(0）举报 5个月前

i7uaboj45#

在第一个模型的基础上，我开始进行更改，而在启动第二个模型之前，我修改了 free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()。

赞(0）回复(0）举报 5个月前

d4so4syb6#

这是从分配的缓存块中打印的信息，用于在A10上顺序启动两个模型，分别为gpu_memory_utilization: 0.2和@serve.deployment(ray_actor_options={"num_gpus": 0.2},)
我打印了启动模型1(tinyllama 1b)的值：

(ServeReplica:model1:MyModel pid=654872) free_gpu_memory:  20509491200 total_gpu_memory:  23609475072
(ServeReplica:model1:MyModel pid=654872) peak_memory:  3099983872
(ServeReplica:model1:MyModel pid=654872) head_size:  64 num_heads:  4 num_layers:  22
(ServeReplica:model1:MyModel pid=654872) cache_block_size:  360448
(ServeReplica:model1:MyModel pid=654872) num_gpu_blocks:  4499
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 3099983872, cache_block_size: 360448
(ServeReplica:model1:MyModel pid=654872) INFO 04-22 16:24:51 llm_engine.py:322] # GPU blocks: 4499, # CPU blocks: 11915

这是启动模型2的值：

(ServeReplica:model2:MyModel pid=658303) free_gpu_memory:  16070606848 total_gpu_memory:  23609475072
(ServeReplica:model2:MyModel pid=658303) peak_memory:  7538868224
(ServeReplica:model2:MyModel pid=658303) head_size:  64 num_heads:  4 num_layers:  22
(ServeReplica:model2:MyModel pid=658303) cache_block_size:  360448
(ServeReplica:model2:MyModel pid=658303) num_gpu_blocks:  -7816
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 7538868224, cache_block_size: 360448

赞(0）回复(0）举报 5个月前

kmynzznz7#

@rkooo567@ywang96 你能检查一下这个问题吗？

赞(0）回复(0）举报 5个月前