vllm [修复]使用safetensor格式加速模型加载

um6iljoc 于 2个月前发布在其他

关注(0)|答案(4)|浏览(42)

在 /vllm/model_executor/weight_utils.py 文件中，

elif use_safetensors:
  for st_file in hf_weights_files:
    with safe_open(st_file, framework="pt") as f:
        for name in f.keys():  # noqa: SIM118
           param = f.get_tensor(name)
           yield name, param

safe_open 的默认设备是 'cpu',这在某些情况下会严重降低权重加载速度。例如，在我们的情况下，加载一个 llama2-7b 模型需要 194 秒。
为了解决这个问题，我们稍微修改了一下实现：

elif use_safetensors:
  for st_file in hf_weights_files:
    with safe_open(st_file, framework="pt", **device=device**) as f:
        for name in f.keys():  # noqa: SIM118
           param = f.get_tensor(name)
           yield name, param

将加载时间从 194 秒提高到了 6 秒，提高了 32 倍。

vllm

来源：https://github.com/vllm-project/vllm/issues/3182