[用法]:使用Vllm AutoAWQ与4个GPU时，未利用GPU,

falq053o 于 2个月前发布在其他

关注(0)|答案(1)|浏览(37)

当前环境

...

您希望如何使用vllm

我已经下载了一个模型。现在在我4个GPU示例上，我尝试使用AutoAWQ对其进行量化。
每当我运行下面的脚本时，我都会得到0%的GPU利用率。
有人能帮忙解释为什么会发生这种情况吗？

import json
from huggingface_hub import snapshot_download
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os

# some other code here
# ////////////////
# some code here

# Load model
model = AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)

# Load quantization config from file
if args.quant_config:
    quant_config = json.loads(args.config)
else:
    # Default quantization config
    print("Using default quantization config")
    quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}

# Quantize
print("Quantizing the model")
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model and tokenizer
if args.quant_path:
    print("Saving the model")
    model.save_quantized(args.quant_path)
    tokenizer.save_pretrained(args.quant_path)
else:
    print("No quantized model path provided, not saving quantized model.")

vllm

来源：https://github.com/vllm-project/vllm/issues/4744

1条答案

按热度按时间

g6ll5ycj1#

尝试这个：

deepspeed_config = {
    "train_batch_size": 4,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 4
    },
    "fp16": {
        "enabled": True
    }
}
accelerator = Accelerator(mixed_precision='fp16', deepspeed_plugin=deepspeed_plugin)

model = AutoAWQForCausalLM.from_pretrained(output_model_path,torch_dtype=torch.float16.,device_map="auto")
model = accelerator.prepare(model)
model.quantize(tokenizer, quant_config=quant_config)
if accelerator.is_main_process:
    model.save_quantized("./"+quant_path, safetensors=True)
    tokenizer.save_pretrained("./"+quant_path)

赞(0）回复(0）举报 2个月前

我来回答

[用法]:使用Vllm AutoAWQ与4个GPU时，未利用GPU,

当前环境

您希望如何使用vllm

1条答案

相关问题

热门标签

最新问答