当前环境
...
您希望如何使用vllm
我已经下载了一个模型。现在在我4个GPU示例上,我尝试使用AutoAWQ对其进行量化。
每当我运行下面的脚本时,我都会得到0%的GPU利用率。
有人能帮忙解释为什么会发生这种情况吗?
import json
from huggingface_hub import snapshot_download
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import os
# some other code here
# ////////////////
# some code here
# Load model
model = AutoAWQForCausalLM.from_pretrained(args.model_path, device_map="auto", **{"low_cpu_mem_usage": True})
tokenizer = AutoTokenizer.from_pretrained(args.model_path, trust_remote_code=True)
# Load quantization config from file
if args.quant_config:
quant_config = json.loads(args.config)
else:
# Default quantization config
print("Using default quantization config")
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
# Quantize
print("Quantizing the model")
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model and tokenizer
if args.quant_path:
print("Saving the model")
model.save_quantized(args.quant_path)
tokenizer.save_pretrained(args.quant_path)
else:
print("No quantized model path provided, not saving quantized model.")
1条答案
按热度按时间g6ll5ycj1#
尝试这个: