我正在尝试基于Qwen/Qwen-14B-Chat量化lightblue/qarasu-14B-chat-plus-unleashed。
Transformer(4.36.2)和autoawq(0.1.8)一起使用,可以正常工作。
VLLM(0.2.7)和autoawq(0.1.8),tensor_parallel_size=1,也可以正常工作。
但是,VLLM(0.2.7)和autoawq(0.1.8),tensor_parallel_size=2时,ray worker会因为错误而死机。
引擎参数
INFO 01-11 08:55:54 llm_engine.py:70] Initializing an LLM engine with config: model='/usr/local/model/llm', tokenizer='/usr/local/model/llm', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)
错误
File "/usr/local/api/chat_models/chat_local_vllm.py", line 112, in _prepare_vllm
engine = AsyncLLMEngine.from_engine_args(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
engine = cls(parallel_config.worker_use_ray,
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 273, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
return engine_class(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
self._run_workers(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 795, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 81, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 64, in load_model
self.model = get_model(self.model_config)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 65, in get_model
model = model_class(model_config.hf_config, linear_method)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 231, in __init__
self.transformer = QWenModel(config, linear_method)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 193, in __init__
self.h = nn.ModuleList([
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 194, in <listcomp>
QWenBlock(config, linear_method)
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 147, in __init__
self.mlp = QWenMLP(config.hidden_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 49, in __init__
self.c_proj = RowParallelLinear(intermediate_size,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 495, in __init__
self.linear_weights = self.linear_method.create_weights(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights
raise ValueError(
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.
TheBloke/Qwen-14B-Chat-AWQ 引发相同的错误。
Qwen-7B-Chat-AWQ在相同参数下可以正常工作。
Qwen-14B-Chat的架构不同吗?
8条答案
按热度按时间hm2xizp91#
对于 csdc-atl/Baichuan2-7B-Chat-GPTQ-Int4 的问题。
--tensor-parallel-size=2
可以正常工作,但是--tensor-parallel-size=4
会引发相同的 ValueError。引擎参数
错误
input_size_per_partition
是2752
,而self.quant_config.group_size
是128
,这导致了在条件GPTQLinearMethod.create_weights
不满足时引发错误。mitkmikd2#
quantization=awq,
是什么样子的?lf5gs5x23#
with autoawq==0.1.8
pkbketx94#
使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。
错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
config.json
的Qwen模型后,您可以找到不同规模的各种intermediate_size
。对于Qwen-7B,它是11008,Qwen-14B是13696。group_size
的quantization_config
是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型可用的分布式组数:input_size_per_partition % self.quant_config.group_size != 0
,并引发了此错误。P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
eimct9ow5#
使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
config.json
的Qwen模型后,您可以找到不同规模的各种intermediate_size
。对于Qwen-7B,它是11008,Qwen-14B是13696。group_size
的quantization_config
是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型的可用分布式组数:input_size_per_partition % self.quant_config.group_size != 0
,并引发了此错误。P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
107是一个质数,这是否意味着Qwen-14B只能加载到1个GPU或107个GPU上,而不能支持其他GPU数量?
nfeuvbwi6#
我刚才也遇到了这个问题,但是不知道怎么解决它。
zhte4eai7#
对于qwen 72b,哪个group_size适用于vllm awq?谢谢。
weylhg0b8#
使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
config.json
的Qwen模型后,您可以找到不同规模的各种intermediate_size
。对于Qwen-7B,它是11008,Qwen-14B是13696。group_size
的quantization_config
是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型的可用分布式组数:input_size_per_partition % self.quant_config.group_size != 0
,并引发了此错误。P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
实际上,只需要满足条件input_size_per_partition % self.quant_config.group_size != 0和output_size_per_partition % self.quant_config.pack_factor.numerator != 0即可。
因此,您可以尝试在量化过程中将group_size设置为64。然而,较小的group_size意味着需要更多的量化级别和存储空间,这可能会增加计算和存储负担。