任何人都可以使用Qwen-14B-Chat-AWQ与VLLM/TP一起工作吗?

sqxo8psd  于 2个月前  发布在  其他
关注(0)|答案(8)|浏览(67)

我正在尝试基于Qwen/Qwen-14B-Chat量化lightblue/qarasu-14B-chat-plus-unleashed
Transformer(4.36.2)和autoawq(0.1.8)一起使用,可以正常工作。
VLLM(0.2.7)和autoawq(0.1.8),tensor_parallel_size=1,也可以正常工作。
但是,VLLM(0.2.7)和autoawq(0.1.8),tensor_parallel_size=2时,ray worker会因为错误而死机。

引擎参数

INFO 01-11 08:55:54 llm_engine.py:70] Initializing an LLM engine with config: model='/usr/local/model/llm', tokenizer='/usr/local/model/llm', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=awq, enforce_eager=False, seed=0)

错误

File "/usr/local/api/chat_models/chat_local_vllm.py", line 112, in _prepare_vllm
     engine = AsyncLLMEngine.from_engine_args(engine_args)
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
     engine = cls(parallel_config.worker_use_ray,
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 273, in __init__
     self.engine = self._init_engine(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
     return engine_class(*args, **kwargs)
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 109, in __init__
     self._init_workers_ray(placement_group)
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
     self._run_workers(
   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 795, in _run_workers
     driver_worker_output = getattr(self.driver_worker,
   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 81, in load_model
     self.model_runner.load_model()
   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 64, in load_model
     self.model = get_model(self.model_config)
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader.py", line 65, in get_model
     model = model_class(model_config.hf_config, linear_method)
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 231, in __init__
     self.transformer = QWenModel(config, linear_method)
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 193, in __init__
     self.h = nn.ModuleList([
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 194, in <listcomp>
     QWenBlock(config, linear_method)
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 147, in __init__
     self.mlp = QWenMLP(config.hidden_size,
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen.py", line 49, in __init__
     self.c_proj = RowParallelLinear(intermediate_size,
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 495, in __init__
     self.linear_weights = self.linear_method.create_weights(
   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/awq.py", line 85, in create_weights
     raise ValueError(
 ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

TheBloke/Qwen-14B-Chat-AWQ 引发相同的错误。
Qwen-7B-Chat-AWQ在相同参数下可以正常工作。
Qwen-14B-Chat的架构不同吗?

hm2xizp9

hm2xizp91#

对于 csdc-atl/Baichuan2-7B-Chat-GPTQ-Int4 的问题。
--tensor-parallel-size=2 可以正常工作,但是 --tensor-parallel-size=4 会引发相同的 ValueError。

引擎参数

INFO 01-12 13:00:00 llm_engine.py:70] Initializing an LLM engine with config: model='csdc-atl/Baichuan2-7B-Chat-GPTQ-Int4', tokenizer='csdc-atl/Baichuan2-7B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=3903d0d0b05d4adc0dd340fdf86d9a3d787ed54d, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=4, quantization=gptq, enforce_eager=False, seed=0)

错误

Traceback (most recent call last):
  File "src/vllm_demo.py", line 83, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 500, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 273, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "env/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 318, in _init_engine
    return engine_class(*args, **kwargs)
  File "env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 249, in _init_workers_ray
    self._run_workers(
  File "env/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 795, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "env/lib/python3.10/site-packages/vllm/worker/worker.py", line 81, in load_model
    self.model_runner.load_model()
  File "env/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 64, in load_model
    self.model = get_model(self.model_config)
  File "env/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 65, in get_model
    model = model_class(model_config.hf_config, linear_method)
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 375, in __init__
    super().__init__(config, "ROPE", linear_method)
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 297, in __init__
    self.model = BaiChuanModel(config, position_embedding, linear_method)
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 260, in __init__
    self.layers = nn.ModuleList([
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 261, in <listcomp>
    BaiChuanDecoderLayer(config, position_embedding, linear_method)
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 205, in __init__
    self.mlp = BaiChuanMLP(
  File "env/lib/python3.10/site-packages/vllm/model_executor/models/baichuan.py", line 89, in __init__
    self.down_proj = RowParallelLinear(intermediate_size,
  File "env/lib/python3.10/site-packages/vllm/model_executor/layers/linear.py", line 495, in __init__
    self.linear_weights = self.linear_method.create_weights(
  File "env/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/gptq.py", line 102, in create_weights
    raise ValueError(
ValueError: The input size is not aligned with the quantized weight shape. This can be caused by too large tensor parallel size.

input_size_per_partition2752,而 self.quant_config.group_size128,这导致了在条件 GPTQLinearMethod.create_weights 不满足时引发错误。

mitkmikd

mitkmikd2#

quantization=awq, 是什么样子的?

lf5gs5x2

lf5gs5x23#

with autoawq==0.1.8

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
pkbketx9

pkbketx94#

使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。
错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |

  1. 在检查config.json的Qwen模型后,您可以找到不同规模的各种intermediate_size。对于Qwen-7B,它是11008,Qwen-14B是13696。
  2. 此外,在这个文件中,group_sizequantization_config是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型可用的分布式组数:
  • Qwen-7B是$12886 = 128 * (12*43)$ # 您可以在1、2或43个GPU上并行运行
  • Qwen-14B是$128107=128(1*107)$ # 不能在2个GPU上运行
  1. 结果,触发了input_size_per_partition % self.quant_config.group_size != 0,并引发了此错误。
    P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
eimct9ow

eimct9ow5#

使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |

  1. 在检查config.json的Qwen模型后,您可以找到不同规模的各种intermediate_size。对于Qwen-7B,它是11008,Qwen-14B是13696。
  2. 此外,在这个文件中,group_sizequantization_config是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型的可用分布式组数:
  • Qwen-7B是128∗86=128∗(1∗2∗43) # 您可以在1、2或43个GPU上并行运行
  • Qwen-14B是128∗107=128∗(1∗107) # 不能在2个GPU上运行
  1. 结果,触发了input_size_per_partition % self.quant_config.group_size != 0,并引发了此错误。
    P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
    107是一个质数,这是否意味着Qwen-14B只能加载到1个GPU或107个GPU上,而不能支持其他GPU数量?
nfeuvbwi

nfeuvbwi6#

我刚才也遇到了这个问题,但是不知道怎么解决它。

zhte4eai

zhte4eai7#

对于qwen 72b,哪个group_size适用于vllm awq?谢谢。

weylhg0b

weylhg0b8#

使用TP(或tensor_parallel_size) =2表示vllm将模型分布到2个GPU上。错误发生在这个量化文件(AWQ和GPTQ有类似的检查):
vllm/vllm/model_executor/layers/quantization/gptq.py
第99行到第108行:
| | ifinput_size_per_partition%self.quant_config.group_size!=0: |
| | raiseValueError( |
| | "The input size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |
| | ifoutput_size_per_partition%self.quant_config.pack_factor.numerator!=0: |
| | raiseValueError( |
| | "The output size is not aligned with the quantized " |
| | "weight shape. This can be caused by too large " |
| | "tensor parallel size.") |

  1. 在检查config.json的Qwen模型后,您可以找到不同规模的各种intermediate_size。对于Qwen-7B,它是11008,Qwen-14B是13696。
  2. 此外,在这个文件中,group_sizequantization_config是128(几乎所有量化模型都是如此)。因此,您可以获得每个模型的可用分布式组数:
  • Qwen-7B是128∗86=128∗(1∗2∗43) # 您可以在1、2或43个GPU上并行运行
  • Qwen-14B是128∗107=128∗(1∗107) # 不能在2个GPU上运行
  1. 结果,触发了input_size_per_partition % self.quant_config.group_size != 0,并引发了此错误。
    P.S. 到目前为止,我还没有找到任何解决方案来实现vllm中的中间维度不平衡分布。或者您可以使用简单的HF多GPU加载方法进行测试。
    实际上,只需要满足条件input_size_per_partition % self.quant_config.group_size != 0和output_size_per_partition % self.quant_config.pack_factor.numerator != 0即可。
    因此,您可以尝试在量化过程中将group_size设置为64。然而,较小的group_size意味着需要更多的量化级别和存储空间,这可能会增加计算和存储负担。

相关问题