vllm [Bug]: export failed when kv cache fp8 quantizing Qwen1.5-72B-Chat-GPTQ-Int4

woobm2wo  于 3个月前  发布在  其他
关注(0)|答案(1)|浏览(31)

当前环境信息如下:

  • 安装了vllm版本0.4.2和nvidia-ammo版本0.7.1
  • PyTorch版本:2.3.0+cu121
  • 是否为调试构建:否
  • 使用的CUDA版本:12.1
  • 是否使用ROCM:否
  • 操作系统:Ubuntu 22.04.3 LTS(x86_64)
  • GCC版本:(Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • Clang版本:无法收集
  • CMake版本:3.27.6
  • Libc版本:glibc-2.35
  • Python版本:3.10.12(主, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64位)
  • Python平台:Linux-3.10.0-1160.108.1.el7.x86_64-x86_64-with-glibc2.35
  • 是否支持CUDA:是
  • CUDA运行时版本:12.2.140
  • GPU模型和配置:
    GPU 0:NVIDIA A10
    GPU 1:NVIDIA A10
    GPU 2:NVIDIA A10
    GPU 3:NVIDIA A10
    GPU 4:NVIDIA A10
    GPU 5:NVIDIA A10
    GPU 6:NVIDIA A10
    GPU 7:NVIDIA A10
  • Nvidia驱动版本:535.129.03
  • cuDNN版本:可能在以下之一:
    /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.5
    /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.5
    HIP运行时版本:N/A
    MIOpen运行时版本:N/A
  • XNNPACK可用性:是
    SYS = 连接遍历PCIe以及NUMA节点之间的SMP互连(例如,QPI/UPI)
    NODE = 连接遍历PCIe以及NUMA节点内的PCIe主机桥之间的互连
    PHB = 连接遍历PCIe以及一个PCIe主机桥(通常是CPU)
    PXB = 连接遍历多个PCIe桥(不遍历PCIe主机桥)
    PIX = 连接最多遍历一个PCIe桥
    NV# = 连接遍历一组# NVLinks的绑定
    NIC图例:
    NIC0: mlx5_0
    NIC1: mlx5_1

🐛 描述bug

运行命令:
python quantize.py --model_dir /workspace/models2/Qwen1.5-72B-Chat-GPTQ-Int4 --dtype float16 \ --qformat fp8 --kv_cache_dtype fp8 --output_dir /workspace/output_models/qwen-72b_int4_fp8 \ --calib_size 512 --tp_size 4
得到:

sirbozc5

sirbozc51#

当我尝试转换Qwen-14B-GPTQ-int4时,我遇到了类似的错误,并使用了最新发布的modelopt库。

Cannot export model to the model_config. The modelopt-optimized model state_dict (including the quantization factors) is saved to ../Qwen1.5-14B-Chat-GPTQ-Int4-g128-fpcache/modelopt_model.0.pth using torch.save for further inspection.
Detailed export error: 'QuantLinear' object has no attribute 'weight'
Traceback (most recent call last):
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/modelopt/torch/export/model_config_export.py", line 364, in export_tensorrt_llm_checkpoint
    for tensorrt_llm_config, weights in torch_to_tensorrt_llm_checkpoint(
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/modelopt/torch/export/model_config_export.py", line 220, in torch_to_tensorrt_llm_checkpoint
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/modelopt/torch/export/layer_utils.py", line 1179, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/modelopt/torch/export/layer_utils.py", line 649, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/modelopt/torch/export/layer_utils.py", line 592, in build_linear_config
    torch_weight = module.weight.detach()
                   ^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/vllm_pr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'QuantLinear' object has no attribute 'weight'

看起来ammomodelopt无法处理量化模型,它们没有对gptq进行量化支持。也许modelopt.torch.export.export_tensorrt_llm_checkpoint(以及ammo中的相关函数)应该对QuantLinear层进行适应。

相关问题