AutoGPTQ CUDA推理:当group_size = 1024且desc_act为False时出现问题(Triton不受影响),

4si2a6ki  于 4个月前  发布在  其他
关注(0)|答案(5)|浏览(68)

你好@PanQiWei 和 @qwopqwop200

我遇到了一个奇怪的bug,它只针对 group_size = 1024 + desc_act=False + CUDA推理。
昨晚我进行了大量量化,涵盖了所有量化参数的排列组合。
今天我在测试困惑度,发现使用 group_size = 1024 + desc_act = False 量化的模型在 CUDA 中不支持 model(tokens) 语法。但 model.generate(..) 可以正常工作。
以下是用于演示问题的测试代码:

import os

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import numpy as np
import torch
import torch.nn as nn
import argparse

def get_wikitext2(nsamples, seed, seqlen, tokenizer):
    from datasets import load_dataset

    wikidata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
    wikilist = [' \n' if s == '' else s for s in wikidata['text'] ]

    text = ''.join(wikilist)
    trainenc = tokenizer(text, return_tensors='pt')

    import random
    random.seed(seed)
    np.random.seed(0)
    torch.random.manual_seed(0)

    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
    return traindataset

pretrained_model_dir = "/workspace/models/huggyllama_llama-7b"
quantized_model_dir = "/workspace/test-1024g"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)

if not os.path.isdir(quantized_model_dir):
    quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=1024,
        desc_act=False
    )

    traindataset = get_wikitext2(128, 0, 2048, tokenizer)
    # load un-quantized model, the model will always be force loaded into cpu
    print("Loading model")
    model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)

    print("Quantising")
    model.quantize(traindataset, use_triton=False)

    os.makedirs(quantized_model_dir, exist_ok=True)
    model.save_quantized(quantized_model_dir, use_safetensors=True)

print("Reloading model just quantised")
for triton in [ True, False ]:
    print(f"Testing with use_triton = {triton}")
    model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=triton, use_safetensors=True)

    # Make a long text
    sentence = "auto gptq is " * 500
    input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to("cuda:0")
    # Run model on first 512 tokens
    try:
        output = model(input_ids = input_ids[:, 0:512])
        print(f"Succeeded for triton = {triton}")
    except:
        print(f"FAILED for triton = {triton}")
        raise

输出:

root@1f66221a311b:/workspace/gptq-ppl-test# python test_1024.py
Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.48k/8.48k [00:00<00:00, 4.79MB/s]
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.84k/6.84k [00:00<00:00, 5.21MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.25k/9.25k [00:00<00:00, 6.48MB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.72M/4.72M [00:01<00:00, 4.61MB/s]
Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
Token indices sequence length is longer than the specified maximum sequence length for this model (335688 > 2048). Running this sequence through the model will result in indexing errors
Loading model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:21<00:00, 40.74s/it]
Quantising
Reloading model just quantised
Testing with use_triton = True
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:33<00:00,  2.80s/it]
Succeeded for triton = True
Testing with use_triton = False
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
FAILED for triton = False
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/gptq-ppl-test/test_1024.py:66 in <module>                                             │
│                                                                                                  │
│   63 │   input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to(    │
│   64 │   # Run model on first 512 tokens                                                         │
│   65 │   try:                                                                                    │
│ ❱ 66 │   │   output = model(input_ids[:, 0:512])                                                 │
│   67 │   │   print(f"Succeeded for triton = {triton}")                                           │
│   68 │   except:                                                                                 │
│   69 │   │   print(f"FAILED for triton = {triton}")                                              │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:374 in forward               │
│                                                                                                  │
│   371 │   │   return self.model.to(device)                                                       │
│   372 │                                                                                          │
│   373 │   def forward(self, *args, **kwargs):                                                    │
│ ❱ 374 │   │   return self.model(*args, **kwargs)                                                 │
│   375 │                                                                                          │
│   376 │   def generate(self, **kwargs):                                                          │
│   377 │   │   """shortcut for model.generate"""                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:688 in       │
│ forward                                                                                          │
│                                                                                                  │
│   685 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return   │
│   686 │   │                                                                                      │
│   687 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)    │
│ ❱ 688 │   │   outputs = self.model(                                                              │
│   689 │   │   │   input_ids=input_ids,                                                           │
│   690 │   │   │   attention_mask=attention_mask,                                                 │
│   691 │   │   │   position_ids=position_ids,                                                     │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:578 in       │
│ forward                                                                                          │
│                                                                                                  │
│   575 │   │   │   │   │   None,                                                                  │
│   576 │   │   │   │   )                                                                          │
│   577 │   │   │   else:                                                                          │
│ ❱ 578 │   │   │   │   layer_outputs = decoder_layer(                                             │
│   579 │   │   │   │   │   hidden_states,                                                         │
│   580 │   │   │   │   │   attention_mask=attention_mask,                                         │
│   581 │   │   │   │   │   position_ids=position_ids,                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:306 in       │
│ forward                                                                                          │
│                                                                                                  │
│   303 │   │   # Fully Connected                                                                  │
│   304 │   │   residual = hidden_states                                                           │
│   305 │   │   hidden_states = self.post_attention_layernorm(hidden_states)                       │
│ ❱ 306 │   │   hidden_states = self.mlp(hidden_states)                                            │
│   307 │   │   hidden_states = residual + hidden_states                                           │
│   308 │   │                                                                                      │
│   309 │   │   outputs = (hidden_states,)                                                         │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:158 in       │
│ forward                                                                                          │
│                                                                                                  │
│   155 │   │   self.act_fn = ACT2FN[hidden_act]                                                   │
│   156 │                                                                                          │
│   157 │   def forward(self, x):                                                                  │
│ ❱ 158 │   │   return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))            │
│   159                                                                                            │
│   160                                                                                            │
│   161 class LlamaAttention(nn.Module):                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl            │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward                   │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py:221 in forward       │
│                                                                                                  │
│   218 │   │   │                                                                                  │
│   219 │   │   │      weight = torch.bitwise_right_shift(torch.unsqueeze(self.qweight, 1).expan   │
│   220 │   │   │      torch.bitwise_and(weight,(2 ** self.bits) - 1, out=weight)                  │
│ ❱ 221 │   │   │      weight = weight.reshape(-1, self.group_size, weight.shape[2])               │
│   222 │   │   │   elif self.bits == 3:                                                           │
│   223 │   │   │      zeros = self.qzeros.reshape(self.qzeros.shape[0], self.qzeros.shape[1]//3   │
│   224 │   │   │      zeros = (zeros >> self.wf.unsqueeze(0))                                     │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[-1, 1024, 4096]' is invalid for input of size 45088768

如您所见,在进行 Triton 推理时没有问题。
但是在使用 group_size = 1024 + desc_act = False 的 CUDA 模型上进行 CUDA 推理时会出现这个错误。
与 CUDA + group_size = 1024 + desc_act = True 的情况不同,这个错误不会发生。

yvfmudvl

yvfmudvl1#

@TheBloke,你有没有使用和不使用Triton的纯英文好处的例子?我自己没有观察到性能差异,但我在网上看到很多代码示例中使用了/没有使用它。

7uhlpewt

7uhlpewt2#

Triton不支持Windows,这使得它对很多人来说不可用。
对于那些使用Linux的人,我现在建议大家不要使用Triton,因为它比CUDA慢。我发布的所有模型都使用了一种保证与CUDA兼容的格式 - 即,我不使用desc_act和group_size一起使用。
CUDA性能更好的一个例外是:如果desc_act和group_size一起使用,CUDA性能会下降到5个标记/秒,但Triton性能略好一些。
此外,Triton往往具有较低的显存使用率。
所以:

  • 如果是在Windows上,始终使用CUDA;
  • 对于不使用group_size和desc_act的模型的最大性能:CUDA;
  • 对于使用group_size和desc_act(并且理论上准确度最高的模型)的最大性能:Triton;
  • 如果你需要最小化显存使用量,例如当尝试在2 x 24GB卡上加载一个4位65B模型时:Triton。

理想情况下,所有这些差异都会得到解决,使这两种方法可以相互比较,但目前我们似乎还没有接近这一点。

4zcjmb1e

4zcjmb1e3#

这是一个惊人的回复,谢谢。
澄清一下,查看一些你的量化模型,下面的示例意味着只使用 desc_act 吗?这样 Triton 就不会提供更多的性能?

{
  "bits": 4,
  "group_size": -1,
  "damp_percent": 0.01,
  "desc_act": true,
  "sym": true,
  "true_sequential": true
}

参考: https://huggingface.co/TheBloke/WizardLM-30B-Uncensored-GPTQ/blob/main/quantize_config.json

exdqitrt

exdqitrt4#

正确的,根据我的测试,Triton的表现会差很多。

mcdcgff0

mcdcgff05#

当我尝试量化openlm-research/open_llama_3bpsmathur/orca_mini_3b时,我遇到了相同的错误。我的源代码如下(与readme相同):

weight = torch.bitwise_right_shift(torch.unsqueeze(self.qweight, 1).expan
torch.bitwise_and(weight,(2 ** self.bits) - 1, out=weight)
weight = weight.reshape(-1, self.group_size, weight.shape[2])
elif self.bits == 3:
zeros = self.qzeros.reshape(self.qzeros.shape[0], self.qzeros.shape[1]//3
zeros = (zeros >> self.wf.unsqueeze(0))

RuntimeError: shape '[-1, 128, 3200]' is invalid for input of size 27648000

我不确定这是否是相同的错误,但我的环境是Windows(非Triton)。

相关问题