你好@PanQiWei 和 @qwopqwop200
我遇到了一个奇怪的bug,它只针对 group_size = 1024 + desc_act=False + CUDA推理。
昨晚我进行了大量量化,涵盖了所有量化参数的排列组合。
今天我在测试困惑度,发现使用 group_size = 1024 + desc_act = False 量化的模型在 CUDA 中不支持 model(tokens)
语法。但 model.generate(..)
可以正常工作。
以下是用于演示问题的测试代码:
import os
from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import numpy as np
import torch
import torch.nn as nn
import argparse
def get_wikitext2(nsamples, seed, seqlen, tokenizer):
from datasets import load_dataset
wikidata = load_dataset('wikitext', 'wikitext-2-raw-v1', split='test')
wikilist = [' \n' if s == '' else s for s in wikidata['text'] ]
text = ''.join(wikilist)
trainenc = tokenizer(text, return_tensors='pt')
import random
random.seed(seed)
np.random.seed(0)
torch.random.manual_seed(0)
traindataset = []
for _ in range(nsamples):
i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
j = i + seqlen
inp = trainenc.input_ids[:, i:j]
attention_mask = torch.ones_like(inp)
traindataset.append({'input_ids':inp,'attention_mask': attention_mask})
return traindataset
pretrained_model_dir = "/workspace/models/huggyllama_llama-7b"
quantized_model_dir = "/workspace/test-1024g"
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True)
if not os.path.isdir(quantized_model_dir):
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=1024,
desc_act=False
)
traindataset = get_wikitext2(128, 0, 2048, tokenizer)
# load un-quantized model, the model will always be force loaded into cpu
print("Loading model")
model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)
print("Quantising")
model.quantize(traindataset, use_triton=False)
os.makedirs(quantized_model_dir, exist_ok=True)
model.save_quantized(quantized_model_dir, use_safetensors=True)
print("Reloading model just quantised")
for triton in [ True, False ]:
print(f"Testing with use_triton = {triton}")
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", use_triton=triton, use_safetensors=True)
# Make a long text
sentence = "auto gptq is " * 500
input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to("cuda:0")
# Run model on first 512 tokens
try:
output = model(input_ids = input_ids[:, 0:512])
print(f"Succeeded for triton = {triton}")
except:
print(f"FAILED for triton = {triton}")
raise
输出:
root@1f66221a311b:/workspace/gptq-ppl-test# python test_1024.py
Downloading builder script: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8.48k/8.48k [00:00<00:00, 4.79MB/s]
Downloading metadata: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6.84k/6.84k [00:00<00:00, 5.21MB/s]
Downloading readme: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.25k/9.25k [00:00<00:00, 6.48MB/s]
Downloading and preparing dataset wikitext/wikitext-2-raw-v1 to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.72M/4.72M [00:01<00:00, 4.61MB/s]
Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.
Token indices sequence length is longer than the specified maximum sequence length for this model (335688 > 2048). Running this sequence through the model will result in indexing errors
Loading model
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [01:21<00:00, 40.74s/it]
Quantising
Reloading model just quantised
Testing with use_triton = True
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:33<00:00, 2.80s/it]
Succeeded for triton = True
Testing with use_triton = False
The safetensors archive passed at /workspace/test-1024g/gptq_model-4bit-1024g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
FAILED for triton = False
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /workspace/gptq-ppl-test/test_1024.py:66 in <module> │
│ │
│ 63 │ input_ids = tokenizer(sentence, return_tensors="pt", truncation=False).input_ids.to( │
│ 64 │ # Run model on first 512 tokens │
│ 65 │ try: │
│ ❱ 66 │ │ output = model(input_ids[:, 0:512]) │
│ 67 │ │ print(f"Succeeded for triton = {triton}") │
│ 68 │ except: │
│ 69 │ │ print(f"FAILED for triton = {triton}") │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py:374 in forward │
│ │
│ 371 │ │ return self.model.to(device) │
│ 372 │ │
│ 373 │ def forward(self, *args, **kwargs): │
│ ❱ 374 │ │ return self.model(*args, **kwargs) │
│ 375 │ │
│ 376 │ def generate(self, **kwargs): │
│ 377 │ │ """shortcut for model.generate""" │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:688 in │
│ forward │
│ │
│ 685 │ │ return_dict = return_dict if return_dict is not None else self.config.use_return │
│ 686 │ │ │
│ 687 │ │ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn) │
│ ❱ 688 │ │ outputs = self.model( │
│ 689 │ │ │ input_ids=input_ids, │
│ 690 │ │ │ attention_mask=attention_mask, │
│ 691 │ │ │ position_ids=position_ids, │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:578 in │
│ forward │
│ │
│ 575 │ │ │ │ │ None, │
│ 576 │ │ │ │ ) │
│ 577 │ │ │ else: │
│ ❱ 578 │ │ │ │ layer_outputs = decoder_layer( │
│ 579 │ │ │ │ │ hidden_states, │
│ 580 │ │ │ │ │ attention_mask=attention_mask, │
│ 581 │ │ │ │ │ position_ids=position_ids, │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:306 in │
│ forward │
│ │
│ 303 │ │ # Fully Connected │
│ 304 │ │ residual = hidden_states │
│ 305 │ │ hidden_states = self.post_attention_layernorm(hidden_states) │
│ ❱ 306 │ │ hidden_states = self.mlp(hidden_states) │
│ 307 │ │ hidden_states = residual + hidden_states │
│ 308 │ │ │
│ 309 │ │ outputs = (hidden_states,) │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:158 in │
│ forward │
│ │
│ 155 │ │ self.act_fn = ACT2FN[hidden_act] │
│ 156 │ │
│ 157 │ def forward(self, x): │
│ ❱ 158 │ │ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) │
│ 159 │
│ 160 │
│ 161 class LlamaAttention(nn.Module): │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1501 in _call_impl │
│ │
│ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │
│ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │
│ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │
│ 1502 │ │ # Do not call functions when jit is used │
│ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1504 │ │ backward_pre_hooks = [] │
│ │
│ /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:165 in new_forward │
│ │
│ 162 │ │ │ with torch.no_grad(): │
│ 163 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 164 │ │ else: │
│ ❱ 165 │ │ │ output = old_forward(*args, **kwargs) │
│ 166 │ │ return module._hf_hook.post_forward(module, output) │
│ 167 │ │
│ 168 │ module.forward = new_forward │
│ │
│ /usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py:221 in forward │
│ │
│ 218 │ │ │ │
│ 219 │ │ │ weight = torch.bitwise_right_shift(torch.unsqueeze(self.qweight, 1).expan │
│ 220 │ │ │ torch.bitwise_and(weight,(2 ** self.bits) - 1, out=weight) │
│ ❱ 221 │ │ │ weight = weight.reshape(-1, self.group_size, weight.shape[2]) │
│ 222 │ │ │ elif self.bits == 3: │
│ 223 │ │ │ zeros = self.qzeros.reshape(self.qzeros.shape[0], self.qzeros.shape[1]//3 │
│ 224 │ │ │ zeros = (zeros >> self.wf.unsqueeze(0)) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: shape '[-1, 1024, 4096]' is invalid for input of size 45088768
如您所见,在进行 Triton 推理时没有问题。
但是在使用 group_size = 1024 + desc_act = False 的 CUDA 模型上进行 CUDA 推理时会出现这个错误。
与 CUDA + group_size = 1024 + desc_act = True 的情况不同,这个错误不会发生。
5条答案
按热度按时间yvfmudvl1#
@TheBloke,你有没有使用和不使用Triton的纯英文好处的例子?我自己没有观察到性能差异,但我在网上看到很多代码示例中使用了/没有使用它。
7uhlpewt2#
Triton不支持Windows,这使得它对很多人来说不可用。
对于那些使用Linux的人,我现在建议大家不要使用Triton,因为它比CUDA慢。我发布的所有模型都使用了一种保证与CUDA兼容的格式 - 即,我不使用desc_act和group_size一起使用。
CUDA性能更好的一个例外是:如果desc_act和group_size一起使用,CUDA性能会下降到5个标记/秒,但Triton性能略好一些。
此外,Triton往往具有较低的显存使用率。
所以:
理想情况下,所有这些差异都会得到解决,使这两种方法可以相互比较,但目前我们似乎还没有接近这一点。
4zcjmb1e3#
这是一个惊人的回复,谢谢。
澄清一下,查看一些你的量化模型,下面的示例意味着只使用
desc_act
吗?这样 Triton 就不会提供更多的性能?参考: https://huggingface.co/TheBloke/WizardLM-30B-Uncensored-GPTQ/blob/main/quantize_config.json
exdqitrt4#
正确的,根据我的测试,Triton的表现会差很多。
mcdcgff05#
当我尝试量化
openlm-research/open_llama_3b
或psmathur/orca_mini_3b
时,我遇到了相同的错误。我的源代码如下(与readme相同):weight = torch.bitwise_right_shift(torch.unsqueeze(self.qweight, 1).expan
torch.bitwise_and(weight,(2 ** self.bits) - 1, out=weight)
weight = weight.reshape(-1, self.group_size, weight.shape[2])
elif self.bits == 3:
zeros = self.qzeros.reshape(self.qzeros.shape[0], self.qzeros.shape[1]//3
zeros = (zeros >> self.wf.unsqueeze(0))
RuntimeError: shape '[-1, 128, 3200]' is invalid for input of size 27648000
我不确定这是否是相同的错误,但我的环境是Windows(非Triton)。