当我运行Readme中给出的“使用auto_gptq对模型进行量化和量化后的推理的最简单方法”的代码时，我遇到了这个错误。当我运行AutoGPTQ/examples/quantization中的脚本时，也会出现相同的错误。以下是错误的完整堆栈跟踪：

C:\Users\username\anaconda3\lib\site-packages\transformers\generation\utils.py:1346: UserWarning: Using `max_length`'s default (20) to control the generation length. This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\username\Documents\LangChain\auto_gptq_example.py:38 in <module>                     │
│                                                                                                  │
│   35 model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir)                             │
│   36                                                                                             │
│   37 # inference with model.generate                                                             │
│ ❱ 38 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").t    │
│   39                                                                                             │
│   40 # or you can also use pipeline                                                              │
│   41 pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer)                         │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\auto_gptq\modeling\_base.py:372 in generate     │
│                                                                                                  │
│   369 │   def generate(self, **kwargs):                                                          │
│   370 │   │   """shortcut for model.generate"""                                                  │
│   371 │   │   with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):     │
│ ❱ 372 │   │   │   return self.model.generate(**kwargs)                                           │
│   373 │                                                                                          │
│   374 │   def prepare_inputs_for_generation(self, *args, **kwargs):                              │
│   375 │   │   """shortcut for model.prepare_inputs_for_generation"""                             │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\utils\_contextlib.py:115 in               │
│ decorate_context                                                                                 │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\transformers\generation\utils.py:1515 in        │
│ generate                                                                                         │
│                                                                                                  │
│   1512 │   │   │   │   )                                                                         │
│   1513 │   │   │                                                                                 │
│   1514 │   │   │   # 11. run greedy search                                                       │
│ ❱ 1515 │   │   │   return self.greedy_search(                                                    │
│   1516 │   │   │   │   input_ids,                                                                │
│   1517 │   │   │   │   logits_processor=logits_processor,                                        │
│   1518 │   │   │   │   stopping_criteria=stopping_criteria,                                      │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\transformers\generation\utils.py:2332 in        │
│ greedy_search                                                                                    │
│                                                                                                  │
│   2329 │   │   │   model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)  │
│   2330 │   │   │                                                                                 │
│   2331 │   │   │   # forward pass to get next token                                              │
│ ❱ 2332 │   │   │   outputs = self(                                                               │
│   2333 │   │   │   │   **model_inputs,                                                           │
│   2334 │   │   │   │   return_dict=True,                                                         │
│   2335 │   │   │   │   output_attentions=output_attentions,                                      │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl   │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\accelerate\hooks.py:165 in new_forward          │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\transformers\models\opt\modeling_opt.py:938 in  │
│ forward                                                                                          │
│                                                                                                  │
│    935 │   │   return_dict = return_dict if return_dict is not None else self.config.use_return  │
│    936 │   │                                                                                     │
│    937 │   │   # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)   │
│ ❱  938 │   │   outputs = self.model.decoder(                                                     │
│    939 │   │   │   input_ids=input_ids,                                                          │
│    940 │   │   │   attention_mask=attention_mask,                                                │
│    941 │   │   │   head_mask=head_mask,                                                          │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl   │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\accelerate\hooks.py:165 in new_forward          │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\transformers\models\opt\modeling_opt.py:704 in  │
│ forward                                                                                          │
│                                                                                                  │
│    701 │   │   │   │   │   None,                                                                 │
│    702 │   │   │   │   )                                                                         │
│    703 │   │   │   else:                                                                         │
│ ❱  704 │   │   │   │   layer_outputs = decoder_layer(                                            │
│    705 │   │   │   │   │   hidden_states,                                                        │
│    706 │   │   │   │   │   attention_mask=causal_attention_mask,                                 │
│    707 │   │   │   │   │   layer_head_mask=(head_mask[idx] if head_mask is not None else None),  │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl   │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\accelerate\hooks.py:165 in new_forward          │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\transformers\models\opt\modeling_opt.py:326 in  │
│ forward                                                                                          │
│                                                                                                  │
│    323 │   │                                                                                     │
│    324 │   │   # 125m, 1.7B, ..., 175B applies layer norm BEFORE attention                       │
│    325 │   │   if self.do_layer_norm_before:                                                     │
│ ❱  326 │   │   │   hidden_states = self.self_attn_layer_norm(hidden_states)                      │
│    327 │   │                                                                                     │
│    328 │   │   # Self Attention                                                                  │
│    329 │   │   hidden_states, self_attn_weights, present_key_value = self.self_attn(             │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\modules\module.py:1501 in _call_impl   │
│                                                                                                  │
│   1498 │   │   if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks   │
│   1499 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hooks                   │
│   1500 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1501 │   │   │   return forward_call(*args, **kwargs)                                          │
│   1502 │   │   # Do not call functions when jit is used                                          │
│   1503 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1504 │   │   backward_pre_hooks = []                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\accelerate\hooks.py:165 in new_forward          │
│                                                                                                  │
│   162 │   │   │   with torch.no_grad():                                                          │
│   163 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   164 │   │   else:                                                                              │
│ ❱ 165 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   166 │   │   return module._hf_hook.post_forward(module, output)                                │
│   167 │                                                                                          │
│   168 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\modules\normalization.py:190 in        │
│ forward                                                                                          │
│                                                                                                  │
│   187 │   │   │   init.zeros_(self.bias)                                                         │
│   188 │                                                                                          │
│   189 │   def forward(self, input: Tensor) -> Tensor:                                            │
│ ❱ 190 │   │   return F.layer_norm(                                                               │
│   191 │   │   │   input, self.normalized_shape, self.weight, self.bias, self.eps)                │
│   192 │                                                                                          │
│   193 │   def extra_repr(self) -> str:                                                           │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\nn\functional.py:2515 in layer_norm       │
│                                                                                                  │
│   2512 │   │   return handle_torch_function(                                                     │
│   2513 │   │   │   layer_norm, (input, weight, bias), input, normalized_shape, weight=weight, b  │
│   2514 │   │   )                                                                                 │
│ ❱ 2515 │   return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.c  │
│   2516                                                                                           │
│   2517                                                                                           │
│   2518 def group_norm(                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'

系统信息 -
操作系统 - Windows 11
CPU - Intel i7
GPU - Nvidia GTX 4080
有趣的是，当我在代码中添加以下两行时：

import torch
torch.set_default_dtype(torch.float16)

我得到了一个不同的错误：

C:\Users\username\Documents\LangChain>python auto_gptq_example.py
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ C:\Users\username\Documents\LangChain\auto_gptq_example.py:28 in <module>                     │
│                                                                                                  │
│   25 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config)          │
│   26                                                                                             │
│   27 # quantize model, the examples should be list of dict whose keys can only be "input_ids"    │
│ ❱ 28 model.quantize(examples)                                                                    │
│   29                                                                                             │
│   30 # save quantized model                                                                      │
│   31 model.save_quantized(quantized_model_dir)                                                   │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\torch\utils\_contextlib.py:115 in               │
│ decorate_context                                                                                 │
│                                                                                                  │
│   112 │   @functools.wraps(func)                                                                 │
│   113 │   def decorate_context(*args, **kwargs):                                                 │
│   114 │   │   with ctx_factory():                                                                │
│ ❱ 115 │   │   │   return func(*args, **kwargs)                                                   │
│   116 │                                                                                          │
│   117 │   return decorate_context                                                                │
│   118                                                                                            │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\auto_gptq\modeling\_base.py:298 in quantize     │
│                                                                                                  │
│   295 │   │   │   │                                                                              │
│   296 │   │   │   │   for name in subset:                                                        │
│   297 │   │   │   │   │   logger.info(f'Quantizing {name} in layer {i + 1}/{len(layers)}...')    │
│ ❱ 298 │   │   │   │   │   scale, zero, g_idx = gptq[name].fasterquant(                           │
│   299 │   │   │   │   │   │   percdamp=self.quantize_config.damp_percent,                        │
│   300 │   │   │   │   │   │   groupsize=self.quantize_config.group_size,                         │
│   301 │   │   │   │   │   │   actorder=self.quantize_config.desc_act                             │
│                                                                                                  │
│ C:\Users\username\anaconda3\lib\site-packages\auto_gptq\quantization\gptq.py:94 in            │
│ fasterquant                                                                                      │
│                                                                                                  │
│    91 │   │   damp = percdamp * torch.mean(torch.diag(H))                                        │
│    92 │   │   diag = torch.arange(self.columns, device=self.dev)                                 │
│    93 │   │   H[diag, diag] += damp                                                              │
│ ❱  94 │   │   H = torch.linalg.cholesky(H)                                                       │
│    95 │   │   H = torch.cholesky_inverse(H)                                                      │
│    96 │   │   H = torch.linalg.cholesky(H, upper=True)                                           │
│    97 │   │   Hinv = H                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: "cholesky_cusolver" not implemented for 'Half'

4条答案

按热度按时间

jljoyd4f1#

你好，你能分享更多信息吗？例如，转换器的版本、PyTorch、CUDA等。如果你认为这是一个bug(因为你添加了bug标签),你应该使用bug报告模板，否则这个问题将不会被视为高优先级。

赞(0）回复(0）举报 7个月前

4si2a6ki2#

@PanQiWei

>>> transformers.__version__
'4.29.2'
>>> torch.__version__
'2.0.1+cu117'
>nvidia-smi.exe
Wed May 24 23:09:11 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 531.79                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                      TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4080 L...  WDDM | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P8                6W /  N/A|      0MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     23660    C+G   ...inaries\Win64\EpicGamesLauncher.exe    N/A      |
|    0   N/A  N/A     25172    C+G   ...ne\Binaries\Win64\EpicWebHelper.exe    N/A      |
+---------------------------------------------------------------------------------------+

pu3pd22g3#

请确认您使用的脚本是this one吗？因为我的软件版本和您的完全一样。
另外，您可以尝试加载量化模型，并在之后让程序休眠，看看模型是否已加载到GPU中？
RuntimeError: "LayerNormKernelImpl"未为'Half'实现
通常这种情况发生在使用半数据类型在CPU上进行推理时
有趣的是，当我在代码中添加以下2行时：
为此，您需要在某些特定的代码块之后设置回torch.set_default_dtype(torch.float)

ht4b089n4#

将模型和管道传递给HuggingFacePipeline是不可能的。在您提供的代码中，出现错误的原因可能是模型和管道位于不同的设备上(例如，一个在GPU上，另一个在CPU上)。为了解决这个问题，您需要确保模型和管道都在同一个设备上。您可以尝试以下方法：

将模型移动到GPU上：

model_name = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config, trust_remote_code=True, device_map="auto").cuda()
llm_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=1024
)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

如果您的计算机没有可用的GPU,您可以尝试使用CPU运行模型：

model_name = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantize_config = BaseQuantizeConfig(bits=4, group_size=128)
model = AutoGPTQForCausalLM.from_pretrained(model_name, quantize_config, trust_remote_code=True, device_map="auto")
llm_pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=1024
)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

请注意，这可能会导致性能下降，因为CPU可能无法像GPU那样快速处理大量计算。

AutoGPTQ [BUG] 运行时错误："LayerNormKernelImpl"未为'Half'实现

4条答案

相关问题

热门标签

最新问答