pytorch 获取RuntimeError：预期标量类型为Half，但在opt6.7B微调中的AWS P3示例中发现Float

我有一个简单的代码，它采用opt6.7B模型并对其进行微调。当我在Google colab中运行此代码时（Tesla T4，16 GB）它运行没有任何问题。但是当我尝试在AWS p3-2xlarge环境中运行相同的代码时（Tesla V100 GPU，16 GB）它给出了错误。

RuntimeError: expected scalar type Half but found Float

为了能够在单个GPU上运行微调，我使用LORA和peft。它们的安装方式完全相同（pip install）在这两种情况下。我可以使用with torch.autocast("cuda"):，然后错误消失。但是训练的损失变得非常奇怪，这意味着它不会逐渐减少，而是在很大范围内波动（0-5）（如果我将模型改为GPT-J，那么损失总是保持为0），而对于colab的情况，损失是逐渐减少的。所以我不确定使用with torch.autocast("cuda"):是否是一件好事。
转换器版本在两种情况下都是4.28.0.dev0。colab的Torch版本显示1.13.1+cu116，而p3显示-1.13.1（这是否意味着它不支持CUDA？我怀疑，在此之上执行torch.cuda.is_available()显示True）
我能看到的唯一大的区别是，对于colab，bitsandbytes有以下设置日志

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118

而对于p3，它是以下的

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...

我错过了什么？我没有在这里发布代码。但它确实是一个非常基本的版本，采用opt-6.7b并使用LORA和peft在alpaca数据集上进行微调。
为什么它在colab中运行而不在p3中运行？欢迎任何帮助：）
编辑
我张贴一个最小的代码示例，我实际上尝试

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers
from datasets import load_dataset

tokenizer.pad_token_id = 0
CUTOFF_LEN = 256

data = load_dataset("tatsu-lab/alpaca")

data = data.shuffle().map(
    lambda data_point: tokenizer(
        data_point['text'],
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    ),
    batched=True
)
# data = load_dataset("Abirate/english_quotes")
# data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=400, 
        learning_rate=2e-5, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

下面是完整的堆栈跟踪

/tmp/ipykernel_24622/2601578793.py:2 in <module>                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_24622/2601578793.py'                        │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1639 in train        │
│                                                                                                  │
│   1636 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1637 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1638 │   │   )                                                                                 │
│ ❱ 1639 │   │   return inner_training_loop(                                                       │
│   1640 │   │   │   args=args,                                                                    │
│   1641 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1642 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1906 in              │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1903 │   │   │   │   │   with model.no_sync():                                                 │
│   1904 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1905 │   │   │   │   else:                                                                     │
│ ❱ 1906 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1907 │   │   │   │                                                                             │
│   1908 │   │   │   │   if (                                                                      │
│   1909 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:2662 in              │
│ training_step                                                                                    │
│                                                                                                  │
│   2659 │   │   │   loss = loss / self.args.gradient_accumulation_steps                           │
│   2660 │   │                                                                                     │
│   2661 │   │   if self.do_grad_scaling:                                                          │
│ ❱ 2662 │   │   │   self.scaler.scale(loss).backward()                                            │
│   2663 │   │   elif self.use_apex:                                                               │
│   2664 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2665 │   │   │   │   scaled_loss.backward()                                                    │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py:488 in backward             │
│                                                                                                  │
│    485 │   │   │   │   create_graph=create_graph,                                                │
│    486 │   │   │   │   inputs=inputs,                                                            │
│    487 │   │   │   )                                                                             │
│ ❱  488 │   │   torch.autograd.backward(                                                          │
│    489 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    490 │   │   )                                                                                 │
│    491                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py:157 in backward    │
│                                                                                                  │
│   154 │   │   │   raise RuntimeError(                                                            │
│   155 │   │   │   │   "none of output has requires_grad=True,"                                   │
│   156 │   │   │   │   " this checkpoint() is not necessary")                                     │
│ ❱ 157 │   │   torch.autograd.backward(outputs_with_grad, args_with_grad)                         │
│   158 │   │   grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None                  │
│   159 │   │   │   │   │     for inp in detached_inputs)                                          │
│   160                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:456 in   │
│ backward                                                                                         │
│                                                                                                  │
│   453 │   │   │                                                                                  │
│   454 │   │   │   elif state.CB is not None:                                                     │
│   455 │   │   │   │   CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul   │
│ ❱ 456 │   │   │   │   grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype   │
│   457 │   │   │   elif state.CxB is not None:                                                    │
│   458 │   │   │   │                                                                              │
│   459 │   │   │   │   if state.tile_indices is None:

（很抱歉，如果这是一个非常新手的问题，但我目前没有解决方案：）

pytorch 获取RuntimeError：预期标量类型为Half，但在opt6.7B微调中的AWS P3示例中发现Float

1条答案

相关问题

热门标签

最新问答