我有一个简单的代码,它采用opt6.7B模型并对其进行微调。当我在Google colab中运行此代码时(Tesla T4,16 GB)它运行没有任何问题。但是当我尝试在AWS p3-2xlarge环境中运行相同的代码时(Tesla V100 GPU,16 GB)它给出了错误。
RuntimeError: expected scalar type Half but found Float
为了能够在单个GPU上运行微调,我使用LORA和peft。它们的安装方式完全相同(pip install)在这两种情况下。我可以使用with torch.autocast("cuda"):
,然后错误消失。但是训练的损失变得非常奇怪,这意味着它不会逐渐减少,而是在很大范围内波动(0-5)(如果我将模型改为GPT-J,那么损失总是保持为0),而对于colab的情况,损失是逐渐减少的。所以我不确定使用with torch.autocast("cuda"):
是否是一件好事。
转换器版本在两种情况下都是4.28.0.dev0
。colab的Torch版本显示1.13.1+cu116
,而p3显示-1.13.1
(这是否意味着它不支持CUDA?我怀疑,在此之上执行torch.cuda.is_available()
显示True)
我能看到的唯一大的区别是,对于colab,bitsandbytes有以下设置日志
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
而对于p3,它是以下的
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...
我错过了什么?我没有在这里发布代码。但它确实是一个非常基本的版本,采用opt-6.7b并使用LORA和peft在alpaca数据集上进行微调。
为什么它在colab中运行而不在p3中运行?欢迎任何帮助:)
编辑
我张贴一个最小的代码示例,我实际上尝试
import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-6.7b",
load_in_8bit=True,
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
for param in model.parameters():
param.requires_grad = False # freeze the model - train adapters later
if param.ndim == 1:
# cast the small parameters (e.g. layernorm) to fp32 for stability
param.data = param.data.to(torch.float32)
model.gradient_checkpointing_enable() # reduce number of stored activations
model.enable_input_require_grads()
class CastOutputToFloat(nn.Sequential):
def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
print_trainable_parameters(model)
import transformers
from datasets import load_dataset
tokenizer.pad_token_id = 0
CUTOFF_LEN = 256
data = load_dataset("tatsu-lab/alpaca")
data = data.shuffle().map(
lambda data_point: tokenizer(
data_point['text'],
truncation=True,
max_length=CUTOFF_LEN,
padding="max_length",
),
batched=True
)
# data = load_dataset("Abirate/english_quotes")
# data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)
trainer = transformers.Trainer(
model=model,
train_dataset=data['train'],
args=transformers.TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
max_steps=400,
learning_rate=2e-5,
fp16=True,
logging_steps=1,
output_dir='outputs'
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()
下面是完整的堆栈跟踪
/tmp/ipykernel_24622/2601578793.py:2 in <module> │
│ │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_24622/2601578793.py' │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1639 in train │
│ │
│ 1636 │ │ inner_training_loop = find_executable_batch_size( │
│ 1637 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │
│ 1638 │ │ ) │
│ ❱ 1639 │ │ return inner_training_loop( │
│ 1640 │ │ │ args=args, │
│ 1641 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1642 │ │ │ trial=trial, │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1906 in │
│ _inner_training_loop │
│ │
│ 1903 │ │ │ │ │ with model.no_sync(): │
│ 1904 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1905 │ │ │ │ else: │
│ ❱ 1906 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │
│ 1907 │ │ │ │ │
│ 1908 │ │ │ │ if ( │
│ 1909 │ │ │ │ │ args.logging_nan_inf_filter │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:2662 in │
│ training_step │
│ │
│ 2659 │ │ │ loss = loss / self.args.gradient_accumulation_steps │
│ 2660 │ │ │
│ 2661 │ │ if self.do_grad_scaling: │
│ ❱ 2662 │ │ │ self.scaler.scale(loss).backward() │
│ 2663 │ │ elif self.use_apex: │
│ 2664 │ │ │ with amp.scale_loss(loss, self.optimizer) as scaled_loss: │
│ 2665 │ │ │ │ scaled_loss.backward() │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py:488 in backward │
│ │
│ 485 │ │ │ │ create_graph=create_graph, │
│ 486 │ │ │ │ inputs=inputs, │
│ 487 │ │ │ ) │
│ ❱ 488 │ │ torch.autograd.backward( │
│ 489 │ │ │ self, gradient, retain_graph, create_graph, inputs=inputs │
│ 490 │ │ ) │
│ 491 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors_, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py:157 in backward │
│ │
│ 154 │ │ │ raise RuntimeError( │
│ 155 │ │ │ │ "none of output has requires_grad=True," │
│ 156 │ │ │ │ " this checkpoint() is not necessary") │
│ ❱ 157 │ │ torch.autograd.backward(outputs_with_grad, args_with_grad) │
│ 158 │ │ grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None │
│ 159 │ │ │ │ │ for inp in detached_inputs) │
│ 160 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward │
│ │
│ 194 │ # The reason we repeat same the comment below is that │
│ 195 │ # some Python versions print out the first line of a multi-line function │
│ 196 │ # calls in the traceback and some print out the last line │
│ ❱ 197 │ Variable._execution_engine.run_backward( # Calls into the C++ engine to run the bac │
│ 198 │ │ tensors, grad_tensors_, retain_graph, create_graph, inputs, │
│ 199 │ │ allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to ru │
│ 200 │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply │
│ │
│ 264 │ │ │ │ │ │ │ "Function is not allowed. You should only implement one " │
│ 265 │ │ │ │ │ │ │ "of them.") │
│ 266 │ │ user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn │
│ ❱ 267 │ │ return user_fn(self, *args) │
│ 268 │ │
│ 269 │ def apply_jvp(self, *args): │
│ 270 │ │ # _forward_cls is defined by derived class │
│ │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:456 in │
│ backward │
│ │
│ 453 │ │ │ │
│ 454 │ │ │ elif state.CB is not None: │
│ 455 │ │ │ │ CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul │
│ ❱ 456 │ │ │ │ grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype │
│ 457 │ │ │ elif state.CxB is not None: │
│ 458 │ │ │ │ │
│ 459 │ │ │ │ if state.tile_indices is None:
(很抱歉,如果这是一个非常新手的问题,但我目前没有解决方案:)
1条答案
按热度按时间hjzp0vay1#
我也有同样的错误。在谷歌上搜索后,最后通过添加torch.autocast(“cuda”)代码得到了解决方案:在我的训练方法之前。像这样: