pytorch 运行时错误:预期所有Tensor都在同一设备上,但在使用转换器体系结构时发现至少两个设备,cuda:1和cuda:0!

gorkyyrv  于 2022-11-29  发布在  其他
关注(0)|答案(1)|浏览(579)

我在通过pytorch练习transformer时遇到了一个多GPU问题。之前使用pytorch学习的所有训练都可以通过在model对象上并行放置nn. datapallel来完成。然而,在seq2seq之前,这种方法一直工作得很好,但transformer返回了以下错误:

RuntimeError                              Traceback (most recent call last)
Cell In [44], line 66
     63 for epoch in range(N_EPOCHS):
     64     start_time = time.time() # 시작 시간 기록
---> 66     train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
     67     valid_loss = evaluate(model, validation_iterator, criterion)
     69     end_time = time.time() # 종료 시간 기록

Cell In [41], line 15, in train(model, iterator, optimizer, criterion, clip)
     11 optimizer.zero_grad()
     13 # 출력 단어의 마지막 인덱스(<eos>)는 제외
     14 # 입력을 할 때는 <sos>부터 시작하도록 처리
---> 15 output, _ = model(src, trg[:,:-1])
     17 # output: [배치 크기, trg_len - 1, output_dim]
     18 # trg: [배치 크기, trg_len]
     20 output_dim = output.shape[-1]

File ~/anaconda3/envs/jki_pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
...
    return forward_call(*input, **kwargs)
  File "/tmp/ipykernel_212252/284771533.py", line 31, in forward
    src = self.dropout((self.tok_embedding(src) * self.scale) + self.pos_embedding(pos))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

目前,设备设置为cuda,并且nn.dataparallel仅应用于最终变压器模型,编码器和解码器除外。

# 인코더(encoder)와 디코더(decoder) 객체 선언
enc = Encoder(INPUT_DIM, HIDDEN_DIM, ENC_LAYERS, ENC_HEADS, ENC_PF_DIM, ENC_DROPOUT, device)
dec = Decoder(OUTPUT_DIM, HIDDEN_DIM, DEC_LAYERS, DEC_HEADS, DEC_PF_DIM, DEC_DROPOUT, device)

# Transformer 객체 선언 및 병렬처리
model = nn.DataParallel(Transformer(enc, dec, SRC_PAD_IDX, TRG_PAD_IDX, device).to(device))

我在编码器和解码器对象上尝试了nn.dataparallel,但它仍然返回相同的错误。有人遇到过和我一样的错误吗?你是如何解决的?我使用的是两个2080ti,设备值如下。

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

>>> cuda

由于当前内存问题,批处理大小变得非常小,这不可避免地降低了学习性能和时间。我期待您的帮助

mf98qq94

mf98qq941#

当(如错误所示)您在不同的GPU上有两个参数时,就会发生这种情况。
如果不查看完整的代码,很难知道确切的问题是什么,但我建议这样做:
1.尝试在单个GPU上运行您的代码。只需在代码开始时(在任何导入之前)经过此步骤:
导入操作系统操作系统环境
1.确保在所有Tensor和模型上使用.cuda()而不是.device(),并且不要将任何Tensor发送到不同的设备。datapallel将处理其余的:)

相关问题