PyTorch RuntimeError t == DeviceType::CUDAINTERNAL ASSERT FAILED

4urapxun  于 2023-08-05  发布在  其他
关注(0)|答案(2)|浏览(723)

PyTorch Lightning模型在使用此Trainer配置的CPU上运行良好:

trainer = Trainer(
    gpus=0,
    max_epochs=10,
    gradient_clip_val=2,
    callbacks=[pl.callbacks.progress.TQDMProgressBar(refresh_rate=5)],
)

trainer.fit(model)

字符串
但是在GPU上运行完全相同的模型(通过在上面的代码中更改gpus=-1gpus=1)会触发以下错误:

RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED
at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.


模型如下:

class TorchModel(LightningModule):
    def __init__(self):
        super(TorchModel, self).__init__()
        self.cat_layers = ModuleList([TorchCatEmbedding(cat) for cat in columns_to_embed])
        self.num_layers = ModuleList([LambdaLayer(lambda x: x[:, idx:idx+1]) for _, idx in numeric_columns])
        self.ffo = TorchFFO(len(self.num_layers) + sum([embed_dim(l) for l in self.cat_layers]), y.shape[1])
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, inputs):
        cats = [c(inputs) for c in self.cat_layers]
        nums = [n(inputs) for n in self.num_layers]
        concat = torch.cat(cats + nums, dim=1)
        out = self.ffo(concat)
        out = self.softmax(out)
        return out

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        loss = cce(torch.log(torch.maximum(torch.tensor(1e-8), y_hat)), y.argmax(dim=1))
        acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
        self.log("loss", loss)
        self.log("acc", acc, prog_bar=True)
        self.log("lr", self.scheduler.get_last_lr()[-1], prog_bar=True)
        return loss


其中TorchCatEmbeddingTorchFFO是两个子模型。
有没有办法解决这个问题?
PyTorch版本:

>>> torch.__version__
'1.10.1+cu113'


CUDA信息:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+

eqzww0vc

eqzww0vc1#

这是由于torch.tensor()声明没有在训练步骤中传输到GPU:

def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    y_hat = self.forward(x)
    loss = cce(torch.log(torch.maximum(torch.tensor(1e-8), y_hat)), y.argmax(dim=1))
    acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
    self.log("loss", loss)
    self.log("acc", acc, prog_bar=True)
    self.log("lr", self.scheduler.get_last_lr()[-1], prog_bar=True)
    return loss

字符串
更改此:

loss = cce(
    torch.log(torch.maximum(torch.tensor(1e-8), y_hat)),
    y.argmax(dim=1)
)


添加.type_as(y_hat)

loss = cce(
    torch.log(torch.maximum(torch.tensor(1e-8).type_as(y_hat), y_hat)), 
    y.argmax(dim=1)
)


解决了这个问题

kmbjn2e3

kmbjn2e32#

我遇到了同样的问题,甚至我所有的Tensor都被推到了设备上!为了更深入地了解实际问题,在CPU上运行代码,它将打印更多信息错误。我通过简单地升级我的 Torch 和 Torch 视觉解决了它,然后,一切都工作正常:

pip install --upgrade torch torchvision

字符串
祝你好运!

相关问题