PyTorch RuntimeError t == DeviceType：：CUDAINTERNAL ASSERT FAILED

4urapxun 于 2023-08-05 发布在其他

关注(0)|答案(2)|浏览(722)

PyTorch Lightning模型在使用此Trainer配置的CPU上运行良好：

trainer = Trainer(
    gpus=0,
    max_epochs=10,
    gradient_clip_val=2,
    callbacks=[pl.callbacks.progress.TQDMProgressBar(refresh_rate=5)],
)

trainer.fit(model)

字符串
但是在GPU上运行完全相同的模型（通过在上面的代码中更改gpus=-1或gpus=1）会触发以下错误：

RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED
at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.

型
模型如下：

class TorchModel(LightningModule):
    def __init__(self):
        super(TorchModel, self).__init__()
        self.cat_layers = ModuleList([TorchCatEmbedding(cat) for cat in columns_to_embed])
        self.num_layers = ModuleList([LambdaLayer(lambda x: x[:, idx:idx+1]) for _, idx in numeric_columns])
        self.ffo = TorchFFO(len(self.num_layers) + sum([embed_dim(l) for l in self.cat_layers]), y.shape[1])
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, inputs):
        cats = [c(inputs) for c in self.cat_layers]
        nums = [n(inputs) for n in self.num_layers]
        concat = torch.cat(cats + nums, dim=1)
        out = self.ffo(concat)
        out = self.softmax(out)
        return out

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        y_hat = self.forward(x)
        loss = cce(torch.log(torch.maximum(torch.tensor(1e-8), y_hat)), y.argmax(dim=1))
        acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
        self.log("loss", loss)
        self.log("acc", acc, prog_bar=True)
        self.log("lr", self.scheduler.get_last_lr()[-1], prog_bar=True)
        return loss

型
其中TorchCatEmbedding和TorchFFO是两个子模型。
有没有办法解决这个问题？
PyTorch版本：

>>> torch.__version__
'1.10.1+cu113'

型
CUDA信息：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+

型

pytorch

来源：https://stackoverflow.com/questions/70594827/pytorch-runtimeerror-t-devicetypecudainternal-assert-failed

2条答案

按热度按时间

eqzww0vc1#

这是由于torch.tensor()声明没有在训练步骤中传输到GPU：

def training_step(self, train_batch, batch_idx):
    x, y = train_batch
    y_hat = self.forward(x)
    loss = cce(torch.log(torch.maximum(torch.tensor(1e-8), y_hat)), y.argmax(dim=1))
    acc = tm.functional.accuracy(y_hat.argmax(dim=1), y.argmax(dim=1))
    self.log("loss", loss)
    self.log("acc", acc, prog_bar=True)
    self.log("lr", self.scheduler.get_last_lr()[-1], prog_bar=True)
    return loss

字符串
更改此：

loss = cce(
    torch.log(torch.maximum(torch.tensor(1e-8), y_hat)),
    y.argmax(dim=1)
)

型
添加.type_as(y_hat)：

loss = cce(
    torch.log(torch.maximum(torch.tensor(1e-8).type_as(y_hat), y_hat)), 
    y.argmax(dim=1)
)

型
解决了这个问题

赞(0）回复(0）举报 2023-08-05

kmbjn2e32#

我遇到了同样的问题，甚至我所有的Tensor都被推到了设备上！为了更深入地了解实际问题，在CPU上运行代码，它将打印更多信息错误。我通过简单地升级我的 Torch 和 Torch 视觉解决了它，然后，一切都工作正常：

pip install --upgrade torch torchvision

字符串
祝你好运！

赞(0）回复(0）举报 2023-08-05

我来回答

PyTorch RuntimeError t == DeviceType：：CUDAINTERNAL ASSERT FAILED

2条答案

相关问题

热门标签

最新问答