pytorch ResumableUploadAbortException：上传完成，流中剩余1141101995个额外字节

ahy6op9u 于 2023-05-07 发布在其他

关注(0)|答案(2)|浏览(134)

我正在使用GCP Vertex platform进行分布式训练。该模型使用Pytorch和HuggingFace使用4个GPU并行训练。训练后，当我将模型从本地container保存到GCP bucket时，它会抛出错误。
代码如下：
我以如下方式启动train.py：

python -m torch.distributed.launch --nproc_per_node 4  train.py

训练完成后，我用这个保存模型文件。它有3个文件需要保存。

trainer.save_model("model_mlm") #Saves in local directory
subprocess.call('gsutil -o GSUtil:parallel_composite_upload_threshold=0  cp -r /pythonPackage/trainer/model_mlm gs://*****/model_mlm', shell=True, stdout=subprocess.PIPE) #from local to GCP

错误：

ResumableUploadAbortException: Upload complete with 1141101995 additional bytes left in stream; this can happen if a file changes size while being uploaded

有时候我会犯这样的错误：

ResumableUploadAbortException: 409 The object has already been created in an earlier attempt and was overwritten, possibly due to a race condition.

pytorch

来源：https://stackoverflow.com/questions/71868757/resumableuploadabortexception-upload-complete-with-1141101995-additional-bytes

2条答案

按热度按时间

gev0vcfq1#

根据文档名称冲突，您正在尝试覆盖已创建的文件。
所以我建议你在每次训练中用一个唯一的标识符来改变命运的位置，这样你就不会收到这种类型的错误。例如，在bucket的末尾添加字符串格式的时间戳，如下所示：

- gs://pypl_bkt_prd_row_std_aiml_vertexai/model_mlm_vocab_exp2_50epocs/20220407150000

我想提一下，这种错误是可重试的，如错误文档错误文档中所述。

赞(0）回复(0）举报 2023-05-07

y1aodyip2#

我也遇到了这个问题。当文件内容在rsync上传文件时发生变化时，似乎会发生这种情况。这可能发生在大文件中，因为文件写入不能保证是事务性的。
我通过简单地重试gsutil rsync命令解决了这个问题。

赞(0）回复(0）举报 2023-05-07

我来回答

pytorch ResumableUploadAbortException：上传完成，流中剩余1141101995个额外字节

2条答案

相关问题

热门标签

最新问答