sagemaker pytorch“工件上传失败”

nhaq1z21 于 2022-12-13 发布在其他

关注(0)|答案(1)|浏览(164)

我正在进行一个测试赛格制造者皮拓奇培训。
它创建估计器并成功运行训练。但是，它在运行“上载生成的训练模型”时死亡
错误为“培训作业pytorch-training-2022年12月05日19日45日41日370：失败。原因：客户端错误：项目上载失败：写入的文件太多”

estimator = PyTorch(  # create the estimator
        entry_point="CloudSeg.py",
        input_mode="FastFile",
        TrainingInputMode='FastFile',
        role=role,
        py_version="py38",
        framework_version="1.11.0",
        instance_count=1,
        instance_type="ml.g4dn.xlarge",
        checkpoint_s3_uri=checkpoint_s3_bucket,
        checkpoint_local_path=checkpoint_local_path,
        use_spot_instances=use_spot_instances,
        max_run=max_run,
        max_wait=max_wait,
        hyperparameters={"epochs": 1, "backend": "nccl"},
        )

    estimator.fit({"training": "s3://bucket/DATA/"})  # fit with the training data

拟合的结果为：

2022-12-05 19:54:10 Training - Training image download completed. Training in progress.
2022-12-05 19:54:10 Uploading - Uploading generated training model
2022-12-05 19:54:10 Failed - Training job failed
ProfilerReport-1670269542: Stopping
-

UnexpectedStatusException                 
Traceback (most recent call last)
/tmp/ipykernel_19821/1489485288.py in \<cell line: 1\>()
\----\> 1 estimator.fit({"training": 's3://picard-prov/38-cloud-simple-unet_DATA/'})
...
\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
3891
3892         if wait:
\-\> 3893             self.\_check_job_status(job_name, description, "TrainingJobStatus")
3894             if dot:
3895                 print()

\~/anaconda3/envs/pytorch_p38/lib/python3.8/site-packages/sagemaker/session.py in \_check_job_status(self, job, desc, status_key_name)
3429                     actual_status=status,
3430                 )
\-\> 3431             raise exceptions.UnexpectedStatusException(
3432                 message=message,
3433                 allowed_statuses=\["Completed", "Stopped"\],

UnexpectedStatusException: Error for Training job pytorch-training-2022-12-05-19-45-41-370: **Failed. Reason: ClientError: Artifact upload failed:Too many files are written**

有什么解决办法吗？

谢谢你，谢谢你
我试过摆脱快速文件模式。没用

pytorch

来源：https://stackoverflow.com/questions/74694537/sagemaker-pytorch-artifact-upload-failed

1条答案

按热度按时间

xwbd5t1u1#

培训完成后，SageMaker将process training outputs，其中包括上传CloudSeg.py放置在/opt/ml/model中的文件。检查您最终放置在这些输出文件夹中的文件数量，SageMaker将代表您上传到S3（根据错误消息，文件数量太多）。
/opt/ml/model
/opt/ml/output
您可以编写代码打印出存储在其中的文件，作为算法的最后一步，或者使用SageMaker SSH Helper来交互式地检查正在发生的事情。

赞(0）回复(0）举报 2022-12-13

我来回答

sagemaker pytorch“工件上传失败”

有什么解决办法吗？

1条答案

相关问题

热门标签

最新问答