ludwig 在MLflow运行中恢复训练时,无法正确地从/向artifacts进行读取/写入,

knsnq2tg  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(24)

描述错误

我正在恢复一个MLflow运行中的模型的训练。我开始使用MflowCallback进行训练,然后用CTRL+C中断它。我创建了以下脚本,该脚本将重新启动运行并恢复训练:

def resume_training(run_id):
    """Resume training a model."""
    with mlflow.start_run(run_id=run_id) as run:
        # get params and model of the train run to resume
        params = run.data.params
        mlmodel_path = mlflow.get_artifact_uri()
        logged_thresholds_path = pathlib.Path(
            mlflow.get_artifact_uri("model/thresholds.json")
        )
        model_resume_path = mlmodel_path.replace("file://", "")

        csv_path = params.get("csv_path")
        model_config = params.get("model_config")
        thresholds = params.get("thresholds")
        deterministic = int(params.get("deterministic"))
        random_seed = int(params.get("random_seed"))
        tmpdir = tempfile.mkdtemp()

        if not logged_thresholds_path.exists():
            mlflow.log_artifact(thresholds, "model")

        allow_parallel_threads = deterministic == 0
        model = ludwig.api.LudwigModel(
            config=model_config,
            logging_level=logging.INFO,
            callbacks=[ludwig.contribs.mlflow.MlflowCallback()],
            allow_parallel_threads=allow_parallel_threads,
        )

        try:
            data = pd.read_csv(csv_path)
        except pd.errors.EmptyDataError:
            raise pd.errors.EmptyDataError(
                f"File {csv_path} is empty. Please check the data."
            )
        model.train(
            dataset=data,
            random_seed=random_seed,
            output_directory=tmpdir,
            model_resume_path=model_resume_path,
        )

以下截图显示了记录到恢复运行的工件:

在预处理数据之后,输出状态如下:
Resuming training of model: [...]/mlruns/0/d3dd4d3d24204fe69b2c7a4727998afa/artifacts/model/training_progress.json
然后紧接着出现:
FileNotFoundError: [Errno 2] No such file or directory: '[...]/mlruns/0/d3dd4d3d24204fe69b2c7a4727998afa/artifacts/model/training_progress.json'
ludwig假设在运行工件中的"model"下是 training_process.json 。但实际上它位于 "model/model"(请参阅上面的截图)。
如果我更改以下行:

mlmodel_path = mlflow.get_artifact_uri()

为:

mlmodel_path = mlflow.get_artifact_uri("model")

那么训练就可以正常进行,但是 description.json 文件将被放置在 model 中,而不是工件根目录:

重现问题

重现此行为的方法:

  1. 使用MlflowCallback开始训练LudwigModel
  2. 中断训练
  3. 使用上述脚本恢复训练

预期行为

训练应该能够正确恢复,并且 description.json 文件应该位于训练运行的工件根目录中。

9q78igpj

9q78igpj1#

你好,Peetee06。感谢你指出这个问题。我计划在下周初解决它,并在有更新时立即通知你。

相关问题