描述错误
我正在恢复一个MLflow运行中的模型的训练。我开始使用MflowCallback进行训练,然后用CTRL+C中断它。我创建了以下脚本,该脚本将重新启动运行并恢复训练:
def resume_training(run_id):
"""Resume training a model."""
with mlflow.start_run(run_id=run_id) as run:
# get params and model of the train run to resume
params = run.data.params
mlmodel_path = mlflow.get_artifact_uri()
logged_thresholds_path = pathlib.Path(
mlflow.get_artifact_uri("model/thresholds.json")
)
model_resume_path = mlmodel_path.replace("file://", "")
csv_path = params.get("csv_path")
model_config = params.get("model_config")
thresholds = params.get("thresholds")
deterministic = int(params.get("deterministic"))
random_seed = int(params.get("random_seed"))
tmpdir = tempfile.mkdtemp()
if not logged_thresholds_path.exists():
mlflow.log_artifact(thresholds, "model")
allow_parallel_threads = deterministic == 0
model = ludwig.api.LudwigModel(
config=model_config,
logging_level=logging.INFO,
callbacks=[ludwig.contribs.mlflow.MlflowCallback()],
allow_parallel_threads=allow_parallel_threads,
)
try:
data = pd.read_csv(csv_path)
except pd.errors.EmptyDataError:
raise pd.errors.EmptyDataError(
f"File {csv_path} is empty. Please check the data."
)
model.train(
dataset=data,
random_seed=random_seed,
output_directory=tmpdir,
model_resume_path=model_resume_path,
)
以下截图显示了记录到恢复运行的工件:
在预处理数据之后,输出状态如下:Resuming training of model: [...]/mlruns/0/d3dd4d3d24204fe69b2c7a4727998afa/artifacts/model/training_progress.json
然后紧接着出现:FileNotFoundError: [Errno 2] No such file or directory: '[...]/mlruns/0/d3dd4d3d24204fe69b2c7a4727998afa/artifacts/model/training_progress.json'
ludwig假设在运行工件中的"model"下是 training_process.json
。但实际上它位于 "model/model"(请参阅上面的截图)。
如果我更改以下行:
mlmodel_path = mlflow.get_artifact_uri()
为:
mlmodel_path = mlflow.get_artifact_uri("model")
那么训练就可以正常进行,但是 description.json
文件将被放置在 model
中,而不是工件根目录:
重现问题
重现此行为的方法:
- 使用MlflowCallback开始训练LudwigModel
- 中断训练
- 使用上述脚本恢复训练
预期行为
训练应该能够正确恢复,并且 description.json
文件应该位于训练运行的工件根目录中。
1条答案
按热度按时间9q78igpj1#
你好,Peetee06。感谢你指出这个问题。我计划在下周初解决它,并在有更新时立即通知你。