Azure ML:扫描作业已完成,但状态未从“正在运行”更改

dsf9zpds  于 2023-10-22  发布在  其他
关注(0)|答案(1)|浏览(109)

通过在本地计算机上使用azureml Python SDK,我能够配置Hyperdrive扫描作业并提交到Azure Machine Learning Studio中的计算集群。下面是我的代码示例,基于this tutorial from Microsoft Learn

  1. #import required libraries
  2. from azure.ai.ml import MLClient, command
  3. from azure.ai.ml.sweep import Choice
  4. from azure.identity import DefaultAzureCredential
  5. # from azure.ai.ml.entities import Environment
  6. #connect to the workspace
  7. ml_client = MLClient.from_config(DefaultAzureCredential())
  8. def build_sweep_job(experiment_name, jobname):
  9. # Create your base command job
  10. env = '...'
  11. compute = '...'
  12. params = [1, 2]
  13. n_jobs = len(models)
  14. max_trials = n_jobs
  15. cmd = 'python train.py --param ${{inputs.params}}'
  16. inputs = {
  17. 'params': params[0],
  18. }
  19. command_job = command(code='codepath', command=cmd, environment=env, inputs=inputs, compute=compute, experiment_name=experiment_name)
  20. # Override your inputs with parameter expressions
  21. command_job_for_sweep = command_job(models=Choice(values=models))
  22. sweep_job = command_job_for_sweep.sweep(
  23. compute=compute,
  24. sampling_algorithm='grid',
  25. primary_metric='Best value',
  26. goal='Minimize',
  27. )
  28. # Specify your experiment details
  29. sweep_job.display_name = jobname
  30. sweep_job.experiment_name = experiment_name
  31. sweep_job.description = 'Run a hyperparameter sweep job.'
  32. sweep_job.set_limits(max_concurrent_trials=n_jobs, max_total_trials=max_trials)
  33. sweep_job.early_termination = None
  34. return sweep_job
  35. sweep_job = build_sweep_job(experiment_name='my_experiment', jobname='todays_job')
  36. returned_sweep_job = ml_client.create_or_update(sweep_job)
  37. print(returned_sweep_job.services["Studio"].endpoint)

代码正确地创建并运行我的作业。最后,ML Studio Web界面显示它已完成:

然而,在我的Python Notebook中,在VS Code中运行,即使在完成一个小时后,状态也显示为“Running”:

我该怎么解决呢?我想知道,没有离开我的本地VS代码,我的工作已经成功结束。

waxmsbnn

waxmsbnn1#

根据提供的信息,returned_job.status似乎有延迟。
获取作业的正确状态的一个可能的解决方案是,您可以使用ml_client.jobs.list()来获取作业的更新详细信息。
下面是示例代码片段(根据您的要求修改):

  1. for run in ml_client.jobs.list():
  2. if run.experiment_name == "<Experiment Name>":
  3. print("Name",run.display_name)
  4. print("Type",run.type)
  5. print("Compute",run.compute)
  6. print("Status",run.status)

通过上面的代码片段,我可以获得更新的状态。

相关问题