ludwig Hyperopt Bug: Ray Actors有时在达到time_budget_s时会死亡,在使用AsyncHyperband进行超参数优化时

dphi5xsq  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(36)

一旦在hyperopt中达到time_budget_s,所有的Ray Actors会逐步死亡。

2022-08-04 17:47:10,441	INFO stopper.py:350 -- Reached timeout of 180 seconds. Stopping all trials.
== Status ==
Current time: 2022-08-04 17:47:10 (running for 00:03:01.46)
Memory usage on this node: 20.1/246.4 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/150 CPUs, 0/12 GPUs, 0.0/500.0 GiB heap, 0.0/500.0 GiB objects (0.0/3.0 accelerator_type:T4)
Current best trial: b74f3_00002 with metric_score=0.130369803722715 and parameters={'trainer.learning_rate': 0.005, 'trainer.decay_steps': 10000}
Result logdir: /home/ray/src/hyperopt
Number of trials: 4/4 (4 TERMINATED)
+-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------+
| Trial name        | status     | loc                  |   trainer.decay_steps |   trainer.learning_rate |   iter |   total time (s) |   metric_score |
|-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------|
| trial_b74f3_00000 | TERMINATED | 192.168.44.226:44064 |                 10000 |                   0.001 |     10 |          165.849 |       0.132162 |
| trial_b74f3_00001 | TERMINATED | 192.168.74.69:5093   |                  2000 |                   0.005 |     11 |          172.588 |       0.131108 |
| trial_b74f3_00002 | TERMINATED | 192.168.72.27:6452   |                 10000 |                   0.005 |     10 |          166.155 |       0.13037  |
| trial_b74f3_00003 | TERMINATED | 192.168.45.45:55382  |                  8000 |                   0.001 |     10 |          162.189 |       0.132678 |
+-------------------+------------+----------------------+-----------------------+-------------------------+--------+------------------+----------------+

Training:  11%|█         | 120/1100 [02:29<10:47,  1.51it/s]
2022-08-04 17:47:10,603	WARNING worker.py:1382 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 77003339209b3e674c1826ef52407c82b1d419681c000000 Worker ID: fd87e59263039f1b712913a4de1750c0f527bae210284a5c54307c2b Node ID: 0b05b201854acb7ec8473e64f8b224140dc47236ddc8ecfb9903c3fe Worker IP address: 192.168.45.45 Worker port: 10201 Worker PID: 56401
(BaseWorkerMixin pid=5516, ip=192.168.35.224) The actor is dead because its owner has died. Owner Id: 7304cad6ec56a8c825c4de04dcb3f0106c885bc42b6ab195439eefe6 Owner Ip address: 192.168.45.45 Owner worker exit type: INTENDED_EXIT
2022-08-04 17:47:13,554	WARNING worker.py:1382 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: 2fad8fa976dfdce26f68298242b39ff6194200d51c000000 Worker ID: a0a4f773d2064aa4385a9d7bf97b055a254d7e86b10b8ef35ffb6e91 Node ID: 0017c30b633d4339dba7461ec73a43b1f6e65ae839edf4bae757dcc9 Worker IP address: 192.168.86.8 Worker port: 10193 Worker PID: 957

然后事情似乎会暂时挂起(比如30秒到1分钟),然后Ray Tune返回试验结果。
理想情况下,ray workers/actors不应该死亡。

r9f1avp5

r9f1avp51#

@jeffkinnison @arnavgarg1 这会导致模型训练失败,还是只是暂时的延迟?如果是前者,建议使用P0;如果是后者,建议使用P1/P2 cc @tgaddair

8yparm6h

8yparm6h2#

@drishi 我还没有看到这导致任何训练失败,但肯定会有延迟。我同意这可能是一个P1问题。在Predibase方面,这可能只是一段令人困惑的经历,因为在转移到评估阶段之前,trials中的任何一个都可能在几分钟内不会更新其指标。

相关问题