ludwig [ray] 修复RayDatasets中的动态资源分配问题

ulmd4ohb  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(33)

看起来在使用Ray Tune与动态资源分配+ RayDatasets时出现了死锁。当我们设置cache_format = 'parquet'时,一切正常,但当我们使用新的默认值cache_format = 'ray'时,试验将挂起,可能是因为RayDatasets锁定了动态分配所需的一些资源。
即使我们提高集群中资源的数量,我们最终还是会陷入相同的境地:

2021-10-29 12:55:53,822	WARNING worker.py:1227 -- The actor or task with ID eae16da7b2a06f615e017e6d60f6e1ebc6be08064045c345 cannot be scheduled right now. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this actor or task, which may take time to install.  Otherwise, this is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increasing the resources available to this Ray cluster.
Required resources for this actor or task: {CPU_group_098d048e4d6df88103b3ad1b5a1e6f44: 1.000000}
Available resources on this node: {5.998000/8.000000 CPU, 292485120.019531 GiB/292485120.019531 GiB memory, 7680000.000000 GiB/7680000.000000 GiB object_store_memory, 1.000000/1.000000 CPU_group_1_f26449396dc25d88645e711373b91bd4, 1000.000000/1000.000000 bundle_group_1_098d048e4d6df88103b3ad1b5a1e6f44, 0.000000/0.001000 CPU_group_0_098d048e4d6df88103b3ad1b5a1e6f44, 1.000000/1.000000 CPU_group_1_098d048e4d6df88103b3ad1b5a1e6f44, 2000.000000/2000.000000 bundle_group_098d048e4d6df88103b3ad1b5a1e6f44, 1000.000000/1000.000000 bundle_group_0_098d048e4d6df88103b3ad1b5a1e6f44, 1000.000000/1000.000000 bundle_group_0_f26449396dc25d88645e711373b91bd4, 0.000000/1.001000 CPU_group_098d048e4d6df88103b3ad1b5a1e6f44, 0.000000/0.001000 CPU_group_0_f26449396dc25d88645e711373b91bd4, 1.000000/1.000000 node:192.168.4.54, 2000.000000/2000.000000 bundle_group_f26449396dc25d88645e711373b91bd4, 0.000000/1.001000 CPU_group_f26449396dc25d88645e711373b91bd4, 1000.000000/1000.000000 bundle_group_1_f26449396dc25d88645e711373b91bd4}
In total there are 6 pending tasks and 0 pending actors on this node.
== Status ==
Memory usage on this node: 9.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 2.002/8 CPUs, 0/0 GPUs, 0.0/5.58 GiB heap, 0.0/0.15 GiB objects
Result logdir: /tmp/mock-client-8fb5/trainable_func_fiYzhHE
Number of trials: 2/2 (2 RUNNING)
+-------------------+----------+-------+------------------------+------------------------------+--------------------------+
| Trial name        | status   | loc   |   binary_46001.fc_size |   binary_46001.num_fc_layers |   training.learning_rate |
|-------------------+----------+-------+------------------------+------------------------------+--------------------------|
| trial_23bf9_00000 | RUNNING  |       |                    124 |                            4 |               0.00561152 |
| trial_23bf9_00001 | RUNNING  |       |                    220 |                            2 |               0.0291064  |
+-------------------+----------+-------+------------------------+------------------------------+--------------------------+
rryofs0p

rryofs0p2#

嘿,@tgaddair ,一个解决方法是在 tune.run 中使用 max_concurrent_trials 参数。这确保一次最多安排 N 个试验。如果你设置了一些资源没有被试验占用,Ray Data workers 将能够运行。在动态资源分配的情况下,用于资源的函数需要修改以保持一些 CPU 空闲(应该是简单的)。
我们将研究为这个问题提供一个合适的修复方案。

相关问题