ludwig 由于内存不足，4/5次试验失败,

7gyucuyw 于 6个月前发布在其他

关注(0)|答案(2)|浏览(62)

问题描述：在使用Ludwig AI的AutoML进行实验时，由于内存不足，导致4/5次试验失败。作者拥有4个GPU(RTX 2080 Super,8 GB)和64 GB RAM,但似乎AutoML无法充分利用这些GPU来获得最佳结果。

问题原因：可能是AutoML在分配GPU资源时存在问题，导致部分试验无法正常运行。

解决方案：可以尝试增加实验的时间预算，以便为试验提供更多的运行时间。此外，还可以检查AutoML的配置，确保其能够正确识别并使用所有可用的GPU资源。

ludwig

来源：https://github.com/ludwig-ai/ludwig/issues/4010

2条答案

按热度按时间

bnl4lu3b1#

嘿，@diegotxegp,

你能尝试将 max_concurrent_trials 设置为 1 或 2 这样的值吗？https://ludwig.ai/latest/configuration/hyperparameter_optimization/#executor

关于 GPU 使用情况 - 你的 CUDA_VISIBLE_DEVICES 环境变量是否已设置？

赞(0）回复(0）举报 6个月前

esyap4oy2#

感谢您的快速回复。关键是我正在尝试使用AutoML自动进行实验。自从出现错误以来，我按照您的建议添加了"os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"",并关于"max_current_trials",我设置如下，但没有太大区别：
代码：

from ludwig.automl import auto_train
auto_train_results = auto_train(
    dataset=self.df,
    target=selected_targets[0],
    time_limit_s=7200,
    num_samples=4,
    cpu_resources_per_trial=5,
    gpu_resources_per_trial=1,
    max_concurrent_trials=1,
)

AutoML配置：

{
  'eval_split': 'validation',
  'executor': {
    'cpu_resources_per_trial': 5,
    'gpu_resources_per_trial': 1,
    'kubernetes_namespace': None,
    'max_concurrent_trials': None,
    'num_samples': 5,
    'scheduler': {
      'brackets': 1,
      'grace_period': 72,
      'max_t': 7200,
      'metric': None,
      'mode': None,
      'reduction_factor': 5.0,
      'stop_last_trials': True,
      'time_attr': 'time_total_s',
      'type': 'async_hyperband'},
    'time_budget_s': 7200,
    'trial_driver_resources': {'CPU': 1, 'GPU': 0},
    'type': 'ray'},
  'goal': 'maximize',
  'metric': 'roc_auc',
  'output_feature': 'recommended',
  'parameters': {
    'combiner.dropout': { 'lower': 0.0, 'space': 'uniform', 'upper': 0.1},
    'combiner.num_fc_layers': { 'lower': 1, 'space': 'randint', 'upper': 4},
    'combiner.output_size': { 'categories': [128, 256], 'space': 'choice'},
    'trainer.batch_size': { 'categories': [64, 128, 256, 512, 1024], 'space': 'choice'},
    'trainer.learning_rate': { 'lower': 2e-05, 'space': 'loguniform', 'upper': 0.001}},
    'search_alg': {'type': 'hyperopt'},
    'split': 'validation'}
}

赞(0）回复(0）举报 6个月前

我来回答

ludwig 由于内存不足，4/5次试验失败,

2条答案

相关问题

热门标签

最新问答