ludwig 运行时错误:期望所有Tensor都在相同的设备上,

jaxagkaj  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(46)

描述bug

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

重现步骤

  1. 设置:Ray 2.3 + master分支Ludwig。
  2. 配置Ray集群:1个节点,1个GPU。
  3. 在Ray主节点上运行python3 llm_text_generation/simple_model_training.py
  4. 查看错误信息
preds, _ = model.predict(test_set, skip_save_predictions=False)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/api.py", line 897, in predict
    predictions = predictor.batch_predict(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 695, in batch_predict
    predictions = self.df_engine.from_ray_dataset(predictions)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/data/dataframe/dask.py", line 247, in from_ray_dataset
    return dataset.to_dask()
  File "/usr/local/lib/python3.9/dist-packages/ray/data/dataset.py", line 3432, in to_dask
    schema = self.schema(fetch_if_missing=True)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/dataset.py", line 2234, in schema
    return self._plan.schema(fetch_if_missing=fetch_if_missing)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/plan.py", line 360, in schema
    self.execute()
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/plan.py", line 539, in execute
    blocks = execute_to_legacy_block_list(
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 84, in execute_to_legacy_block_list
    bundles = executor.execute(dag, initial_stats=stats)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 82, in execute
    return execute_recursive(dag)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 63, in execute_recursive
    output = _naive_run_until_complete(op)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 106, in _naive_run_until_complete
    op.notify_work_completed(ready)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 160, in notify_work_completed
    task.output = self._map_ref_to_ref_bundle(ref)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/map_operator.py", line 296, in _map_ref_to_ref_bundle
    block_metas = ray.get(all_refs[-1])
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2380, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_MapWorker.submit() (pid=7154, ip=10.78.204.13)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 272, in submit
    yield from _map_task(fn, ctx, *blocks)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/map_operator.py", line 351, in _map_task
    for b_out in fn(iter(blocks), ctx):
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 219, in do_map
    yield from block_fn(blocks, ctx, *fn_args, **fn_kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/planner/map_batches.py", line 102, in fn
    yield from process_next_batch(batch)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/planner/map_batches.py", line 66, in process_next_batch
    batch = batch_fn(batch, *fn_args, **fn_kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 207, in fn
    return ray.data._cached_fn(item)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 782, in __call__
    predictions = self.predict(batch=dataset).set_index(df.index)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 163, in predict_single
    preds = self._predict(batch)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 188, in _predict
    outputs = self._predict_on_inputs(inputs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 334, in _predict_on_inputs
    return self.dist_model.generate(inputs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/llm.py", line 359, in generate
    model_outputs = self.model.generate(
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 1524, in generate
    return self.beam_search(
  File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
    outputs = self(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 938, in forward
    outputs = self.model.decoder(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 635, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
(_MapWorker pid=7154) /usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.

预期行为

通过,因为这是一个示例。

环境信息(请填写以下信息):

  • OS: Debian-11
  • Python版本:3.9
  • Ludwig版本:master分支
7dl7o3gd

7dl7o3gd1#

我添加了$x_1m^0n^1x$配置,但仍然出现相同的错误。

$x_1a^0b^1x$

oalqel3c

oalqel3c2#

看起来class 'ludwig.trainers.trainer_llm.NoneTrainer'是根本原因,它没有初始化分布式后端。

bqucvtff

bqucvtff3#

抓得好!@chongxiaoc,你愿意为这个项目做出卢德维希的贡献吗?

fumotvh3

fumotvh34#

嘿,@chongxiaoc,我认为这个问题可能在最新的 Backbone.js 版本(4位训练)中已经修复了。你能再试一次吗?

相关问题