描述bug
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
重现步骤
- 设置:Ray 2.3 + master分支Ludwig。
- 配置Ray集群:1个节点,1个GPU。
- 在Ray主节点上运行
python3 llm_text_generation/simple_model_training.py
- 查看错误信息
preds, _ = model.predict(test_set, skip_save_predictions=False)
File "/usr/local/lib/python3.9/dist-packages/ludwig/api.py", line 897, in predict
predictions = predictor.batch_predict(
File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 695, in batch_predict
predictions = self.df_engine.from_ray_dataset(predictions)
File "/usr/local/lib/python3.9/dist-packages/ludwig/data/dataframe/dask.py", line 247, in from_ray_dataset
return dataset.to_dask()
File "/usr/local/lib/python3.9/dist-packages/ray/data/dataset.py", line 3432, in to_dask
schema = self.schema(fetch_if_missing=True)
File "/usr/local/lib/python3.9/dist-packages/ray/data/dataset.py", line 2234, in schema
return self._plan.schema(fetch_if_missing=fetch_if_missing)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/plan.py", line 360, in schema
self.execute()
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/plan.py", line 539, in execute
blocks = execute_to_legacy_block_list(
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 84, in execute_to_legacy_block_list
bundles = executor.execute(dag, initial_stats=stats)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 82, in execute
return execute_recursive(dag)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 63, in execute_recursive
output = _naive_run_until_complete(op)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/bulk_executor.py", line 106, in _naive_run_until_complete
op.notify_work_completed(ready)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 160, in notify_work_completed
task.output = self._map_ref_to_ref_bundle(ref)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/map_operator.py", line 296, in _map_ref_to_ref_bundle
block_metas = ray.get(all_refs[-1])
File "/usr/local/lib/python3.9/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/ray/_private/worker.py", line 2380, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::_MapWorker.submit() (pid=7154, ip=10.78.204.13)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/actor_pool_map_operator.py", line 272, in submit
yield from _map_task(fn, ctx, *blocks)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/operators/map_operator.py", line 351, in _map_task
for b_out in fn(iter(blocks), ctx):
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 219, in do_map
yield from block_fn(blocks, ctx, *fn_args, **fn_kwargs)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/planner/map_batches.py", line 102, in fn
yield from process_next_batch(batch)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/planner/map_batches.py", line 66, in process_next_batch
batch = batch_fn(batch, *fn_args, **fn_kwargs)
File "/usr/local/lib/python3.9/dist-packages/ray/data/_internal/execution/legacy_compat.py", line 207, in fn
return ray.data._cached_fn(item)
File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 782, in __call__
predictions = self.predict(batch=dataset).set_index(df.index)
File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 163, in predict_single
preds = self._predict(batch)
File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 188, in _predict
outputs = self._predict_on_inputs(inputs)
File "/usr/local/lib/python3.9/dist-packages/ludwig/models/predictor.py", line 334, in _predict_on_inputs
return self.dist_model.generate(inputs)
File "/usr/local/lib/python3.9/dist-packages/ludwig/models/llm.py", line 359, in generate
model_outputs = self.model.generate(
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 1524, in generate
return self.beam_search(
File "/usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py", line 2810, in beam_search
outputs = self(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 938, in forward
outputs = self.model.decoder(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/transformers/models/opt/modeling_opt.py", line 635, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/functional.py", line 2210, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
(_MapWorker pid=7154) /usr/local/lib/python3.9/dist-packages/transformers/generation/utils.py:1405: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.
预期行为
通过,因为这是一个示例。
环境信息(请填写以下信息):
- OS: Debian-11
- Python版本:3.9
- Ludwig版本:master分支
4条答案
按热度按时间7dl7o3gd1#
我添加了$x_1m^0n^1x$配置,但仍然出现相同的错误。
$x_1a^0b^1x$
oalqel3c2#
看起来
class 'ludwig.trainers.trainer_llm.NoneTrainer'
是根本原因,它没有初始化分布式后端。bqucvtff3#
抓得好!@chongxiaoc,你愿意为这个项目做出卢德维希的贡献吗?
fumotvh34#
嘿,@chongxiaoc,我认为这个问题可能在最新的 Backbone.js 版本(4位训练)中已经修复了。你能再试一次吗?