描述错误
我遇到了这个错误,使用以下命令:
ludwig train
(和use_gpu: false
)ludwig experiment
(任何GPU设置)
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with
map_location=torch.device('cpu') to map your storages to the CPU.
重现问题
model_type: ecd
input_features:
-
name: quantity
type: numerical
encoder:
type: dense
dropout: 0.2
num_layers: 1
activation: relu
output_size: 16
output_features:
-
name: unit
type: category
calibration: true
loss:
type: softmax_cross_entropy
top_k: 1
trainer:
early_stop: 3
epochs: 100
batch_size: 32
learning_rate: 0.001
optimizer:
type: adam
backend:
type: ray
trainer:
use_gpu: false
数据集(随机生成):
data.csv
命令:
ludwig train \
--dataset data.csv \
--config config.yaml
ludwig experiment \
--dataset data.csv \
--config config.yaml
预期行为
我希望能够在不出现CUDA相关错误的情况下在模型上运行 ludwig train
和 ludwig experiment
。
环境信息(请填写以下信息):
- OS: CentOS
- 版本: 7
- Python版本: 3.10.13
- Ludwig版本: 0.8.6
- Ray版本: 2.3.1
- 示例有A10G
额外的上下文
当使用 ludwig train
时:
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 552, in <lambda>
(TorchTrainer pid=18223) lambda config: tune_batch_size_fn(**config),
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 266, in tune_batch_size_fn
(TorchTrainer pid=18223) model = ray.get(model_ref)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(TorchTrainer pid=18223) return func(*args, **kwargs)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2382, in get
(TorchTrainer pid=18223) raise value
(TorchTrainer pid=18223) ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(TorchTrainer pid=18223) traceback: Traceback (most recent call last):
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(TorchTrainer pid=18223) obj = self._deserialize_object(data, metadata, object_ref)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(TorchTrainer pid=18223) return self._deserialize_msgpack_data(data, metadata_fields)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(TorchTrainer pid=18223) python_objects = self._deserialize_pickle5_data(pickle5_data)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(TorchTrainer pid=18223) obj = pickle.loads(in_band)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(TorchTrainer pid=18223) return torch.load(io.BytesIO(b))
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(TorchTrainer pid=18223) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(TorchTrainer pid=18223) result = unpickler.load()
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(TorchTrainer pid=18223) wrap_storage=restore_location(obj, location),
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(TorchTrainer pid=18223) result = fn(storage, location)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(TorchTrainer pid=18223) device = validate_cuda_device(location)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(TorchTrainer pid=18223) raise RuntimeError('Attempting to deserialize object on a CUDA '
(TorchTrainer pid=18223) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
当使用 ludwig experiment
时:
MapBatches(postprocess_batch): 0%| | 0/1 [00:10<?, ?it/s]
(_map_task pid=4697) 2023-11-22 22:28:35,512 ERROR serialization.py:371 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(_map_task pid=4697) Traceback (most recent call last):
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(_map_task pid=4697) obj = self._deserialize_object(data, metadata, object_ref)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(_map_task pid=4697) return self._deserialize_msgpack_data(data, metadata_fields)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(_map_task pid=4697) python_objects = self._deserialize_pickle5_data(pickle5_data)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(_map_task pid=4697) obj = pickle.loads(in_band)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(_map_task pid=4697) return torch.load(io.BytesIO(b))
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(_map_task pid=4697) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(_map_task pid=4697) result = unpickler.load()
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(_map_task pid=4697) wrap_storage=restore_location(obj, location),
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(_map_task pid=4697) result = fn(storage, location)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(_map_task pid=4697) device = validate_cuda_device(location)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(_map_task pid=4697) raise RuntimeError('Attempting to deserialize object on a CUDA '
(_map_task pid=4697) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
2条答案
按热度按时间9q78igpj1#
你好,philippe-solodov-wd,你能详细描述一下你的用例吗?你是想在没有使用GPU的情况下,在支持GPU的机器上进行训练吗?
lnlaulya2#
嘿,@philippe-solodov-wd,看起来问题是Ludwig入口点在GPU上初始化模型权重,但工作人员无法反序列化它们,因为它们没有GPU可见性。我们这边的解决方法是在将模型插入Ray对象存储之前将其移动到CPU上,如果工作人员没有GPU的话。
作为暂时的解决方法,你可以尝试运行以下命令: