ludwig CUDA反序列化在GPU机器上失败

ltqd579y  于 4个月前  发布在  其他
关注(0)|答案(2)|浏览(46)

描述错误

我遇到了这个错误,使用以下命令:

  • ludwig train (和 use_gpu: false)
  • ludwig experiment (任何GPU设置)
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with 
map_location=torch.device('cpu') to map your storages to the CPU.

重现问题

model_type: ecd
input_features:
-
    name: quantity
    type: numerical
    encoder:
      type: dense
      dropout: 0.2
      num_layers: 1
      activation: relu

      output_size: 16
output_features:
-
    name: unit
    type: category
    calibration: true
    loss:
      type: softmax_cross_entropy
    top_k: 1

trainer:
    early_stop: 3
    epochs: 100
    batch_size: 32
    learning_rate: 0.001
    optimizer:
        type: adam

backend:
    type: ray
    trainer:
        use_gpu: false

数据集(随机生成):
data.csv
命令:

ludwig train \
  --dataset data.csv \
  --config config.yaml

ludwig experiment \
  --dataset data.csv \
  --config config.yaml

预期行为

我希望能够在不出现CUDA相关错误的情况下在模型上运行 ludwig trainludwig experiment

环境信息(请填写以下信息):

  • OS: CentOS
  • 版本: 7
  • Python版本: 3.10.13
  • Ludwig版本: 0.8.6
  • Ray版本: 2.3.1
  • 示例有A10G
    额外的上下文

当使用 ludwig train 时:

(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 552, in <lambda>
(TorchTrainer pid=18223)   lambda config: tune_batch_size_fn(**config),
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 266, in tune_batch_size_fn
(TorchTrainer pid=18223)   model = ray.get(model_ref)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(TorchTrainer pid=18223)   return func(*args, **kwargs)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2382, in get
(TorchTrainer pid=18223)   raise value
(TorchTrainer pid=18223) ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(TorchTrainer pid=18223) traceback: Traceback (most recent call last):
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(TorchTrainer pid=18223)   obj = self._deserialize_object(data, metadata, object_ref)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(TorchTrainer pid=18223)   return self._deserialize_msgpack_data(data, metadata_fields)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(TorchTrainer pid=18223)   python_objects = self._deserialize_pickle5_data(pickle5_data)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(TorchTrainer pid=18223)   obj = pickle.loads(in_band)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(TorchTrainer pid=18223)   return torch.load(io.BytesIO(b))
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(TorchTrainer pid=18223)   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(TorchTrainer pid=18223)   result = unpickler.load()
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(TorchTrainer pid=18223)   wrap_storage=restore_location(obj, location),
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(TorchTrainer pid=18223)   result = fn(storage, location)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(TorchTrainer pid=18223)   device = validate_cuda_device(location)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(TorchTrainer pid=18223)   raise RuntimeError('Attempting to deserialize object on a CUDA '
(TorchTrainer pid=18223) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

当使用 ludwig experiment 时:

MapBatches(postprocess_batch):  0%|     | 0/1 [00:10<?, ?it/s]
(_map_task pid=4697) 2023-11-22 22:28:35,512	ERROR serialization.py:371 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(_map_task pid=4697) Traceback (most recent call last):
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(_map_task pid=4697)   obj = self._deserialize_object(data, metadata, object_ref)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(_map_task pid=4697)   return self._deserialize_msgpack_data(data, metadata_fields)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(_map_task pid=4697)   python_objects = self._deserialize_pickle5_data(pickle5_data)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(_map_task pid=4697)   obj = pickle.loads(in_band)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(_map_task pid=4697)   return torch.load(io.BytesIO(b))
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(_map_task pid=4697)   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(_map_task pid=4697)   result = unpickler.load()
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(_map_task pid=4697)   wrap_storage=restore_location(obj, location),
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(_map_task pid=4697)   result = fn(storage, location)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(_map_task pid=4697)   device = validate_cuda_device(location)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(_map_task pid=4697)   raise RuntimeError('Attempting to deserialize object on a CUDA '
(_map_task pid=4697) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
9q78igpj

9q78igpj1#

你好,philippe-solodov-wd,你能详细描述一下你的用例吗?你是想在没有使用GPU的情况下,在支持GPU的机器上进行训练吗?

lnlaulya

lnlaulya2#

嘿,@philippe-solodov-wd,看起来问题是Ludwig入口点在GPU上初始化模型权重,但工作人员无法反序列化它们,因为它们没有GPU可见性。我们这边的解决方法是在将模型插入Ray对象存储之前将其移动到CPU上,如果工作人员没有GPU的话。
作为暂时的解决方法,你可以尝试运行以下命令:

CUDA_VISIBILE_DEVICES="" ludwig train ...

相关问题