Chinese-CLIP 运行muge + finetune + distllation失败,需要帮助,

eagi6jfj  于 8个月前  发布在  其他
关注(0)|答案(2)|浏览(212)

运行失败,寻求帮助

报错信息:

  1. Note that --use_env is set by default in torchrun.
  2. If your script expects `--local_rank` argument to be set, please
  3. change it to read from `os.environ['LOCAL_RANK']` instead. See
  4. https://pytorch.org/docs/stable/distributed.html#launch-utility for
  5. further instructions
  6. FutureWarning,
  7. ': [Errno 2] No such file or directorython3: can't open file '
  8. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 2) local_rank: 0 (pid: 36860) of binary: /home/user/anaconda3/envs/jina3/bin/python3
  9. Traceback (most recent call last):
  10. File "/home/user/anaconda3/envs/jina3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
  11. "__main__", mod_spec)
  12. File "/home/user/anaconda3/envs/jina3/lib/python3.7/runpy.py", line 85, in _run_code
  13. exec(code, run_globals)
  14. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
  15. main()
  16. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
  17. launch(args)
  18. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
  19. run(args)
  20. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
  21. )(*cmd_args)
  22. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
  23. return launch_agent(self._config, self._entrypoint, list(args))
  24. File "/home/user/anaconda3/envs/jina3/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
  25. failures=result.failures,
  26. torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
  27. ============================================================
  28. FAILED
  29. ------------------------------------------------------------
  30. Failures:
  31. <NO_OTHER_FAILURES>
  32. ------------------------------------------------------------
  33. Root Cause (first observed failure):
  34. [0]:
  35. time : 2023-11-01_07:59:28
  36. host : localhost
  37. rank : 0 (local_rank: 0)
  38. exitcode : 2 (pid: 36860)
  39. error_file: <N/A>

参数配置

  1. #!/usr/bin/env
  2. GPUS_PER_NODE=1
  3. # Number of GPU workers, for single-worker training, please set to 1
  4. WORKER_CNT=1
  5. # The ip address of the rank-0 worker, for single-worker training, please set to localhost
  6. export MASTER_ADDR=localhost
  7. # The port for communication
  8. export MASTER_PORT=8514
  9. # The rank of this worker, should be in {0, ..., WORKER_CNT-1}, for single-worker training, please set to 0
  10. export RANK=0
  11. export PYTHONPATH=${PYTHONPATH}:`pwd`/cn_clip/
  12. DATAPATH=${1}
  13. # data options
  14. train_data=${DATAPATH}/datasets/MUGE/lmdb/train
  15. val_data=${DATAPATH}/datasets/MUGE/lmdb/valid # if val_data is not specified, the validation will be automatically disabled
  16. # restore options
  17. resume=${DATAPATH}/pretrained_weights/clip_cn_vit-b-16.pt # or specify your customed ckpt path to resume
  18. reset_data_offset="--reset-data-offset"
  19. reset_optimizer="--reset-optimizer"
  20. # reset_optimizer=""
  21. # output options
  22. output_base_dir=${DATAPATH}/experiments/
  23. name=muge_finetune_vit-b-16_roberta-base_bs128_8gpu
  24. save_step_frequency=999999 # disable it
  25. save_epoch_frequency=1
  26. log_interval=1
  27. report_training_batch_acc="--report-training-batch-acc"
  28. # report_training_batch_acc=""
  29. # training hyper-params
  30. context_length=52
  31. warmup=100
  32. batch_size=128
  33. valid_batch_size=128
  34. accum_freq=1
  35. lr=5e-5
  36. wd=0.001
  37. max_epochs=3 # or you can alternatively specify --max-steps
  38. valid_step_interval=150
  39. valid_epoch_interval=1
  40. vision_model=ViT-B-16
  41. text_model=RoBERTa-wwm-ext-base-chinese
  42. use_augment="--use-augment"
  43. distllation="--distllation"
  44. teacher_model_name="damo/multi-modal_team-vit-large-patch14_multi-modal-similarity"
  45. # use_augment=""
  46. python3 -m torch.distributed.launch --use_env --nproc_per_node=${GPUS_PER_NODE} --nnodes=${WORKER_CNT} --node_rank=${RANK} \
  47. --master_addr=${MASTER_ADDR} --master_port=${MASTER_PORT} cn_clip/training/main.py \
  48. --train-data=${train_data} \
  49. --val-data=${val_data} \
  50. --resume=${resume} \
  51. ${reset_data_offset} \
  52. ${reset_optimizer} \
  53. --logs=${output_base_dir} \
  54. --name=${name} \
  55. --save-step-frequency=${save_step_frequency} \
  56. --save-epoch-frequency=${save_epoch_frequency} \
  57. --log-interval=${log_interval} \
  58. ${report_training_batch_acc} \
  59. --context-length=${context_length} \
  60. --warmup=${warmup} \
  61. --batch-size=${batch_size} \
  62. --valid-batch-size=${valid_batch_size} \
  63. --valid-step-interval=${valid_step_interval} \
  64. --valid-epoch-interval=${valid_epoch_interval} \
  65. --accum-freq=${accum_freq} \
  66. --lr=${lr} \
  67. --wd=${wd} \
  68. --max-epochs=${max_epochs} \
  69. --vision-model=${vision_model} \
  70. ${use_augment} \
  71. --text-model=${text_model} \
  72. ${distllation} \
  73. --teacher-model-name=${teacher_model_name} \

环境信息

torch 1.13.1
torchvision 0.14.1
Linux dev 5.4.0-152-generic #169 ~18.04.1-Ubuntu SMP Wed Jun 7 22:22:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
单片GPU,NVIDIA GeForce RTX 2080 Ti , Driver Version: 530.30.02 CUDA Version: 12.1

sd2nnvve

sd2nnvve1#

你好,能否提供运行命令?
还有基本的微调muge_finetune_vit-b-16_rbt-base.sh是否能正常运行?

mqkwyuun

mqkwyuun2#

我的问题是image_b64为空。

相关问题