PaddleNLP [Question]: 求助,chatglm2 单卡sft内存溢出

waxmsbnn  于 2个月前  发布在  其他
关注(0)|答案(6)|浏览(53)

请尝试将模型的per_device_train_batch_sizeper_device_eval_batch_size设置为更大的值,例如:

{
 "model_name_or_path": "/home/duyi/paddle",
 "dataset_name_or_path": "/home/duyi/ChatGLM2-6B/ptuning/AdvertiseGen",
 "output_dir": "./checkpoints/chatglm2_sft_ckpts",
 "per_device_train_batch_size": 16,
 "gradient_accumulation_steps": 4,
 "per_device_eval_batch_size": 16,
 "eval_accumulation_steps":16,
 "num_train_epochs": 3,
 "learning_rate": 3e-05,
 "warmup_steps": 30,
 "logging_steps": 1,
 "evaluation_strategy": "epoch",
 "save_strategy": "epoch",
 "src_length": 1024,
 "max_length": 2048,
 "fp16": true,
 "fp16_opt_level": "O2",
 "do_train": true,
 "do_eval": true,
 "disable_tqdm": true,
 "load_best_model_at_end": true,
 "eval_with_do_generation": false,
 "metric_for_best_model": "accuracy",
 "recompute": true,
 "save_total_limit": 1,
 "sharding_parallel_degree": 4,
 "sharding": "stage3",
 "zero_padding": false,
 "use_flash_attention": false
}

如果问题仍然存在,请检查是否有其他进程在使用GPU 0。如果有,请停止它们或在另一个GPU上运行PaddlePaddle。

bweufnob

bweufnob1#

本地相同环境,尝试多次无法复现问题。复现命令如下:

python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" finetune_generation.py chatglm2/sft_argument.json

建议尝试以下操作后再运行:

  1. 更新paddle为develop版本
python -m pip install paddlepaddle-gpu==0.0.0.post120 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
  1. 确认数据集位置放置准确
    如仍存在问题,辛苦上传完整日志和复现命令,方便我们进行调试。
pvcm50d1

pvcm50d12#

本地相同环境,尝试多次无法复现问题。复现命令如下:

python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" finetune_generation.py chatglm2/sft_argument.json

建议尝试以下操作后再运行:

  1. 更新paddle为develop版本
python -m pip install paddlepaddle-gpu==0.0.0.post120 -f https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html
  1. 确认数据集位置放置准确
    如仍存在问题,辛苦上传完整日志和复现命令,方便我们进行调试。
    我用的是run_finetune.py,没有在llm目录下找到 finetune_generation 这个文件。 同时我尝试用多卡跑,但是会出现连不上端口的情况:
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0715 07:49:49.644336  2696 tcp_utils.cc:181] The server starts to listen on IP_ANY:46524

之后一直无响应
我的启动命令:
python -u -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" run_finetune.py config/chatglm2/sft_argument.json
完整日志:

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2024-07-15 07:49:45,444 -----------  Configuration  ----------------------
LAUNCH INFO 2024-07-15 07:49:45,445 auto_parallel_config: None
LAUNCH INFO 2024-07-15 07:49:45,445 auto_tuner_json: None
LAUNCH INFO 2024-07-15 07:49:45,445 devices: 0,1,2,3,4,5,6,7
LAUNCH INFO 2024-07-15 07:49:45,445 elastic_level: -1
LAUNCH INFO 2024-07-15 07:49:45,445 elastic_timeout: 30
LAUNCH INFO 2024-07-15 07:49:45,445 enable_gpu_log: True
LAUNCH INFO 2024-07-15 07:49:45,445 gloo_port: 6767
LAUNCH INFO 2024-07-15 07:49:45,445 host: None
LAUNCH INFO 2024-07-15 07:49:45,445 ips: None
LAUNCH INFO 2024-07-15 07:49:45,445 job_id: default
LAUNCH INFO 2024-07-15 07:49:45,445 legacy: False
LAUNCH INFO 2024-07-15 07:49:45,445 log_dir: log
LAUNCH INFO 2024-07-15 07:49:45,445 log_level: INFO
LAUNCH INFO 2024-07-15 07:49:45,445 log_overwrite: False
LAUNCH INFO 2024-07-15 07:49:45,445 master: None
LAUNCH INFO 2024-07-15 07:49:45,445 max_restart: 3
LAUNCH INFO 2024-07-15 07:49:45,445 nnodes: 1
LAUNCH INFO 2024-07-15 07:49:45,445 nproc_per_node: None
LAUNCH INFO 2024-07-15 07:49:45,445 rank: -1
LAUNCH INFO 2024-07-15 07:49:45,445 run_mode: collective
LAUNCH INFO 2024-07-15 07:49:45,445 server_num: None
LAUNCH INFO 2024-07-15 07:49:45,445 servers: 
LAUNCH INFO 2024-07-15 07:49:45,445 sort_ip: False
LAUNCH INFO 2024-07-15 07:49:45,445 start_port: 6070
LAUNCH INFO 2024-07-15 07:49:45,445 trainer_num: None
LAUNCH INFO 2024-07-15 07:49:45,445 trainers: 
LAUNCH INFO 2024-07-15 07:49:45,445 training_script: run_finetune.py
LAUNCH INFO 2024-07-15 07:49:45,445 training_script_args: ['config/chatglm2/sft_argument.json']
LAUNCH INFO 2024-07-15 07:49:45,445 with_gloo: 1
LAUNCH INFO 2024-07-15 07:49:45,445 --------------------------------------------------
LAUNCH INFO 2024-07-15 07:49:45,449 Job: default, mode collective, replicas 1[1:1], elastic False
LAUNCH INFO 2024-07-15 07:49:45,451 Run Pod: ojhcwq, replicas 8, status ready
LAUNCH INFO 2024-07-15 07:49:45,607 Watching Pod: ojhcwq, replicas 8, status running
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
[2024-07-15 07:49:48,214] [ WARNING] - if you run ring_flash_attention.py, please ensure you install the paddlenlp_ops by following the instructions provided at https://github.com/PaddlePaddle/PaddleNLP/blob/develop/csrc/README.md
[2024-07-15 07:49:49,643] [    INFO] distributed_strategy.py:214 - distributed strategy initialized
======================= Modified FLAGS detected =======================
FLAGS(name='FLAGS_selected_gpus', current_value='0', default_value='')
=======================================================================
I0715 07:49:49.644336  2696 tcp_utils.cc:181] The server starts to listen on IP_ANY:46524
72qzrwbm

72qzrwbm3#

我发现在运行 run_check() 时,可以成功连接到 127.0.0.1,但是一旦运行训练脚本,就会产生寻找另一个地址并无法连接的情况:
I0715 07:52:00.361024 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening. I0715 07:54:13.480999 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening.

tjvv9vkg

tjvv9vkg4#

在运行run_check()时,可以成功连接到127.0.0.1,但是一旦运行训练脚本,就会产生寻找另一个地址并无法连接的情况。

PaddlePaddle中是否有将此手动更改为127.0.0.1的地方?

uttx8gqw

uttx8gqw5#

  1. 可以使用master指定,
# 设置为etcd 服务,独立的 etcd 服务
python -m paddle.distributed.launch --master=etcd://10.11.60.193:2379 --nnodes=4 --devices=1,2,3  train.py

或者

# 设置为http 服务,训练节点和可用端口组成
python -m paddle.distributed.launch --master=10.11.60.193:2379 --nnodes=4 --devices=1,2,3  train.py
  1. 如果还是存在问题,建议根据官网说明,选择适合自己的配置,安装develop版本。
daolsyd0

daolsyd06#

我发现在运行 run_check() 时,可以成功连接到 127.0.0.1,但是一旦运行训练脚本,就会产生寻找另一个地址并无法连接的情况:I0715 07:52:00.361024 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening. I0715 07:54:13.480999 2696 tcp_utils.cc:107] Retry to connect to 172.31.3.19:46524 while the server is not yet listening.
如果 172.31.3.19 不是本机的 IP,建议检查机器环境。Paddle 在启动时会拉取环境变量,通过本机 IP 启动。

相关问题