torch distributed.init out of memory

x33g5p2x  于2022-05-16 转载在 其他  
字(0.7k)|赞(0)|评价(0)|浏览(426)

torch distributed.init out of memory

设置环境gpu:

  1. os.environ["CUDA_VISIBLE_DEVICES"] = "1, 2, 3"
  2. local_rank=0
  3. torch.cuda.set_device(local_rank)

cuda(0)默认是第0块显卡,

但是设置CUDA_VISIBLE_DEVICES后:

cuda(0)就是CUDA_VISIBLE_DEVICES里面的第一个gpu。

  1. distributed.init 报错out of memory
  1. import argparse
  2. import logging
  3. import os
  4. import time
  5. import torch
  6. import torch.distributed as dist
  7. import torch.nn.functional as F
  8. import torch.utils.data.distributed
  9. def main(args):
  10. try:
  11. world_size = int(os.environ['WORLD_SIZE'])
  12. rank = int(os.environ['RANK'])
  13. dist_url = "tcp://{}:{}".format(os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"])
  14. except KeyError:
  15. world_size = 1
  16. rank = 0
  17. dist_url = "tcp://127.0.0.1

相关文章