Distributed package doesn‘t have NCCL built in

x33g5p2x  于2022-05-11 转载在 其他  
字(1.2k)|赞(0)|评价(0)|浏览(816)

Distributed package doesn't have NCCL built in

问题描述:
python在windows环境下dist.init_process_group(backend, rank, world_size)处报错‘RuntimeError: Distributed package doesn’t have NCCL built in’,具体信息如下:

File "D:\Software\Anaconda\Anaconda3\envs\segmenter\lib\site-packages\torch\distributed\distributed_c10d.py", line 531, in init_process_group
    timeout=timeout)
  File "D:\Software\Anaconda\Anaconda3\envs\segmenter\lib\site-packages\torch\distributed\distributed_c10d.py", line 625, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL "
RuntimeError: Distributed package doesn't have NCCL built in

原因分析:
windows不支持NCCL backend

解决方案:
在dist.init_process_group语句之前添加backend=‘gloo’,也就是在windows中使用GLOO替代NCCL。
————————————————
版权声明:本文为CSDN博主「StarCap」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/StarCap/article/details/120070425

insightface训练代码:

try:
    world_size =1# int(os.environ["WORLD_SIZE"])
    rank =0# int(os.environ["RANK"])
    # distributed.init_process_group("nccl")
    distributed.init_process_group("gloo")
except KeyError:
    world_size = 1
    rank = 0
    distributed.init_process_group(
        backend="nccl",
        init_method="tcp://127.0.0.1:12584",
        rank=rank,
        world_size=world_size,
    )

相关文章