DeepSpeed-MII 运行时错误:服务器套接字无法在任何本地网络地址上监听,服务器套接字无法绑定到[::]:29700(错误号:98-地址已被占用),

68de4m5k  于 3个月前  发布在  其他
关注(0)|答案(2)|浏览(41)

当我尝试使用 deepspeed --num_gpus 2 xxx.py 启动服务器时,会出现错误。但是如果我使用 python3 xxx.py 启动服务器,它运行得很好。我想在两个 A100(每个 A100 80G)上部署 llama-70b(可能 140G),所以我必须使用 deepspeed 启动服务器。以下是信息:

[2024-01-20 10:15:26,416] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:26,676] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:26,846] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-01-20 10:15:26,846] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-01-20 10:15:26,846] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-01-20 10:15:26,846] [INFO] [launch.py:163:main] dist_world_size=2
[2024-01-20 10:15:26,846] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-01-20 10:15:26,967] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:26,967] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:27,150] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:27,150] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:27,259] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-01-20 10:15:27,260] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-01-20 10:15:27,260] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-01-20 10:15:27,260] [INFO] [launch.py:163:main] dist_world_size=2
[2024-01-20 10:15:27,260] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-01-20 10:15:28,970] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:29,041] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:29,083] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:29,117] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-20 10:15:29,509] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-20 10:15:29,576] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-20 10:15:29,576] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-01-20 10:15:29,804] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-20 10:15:29,805] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[W socket.cpp:436] [c10d] The server socket has failed to bind to [::]:29700 (errno: 98 - Address already in use).
[W socket.cpp:436] [c10d] The server socket has failed to bind to 0.0.0.0:29700 (errno: 98 - Address already in use).
[E socket.cpp:472] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/launch/multi_gpu_server.py", line 105, in <module>
    main()
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/launch/multi_gpu_server.py", line 98, in main
    inference_pipeline = async_pipeline(args.model_config)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/api.py", line 167, in async_pipeline
    inference_engine = load_model(model_config)
                       ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/modeling/models.py", line 14, in load_model
    init_distributed(model_config)
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/utils.py", line 187, in init_distributed
    deepspeed.init_distributed(dist_backend="nccl", timeout=timedelta(seconds=1e9))
  File "/home/infer/miniconda3/lib/python3.11/site-packages/deepspeed/comm/comm.py", line 670, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 120, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/infer/miniconda3/lib/python3.11/site-packages/deepspeed/comm/torch.py", line 146, in init_process_group
    torch.distributed.init_process_group(backend,
  File "/home/infer/miniconda3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 74, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1141, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 241, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29700 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29700 (errno: 98 - Address already in use).
[2024-01-20 10:15:29,822] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-20 10:15:29,878] [INFO] [engine_v2.py:82:__init__] Building model...
[2024-01-20 10:15:29,944] [INFO] [engine_v2.py:82:__init__] Building model...
Using /home/infer/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Using /home/infer/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
[2024-01-20 10:15:30,593] [INFO] [engine_v2.py:82:__init__] Building model...
Using /home/infer/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
[2024-01-20 10:15:30,848] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1648350
[2024-01-20 10:15:30,848] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1648351
[2024-01-20 10:15:31,004] [ERROR] [launch.py:321:sigkill_handler] ['/home/infer/miniconda3/bin/python', '-m', 'mii.launch.multi_gpu_server', '--deployment-name', 'llama-deployment', '--load-balancer-port', '50050', '--restful-gateway-port', '28080', '--restful-gateway-host', 'localhost', '--restful-gateway-procs', '32', '--server-port', '50051', '--zmq-port', '25555', '--model-config', 'eyJtb2RlbF9uYW1lX29yX3BhdGgiOiAiL21udC9MbGFtYS0yLTdiLWNoYXQtaGYiLCAidG9rZW5pemVyIjogIi9tbnQvTGxhbWEtMi03Yi1jaGF0LWhmIiwgInRhc2siOiAidGV4dC1nZW5lcmF0aW9uIiwgInRlbnNvcl9wYXJhbGxlbCI6IDIsICJpbmZlcmVuY2VfZW5naW5lX2NvbmZpZyI6IHsidGVuc29yX3BhcmFsbGVsIjogeyJ0cF9zaXplIjogMn0sICJzdGF0ZV9tYW5hZ2VyIjogeyJtYXhfdHJhY2tlZF9zZXF1ZW5jZXMiOiAyMDQ4LCAibWF4X3JhZ2dlZF9iYXRjaF9zaXplIjogNzY4LCAibWF4X3JhZ2dlZF9zZXF1ZW5jZV9jb3VudCI6IDUxMiwgIm1heF9jb250ZXh0IjogODE5MiwgIm1lbW9yeV9jb25maWciOiB7Im1vZGUiOiAicmVzZXJ2ZSIsICJzaXplIjogMTAwMDAwMDAwMH0sICJvZmZsb2FkIjogZmFsc2V9fSwgInRvcmNoX2Rpc3RfcG9ydCI6IDI5NzAwLCAiem1xX3BvcnRfbnVtYmVyIjogMjU1NTUsICJyZXBsaWNhX251bSI6IDEsICJyZXBsaWNhX2NvbmZpZ3MiOiBbeyJob3N0bmFtZSI6ICJsb2NhbGhvc3QiLCAidGVuc29yX3BhcmFsbGVsX3BvcnRzIjogWzUwMDUxLCA1MDA1Ml0sICJ0b3JjaF9kaXN0X3BvcnQiOiAyOTcwMCwgImdwdV9pbmRpY2VzIjogWzAsIDFdLCAiem1xX3BvcnQiOiAyNTU1NX1dLCAiZGV2aWNlX21hcCI6ICJhdXRvIiwgIm1heF9sZW5ndGgiOiBudWxsLCAiYWxsX3Jhbmtfb3V0cHV0IjogZmFsc2UsICJzeW5jX2RlYnVnIjogZmFsc2UsICJwcm9maWxlX21vZGVsX3RpbWUiOiBmYWxzZX0='] exits with return code = 1
[2024-01-20 10:15:31,968] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:31,968] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:32,151] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
[2024-01-20 10:15:32,151] [INFO] [server.py:65:_wait_until_server_is_live] waiting for server to start...
Traceback (most recent call last):
  File "/home/infer/deepspeed-fastgen/quest.py", line 26, in <module>
    client = mii.serve("/mnt/Llama-2-7b-chat-hf", deployment_name="llama-deployment", replica_num=1,   #replica_num=2 tensor_parallel=2
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/api.py", line 124, in serve
    import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
  File "/tmp/mii_cache/llama-deployment/score.py", line 33, in init
    mii.backend.MIIServer(mii_config)
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/backend/server.py", line 47, in __init__
    self._wait_until_server_is_live(processes,
  File "/home/infer/miniconda3/lib/python3.11/site-packages/mii/backend/server.py", line 62, in _wait_until_server_is_live
    raise RuntimeError(
RuntimeError: server crashed for some reason, unable to proceed
[2024-01-20 10:15:33,306] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1647573
[2024-01-20 10:15:33,306] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1647574
[2024-01-20 10:15:33,342] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1648352
[2024-01-20 10:15:33,404] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1648353
[2024-01-20 10:15:33,463] [INFO] [launch.py:324:sigkill_handler] Main process received SIGTERM, exiting
[2024-01-20 10:15:33,917] [ERROR] [launch.py:321:sigkill_handler] ['/home/infer/miniconda3/bin/python', '-u', 'quest.py', '--local_rank=1'] exits with return code = 1

起初,我认为这只是一个占用此端口的进程,所以我将其更改为 29700。但是如您所见,问题并未解决。我该怎么办?代码就像示例(但使用 llama-7b):

import mii
client = mii.serve("/mnt/Llama-2-7b-chat-hf", deployment_name="llama-deployment", tensor_parallel=2)
bjg7j2ky

bjg7j2ky1#

如果你使用mii.serve启动服务器,就不需要使用deepspeed启动器来利用Tensor并行性。mii.serve会调用DeepSpeed启动器,因此当你使用deepspeed --num_gpus 2运行脚本时,你试图启动两个推理服务器(因此你看到了地址已经被使用的错误)。

e1xvtsh3

e1xvtsh32#

这段代码存在同样的问题:
from mii import pipeline pipe = pipeline("mistralai/Mistral-7B-Instruct-v0.1") output = pipe(["Hello, my name is", "DeepSpeed is"], max_new_tokens=128) print(output)
错误信息:
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use)
它仅使用管道,没有额外调用mii.serve。

相关问题