我有两个节点,每个节点都有一个16GB的GPU。我想在这两个节点上运行llama-2-13b-hf模型,每个节点有一个副本。
查看/job/hostfile:
deepspeed-mii-inference-worker-0 slots=1
deepspeed-mii-inference-worker-1 slots=1
服务器代码
import mii
client = mii.serve(
"/data/Llama-2-13b-hf/",
deployment_name="llama2-deployment",
enable_restful_api=True,
restful_api_port=28080,
tensor_parallel=2,
replica_num=1
)
错误:
[2024-03-08 09:16:45,938] [INFO] [multinode_runner.py:80:get_cmd] Running on the following workers: deepspeed-mii-inference-worker-0,deepspeed-mii-inference-worker-1
[2024-03-08 09:16:45,938] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w deepspeed-mii-inference-worker-0,deepspeed-mii-inference-worker-1 export NCCL_VERSION=2.19.3-1; export PYTHONPATH=/data; cd /data; /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJkZWVwc3BlZWQtbWlpLWluZmVyZW5jZS13b3JrZXItMCI6IFswXSwgImRlZXBzcGVlZC1taWktaW5mZXJlbmNlLXdvcmtlci0xIjogWzBdfQ== --node_rank=%n --master_addr=10.11.1.207 --master_port=29500 deepspeed-mii-server.py
deepspeed-mii-inference-worker-0: Warning: Permanently added 'deepspeed-mii-inference-worker-0' (ED25519) to the list of known hosts.
deepspeed-mii-inference-worker-1: Warning: Permanently added 'deepspeed-mii-inference-worker-1' (ED25519) to the list of known hosts.
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:49,525] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:49,682] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,274] [INFO] [launch.py:138:main] 1 NCCL_VERSION=2.19.3-1
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,274] [INFO] [launch.py:145:main] WORLD INFO DICT: {'deepspeed-mii-inference-worker-0': [0], 'deepspeed-mii-inference-worker-1': [0]}
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,274] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=1
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,274] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'deepspeed-mii-inference-worker-0': [0], 'deepspeed-mii-inference-worker-1': [1]})
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,275] [INFO] [launch.py:163:main] dist_world_size=2
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,275] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:50,275] [INFO] [launch.py:253:main] process 1208 spawned with command: ['/usr/bin/python3', '-u', 'deepspeed-mii-server.py', '--local_rank=0']
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,547] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3-1
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:145:main] WORLD INFO DICT: {'deepspeed-mii-inference-worker-0': [0], 'deepspeed-mii-inference-worker-1': [0]}
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:151:main] nnodes=2, num_local_procs=1, node_rank=0
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'deepspeed-mii-inference-worker-0': [0], 'deepspeed-mii-inference-worker-1': [1]})
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:163:main] dist_world_size=2
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:50,548] [INFO] [launch.py:253:main] process 1257 spawned with command: ['/usr/bin/python3', '-u', 'deepspeed-mii-server.py', '--local_rank=0']
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:53,103] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:53,717] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:53,973] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:53,973] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
deepspeed-mii-inference-worker-1: Traceback (most recent call last):
deepspeed-mii-inference-worker-1: File "/data/deepspeed-mii-server.py", line 6, in <module>
deepspeed-mii-inference-worker-1: client = mii.serve(
deepspeed-mii-inference-worker-1: File "/usr/local/lib/python3.10/dist-packages/mii/api.py", line 124, in serve
deepspeed-mii-inference-worker-1: import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
deepspeed-mii-inference-worker-1: File "/tmp/mii_cache/llama2-deployment/score.py", line 33, in init
deepspeed-mii-inference-worker-1: mii.backend.MIIServer(mii_config)
deepspeed-mii-inference-worker-1: File "/usr/local/lib/python3.10/dist-packages/mii/backend/server.py", line 44, in __init__
deepspeed-mii-inference-worker-1: mii_config.generate_replica_configs()
deepspeed-mii-inference-worker-1: File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 302, in generate_replica_configs
deepspeed-mii-inference-worker-1: replica_pool = _allocate_devices(self.hostfile,
deepspeed-mii-inference-worker-1: File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 350, in _allocate_devices
deepspeed-mii-inference-worker-1: raise ValueError(
deepspeed-mii-inference-worker-1: ValueError: Only able to place 0 replicas, but 1 replicas were requested.
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:54,706] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:54,706] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
deepspeed-mii-inference-worker-0: Traceback (most recent call last):
deepspeed-mii-inference-worker-0: File "/data/deepspeed-mii-server.py", line 6, in <module>
deepspeed-mii-inference-worker-0: client = mii.serve(
deepspeed-mii-inference-worker-0: File "/usr/local/lib/python3.10/dist-packages/mii/api.py", line 124, in serve
deepspeed-mii-inference-worker-0: import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
deepspeed-mii-inference-worker-0: File "/tmp/mii_cache/llama2-deployment/score.py", line 33, in init
deepspeed-mii-inference-worker-0: mii.backend.MIIServer(mii_config)
deepspeed-mii-inference-worker-0: File "/usr/local/lib/python3.10/dist-packages/mii/backend/server.py", line 44, in __init__
deepspeed-mii-inference-worker-0: mii_config.generate_replica_configs()
deepspeed-mii-inference-worker-0: File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 302, in generate_replica_configs
deepspeed-mii-inference-worker-0: replica_pool = _allocate_devices(self.hostfile,
deepspeed-mii-inference-worker-0: File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 350, in _allocate_devices
deepspeed-mii-inference-worker-0: raise ValueError(
deepspeed-mii-inference-worker-0: ValueError: Only able to place 0 replicas, but 1 replicas were requested.
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:55,279] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1208
deepspeed-mii-inference-worker-1: [2024-03-08 09:16:55,280] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'deepspeed-mii-server.py', '--local_rank=0'] exits with return code = 1
pdsh@deepspeed-mii-inference-launcher: deepspeed-mii-inference-worker-1: ssh exited with exit code 1
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:56,554] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 1257
deepspeed-mii-inference-worker-0: [2024-03-08 09:16:56,554] [ERROR] [launch.py:322:sigkill_handler] ['/usr/bin/python3', '-u', 'deepspeed-mii-server.py', '--local_rank=0'] exits with return code = 1
2条答案
按热度按时间r1wp621o1#
你好@gujingit,我们目前不支持在节点之间拆分模型,只支持在单个节点上的GPU之间拆分,然后在不同的节点上拥有副本。
xj3cbfub2#
@mrwyattii 你好~ DeepSpeed-MII是否支持单个节点上的多个副本?例如,我有一个带有8个A100 GPU的节点,我将tensor_parallel设置为4,replica_num设置为2。我发现每次只有4个GPU在工作,剩下的4个GPU只是等待。有点奇怪!