DeepSpeed-MII 错误:"只能放置X个副本,但请求了Y个副本"

igetnqfo  于 3个月前  发布在  其他
关注(0)|答案(2)|浏览(39)

我在一个示例上运行了AWS ml.g5.12xlarge,并使用了4个GPU。我遇到了这个错误(Only able to place 1 replicas, but 2 replicas were requested)。当我使用client.generate(inputs, max_new_tokens=128, replica_num=4)时,也出现了类似的错误(Only able to place 1 replicas, but 4 replicas were requested)。
我使用AWS DJL DeepSpeed进行运行,并使用了以下serving.properties文件:

engine=DeepSpeed
option.entrypoint=model.py

model.py 是一个自定义文件,包含了上面的代码以及其他在使用DJL服务器时需要的简单脚本。

oalqel3c

oalqel3c1#

你好,@spring1915。tensor_parallelreplica_num 的值应该传递给 mii.serve。我已经在 #386 中更新了 MII,当向 generate 方法提供不支持的额外 kwargs 时,它会报错。请尝试将您的代码更新为以下内容并再次尝试:

client = mii.serve("mistralai/Mistral-7B-Instruct-v0.2", tensor_parallel=2,  replica_num=2)
response = client.generate(inputs, max_new_tokens=128)
jqjz2hbq

jqjz2hbq2#

获取以下错误信息:

python3 -m api_server
[2024-02-19 20:51:50,842] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
args.replica_num  1
[2024-02-19 20:51:51,516] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
[2024-02-19 20:51:51,516] [INFO] [server.py:38:__init__] Hostfile /job/hostfile not found, creating hostfile.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/api_server.py", line 192, in <module>
    mii.serve(args.model,
  File "/usr/local/lib/python3.10/dist-packages/mii/api.py", line 124, in serve
    import_score_file(mii_config.deployment_name, DeploymentType.LOCAL).init()
  File "/tmp/mii_cache/deepspeed-mii/score.py", line 33, in init
    mii.backend.MIIServer(mii_config)
  File "/usr/local/lib/python3.10/dist-packages/mii/backend/server.py", line 44, in __init__
    mii_config.generate_replica_configs()
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 302, in generate_replica_configs
    replica_pool = _allocate_devices(self.hostfile,
  File "/usr/local/lib/python3.10/dist-packages/mii/config.py", line 350, in _allocate_devices
    raise ValueError(
ValueError: Only able to place 0 replicas, but 1 replicas were requested.

深度学习加速器是否适用于单GPU环境A40?

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:00:07.0 Off |                    0 |
|  0%   53C    P8              23W / 300W |      4MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

请在同一主题上提供帮助。

相关问题