DeepSpeed-MII 等待服务器启动...

w9apscun  于 6个月前  发布在  其他
关注(0)|答案(8)|浏览(80)

你好,我在一个有4个GPU的节点上开始部署,并设置了tensor_parallel为2。程序一直在等待服务器启动。

$x_1^c_0^d_1^x$

代码是:

$x_1^c_1^d_1^x$

hostfile是:

127.0.0.1 slots=2
2lpgd968

2lpgd9681#

你能看到任何GPU内存使用情况(通过nvidia-smi)吗?我想知道是否存在加载模型的问题。无论如何,我认为我们可以改进向用户提供反馈的方式,使其更具描述性,说明服务器在后台正在做什么。
另外,你能尝试不使用grpc服务器吗?在调用mii.deploy()时设置deployment_type=mii.DeploymentType.NON_PERSISTENT,并使用deepspeed --num_gpus 2 your_script.py启动。

disbfnqx

disbfnqx2#

我正在面临相同的问题,无论是持久部署还是非持久部署。它都没有在GPU上加载模型。我已经尝试了deepspeed和zero2以及zero3。

atmip9wb

atmip9wb3#

[2023-09-04 12:23:19,159] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['.local/lib/python3.10/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.10.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 28.73 GB
kjthegm6

kjthegm64#

我在使用持久化部署和非持久化部署时都遇到了同样的问题。模型无法加载到GPU上。我已经尝试了deepspeed和zero2、zero3。
@infosechoudini 当你使用非持久化部署类型或仅使用DeepSpeed加载模型时,你看到了什么行为?像下面这样的简单脚本对你有用吗?

import torch
import deepspeed
import os
from transformers import pipeline

local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))

task_name = "text-generation"
model_name = "gpt2"
input_strs = ["DeepSpeed is", "Microsoft is"]

def run():
    pipe = pipeline(task_name, model_name, torch_dtype=torch.float16, device=local_rank)

    pipe.model = deepspeed.init_inference(
        pipe.model,
        replace_with_kernel_inject=True,
        mp_size=world_size,
        dtype=torch.float16,
    )

    output = pipe(input_strs)
    print(output)

if __name__ == "__main__":
    run()

运行 deepspeed script.py

kyks70gy

kyks70gy5#

嘿,
Deepspeed运行正常。我昨天刚用deepspeed训练了一个模型。我在尝试弄清楚问题,但找不到解决方案。
它只是挂起等待服务器启动,然后在超时后崩溃。

lyfkaqu1

lyfkaqu16#

我想确定MII中是否存在bug,或者您的环境中是否存在导致此挂起的问题。我看到您正在设置"tensor_parallel": 5。我过去曾遇到过在使用奇数个GPU时模型分片出现问题的情况。您能尝试使用4个GPU运行吗?

i86rm4rw

i86rm4rw7#

你好,@mrwyattii ,我想请问如何保持RESTful服务器的活跃状态?这是我的脚本:

import mii

mii_configs = {
    "tensor_parallel": 2, 
    "dtype": "fp16",
    "enable_restful_api": True, 
    "restful_api_port": 35215,
    "skip_model_check": True
}
mii.deploy(task="text-generation",
           model="/path/to/my/model",
           deployment_name=MY_DEPLOYMENT",
           mii_config=mii_configs,
           deployment_type=mii.DeploymentType.NON_PERSISTENT
           )

看起来在我运行了deepspeed --num_gpus 2 api.py之后,进程就退出了。模型已经加载到了GPU上,但是服务器没有保持活跃。你能帮我解决这个问题吗?

fwzugrvs

fwzugrvs8#

你好,@mrwyattii,服务器.py一直等待服务器启动可能的原因是什么?
当我使用你提供的test.py运行服务器时,它似乎可以正常工作,所以我认为模型不是导致这个问题的原因。但是当我运行server.py时,它会一直等待服务器启动。nvidia-smi显示每个GPU使用了448MB的内存,而我尝试加载7b模型时,所以我认为模型没有正确加载。但是为什么持久化部署和非持久化部署之间会有差异呢?

相关问题