DeepSpeed-MII 如何进行批量推理

woobm2wo  于 3个月前  发布在  其他
关注(0)|答案(7)|浏览(79)

你好,
我是一个MII的新探索者。
当我尝试进行批量推理时,遇到了这样的问题:

start_text = ["blablabla..."]
batch_size = 10
result = generator.query({"query": start_text * batch_size})

看起来它正在进行顺序推理,时间会随着批次大小的增加而累积。所以我这样做了:

start_text = ["blablabla..."]
batch_size = 10
result = generator.query({"query": start_text * batch_size}, batch_size=batch_size)

我得到了这个错误:

Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with `pipe.tokenizer.pad_token_id = model.config.eos_token_id`."

我认为我离正确答案很近,因为我知道这个错误似乎是来自Huggingface管道和deepspeed.init_inference()。我可以这样设置,但我不知道MII如何设置,因为一切都被封装起来了。
谢谢!

zengzsys

zengzsys1#

你好,Emerald01,感谢你报告这个问题。你能提供更多的细节吗?比如你使用的是哪个模型和任务?

pkbketx9

pkbketx92#

你好,@mrwyattii ,感谢你的帮助。

我们正在尝试了解如何使用DeepSpeed MII以更低的成本和更高的效率托管模型。我们正在测试CodeGen(尽管它不是MII直接支持的,但通过跳过模型检查,它可以运行得非常好,正如另一个问题中讨论的那样),但我们还将测试其他更大的模型。

如果我们不进行批处理推理,它运行得非常好。但是当我有一个像这个例子中的提示列表这样的批量输入时,输出延迟与批量大小成线性关系,因此我想知道是否有办法进行批处理推理,然后我使用了batch_size=,看起来是正确的,但我得到了上面的错误。在原始的DeepSpeed + pipeline中,我可以很容易地纠正这个问题,就像设置pipe.tokenizer.pad_token_id = pipe.model.config.eos_token_id一样,但由于MII封装了一切,我不知道我应该如何与你的API互动,使分词器对批处理感到满意。

如果你有任何建议,或者我遗漏了什么,请告诉我。非常感谢!

import mii
import time
import torch

batch_size = 10
new_tokens = 128

mii_configs = {"tensor_parallel": 1, "dtype": "fp16", "skip_model_check": True}

mii.deploy(task="text-generation",
           model="Salesforce/codegen-16B-multi",
           deployment_name="codegen16Bmulti_deployment",
           mii_config=mii_configs,
           enable_deepspeed=True,
           enable_zero=False,
           )

generator = mii.mii_query_handle("codegen16Bmulti_deployment")

start_text = ["Calculate the mean of ask, bid and volume respectively for Bitcoin over the past 10 days"]

# benchmark
t0 = time.time()
result = generator.query({"query": start_text * batch_size},
                         # batch_size=batch_size,
                         do_sample=False,
                         max_new_tokens=new_tokens,
                         pad_token_id=50256)
torch.cuda.synchronize()
t1 = time.time()
3zwtqj6y

3zwtqj6y3#

@Emerald01 我在我的系统上成功复现了这个问题。我们目前在MII API中没有一个方法来实现修复这个分词器填充问题的必要更改。然而,我认为将来这将是有价值的添加。我将不得不与团队讨论如何启用这种功能。

在此期间,您可以通过安装修补版本来解决问题:

  1. git clone https://github.com/microsoft/DeepSpeed-MII.git
  2. cd DeepSpeed-MII
  3. touch tokenizer_padding.patch
  4. 使用您喜欢的文本编辑器打开 tokenizer_padding.patch 并粘贴以下内容:
diff --git a/mii/models/load_models.py b/mii/models/load_models.py
index 431fc34..5beb1a1 100644
--- a/mii/models/load_models.py
+++ b/mii/models/load_models.py
@@ -43,6 +43,7 @@ def load_models(task_name,
             assert mii_config.dtype == torch.half or mii_config.dtype == torch.int8, "Bloom models only support fp16/int8"
             assert mii_config.enable_cuda_graph == False, "Bloom models do no support Cuda Graphs"
         inference_pipeline = hf_provider(model_path, model_name, task_name, mii_config)
+        inference_pipeline.tokenizer.pad_token_id = inference_pipeline.model.config.eos_token_id
     elif provider == mii.constants.ModelProvider.ELEUTHER_AI:
         from mii.models.providers.eleutherai import eleutherai_provider
         assert mii_config.dtype == torch.half, "gpt-neox only support fp16"
  1. git apply tokenizer_padding.patch
  2. pip install .
    请告诉我这是否有效,并期待未来的发展,以便更多地自定义管道对象!
kx1ctssn

kx1ctssn4#

@mrwyattii
这个方法可行!谢谢。
然而,我注意到每个GPU实际上得到了原始模型的大小,而不是将模型拆分到各个设备上,就像我想象的那样。我想这是一个单独的问题,但我只是想提出来看看我们是否有任何快速的解决方案。如果你认为这是更合适的做法,我可以打开一个新的问题。
这是部分日志,它证实了我开启了2个A100 GPUs

[2023-01-26 17:51:19,823] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.11.4
[2023-01-26 17:51:19,823] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-01-26 17:51:19,823] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-01-26 17:51:19,823] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-01-26 17:51:19,823] [INFO] [launch.py:162:main] dist_world_size=2
[2023-01-26 17:51:19,823] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
...
[2023-01-26 18:00:03,542] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
> --------- MII Settings: ds_optimize=True, replace_with_kernel_inject=True, enable_cuda_graph=False 
[2023-01-26 18:00:06,951] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-01-26 18:00:06,952] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-01-26 18:00:06,955] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-01-26 18:00:08,547] [INFO] [server_client.py:117:_wait_until_server_is_live] waiting for server to start...
> --------- MII Settings: ds_optimize=True, replace_with_kernel_inject=True, enable_cuda_graph=False 
[2023-01-26 18:00:09,732] [INFO] [logging.py:68:log_dist] [Rank -1] DeepSpeed info: version=0.8.0, git-hash=unknown, git-branch=unknown
[2023-01-26 18:00:09,733] [INFO] [logging.py:68:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
About to start server
...

现在它运行得很好。直到......
最后,当我运行batch_size = 32时,它引发了OOM(内存不足)。我看到它使用了大约36G,这就是整个模型的大小。每个A100有40G内存,所以OOM是有道理的。我认为每个设备可能只得到模型的一半左右,大约18G,这样就不会导致内存爆炸。我对DeepSpeed的理解是不是哪里出了问题?

debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:50050 {grpc_message:"Exception calling application: CUDA out of memory. Tried to allocate 66.00 MiB (GPU 0; 39.59 GiB total capacity; 36.40 GiB already allocated; 68.19 MiB free; 36.78 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF", grpc_status:2, created_time:"2023-01-26T18:00:55.029350976+00:00"}"
bvuwiixz

bvuwiixz5#

@Emerald01

您没有看到内存节省的原因是因为DeepSpeed-inference目前不支持Codegen模型的自动内核注入。如果没有DeepSpeed内核,我们不会将模型分片到GPU上。如果您尝试使用支持自动注入的模型(例如,gpt2),您会看到每个GPU的内存减少了。

对于不支持自动内核注入的模型,我们确实允许自定义注入策略。您可以在我们的单元测试中看到一个示例:https://github.com/microsoft/DeepSpeed/blob/ef6a958e70fe0106afbff9c2c885878cc659f4ac/tests/unit/inference/test_inference.py#L405

此外,您还可以使用ZeRO从GPU内存中卸载。请参阅我们这里的示例:https://github.com/microsoft/DeepSpeed-MII/blob/main/examples/local/text-generation-zero-example.py

a5g8bdjr

a5g8bdjr6#

@mrwyattii I added this line
inference_pipeline.tokenizer.pad_token_id = inference_pipeline.model.config.eos_token_id
after this
DeepSpeed-MII/mii/models/load_models.py
Line 47 in 737c247
| | inference_pipeline=hf_provider(model_path, model_name, task_name, mii_config) |
However, I get

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Traceback (most recent call last):
  File "gen.py", line 9, in <module>
    result = generator.query({"query": ["DeepSpeed is", "Seattle is"]*batch_size}, do_sample=True, max_new_tokens=100 , batch_size=batch_size) # 
  File "/admin/home/DeepSpeed-MII/mii/client.py", line 125, in query
    response = self.asyncio_loop.run_until_complete(
  File "/admin/home/anaconda3/envs/test/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/admin/home/DeepSpeed-MII/mii/client.py", line 109, in _query_in_tensor_parallel
    await responses[0]
  File "/admin/home/DeepSpeed-MII/mii/client.py", line 72, in _request_async_response
    proto_response = await getattr(self.stub, conversions["method"])(proto_request)
  File "/admin/home/anaconda3/envs/test/lib/python3.8/site-packages/grpc/aio/_call.py", line 290, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: The specified pointer resides on host memory and is not registered with any CUDA device."
        debug_error_string = "UNKNOWN:Error received from peer ipv6:%5B::1%5D:50050 {created_time:"2023-04-15T05:32:19.698693653+00:00", grpc_status:2, grpc_message:"Exception calling application: The specified pointer resides on host memory and is not registered with any CUDA device."}"

Which points to

Traceback (most recent call last):
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/mii/grpc_related/modelresponse_server.py", line 91, in _run_inference
    response = self.inference_pipeline(*args, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/text_generation.py", line 209, in __call__
    return super().__call__(text_inputs, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1090, in __call__
    outputs = list(final_iterator)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/pt_utils.py", line 124, in __next__
    item = next(self.iterator)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/pt_utils.py", line 125, in __next__
    processed = self.infer(item, **self.params)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/base.py", line 1015, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/pipelines/text_generation.py", line 251, in _forward
    generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/deepspeed/inference/engine.py", line 603, in _generate
    return self.module.generate(*inputs, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1075, in forward
    transformer_outputs = self.transformer(
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
    outputs = block(
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 147, in forward
    self.attention(input,
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 160, in forward
    context_layer, key_layer, value_layer = self.compute_attention(qkv_out=qkv_out,
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/ds_attention.py", line 99, in compute_attention
    attn_key_value = self.score_context_func(
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home//anaconda3/envs/test/lib/python3.8/site-packages/deepspeed/ops/transformer/inference/op_binding/softmax_context.py", line 31, in forward
    output = self.softmax_context_func(query_key_value, attn_mask, self.config.rotary_dim, self.config.rotate_half,
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

I found out "alibi" was on cpu, but irrespective of that.

Here's how error occur:

  1. Do inference with certain batch size.
  2. Now increase the batch size and try inference again. This gives error.
    You can choose any batch size, but once the server has started and a single inference is done with certain batch size, you can't increase the batch size, can keep same or decrease for consecutive batch sizes.
    I sometimes also get:
!!!! kernel execution error. (batch: 48, m: 3, n: 3, k: 64, error: 13) 
!!!! kernel execution error. (batch: 48, m: 64, n: 3, k: 3, error: 13)

Reproduce:

  1. Start server
import mii
mii_configs = {"tensor_parallel": 1, "dtype": "fp32"}

model = "gpt2" # model can be anything
# model = "EleutherAI/pythia-160m" 
mii.deploy(task="text-generation",
           model=model,
           deployment_name=model + "_deploy",
           mii_config=mii_configs)
  1. Generate with certain batch size.
model="gpt2"
generator = mii.mii_query_handle(model + "_deploy")
batch_size = 4
result = generator.query({"query": ['DeepSpeed is', 'Seattle is', 'DeepSpeed is', 'DeepSpeed is']*2}, do_sample=True, max_new_tokens=100, batch_size=batch_size)
  1. Infer again with large batch size --> gives error
model="gpt2"
generator = mii.mii_query_handle(model + "_deploy")
batch_size = 8
result = generator.query({"query": ['DeepSpeed is', 'Seattle is', 'DeepSpeed is', 'DeepSpeed is']*2}, do_sample=True, max_new_tokens=100, batch_size=batch_size)
  1. Terminate & start again
model="gpt2"
mii.terminate(model + "_deploy")

mii_configs = {"tensor_parallel": 1, "dtype": "fp32"}
model = "gpt2"
mii.deploy(task="text-generation",
           model=model,
           deployment_name=model + "_deploy",
           mii_config=mii_configs)
  1. Infer with batch size 8 -> runs without error
model="gpt2"
generator = mii.mii_query_handle(model + "_deploy")
batch_size = 8
result = generator.query({"query": ['DeepSpeed is', 'Seattle is', 'DeepSpeed is', 'DeepSpeed is']*2}, do_sample=True, max_new_tokens=100, batch_size=batch_size)
tjjdgumg

tjjdgumg7#

嘿,@mrwyattii,你能调查一下这个问题吗?或者指派一个人来处理。

相关问题