text-generation-inference mistralai/Mixtral-8x22B-Instruct-v0.1: Successful warmup, crashes on inference

tvokkenx  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(97)

系统信息

TGI版本:v2.0.4
型号:mistralai/Mixtral-8x22B-Instruct-v0.1
硬件:4x Nvidia H100 70GB HBM3
部署特定性:OpenShift

信息

  • Docker
  • CLI直接使用

任务

  • 一个官方支持的命令
  • 我自己的修改

重现过程

使用以下节点运行TGI:

  • MAX_BATCH_PREFILL_TOKENS=35000
  • MAX_INPUT_LENGTH=35000
  • MAX_TOTAL_TOKENS=36864
  • NUM_SHARD=4

预热成功:

�[2m2024-06-17T14:22:01.460404Z�[0m �[32m INFO�[0m �[2mtext_generation_router�[0m�[2m:�[0m �[2mrouter/src/main.rs�[0m�[2m:�[0m�[2m354:�[0m Setting max batch total tokens to 42768

模型能够成功处理相对较小的请求。当发送较大的请求(接近最大输入长度但仍然低于)时,预填充操作崩溃:

�[2m2024-06-17T14:22:46.490042Z�[0m �[31mERROR�[0m �[2mtext_generation_launcher�[0m�[2m:�[0m Method Prefill encountered an error.
Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
    return _main(
  File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve
    server.serve(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 258, in serve
    asyncio.run(
  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
    return await response
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 144, in Prefill
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 960, in generate_token
    raise e
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 957, in generate_token
    out, speculative_logits = self.forward(batch)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 523, in forward
    logits, speculative_logits = self.model.forward(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 647, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 589, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 529, in forward
    moe_output = self.moe(normed_attn_res_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mixtral_modeling.py", line 367, in forward
    out = fused_moe(
  File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 430, in fused_moe
    intermediate_cache3 = torch.empty((M, topk_ids.shape[1], w2.shape[1]),
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 758.00 MiB. GPU

我尝试设置MAX_BATCH_SIZE=1,以确保批次中的令牌数量低于由TGI计算的最大批次总令牌数。然而,错误仍然发生。

预期行为

由于最大批次总令牌数是正确的,因此不会出现OOM。

lrl1mhuk

lrl1mhuk1#

感谢@alexanderdicke-webcom的报告!
在与Olivier简要讨论后,似乎与torch分配器出现问题有关。你是对的,在预热之后,它应该能够处理传递给模型的序列,所以我们需要仔细检查一下。
cc @OlivierDehaene

o2g1uqev

o2g1uqev2#

我遇到了类似的oom问题(在sha-74b0231中使用34B llama和2*A6000),但在sha-eade737中可以正常工作。

相关问题