当使用djl-deepspeed时,vllm一直挂起,

xam8gpfp  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(68)

我正在尝试使用vllm将Mistral 7B Instruct v0.2模型部署到一个异步的Sagemaker端点,并使用A10(g5.2xlarge)机器,但我一直看到

[INFO ] PyProcess - W-309-model-stdout: INFO 02-18 18:45:24 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%

日志显示部署成功

2024-02-15T17:06:15.629-05:00 | [INFO ] ModelInfo - Available GPU memory: 22481 MB, required: 0 MB, reserved: 500 MB
  | 2024-02-15T17:06:15.879-05:00 | [INFO ] ModelInfo - Loading model model on gpu(0)
  | 2024-02-15T17:06:15.879-05:00 | [INFO ] WorkerPool - scaling up min workers by 1 (from 0 to 1) workers. Total range is min 1 to max 1
  | 2024-02-15T17:06:15.879-05:00 | [INFO ] PyProcess - Start process: 19000 - retry: 0
  | 2024-02-15T17:06:16.129-05:00 | [INFO ] Connection - Set CUDA_VISIBLE_DEVICES=0
  | 2024-02-15T17:06:20.242-05:00 | [INFO ] PyProcess - W-332-model-stdout: 332 - djl_python_engine started with args: ['--sock-type', 'unix', '--sock-name', '/tmp/djl_sock.19000', '--model-dir', '/opt/ml/model', '--entry-point', 'djl_python.huggingface', '--device-id', '0']
  | 2024-02-15T17:06:28.404-05:00 | [INFO ] PyProcess - W-332-model-stdout: Python engine started.
  | 2024-02-15T17:06:28.655-05:00 | [INFO ] PyProcess - W-332-model-stdout: Using device map auto
  | 2024-02-15T17:06:28.655-05:00 | [INFO ] PyProcess - W-332-model-stdout: WARNING 02-15 22:06:28 config.py:457] Casting torch.bfloat16 to torch.float16.
  | 2024-02-15T17:06:33.242-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:06:28 llm_engine.py:70] Initializing an LLM engine with config: model='/tmp/.djl.ai/download/6ac9fecd08e21f9f01b9c13e5b30ed7ff37c5cbf', tokenizer='/tmp/.djl.ai/download/6ac9fecd08e21f9f01b9c13e5b30ed7ff37c5cbf', tokenizer_mode=auto, revision=b70aa86578567ba3301b21c8a27bea4e8f6d6d61, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8847, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
  | 2024-02-15T17:06:46.930-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:06:45 llm_engine.py:275] # GPU blocks: 2050, # CPU blocks: 2048
  | 2024-02-15T17:06:46.930-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:06:46 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
  | 2024-02-15T17:06:51.241-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:06:46 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode.
  | 2024-02-15T17:06:51.962-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:06:51 model_runner.py:547] Graph capturing finished in 5 secs.
  | 2024-02-15T17:06:51.962-05:00 | [INFO ] PyProcess - Model [model] initialized.
  | 2024-02-15T17:06:51.962-05:00 | [INFO ] ModelServer - Initialize BOTH server with: EpollServerSocketChannel.
  | 2024-02-15T17:06:56.241-05:00 | [INFO ] ModelServer - BOTH API bind to: http://0.0.0.0:8080
  | 2024-02-15T17:40:28.241-05:00 | [INFO ] PyProcess - W-332-model-stdout: INFO 02-15 22:40:23 llm_engine.py:706] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%

我使用的平台是djl-deepspeed==0.12.6,这里有顶部的第二个推理Docker,以及以下属性文件

engine=Python
option.entryPoint=djl_python.huggingface
option.model_id=s3addresstothemodelfiles
option.dtype=fp16
option.task=text-generation
option.rolling_batch=vllm
option.tensor_parallel_degree=1
option.device_map=auto
option.revision=b70aa86578567ba3301b21c8a27bea4e8f6d6d61
option.trust_remote_code=true
option.max_model_len=8847
option.max_rolling_batch_size=100

它运行以下文件https://github.com/deepjavalibrary/djl-serving/blob/master/engines/python/setup/djl_python/huggingface.py,但似乎卡住了,因为我没有从容器得到任何响应

2024-02-18T18:45:24.695: [sagemaker logs] [d7f5b1f8-e361-4f09-9e93-b46878902182] The response from container primary did not specify the required Content-Length header
zxlwwiss

zxlwwiss1#

你是否已经解决了这个问题?我使用了类似的设置,包括Mistral 7B、vLLM和一个DJL预构建容器。如果我为实时推理进行配置,一切都可以正常工作,但是如果我创建并尝试调用异步端点,我会得到相同的日志消息。

相关问题