vllm ConnectionResetError: [Errno 104] Connection reset by peer

hvvq6cgz  于 4个月前  发布在  其他
关注(0)|答案(5)|浏览(46)

偶尔遇到错误

+ python3 -m vllm.entrypoints.openai.api_server --host xxxxx --port 8003 --served-model-name qwen1.5-72b-chat-int4 --model /home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4 --trust-remote-code --tokenizer-mode auto --max-num-batched-tokens 32768 --tensor-parallel-size 4
INFO 02-29 14:28:09 api_server.py:228] args: Namespace(host='xxxxx', port=8003, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name='qwen1.5-72b-chat-int4', lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=32768, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 02-29 14:28:09 config.py:186] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 02-29 14:28:09 config.py:421] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-02-29 14:28:12,795 INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-29 14:28:15 llm_engine.py:87] Initializing an LLM engine with config: model='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer='/home/vllm/model/Qwen1.5-72B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=4, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 236, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 625, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 321, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 366, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 126, in __init__
    self._init_workers_ray(placement_group)
  File "/workspace/vllm/engine/llm_engine.py", line 303, in _init_workers_ray
    self._run_workers("init_model",
  File "/workspace/vllm/engine/llm_engine.py", line 1036, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/vllm/worker/worker.py", line 94, in init_model
    init_distributed_environment(self.parallel_config, self.rank,
  File "/workspace/vllm/worker/worker.py", line 275, in init_distributed_environment
    cupy_utils.init_process_group(
  File "/workspace/vllm/model_executor/parallel_utils/cupy_utils.py", line 90, in init_process_group
    _NCCL_BACKEND = NCCLBackendWithBFloat16(world_size, rank, host, port)
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_nccl_comm.py", line 70, in __init__
    self._init_with_tcp_store(n_devices, rank, host, port)
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_nccl_comm.py", line 94, in _init_with_tcp_store
    self._store_proxy['nccl_id'] = shifted_nccl_id
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_store.py", line 148, in __setitem__
    self._send_recv(_store_actions.Set(key, value))
  File "/usr/local/lib/python3.10/dist-packages/cupyx/distributed/_store.py", line 130, in _send_recv
    result_bytes = s.recv(sizeof(
ConnectionResetError: [Errno 104] Connection reset by peer
mkh04yzy

mkh04yzy1#

你好,
你解决了这个问题吗?我也遇到了同样的问题。

wqnecbli

wqnecbli2#

你好,
你解决了这个问题吗?我也遇到了同样的问题。
这个问题还没有解决,我偶尔会遇到它。

bzzcjhmw

bzzcjhmw3#

我认为在使用Tensor并行性时,cupy后端存在一个问题。如果你使用enforce_eager=True,这个问题可能会得到解决(尽管会影响性能)。关于错误本身,我认为https://github.com/cupy/cupy可能是一个更好的地方来报告错误。

pb3s4cty

pb3s4cty4#

我认为在使用Tensor并行性时,cupy后端存在一个问题。如果你使用enforce_eager=True,这个问题可能会得到解决(尽管会影响性能)。关于错误本身,我认为https://github.com/cupy/cupy可能是一个更好的地方来报告错误。

你好,

感谢你的建议。我会在cupy上报告这个问题。然而,由于在发布之前,虚拟机必须在多节点环境中成功运行,而我每次部署都失败了,所以我认为这一定与环境有关,阻碍了部署。

bq8i3lrv

bq8i3lrv5#

我认为cupy后端最近才引入了cuda graph(通过force_eager=True强制禁用)。我的猜测是这个后端在某些环境中运行得不太好,但在不重现问题的情况下很难进行故障排除。如果你能告诉我你的示例详细信息,我可以尝试重现问题。

相关问题