当前环境

在设置中，我使用了版本0.5和vllm_openai目标作为Dockerfile的一部分，并附带了以下参数：

environment:
    - NCCL_SOCKET_IFNAME=eth0
    restart: unless-stopped
    ulimits:
      memlock: -1
      stack: -1
    ports:
      - "3010:8000"
    ipc: host
    command:
      - "--model"
      - "/models/Mixtral-8x22B-Instruct-v0.1-FP8"
      - "--gpu-memory-utilization"
      - "0.95" 
      - "--tensor-parallel-size"
      - "8"
      - "--host"
      - "0.0.0.0"
      - "--max-num-seqs" 
      - "70"
      - "--quantization"
      - "fp8"
      - "--download-dir"
      - "/models"

🐛 描述错误

当我将Mixtral-8x22B-Instruct-v0.1-FP8加载到8 L40S上时，会出现以下错误：

Attaching to vllm1-1
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1  | INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | Traceback (most recent call last):
vllm1-1  |   File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1  |     cache[rtype].remove(name)
vllm1-1  | KeyError: '/psm_38be8863'
vllm1-1  | WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1  | WARNING 06-13 00:51:55 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | INFO 06-13 00:51:55 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:57 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer. 
vllm1-1  | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  | PC: @     0x14c3bc9359fc  (unknown)  pthread_kill
vllm1-1  |     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1  | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: PC: @     0x14c3bc9359fc  (unknown)  pthread_kill
vllm1-1  | [2024-06-13 00:51:58,040 E 1 1] logging.cc:343:     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | Fatal Python error: Aborted
vllm1-1  | 
vllm1-1  | Stack (most recent call first):
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1  | 
vllm1-1  | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1  | [failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x14c3bc8c7898 while already in AbslFailureSignalHandler()
vllm1-1  | *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1  | PC: @     0x14c3bc8c7898  (unknown)  abort
vllm1-1  |     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1  | 
vllm1-1  | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1  |     @     0x14c3bc89e1c0  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1  | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: PC: @     0x14c3bc8c7898  (unknown)  abort
vllm1-1  | [2024-06-13 00:51:58,046 E 1 1] logging.cc:343:     @     0x14c3bc8e1520  (unknown)  (unknown)
vllm1-1  | [2024-06-13 00:51:58,048 E 1 1] logging.cc:343:     @     0x14c3bc89e1c0  (unknown)  (unknown)
vllm1-1  | Fatal Python error: Segmentation fault
vllm1-1  | 
vllm1-1  | Stack (most recent call first):
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1  |   File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1  |   File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1  | 
vllm1-1  | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1  | /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown
vllm1-1  |   warnings.warn('resource_tracker: There appear to be %d '
vllm1-1 exited with code 0

如果有任何帮助，将不胜感激！

5条答案

按热度按时间

0kjbasz61#

仅支持计算能力 >= 90 的 f8e4m3nv 转换/转换。
L40S 应具有计算能力 == 89,您需要使用 H100 对 fp8 模型进行推理。

赞(0）回复(0）举报 6个月前

uqxowvwt2#

谢谢！

mw3dktmi3#

你可以尝试卸载你的triton并使用triton nightly,具体指南在这里：https://github.com/triton-lang/triton?tab=readme-ov-file#quick-installation
目前我们所需的triton 2.3由于PyTorch无法支持Ada Lovelace,但未来的版本将会支持。

zfciruhq4#

感谢！对于Triton有点新，那是vllm使用的位于cuda之上的自定义内核吗？如果是的话，我相信我只需要用这个在晚上构建内核：https://llvm.org/docs/CMake.html。如果我能使用自己的Ada硬件进行FP8,那将是传奇般的。

llycmphe5#

Triton本身并不是一个自定义内核，而是一个用于在运行时对内核进行即时编译的库。所以你只需要升级已安装的Python包即可。在安装vllm后，尝试卸载Triton并安装更新版本或夜间版本，看看他们是否解决了这个问题。

vllm [Bug]: 加载 Mixtral-8x22B-Instruct-v0.1-FP8 在 8xL40S 上会导致 SIGSEGV

当前环境

🐛 描述错误

5条答案

相关问题

热门标签

最新问答