当前环境
在设置中,我使用了版本0.5和vllm_openai目标作为Dockerfile的一部分,并附带了以下参数:
environment:
- NCCL_SOCKET_IFNAME=eth0
restart: unless-stopped
ulimits:
memlock: -1
stack: -1
ports:
- "3010:8000"
ipc: host
command:
- "--model"
- "/models/Mixtral-8x22B-Instruct-v0.1-FP8"
- "--gpu-memory-utilization"
- "0.95"
- "--tensor-parallel-size"
- "8"
- "--host"
- "0.0.0.0"
- "--max-num-seqs"
- "70"
- "--quantization"
- "fp8"
- "--download-dir"
- "/models"
🐛 描述错误
当我将Mixtral-8x22B-Instruct-v0.1-FP8加载到8 L40S上时,会出现以下错误:
Attaching to vllm1-1
vllm1-1 | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:33 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
vllm1-1 | INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 utils.py:637] Found nccl from library libnccl.so.2
vllm1-1 | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:34 pynccl.py:63] vLLM is using nccl==2.20.5
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | Traceback (most recent call last):
vllm1-1 | File "/usr/lib/python3.10/multiprocessing/resource_tracker.py", line 209, in main
vllm1-1 | cache[rtype].remove(name)
vllm1-1 | KeyError: '/psm_38be8863'
vllm1-1 | WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 custom_all_reduce.py:165] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
vllm1-1 | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:34 fp8.py:48] Detected fp8 checkpoint. Please note that the format is experimental and subject to change.
vllm1-1 | WARNING 06-13 00:51:55 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | INFO 06-13 00:51:55 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14204) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14208) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14205) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14206) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14208) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14204) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14207) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14210) WARNING 06-13 00:51:56 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14205) INFO 06-13 00:51:56 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14209) WARNING 06-13 00:51:57 utils.py:465] Found input_scales that are not equal for fp8 MoE layer. Using the maximum across experts for each layer.
vllm1-1 | (VllmWorkerProcess pid=14206) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14207) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14210) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | (VllmWorkerProcess pid=14209) INFO 06-13 00:51:57 model_runner.py:160] Loading model weights took 16.4319 GB
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | PC: @ 0x14c3bc9359fc (unknown) pthread_kill
vllm1-1 | @ 0x14c3bc8e1520 (unknown) (unknown)
vllm1-1 | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: *** SIGABRT received at time=1718239918 on cpu 73 ***
vllm1-1 | [2024-06-13 00:51:58,039 E 1 1] logging.cc:343: PC: @ 0x14c3bc9359fc (unknown) pthread_kill
vllm1-1 | [2024-06-13 00:51:58,040 E 1 1] logging.cc:343: @ 0x14c3bc8e1520 (unknown) (unknown)
vllm1-1 | Fatal Python error: Aborted
vllm1-1 |
vllm1-1 | Stack (most recent call first):
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1 | File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1 | File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1 |
vllm1-1 | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1 | [failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x14c3bc8c7898 while already in AbslFailureSignalHandler()
vllm1-1 | *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1 | PC: @ 0x14c3bc8c7898 (unknown) abort
vllm1-1 | @ 0x14c3bc8e1520 (unknown) (unknown)
vllm1-1 | Conversion from/to f8e4m3nv is only supported on compute capability >= 90
vllm1-1 |
vllm1-1 | UNREACHABLE executed at /project/lib/Conversion/TritonGPUToLLVM/ElementwiseOpToLLVM.cpp:823!
vllm1-1 | @ 0x14c3bc89e1c0 (unknown) (unknown)
vllm1-1 | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: *** SIGSEGV received at time=1718239918 on cpu 73 ***
vllm1-1 | [2024-06-13 00:51:58,044 E 1 1] logging.cc:343: PC: @ 0x14c3bc8c7898 (unknown) abort
vllm1-1 | [2024-06-13 00:51:58,046 E 1 1] logging.cc:343: @ 0x14c3bc8e1520 (unknown) (unknown)
vllm1-1 | [2024-06-13 00:51:58,048 E 1 1] logging.cc:343: @ 0x14c3bc89e1c0 (unknown) (unknown)
vllm1-1 | Fatal Python error: Segmentation fault
vllm1-1 |
vllm1-1 | Stack (most recent call first):
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 173 in make_llir
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/backends/cuda.py", line 199 in <lambda>
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 193 in compile
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 416 in run
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 167 in <lambda>
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 245 in invoke_fused_moe_kernel
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 427 in fused_experts
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/fused_moe/fused_moe.py", line 515 in fused_moe
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 271 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 424 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 468 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/mixtral.py", line 535 in forward
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541 in _call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532 in _wrapped_call_impl
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 749 in execute_model
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 844 in profile_run
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 162 in determine_num_available_blocks
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115 in decorate_context
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 119 in _run_workers
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 38 in determine_num_available_blocks
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 313 in _initialize_kv_caches
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 236 in __init__
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 473 in _init_engine
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 349 in __init__
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 398 in from_engine_args
vllm1-1 | File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 196 in <module>
vllm1-1 | File "/usr/lib/python3.10/runpy.py", line 86 in _run_code
vllm1-1 | File "/usr/lib/python3.10/runpy.py", line 196 in _run_module_as_main
vllm1-1 |
vllm1-1 | Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, charset_normalizer.md, simplejson._speedups, requests.packages.charset_normalizer.md, requests.packages.chardet.md, yaml._yaml, psutil._psutil_linux, psutil._psutil_posix, msgpack._cmsgpack, google._upb._message, setproctitle, uvloop.loop, ray._raylet, sentencepiece._sentencepiece, ujson, regex._regex, scipy._lib._ccallback_c, numba.core.typeconv._typeconv, numba._helperlib, numba._dynfunc, numba._dispatcher, numba.core.runtime._nrt_python, numba.np.ufunc._internal, numba.experimental.jitclass._box, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, multidict._multidict, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, xxhash._xxhash, pyarrow._json, markupsafe._speedups, PIL._imaging, cuda_utils (total: 103)
vllm1-1 | /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown
vllm1-1 | warnings.warn('resource_tracker: There appear to be %d '
vllm1-1 exited with code 0
如果有任何帮助,将不胜感激!
5条答案
按热度按时间0kjbasz61#
仅支持计算能力 >= 90 的 f8e4m3nv 转换/转换。
L40S 应具有计算能力 == 89,您需要使用 H100 对 fp8 模型进行推理。
uqxowvwt2#
谢谢!
mw3dktmi3#
你可以尝试卸载你的triton并使用triton nightly,具体指南在这里:https://github.com/triton-lang/triton?tab=readme-ov-file#quick-installation
目前我们所需的triton 2.3由于PyTorch无法支持Ada Lovelace,但未来的版本将会支持。
zfciruhq4#
感谢!对于Triton有点新,那是vllm使用的位于cuda之上的自定义内核吗?如果是的话,我相信我只需要用这个在晚上构建内核:https://llvm.org/docs/CMake.html。如果我能使用自己的Ada硬件进行FP8,那将是传奇般的。
llycmphe5#
Triton本身并不是一个自定义内核,而是一个用于在运行时对内核进行即时编译的库。所以你只需要升级已安装的Python包即可。在安装vllm后,尝试卸载Triton并安装更新版本或夜间版本,看看他们是否解决了这个问题。