mlc-llm [Bug] 未处理的CUDA错误,与ROCm 5.7相关

atmip9wb  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(42)

🐛 Bug

尝试在q4f16_1分割的Mixtral 8x7上运行,我从rccl得到一个未处理的cuda错误。

重现问题

重现问题的步骤:

  1. 安装MLC-LLM
  2. 转换Mixtral权重
  3. 为ROCm编译Mixtral,使用--tensor-parallel-shard 2
  4. 尝试使用以下代码运行NCCL_DEBUG=INFO:
from mlc_llm import ChatModule
cm = ChatModule(model="./dist/Mixtral-8x7B-Instruct-v0.1-MLC", \
    model_lib_path="./dist/libs/Mixtral-8x7B-Instruct-v0.1-q4f16_1-rocm.so", device="rocm")

while True:
    prompt = input("> ")
    cm.generate("prompt")

生成以下日志:

[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:12] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-18 15:41:13] INFO auto_device.py:76: �[92mFound�[0m device: rocm:0
[2024-04-18 15:41:13] INFO auto_device.py:76: �[92mFound�[0m device: rocm:1
[2024-04-18 15:41:13] INFO chat_module.py:379: Using model folder: /home/user/mlc/dist/Mixtral-8x7B-Instruct-v0.1-MLC
[2024-04-18 15:41:13] INFO chat_module.py:380: Using mlc chat config: /home/user/mlc/dist/Mixtral-8x7B-Instruct-v0.1-MLC/mlc-chat-config.json
[2024-04-18 15:41:13] INFO chat_module.py:529: Using library model: ./dist/libs/Mixtral-8x7B-Instruct-v0.1-q4f16_1-rocm.so
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:13] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[2024-04-18 15:41:13] INFO model_metadata.py:96: �[92mTotal memory usage�[0m: 15417.07 MB (Parameters: 12599.13 MB. KVCache: 0.00 MB. Temporary buffer: 2817.94 MB)
[2024-04-18 15:41:13] INFO model_metadata.py:105: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
localhost:7497:7497 [0] NCCL INFO Bootstrap : Using enp12s0:192.168.1.236<0>
localhost:7497:7497 [0] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 0 : librccl-net.so: cannot open shared object file: No such file or directory
localhost:7497:7497 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
localhost:7497:7497 [0] NCCL INFO Kernel version: 6.5.0-27-generic
localhost:7497:7578 [0] NCCL INFO ROCr version 1.1
localhost:7497:7578 [0] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
RCCL version 2.17.1+hip5.7 HEAD:3d014cc+
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
[15:41:14] /workspace/tvm/src/target/parsers/aprofile.cc:97: Warning: Cannot parse target features. LLVM was not compiled with support for Arm(R)-based targets.
terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [15:41:14] /workspace/tvm/src/runtime/disco/nccl/nccl.cc:87: rcclErrror: unhandled cuda error
Stack trace:
  0: _ZN3tvm7runtime6deta
  1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
  4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
  5: execute_native_thread_routine
        at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
  6: start_thread
        at ./nptl/pthread_create.c:442
  7: 0x000076773652684f
        at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
  8: 0xffffffffffffffff

localhost:7579:7579 [1] NCCL INFO ROCr version 1.1
localhost:7579:7579 [1] NCCL INFO Dmabuf feature disabled without NCCL_ENABLE_DMABUF_SUPPORT=1
localhost:7579:7579 [1] NCCL INFO Bootstrap : Using enp12s0:192.168.1.236<0>
localhost:7579:7579 [1] NCCL INFO NET/Plugin : Plugin load (librccl-net.so) returned 2 : librccl-net.so: cannot open shared object file: No such file or directory
localhost:7579:7579 [1] NCCL INFO NET/Plugin : No plugin found, using internal implementation
localhost:7579:7579 [1] NCCL INFO Kernel version: 6.5.0-27-generic
localhost:7579:7579 [1] NCCL INFO Failed to open libibverbs.so[.1]
localhost:7579:7579 [1] NCCL INFO NET/Socket : Using [0]enp12s0:192.168.1.236<0>
localhost:7579:7579 [1] NCCL INFO Using network Socket
localhost:7579:7579 [1] NCCL INFO rocm_smi_lib: version 5.0.0.0
localhost:7579:7579 [1] NCCL INFO Setting affinity for GPU 1 to ffffff
localhost:7579:7579 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 comm 0x3726740 nRanks 02 busId 6000
localhost:7579:7579 [1] NCCL INFO P2P Chunksize set to 131072
localhost:7579:7579 [1] NCCL INFO Channel 00/0 : 1[6000] -> 0[3000] via P2P/IPC comm 0x3726740 nRanks 02
localhost:7579:7579 [1] NCCL INFO Channel 01/0 : 1[6000] -> 0[3000] via P2P/IPC comm 0x3726740 nRanks 02

localhost:7579:7579 [1] /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:228 NCCL WARN Cuda failure 'invalid device pointer'
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport/p2p.cc:342 -> 1
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/transport.cc:160 -> 1
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1269 -> 1
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1500 -> 1
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1701 -> 1
localhost:7579:7579 [1] NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/rccl/build/hipify/src/init.cc:1738 -> 1
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 51, in <module>
    main()
  File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/mlc_llm/cli/worker.py", line 46, in main
    worker_func(worker_id, num_workers, reader, writer)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 277, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
  File "/home/user/miniconda3/envs/mlc-prebuilt/lib/python3.11/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  6: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (int, int, long, long)>::AssignTypedLambda<void (*)(int, int, long, long)>(void (*)(int, int, long, long), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  5: tvm::runtime::WorkerProcess(int, int, long, long)
  4: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
  3: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
  2: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::TypedPackedFunc<void (tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>::AssignTypedLambda<void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)>(void (*)(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(tvm::runtime::TVMArgs const&, tvm::runtime::TVMRetValue*)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
  1: tvm::runtime::nccl::InitCCLPerWorker(tvm::runtime::ShapeTuple, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
  0: _ZN3tvm7runtime6deta
  File "/workspace/tvm/src/runtime/disco/nccl/nccl.cc", line 87
rcclErrror: unhandled cuda error

预期行为

与Mixtral 8x7b进行聊天以进行测试

环境

  • 平台:ROCM 5.7
  • 操作系统:Ubuntu 22.04.4
  • 设备:2x Radeon Instinct Mi-25 16GB
  • 如何安装MLC-LLM:conda
  • 如何安装TVM-Unity:pip
  • Python版本(例如3.10):3.11
  • TVM Unity哈希标签:0c81069ea42f393f6ff24efc47f15bc2316cfb10

其他信息

Ubuntu 22.04.4的全新安装,内核版本为6.5.0-27-generic

j5fpnvbx

j5fpnvbx1#

好的,很高兴知道这也发生在非桥接GPU上。

$x_1e^0f_1^x$

xpszyzbs

xpszyzbs2#

感谢报告。让我们希望我们能尽快得到上游的修复🤞同时也在等待 $x_{1e0f1}^{x}$

相关问题