mlc-llm [Bug] tvm._ffi.base.TVMError: TVMError: Assert失败： T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1]) == 256

b4qexyjb 于 5个月前发布在其他

关注(0)|答案(1)|浏览(112)

🐛 Bug

我正在尝试优化 Qwen/Qwen1.5-4B-Chat 模型。由于我的 MAC M1 只有 8GB RAM,我使用了 3bit 量化和一个非常小的预填充块大小 = 2048。在运行 mlc_llm chat $IR_FILES 时，我遇到了以下错误。

[2024-07-28 00:47:14] INFO auto_config.py:70: Found model configuration: dist/shards/mlc-chat-config.json
[2024-07-28 00:47:14] INFO auto_target.py:84: Detecting target device: metal:0
[2024-07-28 00:47:14] INFO auto_target.py:86: Found target: {"thread_warp_size": 32, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
[2024-07-28 00:47:14] INFO auto_target.py:103: Found host LLVM triple: arm64-apple-darwin22.2.0
[2024-07-28 00:47:14] INFO auto_target.py:104: Found host LLVM CPU: apple-m1
[2024-07-28 00:47:14] INFO auto_config.py:154: Found model type: qwen2. Use `--model-type` to override.
Compiling with arguments:
--config          QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
--quantization    GroupQuantize(name='q3f16_1', kind='group-quant', group_size=40, quantize_dtype='int3', storage_dtype='uint32', model_dtype='float16', linear_weight_layout='NK', quantize_embedding=True, quantize_final_fc=True, num_elem_per_storage=10, num_storage_per_group=4, max_int_value=3)
--model-type      qwen2
--target          {"thread_warp_size": 32, "host": {"mtriple": "arm64-apple-darwin22.2.0", "tag": "", "kind": "llvm", "mcpu": "apple-m1", "keys": ["arm_cpu", "cpu"]}, "max_threads_per_block": 1024, "max_function_args": 31, "max_num_threads": 256, "kind": "metal", "max_shared_memory_per_block": 32768, "tag": "", "keys": ["metal", "gpu"]}
--opt             flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=0;cutlass=0;ipc_allreduce_strategy=NONE
--system-lib-prefix ""
--output          /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
--overrides       context_window_size=None;sliding_window_size=None;prefill_chunk_size=None;attention_sink_size=None;max_batch_size=None;tensor_parallel_shards=1
[2024-07-28 00:47:14] INFO config.py:107: Overriding tensor_parallel_shards from 1 to 1
[2024-07-28 00:47:14] INFO compile.py:127: Creating model from: QWen2Config(hidden_act='silu', hidden_size=2560, intermediate_size=6912, num_attention_heads=20, num_hidden_layers=40, num_key_value_heads=20, rms_norm_eps=1e-06, rope_theta=5000000.0, vocab_size=151936, context_window_size=32768, prefill_chunk_size=2048, tensor_parallel_shards=1, head_dim=128, dtype='float32', max_batch_size=80, kwargs={})
[2024-07-28 00:47:14] INFO compile.py:145: Exporting the model to TVM Unity compiler
[2024-07-28 00:47:17] INFO compile.py:151: Running optimizations using TVM Unity
[2024-07-28 00:47:17] INFO compile.py:171: Registering metadata: {'model_type': 'qwen2', 'quantization': 'q3f16_1', 'context_window_size': 32768, 'sliding_window_size': -1, 'attention_sink_size': -1, 'prefill_chunk_size': 2048, 'tensor_parallel_shards': 1, 'kv_state_kind': 'kv_cache', 'max_batch_size': 80}
[2024-07-28 00:47:18] INFO pipeline.py:52: Running TVM Relax graph-level optimizations
[2024-07-28 00:47:26] INFO pipeline.py:52: Lowering to TVM TIR kernels
[2024-07-28 00:47:33] INFO pipeline.py:52: Running TVM TIR-level optimizations
[2024-07-28 00:47:59] INFO pipeline.py:52: Running TVM Dlight low-level optimizations
[2024-07-28 00:48:01] INFO pipeline.py:52: Lowering to VM bytecode
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `alloc_embedding_tensor`: 10.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_decode`: 50.70 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_prefill`: 157.76 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `batch_verify`: 1298.00 MB
[2024-07-28 00:48:05] INFO estimate_memory_usage.py:58: [Memory usage] Function `create_tir_paged_kv_cache`: 0.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `decode`: 0.63 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `embed`: 10.00 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `prefill`: 111.58 MB
[2024-07-28 00:48:06] INFO estimate_memory_usage.py:58: [Memory usage] Function `softmax_with_temperature`: 0.00 MB
[2024-07-28 00:48:07] INFO pipeline.py:52: Compiling external modules
[2024-07-28 00:48:07] INFO pipeline.py:52: Compilation complete! Exporting to disk
[2024-07-28 00:48:17] INFO model_metadata.py:95: Total memory usage without KV cache:: 2994.43 MB (Parameters: 1696.43 MB. Temporary buffer: 1298.00 MB)
[2024-07-28 00:48:17] INFO model_metadata.py:103: To reduce memory usage, tweak `prefill_chunk_size`, `context_window_size` and `sliding_window_size`
[2024-07-28 00:48:17] INFO compile.py:193: Generated: /var/folders/3d/_9ftlcj54396cwckpfssmw_h0000gn/T/tmpuh24ym5s/lib.dylib
[2024-07-28 00:48:17] INFO jit.py:128: Using compiled model lib: /Users/prashantdandriyal/.cache/mlc_llm/model_lib/82941e4cf5dae160d69bd8844e5ef61e.dylib
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 593, prefill chunk size will be set to 593. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 611, prefill chunk size will be set to 611. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:621: Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 138, prefill chunk size will be set to 2048. 
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:701: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 611, prefill chunk size is 611.
[00:48:18] /Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:706: Estimated total single GPU memory usage: 4641.777 MB (Parameters: 1696.427 MB. KVCache: 327.001 MB. Temporary buffer: 2618.349 MB). The actual usage might be slightly larger than the estimated number.
You can use the following special commands:
/help               print the special commands
/exit               quit the cli
/stats              print out stats of last request (token/sec)
/metrics            print out full engine metrics
/reset              restart a fresh chat
/set [overrides]    override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.

Exception in thread Thread-1:
Traceback (most recent call last):
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 332, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 263, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 252, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 182, in tvm._ffi._cy3.core.CHECK_CALL
File "/Users/prashantdandriyal/miniforge3/envs/mlc/lib/python3.12/site-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: TVMError: Assert fail: T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1]) == 256, Argument fused_fused_dequantize_take1.p_lv2656.shape[1] has an unsatisfied constraint: 256 == T.Cast("int32", fused_fused_dequantize_take1_p_lv2656_shape[1])

重现问题

重现问题的步骤：

下载模型

huggingface-cli download --local-dir dist Qwen/Qwen1.5-4B-Chat

转换权重

# convert weights
MODEL_PATH=dist
QUANTIZATION=q4f16_1
MODEL_NAME=Qwen1.5-4B-Chat
IR_FILES=dist/shards
mlc_llm convert_weight $MODEL_PATH/ --quantization $QUANTIZATION -o $IR_FILES

生成 MLC Chat 配置

MODEL_PATH=dist
QUANTIZATION=q3f16_1
IR_FILES=dist/shards
mlc_llm gen_config  $MODEL_PATH \
--prefill-chunk-size 2048 \
--quantization $QUANTIZATION --conv-template redpajama_chat \
-o $IR_FILES

4. 在库中编译模型(.so)

# Create output directory for the model library compiled
mkdir dist/libs

# compile
MLC_CHAT_CONFIG=dist/shards/mlc-chat-config.json
QUANTIZATION=q3f16_1
mlc_llm compile $MLC_CHAT_CONFIG \
--device metal -o dist/libs/Qwen1.5-4B-Chat-3B-$QUANTIZATION-metal.so

5. 运行聊天

运行最后的聊天命令 mlc_llm chat dist/shards --model-lib dist/libs/Qwen1.5-4B-Chat-3B-q3f16_1-metal.so

来源：https://github.com/mlc-ai/mlc-llm/issues/2700

1条答案

按热度按时间

感谢您的反馈。我们将调查潜在的问题q3f16_1。鉴于这是一个4.5B模型，您是否介意尝试4位量化q4f16_1?

赞(0）回复(0）举报 5个月前

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 4个月前
xxl-job 不能和nacos兼容？
回答(3) 发布于 4个月前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 4个月前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 4个月前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 4个月前