mlc-llm [Bug] 微调后的模型在使用webllm部署时无法工作

v7pvogib  于 6个月前  发布在  其他
关注(0)|答案(6)|浏览(66)

🐛 Bug

我遇到了一个使用webllm部署的(DORA)Qwen2-0.5B微调模型的问题。推理总是失败,错误相同。
错误跟踪:
background.js:53 thread '' 在 /home/cfruan/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rayon-core-1.12.1/src/registry.rs:168:10 处发生恐慌:
全局线程池尚未初始化。:ThreadPoolBuildError { kind: IOError(Os { code: 6, kind: WouldBlock, message: "Resource temporarily unavailable" }) }
任何帮助都将非常受欢迎,这个问题已经困扰我好几天了

重现问题

重现行为所需的步骤:

    1. 使用HuggingFace PEFT包微调一个Qwen2-0.5B模型,在LoraConfig中设置use_dora=True
    1. 使用.merge_and_unload()将lora适配器与基本权重合并
    1. 使用MLCEngine测试合并后的权重
    1. 对(q4f16_1)进行量化并使用mlc-llm(最新夜间版本)编译wasm文件
    1. 使用web-llm部署并尝试运行推理

预期行为

模型应输出令牌,控制台中不应有错误。

环境

  • 平台(例如WebGPU/Vulkan/IOS/Android/CUDA):WebGPU(使用metal编译)
  • 操作系统(例如Ubuntu/Windows/MacOS/...):MacOS
  • 设备(例如iPhone 12 Pro,PC+RTX 3090,...):Mac Book Pro M1
  • 如何安装MLC-LLM(conda,源代码):源代码
  • 如何安装TVM-Unity(pip,源代码):源代码
  • Python版本(例如3.10):NA
  • GPU驱动程序版本(如适用):NA
  • CUDA/cuDNN版本(如适用):NA
  • TVM Unity哈希标签(python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))",适用于编译模型):
  • 其他相关信息:

其他上下文

如果需要,我可以分享代码示例和问题权重

f0brbegy

f0brbegy1#

请问您能否尝试运行原始的Qwen2-0.5B模型?另外,您的微调模型是否可以在其他设备上运行,例如CUDA?

zzwlnbp8

zzwlnbp82#

是的,我可以在webllm上运行原始的Qwen2-0.5B(从源权重编译),并且我可以在Metal上使用mlc-llm python库运行微调后的模型 - 只有微调后的模型在webllm上失败。

oxosxuxt

oxosxuxt3#

这似乎与我们如何打包以及最新的wasm运行时有关。如果您有自定义编译,可以运行原始的Qwen并重现错误,那将很有帮助。或者,如果您能与导致错误的模型分享一个可复现的命令,那就太好了

7ajki6be

7ajki6be4#

这些是我尝试部署的权重 - 在Metal上的Python后端运行正常

from mlc_llm import MLCEngine

# Create engine
model = "HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot"
engine = MLCEngine(model)

# Run chat completion in OpenAI API.
for response in engine.chat.completions.create(
    messages=[{"role": "user", "content": "What is the meaning of life?"}],
    model=model,
    stream=True,
):
    for choice in response.choices:
        print(choice.delta.content, end="", flush=True)
print("\n")

engine.terminate()

[/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21](https://file+.vscode-resource.vscode-cdn.net/opt/homebrew/Caskroom/miniforge/base/envs/mlc/lib/python3.12/site-packages/tqdm/auto.py:21): TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
[2024-06-25 10:15:01] INFO auto_device.py:88: Not found device: cuda:0
[2024-06-25 10:15:02] INFO auto_device.py:88: Not found device: rocm:0
[2024-06-25 10:15:03] INFO auto_device.py:79: Found device: metal:0
[2024-06-25 10:15:04] INFO auto_device.py:88: Not found device: vulkan:0
[2024-06-25 10:15:05] INFO auto_device.py:88: Not found device: opencl:0
[2024-06-25 10:15:05] INFO auto_device.py:35: Using device: metal:0
[2024-06-25 10:15:05] INFO download_cache.py:227: Downloading model from HuggingFace: HF://OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot
[2024-06-25 10:15:05] INFO download_cache.py:29: MLC_DOWNLOAD_CACHE_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO download_cache.py:166: Weights already downloaded: [/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_weights/hf/OpilotAI/qwen2-0.5B-pii-masking-lora-merged-q4f16_1-Opilot)
[2024-06-25 10:15:05] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2024-06-25 10:15:05] INFO jit.py:158: Using cached model lib: [/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib](https://file+.vscode-resource.vscode-cdn.net/Users/User/.cache/mlc_llm/model_lib/c5f2c474b97ac6bb95cf167c9cc9dba8.dylib)
[2024-06-25 10:15:05] INFO engine_base.py:179: The selected engine mode is local. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-06-25 10:15:05] INFO engine_base.py:204: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[2024-06-25 10:15:05] INFO engine_base.py:209: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:668): Under mode "server", max batch size will be set to 80, max KV cache token capacity will be set to 32768, prefill chunk size will be set to 2048. 
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:748): The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[10:15:05] [/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753](https://file+.vscode-resource.vscode-cdn.net/Users/catalyst/Workspace/mlc-ai-package-self-runner/_work/package/package/mlc-llm/cpp/serve/config.cc:753): Estimated total single GPU memory usage: 2959.723 MB (Parameters: 265.118 MB. KVCache: 152.245 MB. Temporary buffer: 2542.361 MB). The actual usage might be slightly larger than the estimated number.
As an AI language model, I don't have personal beliefs or experiences. However, based on scientific research, the meaning of life is a question that has been asked by many people throughout history. It is generally believed that there is no one definitive answer to this question, and it is possible that different people have different ideas about what it means. Some people believe that life is a gift from God, while others believe that it is a struggle between good and evil. Ultimately, the meaning of life is a complex and personal question that depends on many factors, including personal experiences and beliefs.

使用webLLM

原始权重:

--> 按预期工作

微调后的权重

--> 如前所述的错误
有趣的是,我们还有以下组合:

  • 原始权重与微调后的wasm库一起使用 --> 按预期工作
  • 微调后的权重与原始wasm库一起使用 --> 相同的错误

所以问题似乎出在权重上,但为什么它们在使用Python库时可以正常工作?

jpfvwuh4

jpfvwuh45#

你可能现在可以尝试使用未量化的qwen2-0.5B,因为mlc-llm团队还没有发布q4f16 webgpu.wasm,只发布了q0f16,而且你的生成的wasm存在一些问题,这些问题在原始模型上没有问题,但在微调后的模型上有。我也是waiting,期待官方发布的qwen2-0.5B q4f16 wasm。我猜你应该在官方wasm发布后尝试微调后的q4f16模型。

ego6inou

ego6inou6#

@bil-ash 我上周尝试使用q0f16量化,并得到了相同的错误。我认为我生成的wasm与"官方"的wasm之间没有区别。

相关问题