mlc-llm [问题] Llama3:如何在Pixel 8 Pro上解决GPU内存不足错误？

cdmah0mi 于 4个月前发布在其他

关注(0)|答案(2)|浏览(106)

当我尝试在Android上使用4位量化运行Llama-3-8B-Instruct时，遇到了GPU内存不足的错误。具体信息如下：

Process: ai.mlc.mlcchat, PID: 5790
    org.apache.tvm.Base$TVMError: TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 4352.000 MB, which is less than the sum of model weight size (4308.133 MB) and temporary buffer size (278.504 MB).
    1. You can set a larger "gpu_memory_utilization" value.
    2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
    3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
    Stack trace:
      File "/home/ll/MLC/Llama-3-8B-Instruct/mlc-llm/cpp/serve/threaded_engine.cc", line 283
    
    ...

崩溃日志中提供的三个解决方案：

我不知道如何修改gpu_memory_utilization。它可以为Android修改吗？
由于Android上对tensor-parallel-shards的支持不足，跳过它。
修改prefill-chunk-size只能减小临时缓冲区的大小，但模型权重大小已经超过了可用的单GPU内存大小，因此这种方法也是无效的。
那么，有没有办法使用4位量化使Llama-3-8B-Instruct在Android上正常运行呢？
顺便说一下，据我所知，Android上的CPU和GPU似乎共享内存。如果这是真的，为什么Pixel 8 Pro(拥有12GB内存)的单个GPU内存限制为4352.000 MB?
我的Pixel 5也有相同的大小限制。对于从5到8 Pro的Pixel手机，单个GPU的内存大小是否已经升级？
还是因为模型中的某些参数限制了单个GPU可以使用的内存大小？

mlc-llm

来源：https://github.com/mlc-ai/mlc-llm/issues/2703