ollama 错误加载模型：当模型大小大于VRAM且使用多个GPU时，无法分配后端缓冲区,

fumotvh3 于 2个月前发布在其他

关注(0)|答案(1)|浏览(29)

问题是什么？
ollama run llava:34b write me a poem

Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model

硬件
系统有2个独立的GPU:

AMD RX 7600 XT (16 GB)
nvidia 1050 TI (4 GB)

RAM: 48 GB
CPU: AMD 7600X

挣扎
我尝试操作 CUDA_VISIBLE_DEVICES 和 HIP_VISIBLE_DEVICES 环境变量。将其中一个设置为 -1 使 ollama 运行剩余的 GPU。
日志：
both.txt
amd_only.txt
nvidia_only.txt
part of both.txt

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15863.15 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
  what():  unable to allocate backend buffer
time=2024-07-20T10:45:15.144+02:00 level=ERROR source=sched.go:480 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer"

这是我尝试的每个模型都发生的情况，当两者都可用时，它们的大小大于16GB。