ollama 错误加载模型:当模型大小大于VRAM且使用多个GPU时,无法分配后端缓冲区,

fumotvh3  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(29)

问题是什么?
ollama run llava:34b write me a poem

Error: llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model

硬件
系统有2个独立的GPU:

  • AMD RX 7600 XT (16 GB)
  • nvidia 1050 TI (4 GB)

RAM: 48 GB
CPU: AMD 7600X

挣扎
我尝试操作 CUDA_VISIBLE_DEVICESHIP_VISIBLE_DEVICES 环境变量。将其中一个设置为 -1 使 ollama 运行剩余的 GPU。
日志:
both.txt
amd_only.txt
nvidia_only.txt
part of both.txt

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 15863.15 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
  what():  unable to allocate backend buffer
time=2024-07-20T10:45:15.144+02:00 level=ERROR source=sched.go:480 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) error loading model: unable to allocate backend buffer"

这是我尝试的每个模型都发生的情况,当两者都可用时,它们的大小大于16GB。

OS

Linux

GPU

Nvidia, AMD

CPU

AMD

Ollama版本

ollama版本是0.2.1

相关问题