Ollama v0.2.+ 与 phi3:mini 相比,RAM消耗增加,

hl0ma9xz  于 2个月前  发布在  其他
关注(0)|答案(4)|浏览(31)

问题是什么?

目前在一个项目中,我们正在与LLM集成,并在容器中使用Ollama与phi3:mini模型作为本地测试环境。该项目最初使用的是版本0.1.48,可以在一个相当小的虚拟机上运行,非常适合本地测试,只需2.8GB的RAM。然而,升级到v0.2后,Ollama现在至少需要5.6GB的RAM才能运行相同的模型。这意味着在0.1.48和0.2.6之间运行相同模型所需的内存增加了2.8GB。对于所有0.2.+版本,这个问题可能只从0.2.4版本开始报告。这看起来几乎像是模型被加载了两次到内存中。

没有模型的容器仅占用28MB的RAM。

有一个同事在Windows上运行相同的项目,他没有看到不同模型之间RAM使用量之间的增加。

操作系统

macOS

GPU

Apple

CPU

Apple Silicon M3

Ollama版本

0.2.6

l7wslrjt

l7wslrjt1#

服务器日志可能有助于诊断问题。

yb3bgrhw

yb3bgrhw2#

这很有道理,但我最初没有包括它们,因为它们似乎没有任何额外的细节。但是我再次查看它们时发现了一些东西。所以感谢提醒我添加日志。
我注意到以下日志消息 updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 。所以决定看看如果我设置 OLLAMA_MAX_LOADED_MODELS=1 会发生什么。什么都没有发生,但是设置 OLLAMA_NUM_PARALLEL=1 确实有帮助,现在它消耗的内存与之前的 0.2.x 版本相同。这很有道理,因为 0.2.x 确实引入了并行模型。
我尝试使用不同的 OLLAMA_NUM_PARALLEL 值进行操作,每次递增都会消耗更多的 RAM。默认值 0 似乎实际上意味着 4,因为它消耗的 RAM 与最初报告的相同。
这可能是预期的行为,但可以认为将默认值设为 4 是不必要的,因为我假设大多数人在任何给定时间只会运行一个模型。另外,既然 0.1.x 没有并行性?
带有默认值的调试日志如下:

2024/07/18 16:35:57 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
2024-07-18T16:35:57.153195894Z time=2024-07-18T16:35:57.153Z level=INFO source=images.go:778 msg="total blobs: 5"
2024-07-18T16:35:57.154634086Z time=2024-07-18T16:35:57.154Z level=INFO source=images.go:785 msg="total unused blobs removed: 0"
2024-07-18T16:35:57.156344318Z time=2024-07-18T16:35:57.156Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.2.6)"
2024-07-18T16:35:57.156962019Z time=2024-07-18T16:35:57.156Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama277969984/runners
2024-07-18T16:35:57.157094309Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/arm64/cpu/bin/ollama_llama_server.gz
2024-07-18T16:35:57.157300390Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublas.so.11.gz
2024-07-18T16:35:57.157303890Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublasLt.so.11.gz
2024-07-18T16:35:57.157946175Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcudart.so.11.0.gz
2024-07-18T16:35:57.157948592Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/ollama_llama_server.gz
2024-07-18T16:35:59.631799045Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server
2024-07-18T16:35:59.631810753Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server
2024-07-18T16:35:59.631812087Z time=2024-07-18T16:35:59.631Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cuda_v11]"
2024-07-18T16:35:59.631812962Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-07-18T16:35:59.631813795Z time=2024-07-18T16:35:59.631Z level=DEBUG source=sched.go:102 msg="starting llm scheduler"
2024-07-18T16:35:59.631814503Z time=2024-07-18T16:35:59.631Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
2024-07-18T16:35:59.631815420Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
2024-07-18T16:35:59.631816170Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
2024-07-18T16:35:59.631860961Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-07-18T16:35:59.632050584Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[]
2024-07-18T16:35:59.632052501Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcudart.so*
2024-07-18T16:35:59.632054084Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama277969984/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
2024-07-18T16:35:59.632286123Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0]
2024-07-18T16:35:59.632654035Z cudaSetDevice err: 35
2024-07-18T16:35:59.632705201Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:533 msg="Unable to load cudart" library=/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
2024-07-18T16:35:59.632706993Z time=2024-07-18T16:35:59.632Z level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
2024-07-18T16:35:59.632707826Z time=2024-07-18T16:35:59.632Z level=INFO source=gpu.go:346 msg="no compatible GPUs were discovered"
2024-07-18T16:35:59.632753700Z time=2024-07-18T16:35:59.632Z level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="5.8 GiB" available="5.0 GiB"
2024-07-18T16:36:02.156722849Z [GIN] 2024/07/18 - 16:36:02 | 200 |      71.582µs |       127.0.0.1 | HEAD     "/"
2024-07-18T16:36:02.168472601Z [GIN] 2024/07/18 - 16:36:02 | 200 |    11.21226ms |       127.0.0.1 | POST     "/api/show"
2024-07-18T16:36:02.179252034Z time=2024-07-18T16:36:02.179Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="5.8 GiB" before.free="5.0 GiB" before.free_swap="0 B" now.total="5.8 GiB" now.free="5.0 GiB" now.free_swap="0 B"
2024-07-18T16:36:02.179296116Z time=2024-07-18T16:36:02.179Z level=DEBUG source=sched.go:177 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
2024-07-18T16:36:02.184381253Z time=2024-07-18T16:36:02.184Z level=DEBUG source=sched.go:201 msg="cpu mode with first model, loading"
2024-07-18T16:36:02.184981578Z time=2024-07-18T16:36:02.184Z level=DEBUG source=server.go:100 msg="system memory" total="5.8 GiB" free="5.0 GiB" free_swap="0 B"
2024-07-18T16:36:02.184987245Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server
2024-07-18T16:36:02.184989662Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server
2024-07-18T16:36:02.184991537Z time=2024-07-18T16:36:02.184Z level=DEBUG source=memory.go:101 msg=evaluating library=cpu gpu_count=1 available="[5.0 GiB]"
2024-07-18T16:36:02.184993287Z time=2024-07-18T16:36:02.184Z level=WARN source=server.go:132 msg="model request too large for system" requested="5.6 GiB" available=5330853888 total="5.8 GiB" free="5.0 GiB" swap="0 B"
2024-07-18T16:36:02.184995370Z time=2024-07-18T16:36:02.184Z level=INFO source=sched.go:416 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a error="model requires more system memory (5.6 GiB) than is available (5.0 GiB)"
2024-07-18T16:36:02.184997578Z [GIN] 2024/07/18 - 16:36:02 | 500 |   15.662614ms |       127.0.0.1 | POST     "/api/chat"
2024-07-18T16:36:02.185393531Z Error: model requires more system memory (5.6 GiB) than is available (5.0 GiB)
2fjabf4q

2fjabf4q3#

FAQ说The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory,所以似乎要由用户适当地设置这个参数以获得确定性行为。

jq6vz3qz

jq6vz3qz4#

这解释了我所看到的行为。但同时也提出了一个问题,即自动缩放行为是否应该进行一些调整。
5 GB的可用内存对于任何模型来说都是较低的。因此,在这种情况下最多允许4个并行模型运行有点过多,因为它目前会消耗超过一半的内存,甚至阻止加载最小的模型之一。
但是正如你所写的,用户可以为这样的场景定义这个设置,这解决了我当前的问题。如果你发现没有必要进行任何更改,那么我可以继续关闭这个问题。

相关问题