ollama 为什么当我使用GPU NV T2000时，我的Llama 3模型的占用率不高，而是使用CPU进行计算？

jrcvhitl 于 5个月前发布在其他

关注(0)|答案(6)|浏览(180)

问题：当我使用Ollama与Llama 3或其他任何模型时，我发现GPU使用率在高低和低水平之间不断波动，并且没有完全占用。然而，CPU使用率仍然约为40%高。我已经启用了各种参数，但无济于事。

答案：这个问题可能是由于GPU资源分配不均导致的。你可以尝试以下方法来解决这个问题：

调整--ctx-size参数的值。这个参数表示每个GPU上下文的大小，增加它的值可以让更多任务共享GPU资源。例如，你可以将其设置为2048或更高。
调整--batch-size参数的值。这个参数表示每次训练迭代使用的样本数量。减小这个值可以让更多的任务同时使用GPU资源。例如，你可以将其设置为256或更低。
检查你的模型是否存在内存泄漏或者不合理的计算。这可能导致GPU资源被部分占用，从而影响整体性能。你可以使用诸如NVIDIA Visual Profiler等工具来分析模型的性能瓶颈。
如果可能的话，尝试升级你的硬件设备，如更换具有更多显存和更高算力的GPU。这将有助于提高整体性能。
时间：2024-08-13T18:12:10.310+08:00
级别：DEBUG
来源：server.go:410
消息：subprocess环境="[CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\cuda;C:\Program Files\NVIDIA GPU Computing Toolkit\cuda\v12.6 CUDA_PATH_V12_3=C:\Program Files\NVIDIA GPU Computing Toolkit\cuda\v12.3;C:\Program Files\NVIDIA GPU Computing Toolkit\cuda\v12.6 CUDA_PATH_V12_6=C:\Program Files\NVIDIA GPU Computing Toolkit\cuda\v12.6;PATH=C:\Users\pewjs\AppData\Local\Programs\Ollama;C:\Users\pewjs\AppData\Local\Programs\ollama_runners\cuda_v11.3;C:\Users\pewjs\AppData\Local\Programs;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.6\libnvvp;C:\Program Files\Common Files\Oracle\Java\javapath;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.3\libnvvp;D:
"]"
时间：2024-08-13T18:12:10.340+08:00
级别：INFO
来源：sched.go:445
消息：loaded runners
计数：1
时间：2024-08-13T18:12:10.340+08:00
级别：INFO
来源：server.go:593
消息：waiting for llama runner to start responding
时间：2024-08-13T18:12:10.341+08:00
级别：INFO
来源：server.go:627
消息：waiting for server to become available
状态：“llm服务器错误”
这是一段关于Llama模型加载的日志，其中包含了模型的各种参数和配置信息。
llama_new_context_with_model: KV self size = 1280.00 MiB, K (f16): 640.00 MiB, V (f16): 640.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 725.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 28.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 224
time=2024-08-13T18:12:14.341+08:00 level=DEBUG source=server.go:641 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=1 tid="3064" timestamp=1723543936
DEBUG [initialize] new slot | n_ctx_slot=10240 slot_id=0 tid="3064" timestamp=1723543936
INFO [wmain] model loaded | tid="3064" timestamp=1723543936
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="3064" timestamp=1723543936
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="3064" timestamp=1723543936
time=2024-08-13T18:12:16.233+08:00 level=INFO source=server.go:632 msg="llama runner started in 5.89 seconds"
time=2024-08-13T18:12:16.233+08:00 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=D:\ollama\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-08-13T18:12:16.233+08:00 level=DEBUG source=routes.go:1361 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\n介绍一下大模型的学习方法500字<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="3064" timestamp=1723543936
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=2 tid="3064" timestamp=1723543936
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=19 slot_id=0 task_id=2 tid="3064" timestamp=1723543936
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=2 tid="3064" timestamp=1723543936
PS C:\WINDOWS\system32> nvidia-smi -q -d POWER,TEMPERATURE,PERFORMANCE

ollama

来源：https://github.com/ollama/ollama/issues/6337

6条答案

按热度按时间

e37o9pze1#

我看到你正在使用Ollama 0.1.35。
你能尝试使用最新版本吗？

赞(0）回复(0）举报 5个月前

hwazgwia2#

我已经升级到0.1.36版本，但结果仍然一样。GPU波动，CPU占用率为43%,输出延迟。

赞(0）回复(0）举报 5个月前

agyaoht73#

2024/08/14 09:07:52 routes.go:1125: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG: OLLAMA_FLASH_ATTENTION: OLLAMA_HOST: http://0.0.0.0:11434 OLLAMA_INTEL_GPU: false OLLAMA_KEEP_ALIVE: 2m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS: 0 OLLAMA_MAX_QUEUE: 512 OLLAMA_MODELS: D:\ollama OLLAMA_NOHISTORY: false OLLAMA_NOPRUNE: false OLLAMA_NUM_PARALLEL: 0 OLLAMA_ORIGINS: [* http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: C:\Users\pewjs\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_SCHED_SPREAD: false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-14T09:07:52.497+08:00 level=INFO source=images.go:782 msg="total blobs: 22"
time=2024-08-14T09:07:52.498+08:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-14T09:07:52.501+08:00 level=INFO source=routes.go:1172 msg="Listening on [::]:11434 (version 0.3.6)"
time=2024-08-14T09:07:52.506+08:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v6.1 cpu cpu_avx]"
time=2024-08-14T09:07:52.507+08:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-14T09:07:53.771+08:00 level=INFO source=gpu.go:288 msg="detected OS VRAM overhead" id=GPU-8480663ce-4d0c-3d38-a715-655311eef7b7 library=cuda compute=7.5 driver=12.6 name="Quadro T2000" overhead="642.2 MiB"
time=2024-08-14T09:07:53.781+08:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-8480663ce-4d0c-3d38-a715-655311eef7b7 library=cuda compute=7.5 driver=12.6 name="Quadro T2000" total="4.0 GiB" available="3.2 GiB"
[GIN] 2024/08/14 - 09:08:07 | 200 | 543.9μs | 127.0.0.1 | GET "/api/version"
time=2024-08-14T09:08:17.692+08:09 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=16 layers.split="" memory="[3.2 GiB]" memory_available="[3Gib]" memory_required="full" memory_required.full="5.5 GiB" memory_required.partial="3.2 GiB" memory_required.kv="256.0 MiB" memory_required.allocations="[3Gib]" memory_weights.total="3.9 GiB" memory_weights.repeating="3.5 GiB" memory_weights.nonrepeating="411.MiB" memory_graph.full="164.MiB" memory_graph.partial="677.5MiB"
time=2024-
这是一段关于Llama模型加载和初始化的日志信息。Llama是一种基于Transformer的预训练语言模型，用于生成文本。从日志中可以看到，模型已经成功加载并初始化，包括模型的维度、分词方式等信息。同时，还可以看到模型的一些性能参数，如内存占用、计算资源等。此外，还有关于模型输入输出的一些信息，如输入的文本内容、输出的文本结果等。