ollama slower than llama.cpp

tgabmvqs  于 9个月前  发布在  其他
关注(0)|答案(9)|浏览(261)

问题是什么?

在使用llm benchmark与ollama https://github.com/MinhNgyuen/llm-benchmark时,我使用gemma 2 2b得到大约80 t/s。而在对话模式下向llama.cpp询问相同问题时,我得到了130 t/s。我正在运行的llama.cpp命令是".\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv"
这里为什么ollama比llama.cpp慢约38%呢?

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama版本

0.3.5

bihw5rsg

bihw5rsg1#

服务器日志将有助于调试。

sbtkgmzw

sbtkgmzw2#

  1. 2024/08/13 09:25:42 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Philip\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
  2. time=2024-08-13T09:25:42.918-04:00 level=INFO source=images.go:782 msg="total blobs: 10"
  3. time=2024-08-13T09:25:42.926-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
  4. time=2024-08-13T09:25:42.927-04:00 level=INFO source=routes.go:1170 msg="Listening on 127.0.0.1:11434 (version 0.3.5)"
  5. time=2024-08-13T09:25:42.928-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]"
  6. time=2024-08-13T09:25:42.928-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
  7. time=2024-08-13T09:25:43.218-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060 Ti" total="8.0 GiB" available="7.0 GiB"
  8. [GIN] 2024/08/13 - 09:25:53 | 200 | 0s | 127.0.0.1 | HEAD "/"
  9. [GIN] 2024/08/13 - 09:25:53 | 200 | 52.6791ms | 127.0.0.1 | POST "/api/show"
  10. time=2024-08-13T09:25:53.238-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6771941376 required="3.3 GiB"
  11. time=2024-08-13T09:25:53.238-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.3 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
  12. time=2024-08-13T09:25:53.251-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 60908"
  13. time=2024-08-13T09:25:53.279-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
  14. time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
  15. time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
  16. INFO [wmain] build info | build=3535 commit="1e6f6554" tid="9372" timestamp=1723555553
  17. INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="9372" timestamp=1723555553 total_threads=16
  18. INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60908" tid="9372" timestamp=1723555553
  19. llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
  20. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  21. llama_model_loader: - kv 0: general.architecture str = gemma2
  22. llama_model_loader: - kv 1: general.type str = model
  23. llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers
  24. llama_model_loader: - kv 3: general.finetune str = it-transformers
  25. llama_model_loader: - kv 4: general.basename str = gemma-2.0
  26. llama_model_loader: - kv 5: general.size_label str = 2B
  27. llama_model_loader: - kv 6: general.license str = gemma
  28. llama_model_loader: - kv 7: gemma2.context_length u32 = 8192
  29. llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304
  30. llama_model_loader: - kv 9: gemma2.block_count u32 = 26
  31. llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216
  32. llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8
  33. llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4
  34. llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
  35. llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256
  36. llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256
  37. llama_model_loader: - kv 16: general.file_type u32 = 2
  38. llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000
  39. llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000
  40. llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096
  41. llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
  42. llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
  43. time=2024-08-13T09:25:53.791-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
  44. llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
  45. llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
  46. llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
  47. llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2
  48. llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1
  49. llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
  50. llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
  51. llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
  52. llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
  53. llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
  54. llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false
  55. llama_model_loader: - kv 33: general.quantization_version u32 = 2
  56. llama_model_loader: - type f32: 105 tensors
  57. llama_model_loader: - type q4_0: 182 tensors
  58. llama_model_loader: - type q6_K: 1 tensors
  59. llm_load_vocab: special tokens cache size = 249
  60. llm_load_vocab: token to piece cache size = 1.6014 MB
  61. llm_load_print_meta: format = GGUF V3 (latest)
  62. llm_load_print_meta: arch = gemma2
  63. llm_load_print_meta: vocab type = SPM
  64. llm_load_print_meta: n_vocab = 256000
  65. llm_load_print_meta: n_merges = 0
  66. llm_load_print_meta: vocab_only = 0
  67. llm_load_print_meta: n_ctx_train = 8192
  68. llm_load_print_meta: n_embd = 2304
  69. llm_load_print_meta: n_layer = 26
  70. llm_load_print_meta: n_head = 8
  71. llm_load_print_meta: n_head_kv = 4
  72. llm_load_print_meta: n_rot = 256
  73. llm_load_print_meta: n_swa = 4096
  74. llm_load_print_meta: n_embd_head_k = 256
  75. llm_load_print_meta: n_embd_head_v = 256
  76. llm_load_print_meta: n_gqa = 2
  77. llm_load_print_meta: n_embd_k_gqa = 1024
  78. llm_load_print_meta: n_embd_v_gqa = 1024
  79. llm_load_print_meta: f_norm_eps = 0.0e+00
  80. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  81. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  82. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  83. llm_load_print_meta: f_logit_scale = 0.0e+00
  84. llm_load_print_meta: n_ff = 9216
  85. llm_load_print_meta: n_expert = 0
  86. llm_load_print_meta: n_expert_used = 0
  87. llm_load_print_meta: causal attn = 1
  88. llm_load_print_meta: pooling type = 0
  89. llm_load_print_meta: rope type = 2
  90. llm_load_print_meta: rope scaling = linear
  91. llm_load_print_meta: freq_base_train = 10000.0
  92. llm_load_print_meta: freq_scale_train = 1
  93. llm_load_print_meta: n_ctx_orig_yarn = 8192
  94. llm_load_print_meta: rope_finetuned = unknown
  95. llm_load_print_meta: ssm_d_conv = 0
  96. llm_load_print_meta: ssm_d_inner = 0
  97. llm_load_print_meta: ssm_d_state = 0
  98. llm_load_print_meta: ssm_dt_rank = 0
  99. llm_load_print_meta: model type = 2B
  100. llm_load_print_meta: model ftype = Q4_0
  101. llm_load_print_meta: model params = 2.61 B
  102. llm_load_print_meta: model size = 1.51 GiB (4.97 BPW)
  103. llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers
  104. llm_load_print_meta: BOS token = 2 '<bos>'
  105. llm_load_print_meta: EOS token = 1 '<eos>'
  106. llm_load_print_meta: UNK token = 3 '<unk>'
  107. llm_load_print_meta: PAD token = 0 '<pad>'
  108. llm_load_print_meta: LF token = 227 '<0x0A>'
  109. llm_load_print_meta: EOT token = 107 '<end_of_turn>'
  110. llm_load_print_meta: max token length = 48
  111. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  112. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  113. ggml_cuda_init: found 1 CUDA devices:
  114. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  115. llm_load_tensors: ggml ctx size = 0.26 MiB
  116. llm_load_tensors: offloading 26 repeating layers to GPU
  117. llm_load_tensors: offloading non-repeating layers to GPU
  118. llm_load_tensors: offloaded 27/27 layers to GPU
  119. llm_load_tensors: CUDA_Host buffer size = 461.43 MiB
  120. llm_load_tensors: CUDA0 buffer size = 1548.29 MiB
  121. llama_new_context_with_model: n_ctx = 8192
  122. llama_new_context_with_model: n_batch = 512
  123. llama_new_context_with_model: n_ubatch = 512
  124. llama_new_context_with_model: flash_attn = 0
  125. llama_new_context_with_model: freq_base = 10000.0
  126. llama_new_context_with_model: freq_scale = 1
  127. llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB
  128. llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB
  129. llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB
  130. llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB
  131. llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB
  132. llama_new_context_with_model: graph nodes = 1050
  133. llama_new_context_with_model: graph splits = 2
  134. INFO [wmain] model loaded | tid="9372" timestamp=1723555557
  135. time=2024-08-13T09:25:57.268-04:00 level=INFO source=server.go:632 msg="llama runner started in 3.99 seconds"
  136. [GIN] 2024/08/13 - 09:25:57 | 200 | 4.1007296s | 127.0.0.1 | POST "/api/chat"
  137. [GIN] 2024/08/13 - 09:25:58 | 200 | 0s | 127.0.0.1 | HEAD "/"
  138. [GIN] 2024/08/13 - 09:26:12 | 200 | 8.8789117s | 127.0.0.1 | POST "/api/chat"
  139. [GIN] 2024/08/13 - 09:29:02 | 200 | 5.7099ms | 127.0.0.1 | GET "/api/tags"
  140. [GIN] 2024/08/13 - 09:29:06 | 200 | 3.427132s | 127.0.0.1 | POST "/api/chat"
  141. [GIN] 2024/08/13 - 09:29:14 | 200 | 7.9011742s | 127.0.0.1 | POST "/api/chat"
  142. time=2024-08-13T09:29:14.135-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
  143. time=2024-08-13T09:29:14.447-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6820524032 required="6.1 GiB"
  144. time=2024-08-13T09:29:14.447-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.4 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
  145. time=2024-08-13T09:29:14.455-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61318"
  146. time=2024-08-13T09:29:14.461-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
  147. time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
  148. time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
  149. INFO [wmain] build info | build=3535 commit="1e6f6554" tid="14936" timestamp=1723555754
  150. INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="14936" timestamp=1723555754 total_threads=16
  151. INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61318" tid="14936" timestamp=1723555754
  152. llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
  153. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  154. llama_model_loader: - kv 0: general.architecture str = phi3
  155. llama_model_loader: - kv 1: general.name str = Phi3
  156. llama_model_loader: - kv 2: phi3.context_length u32 = 131072
  157. llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
  158. llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072
  159. llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192
  160. llama_model_loader: - kv 6: phi3.block_count u32 = 32
  161. llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32
  162. llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32
  163. llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
  164. llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96
  165. llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
  166. llama_model_loader: - kv 12: general.file_type u32 = 2
  167. llama_model_loader: - kv 13: phi3.rope.scaling.attn_factor f32 = 1.190238
  168. llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
  169. llama_model_loader: - kv 15: tokenizer.ggml.pre str = default
  170. llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  171. llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
  172. llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  173. llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1
  174. llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000
  175. llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0
  176. llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000
  177. llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
  178. llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false
  179. llama_model_loader: - kv 25: tokenizer.chat_template str = {% for message in messages %}{% if me...
  180. llama_model_loader: - kv 26: general.quantization_version u32 = 2
  181. llama_model_loader: - type f32: 67 tensors
  182. llama_model_loader: - type q4_0: 129 tensors
  183. llama_model_loader: - type q6_K: 1 tensors
  184. llm_load_vocab: special tokens cache size = 67
  185. llm_load_vocab: token to piece cache size = 0.1690 MB
  186. llm_load_print_meta: format = GGUF V3 (latest)
  187. llm_load_print_meta: arch = phi3
  188. llm_load_print_meta: vocab type = SPM
  189. llm_load_print_meta: n_vocab = 32064
  190. llm_load_print_meta: n_merges = 0
  191. llm_load_print_meta: vocab_only = 0
  192. llm_load_print_meta: n_ctx_train = 131072
  193. llm_load_print_meta: n_embd = 3072
  194. llm_load_print_meta: n_layer = 32
  195. llm_load_print_meta: n_head = 32
  196. llm_load_print_meta: n_head_kv = 32
  197. llm_load_print_meta: n_rot = 96
  198. llm_load_print_meta: n_swa = 0
  199. llm_load_print_meta: n_embd_head_k = 96
  200. llm_load_print_meta: n_embd_head_v = 96
  201. llm_load_print_meta: n_gqa = 1
  202. llm_load_print_meta: n_embd_k_gqa = 3072
  203. llm_load_print_meta: n_embd_v_gqa = 3072
  204. llm_load_print_meta: f_norm_eps = 0.0e+00
  205. llm_load_print_meta: f_norm_rms_eps = 1.0e-05
  206. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  207. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  208. llm_load_print_meta: f_logit_scale = 0.0e+00
  209. llm_load_print_meta: n_ff = 8192
  210. llm_load_print_meta: n_expert = 0
  211. llm_load_print_meta: n_expert_used = 0
  212. llm_load_print_meta: causal attn = 1
  213. llm_load_print_meta: pooling type = 0
  214. llm_load_print_meta: rope type = 2
  215. llm_load_print_meta: rope scaling = linear
  216. llm_load_print_meta: freq_base_train = 10000.0
  217. llm_load_print_meta: freq_scale_train = 1
  218. llm_load_print_meta: n_ctx_orig_yarn = 4096
  219. llm_load_print_meta: rope_finetuned = unknown
  220. llm_load_print_meta: ssm_d_conv = 0
  221. llm_load_print_meta: ssm_d_inner = 0
  222. llm_load_print_meta: ssm_d_state = 0
  223. llm_load_print_meta: ssm_dt_rank = 0
  224. llm_load_print_meta: model type = 3B
  225. llm_load_print_meta: model ftype = Q4_0
  226. llm_load_print_meta: model params = 3.82 B
  227. llm_load_print_meta: model size = 2.03 GiB (4.55 BPW)
  228. llm_load_print_meta: general.name = Phi3
  229. llm_load_print_meta: BOS token = 1 '<s>'
  230. llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
  231. llm_load_print_meta: UNK token = 0 '<unk>'
  232. llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
  233. llm_load_print_meta: LF token = 13 '<0x0A>'
  234. llm_load_print_meta: EOT token = 32007 '<|end|>'
  235. llm_load_print_meta: max token length = 48
  236. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  237. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  238. ggml_cuda_init: found 1 CUDA devices:
  239. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  240. time=2024-08-13T09:29:14.722-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
  241. llm_load_tensors: ggml ctx size = 0.21 MiB
  242. llm_load_tensors: offloading 32 repeating layers to GPU
  243. llm_load_tensors: offloading non-repeating layers to GPU
  244. llm_load_tensors: offloaded 33/33 layers to GPU
  245. llm_load_tensors: CUDA_Host buffer size = 52.84 MiB
  246. llm_load_tensors: CUDA0 buffer size = 2021.84 MiB
  247. llama_new_context_with_model: n_ctx = 8192
  248. llama_new_context_with_model: n_batch = 512
  249. llama_new_context_with_model: n_ubatch = 512
  250. llama_new_context_with_model: flash_attn = 0
  251. llama_new_context_with_model: freq_base = 10000.0
  252. llama_new_context_with_model: freq_scale = 1
  253. llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
  254. llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
  255. llama_new_context_with_model: CUDA_Host output buffer size = 0.54 MiB
  256. llama_new_context_with_model: CUDA0 compute buffer size = 564.00 MiB
  257. llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
  258. llama_new_context_with_model: graph nodes = 1286
  259. llama_new_context_with_model: graph splits = 2
  260. INFO [wmain] model loaded | tid="14936" timestamp=1723555757
  261. time=2024-08-13T09:29:17.340-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.88 seconds"
  262. [GIN] 2024/08/13 - 09:29:18 | 200 | 4.660208s | 127.0.0.1 | POST "/api/chat"
  263. [GIN] 2024/08/13 - 09:29:24 | 200 | 5.7101586s | 127.0.0.1 | POST "/api/chat"
  264. [GIN] 2024/08/13 - 09:29:43 | 200 | 1.3685ms | 127.0.0.1 | GET "/api/tags"
  265. time=2024-08-13T09:29:44.060-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="370.5 MiB"
  266. time=2024-08-13T09:29:44.403-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6819233792 required="3.3 GiB"
  267. time=2024-08-13T09:29:44.403-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.4 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
  268. time=2024-08-13T09:29:44.412-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61381"
  269. time=2024-08-13T09:29:44.417-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
  270. time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
  271. time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
  272. INFO [wmain] build info | build=3535 commit="1e6f6554" tid="22872" timestamp=1723555784
  273. INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="22872" timestamp=1723555784 total_threads=16
  274. INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61381" tid="22872" timestamp=1723555784
  275. llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
  276. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  277. llama_model_loader: - kv 0: general.architecture str = gemma2
  278. llama_model_loader: - kv 1: general.type str = model
  279. llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers
  280. llama_model_loader: - kv 3: general.finetune str = it-transformers
  281. llama_model_loader: - kv 4: general.basename str = gemma-2.0
  282. llama_model_loader: - kv 5: general.size_label str = 2B
  283. llama_model_loader: - kv 6: general.license str = gemma
  284. llama_model_loader: - kv 7: gemma2.context_length u32 = 8192
  285. llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304
  286. llama_model_loader: - kv 9: gemma2.block_count u32 = 26
  287. llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216
  288. llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8
  289. llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4
  290. llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
  291. llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256
  292. llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256
  293. llama_model_loader: - kv 16: general.file_type u32 = 2
  294. llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000
  295. llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000
  296. llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096
  297. llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
  298. llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
  299. llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
  300. llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
  301. llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
  302. llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2
  303. llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1
  304. llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
  305. llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
  306. llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
  307. llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
  308. llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
  309. llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false
  310. llama_model_loader: - kv 33: general.quantization_version u32 = 2
  311. llama_model_loader: - type f32: 105 tensors
  312. llama_model_loader: - type q4_0: 182 tensors
  313. llama_model_loader: - type q6_K: 1 tensors
  314. time=2024-08-13T09:29:44.673-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
  315. llm_load_vocab: special tokens cache size = 249
  316. llm_load_vocab: token to piece cache size = 1.6014 MB
  317. llm_load_print_meta: format = GGUF V3 (latest)
  318. llm_load_print_meta: arch = gemma2
  319. llm_load_print_meta: vocab type = SPM
  320. llm_load_print_meta: n_vocab = 256000
  321. llm_load_print_meta: n_merges = 0
  322. llm_load_print_meta: vocab_only = 0
  323. llm_load_print_meta: n_ctx_train = 8192
  324. llm_load_print_meta: n_embd = 2304
  325. llm_load_print_meta: n_layer = 26
  326. llm_load_print_meta: n_head = 8
  327. llm_load_print_meta: n_head_kv = 4
  328. llm_load_print_meta: n_rot = 256
  329. llm_load_print_meta: n_swa = 4096
  330. llm_load_print_meta: n_embd_head_k = 256
  331. llm_load_print_meta: n_embd_head_v = 256
  332. llm_load_print_meta: n_gqa = 2
  333. llm_load_print_meta: n_embd_k_gqa = 1024
  334. llm_load_print_meta: n_embd_v_gqa = 1024
  335. llm_load_print_meta: f_norm_eps = 0.0e+00
  336. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  337. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  338. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  339. llm_load_print_meta: f_logit_scale = 0.0e+00
  340. llm_load_print_meta: n_ff = 9216
  341. llm_load_print_meta: n_expert = 0
  342. llm_load_print_meta: n_expert_used = 0
  343. llm_load_print_meta: causal attn = 1
  344. llm_load_print_meta: pooling type = 0
  345. llm_load_print_meta: rope type = 2
  346. llm_load_print_meta: rope scaling = linear
  347. llm_load_print_meta: freq_base_train = 10000.0
  348. llm_load_print_meta: freq_scale_train = 1
  349. llm_load_print_meta: n_ctx_orig_yarn = 8192
  350. llm_load_print_meta: rope_finetuned = unknown
  351. llm_load_print_meta: ssm_d_conv = 0
  352. llm_load_print_meta: ssm_d_inner = 0
  353. llm_load_print_meta: ssm_d_state = 0
  354. llm_load_print_meta: ssm_dt_rank = 0
  355. llm_load_print_meta: model type = 2B
  356. llm_load_print_meta: model ftype = Q4_0
  357. llm_load_print_meta: model params = 2.61 B
  358. llm_load_print_meta: model size = 1.51 GiB (4.97 BPW)
  359. llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers
  360. llm_load_print_meta: BOS token = 2 '<bos>'
  361. llm_load_print_meta: EOS token = 1 '<eos>'
  362. llm_load_print_meta: UNK token = 3 '<unk>'
  363. llm_load_print_meta: PAD token = 0 '<pad>'
  364. llm_load_print_meta: LF token = 227 '<0x0A>'
  365. llm_load_print_meta: EOT token = 107 '<end_of_turn>'
  366. llm_load_print_meta: max token length = 48
  367. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  368. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  369. ggml_cuda_init: found 1 CUDA devices:
  370. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  371. llm_load_tensors: ggml ctx size = 0.26 MiB
  372. llm_load_tensors: offloading 26 repeating layers to GPU
  373. llm_load_tensors: offloading non-repeating layers to GPU
  374. llm_load_tensors: offloaded 27/27 layers to GPU
  375. llm_load_tensors: CUDA_Host buffer size = 461.43 MiB
  376. llm_load_tensors: CUDA0 buffer size = 1548.29 MiB
  377. llama_new_context_with_model: n_ctx = 8192
  378. llama_new_context_with_model: n_batch = 512
  379. llama_new_context_with_model: n_ubatch = 512
  380. llama_new_context_with_model: flash_attn = 0
  381. llama_new_context_with_model: freq_base = 10000.0
  382. llama_new_context_with_model: freq_scale = 1
  383. llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB
  384. llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB
  385. llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB
  386. llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB
  387. llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB
  388. llama_new_context_with_model: graph nodes = 1050
  389. llama_new_context_with_model: graph splits = 2
  390. INFO [wmain] model loaded | tid="22872" timestamp=1723555786
  391. time=2024-08-13T09:29:46.550-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.13 seconds"
  392. [GIN] 2024/08/13 - 09:29:49 | 200 | 5.8284308s | 127.0.0.1 | POST "/api/chat"
  393. [GIN] 2024/08/13 - 09:29:59 | 200 | 9.386519s | 127.0.0.1 | POST "/api/chat"
  394. time=2024-08-13T09:29:59.225-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
  395. time=2024-08-13T09:29:59.522-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6812811264 required="6.1 GiB"
  396. time=2024-08-13T09:29:59.523-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.3 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
  397. time=2024-08-13T09:29:59.531-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61409"
  398. time=2024-08-13T09:29:59.536-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
  399. time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
  400. time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
  401. INFO [wmain] build info | build=3535 commit="1e6f6554" tid="12124" timestamp=1723555799
  402. INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="12124" timestamp=1723555799 total_threads=16
  403. INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61409" tid="12124" timestamp=1723555799
  404. llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
  405. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  406. llama_model_loader: - kv 0: general.architecture str = phi3
  407. llama_model_loader: - kv 1: general.name str = Phi3
  408. llama_model_loader: - kv 2: phi3.context_length u32 = 131072
  409. llama_model_loader: - kv 3: phi3.rope.scaling.original_context_length u32 = 4096
  410. llama_model_loader: - kv 4: phi3.embedding_length u32 = 3072
  411. llama_model_loader: - kv 5: phi3.feed_forward_length u32 = 8192
  412. llama_model_loader: - kv 6: phi3.block_count u32 = 32
  413. llama_model_loader: - kv 7: phi3.attention.head_count u32 = 32
  414. llama_model_loader: - kv 8: phi3.attention.head_count_kv u32 = 32
  415. llama_model_loader: - kv 9: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010
  416. llama_model_loader: - kv 10: phi3.rope.dimension_count u32 = 96
  417. llama_model_loader: - kv 11: phi3.rope.freq_base f32 = 10000.000000
  418. llama_model_loader: - kv 12: general.file_type u32 = 2
  419. llama_model_loader: - kv 13: phi3.rope.scaling.attn_factor f32 = 1.190238
  420. llama_model_loader: - kv 14: tokenizer.ggml.model str = llama
  421. llama_model_loader: - kv 15: tokenizer.ggml.pre str = default
  422. llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
  423. llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00...
  424. llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
  425. llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1
  426. llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000
  427. llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0
  428. llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000
  429. llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
  430. llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false
  431. llama_model_loader: - kv 25: tokenizer.chat_template str = {% for message in messages %}{% if me...
  432. llama_model_loader: - kv 26: general.quantization_version u32 = 2
  433. llama_model_loader: - type f32: 67 tensors
  434. llama_model_loader: - type q4_0: 129 tensors
  435. llama_model_loader: - type q6_K: 1 tensors
  436. llm_load_vocab: special tokens cache size = 67
  437. llm_load_vocab: token to piece cache size = 0.1690 MB
  438. llm_load_print_meta: format = GGUF V3 (latest)
  439. llm_load_print_meta: arch = phi3
  440. llm_load_print_meta: vocab type = SPM
  441. llm_load_print_meta: n_vocab = 32064
  442. llm_load_print_meta: n_merges = 0
  443. llm_load_print_meta: vocab_only = 0
  444. llm_load_print_meta: n_ctx_train = 131072
  445. llm_load_print_meta: n_embd = 3072
  446. llm_load_print_meta: n_layer = 32
  447. llm_load_print_meta: n_head = 32
  448. llm_load_print_meta: n_head_kv = 32
  449. llm_load_print_meta: n_rot = 96
  450. llm_load_print_meta: n_swa = 0
  451. llm_load_print_meta: n_embd_head_k = 96
  452. llm_load_print_meta: n_embd_head_v = 96
  453. llm_load_print_meta: n_gqa = 1
  454. llm_load_print_meta: n_embd_k_gqa = 3072
  455. llm_load_print_meta: n_embd_v_gqa = 3072
  456. llm_load_print_meta: f_norm_eps = 0.0e+00
  457. llm_load_print_meta: f_norm_rms_eps = 1.0e-05
  458. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  459. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  460. llm_load_print_meta: f_logit_scale = 0.0e+00
  461. llm_load_print_meta: n_ff = 8192
  462. llm_load_print_meta: n_expert = 0
  463. llm_load_print_meta: n_expert_used = 0
  464. llm_load_print_meta: causal attn = 1
  465. llm_load_print_meta: pooling type = 0
  466. llm_load_print_meta: rope type = 2
  467. llm_load_print_meta: rope scaling = linear
  468. llm_load_print_meta: freq_base_train = 10000.0
  469. llm_load_print_meta: freq_scale_train = 1
  470. llm_load_print_meta: n_ctx_orig_yarn = 4096
  471. llm_load_print_meta: rope_finetuned = unknown
  472. llm_load_print_meta: ssm_d_conv = 0
  473. llm_load_print_meta: ssm_d_inner = 0
  474. llm_load_print_meta: ssm_d_state = 0
  475. llm_load_print_meta: ssm_dt_rank = 0
  476. llm_load_print_meta: model type = 3B
  477. llm_load_print_meta: model ftype = Q4_0
  478. llm_load_print_meta: model params = 3.82 B
  479. llm_load_print_meta: model size = 2.03 GiB (4.55 BPW)
  480. llm_load_print_meta: general.name = Phi3
  481. llm_load_print_meta: BOS token = 1 '<s>'
  482. llm_load_print_meta: EOS token = 32000 '<|endoftext|>'
  483. llm_load_print_meta: UNK token = 0 '<unk>'
  484. llm_load_print_meta: PAD token = 32000 '<|endoftext|>'
  485. llm_load_print_meta: LF token = 13 '<0x0A>'
  486. llm_load_print_meta: EOT token = 32007 '<|end|>'
  487. llm_load_print_meta: max token length = 48
  488. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  489. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  490. ggml_cuda_init: found 1 CUDA devices:
  491. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  492. time=2024-08-13T09:29:59.798-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
  493. llm_load_tensors: ggml ctx size = 0.21 MiB
  494. llm_load_tensors: offloading 32 repeating layers to GPU
  495. llm_load_tensors: offloading non-repeating layers to GPU
  496. llm_load_tensors: offloaded 33/33 layers to GPU
  497. llm_load_tensors: CUDA_Host buffer size = 52.84 MiB
  498. llm_load_tensors: CUDA0 buffer size = 2021.84 MiB
  499. llama_new_context_with_model: n_ctx = 8192
  500. llama_new_context_with_model: n_batch = 512
  501. llama_new_context_with_model: n_ubatch = 512
  502. llama_new_context_with_model: flash_attn = 0
  503. llama_new_context_with_model: freq_base = 10000.0
  504. llama_new_context_with_model: freq_scale = 1
  505. llama_kv_cache_init: CUDA0 KV buffer size = 3072.00 MiB
  506. llama_new_context_with_model: KV self size = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
  507. llama_new_context_with_model: CUDA_Host output buffer size = 0.54 MiB
  508. llama_new_context_with_model: CUDA0 compute buffer size = 564.00 MiB
  509. llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB
  510. llama_new_context_with_model: graph nodes = 1286
  511. llama_new_context_with_model: graph splits = 2
  512. INFO [wmain] model loaded | tid="12124" timestamp=1723555801
  513. time=2024-08-13T09:30:01.721-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.18 seconds"
  514. [GIN] 2024/08/13 - 09:30:08 | 200 | 9.7045794s | 127.0.0.1 | POST "/api/chat"
  515. [GIN] 2024/08/13 - 09:30:19 | 200 | 11.0502895s | 127.0.0.1 | POST "/api/chat"
  516. [GIN] 2024/08/13 - 09:31:52 | 200 | 0s | 127.0.0.1 | HEAD "/"
  517. [GIN] 2024/08/13 - 09:31:52 | 500 | 535.8µs | 127.0.0.1 | DELETE "/api/delete"
  518. [GIN] 2024/08/13 - 09:32:00 | 200 | 0s | 127.0.0.1 | HEAD "/"
  519. [GIN] 2024/08/13 - 09:32:01 | 200 | 179.651ms | 127.0.0.1 | DELETE "/api/delete"
  520. [GIN] 2024/08/13 - 09:32:04 | 200 | 783µs | 127.0.0.1 | GET "/api/tags"
  521. time=2024-08-13T09:32:04.557-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="472.3 MiB"
  522. time=2024-08-13T09:32:04.891-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6926028800 required="3.3 GiB"
  523. time=2024-08-13T09:32:04.891-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.5 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
  524. time=2024-08-13T09:32:04.899-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61980"
  525. time=2024-08-13T09:32:04.904-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
  526. time=2024-08-13T09:32:04.904-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
  527. time=2024-08-13T09:32:04.905-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
  528. INFO [wmain] build info | build=3535 commit="1e6f6554" tid="20144" timestamp=1723555924
  529. INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="20144" timestamp=1723555924 total_threads=16
  530. INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61980" tid="20144" timestamp=1723555924
  531. llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
  532. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  533. llama_model_loader: - kv 0: general.architecture str = gemma2
  534. llama_model_loader: - kv 1: general.type str = model
  535. llama_model_loader: - kv 2: general.name str = Gemma 2.0 2b It Transformers
  536. llama_model_loader: - kv 3: general.finetune str = it-transformers
  537. llama_model_loader: - kv 4: general.basename str = gemma-2.0
  538. llama_model_loader: - kv 5: general.size_label str = 2B
  539. llama_model_loader: - kv 6: general.license str = gemma
  540. llama_model_loader: - kv 7: gemma2.context_length u32 = 8192
  541. llama_model_loader: - kv 8: gemma2.embedding_length u32 = 2304
  542. llama_model_loader: - kv 9: gemma2.block_count u32 = 26
  543. llama_model_loader: - kv 10: gemma2.feed_forward_length u32 = 9216
  544. llama_model_loader: - kv 11: gemma2.attention.head_count u32 = 8
  545. llama_model_loader: - kv 12: gemma2.attention.head_count_kv u32 = 4
  546. llama_model_loader: - kv 13: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
  547. llama_model_loader: - kv 14: gemma2.attention.key_length u32 = 256
  548. llama_model_loader: - kv 15: gemma2.attention.value_length u32 = 256
  549. llama_model_loader: - kv 16: general.file_type u32 = 2
  550. llama_model_loader: - kv 17: gemma2.attn_logit_softcapping f32 = 50.000000
  551. llama_model_loader: - kv 18: gemma2.final_logit_softcapping f32 = 30.000000
  552. llama_model_loader: - kv 19: gemma2.attention.sliding_window u32 = 4096
  553. llama_model_loader: - kv 20: tokenizer.ggml.model str = llama
  554. llama_model_loader: - kv 21: tokenizer.ggml.pre str = default
  555. llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
  556. llama_model_loader: - kv 23: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
  557. llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
  558. llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 2
  559. llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 1
  560. llama_model_loader: - kv 27: tokenizer.ggml.unknown_token_id u32 = 3
  561. llama_model_loader: - kv 28: tokenizer.ggml.padding_token_id u32 = 0
  562. llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = true
  563. llama_model_loader: - kv 30: tokenizer.ggml.add_eos_token bool = false
  564. llama_model_loader: - kv 31: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
  565. llama_model_loader: - kv 32: tokenizer.ggml.add_space_prefix bool = false
  566. llama_model_loader: - kv 33: general.quantization_version u32 = 2
  567. llama_model_loader: - type f32: 105 tensors
  568. llama_model_loader: - type q4_0: 182 tensors
  569. llama_model_loader: - type q6_K: 1 tensors
  570. time=2024-08-13T09:32:05.169-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
  571. llm_load_vocab: special tokens cache size = 249
  572. llm_load_vocab: token to piece cache size = 1.6014 MB
  573. llm_load_print_meta: format = GGUF V3 (latest)
  574. llm_load_print_meta: arch = gemma2
  575. llm_load_print_meta: vocab type = SPM
  576. llm_load_print_meta: n_vocab = 256000
  577. llm_load_print_meta: n_merges = 0
  578. llm_load_print_meta: vocab_only = 0
  579. llm_load_print_meta: n_ctx_train = 8192
  580. llm_load_print_meta: n_embd = 2304
  581. llm_load_print_meta: n_layer = 26
  582. llm_load_print_meta: n_head = 8
  583. llm_load_print_meta: n_head_kv = 4
  584. llm_load_print_meta: n_rot = 256
  585. llm_load_print_meta: n_swa = 4096
  586. llm_load_print_meta: n_embd_head_k = 256
  587. llm_load_print_meta: n_embd_head_v = 256
  588. llm_load_print_meta: n_gqa = 2
  589. llm_load_print_meta: n_embd_k_gqa = 1024
  590. llm_load_print_meta: n_embd_v_gqa = 1024
  591. llm_load_print_meta: f_norm_eps = 0.0e+00
  592. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  593. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  594. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  595. llm_load_print_meta: f_logit_scale = 0.0e+00
  596. llm_load_print_meta: n_ff = 9216
  597. llm_load_print_meta: n_expert = 0
  598. llm_load_print_meta: n_expert_used = 0
  599. llm_load_print_meta: causal attn = 1
  600. llm_load_print_meta: pooling type = 0
  601. llm_load_print_meta: rope type = 2
  602. llm_load_print_meta: rope scaling = linear
  603. llm_load_print_meta: freq_base_train = 10000.0
  604. llm_load_print_meta: freq_scale_train = 1
  605. llm_load_print_meta: n_ctx_orig_yarn = 8192
  606. llm_load_print_meta: rope_finetuned = unknown
  607. llm_load_print_meta: ssm_d_conv = 0
  608. llm_load_print_meta: ssm_d_inner = 0
  609. llm_load_print_meta: ssm_d_state = 0
  610. llm_load_print_meta: ssm_dt_rank = 0
  611. llm_load_print_meta: model type = 2B
  612. llm_load_print_meta: model ftype = Q4_0
  613. llm_load_print_meta: model params = 2.61 B
  614. llm_load_print_meta: model size = 1.51 GiB (4.97 BPW)
  615. llm_load_print_meta: general.name = Gemma 2.0 2b It Transformers
  616. llm_load_print_meta: BOS token = 2 '<bos>'
  617. llm_load_print_meta: EOS token = 1 '<eos>'
  618. llm_load_print_meta: UNK token = 3 '<unk>'
  619. llm_load_print_meta: PAD token = 0 '<pad>'
  620. llm_load_print_meta: LF token = 227 '<0x0A>'
  621. llm_load_print_meta: EOT token = 107 '<end_of_turn>'
  622. llm_load_print_meta: max token length = 48
  623. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  624. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  625. ggml_cuda_init: found 1 CUDA devices:
  626. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  627. llm_load_tensors: ggml ctx size = 0.26 MiB
  628. llm_load_tensors: offloading 26 repeating layers to GPU
  629. llm_load_tensors: offloading non-repeating layers to GPU
  630. llm_load_tensors: offloaded 27/27 layers to GPU
  631. llm_load_tensors: CUDA_Host buffer size = 461.43 MiB
  632. llm_load_tensors: CUDA0 buffer size = 1548.29 MiB
  633. llama_new_context_with_model: n_ctx = 8192
  634. llama_new_context_with_model: n_batch = 512
  635. llama_new_context_with_model: n_ubatch = 512
  636. llama_new_context_with_model: flash_attn = 0
  637. llama_new_context_with_model: freq_base = 10000.0
  638. llama_new_context_with_model: freq_scale = 1
  639. llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB
  640. llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB
  641. llama_new_context_with_model: CUDA_Host output buffer size = 3.94 MiB
  642. llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB
  643. llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB
  644. llama_new_context_with_model: graph nodes = 1050
  645. llama_new_context_with_model: graph splits = 2
  646. INFO [wmain] model loaded | tid="20144" timestamp=1723555926
  647. time=2024-08-13T09:32:07.053-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.15 seconds"
  648. [GIN] 2024/08/13 - 09:32:10 | 200 | 5.9310484s | 127.0.0.1 | POST "/api/chat"
  649. [GIN] 2024/08/13 - 09:32:18 | 200 | 8.0926735s | 127.0.0.1 | POST "/api/chat"
  650. [GIN] 2024/08/13 - 09:41:44 | 200 | 510.7µs | 127.0.0.1 | GET "/api/version"
展开查看全部
nwsw7zdq

nwsw7zdq3#

如果有帮助的话,这是llama.cpp的输出:

  1. \llama-b3542-bin-win-cuda-cu12.2.0-x64> .\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
  2. Log start
  3. main: build = 3542 (15fa07a5)
  4. main: built with MSVC 19.29.30154.0 for x64
  5. main: seed = 1723556264
  6. llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest))
  7. llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
  8. llama_model_loader: - kv 0: general.architecture str = gemma2
  9. llama_model_loader: - kv 1: general.type str = model
  10. llama_model_loader: - kv 2: general.name str = Gemma 2 2b It
  11. llama_model_loader: - kv 3: general.finetune str = it
  12. llama_model_loader: - kv 4: general.basename str = gemma-2
  13. llama_model_loader: - kv 5: general.size_label str = 2B
  14. llama_model_loader: - kv 6: general.license str = gemma
  15. llama_model_loader: - kv 7: general.tags arr[str,2] = ["conversational", "text-generation"]
  16. llama_model_loader: - kv 8: gemma2.context_length u32 = 8192
  17. llama_model_loader: - kv 9: gemma2.embedding_length u32 = 2304
  18. llama_model_loader: - kv 10: gemma2.block_count u32 = 26
  19. llama_model_loader: - kv 11: gemma2.feed_forward_length u32 = 9216
  20. llama_model_loader: - kv 12: gemma2.attention.head_count u32 = 8
  21. llama_model_loader: - kv 13: gemma2.attention.head_count_kv u32 = 4
  22. llama_model_loader: - kv 14: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001
  23. llama_model_loader: - kv 15: gemma2.attention.key_length u32 = 256
  24. llama_model_loader: - kv 16: gemma2.attention.value_length u32 = 256
  25. llama_model_loader: - kv 17: general.file_type u32 = 15
  26. llama_model_loader: - kv 18: gemma2.attn_logit_softcapping f32 = 50.000000
  27. llama_model_loader: - kv 19: gemma2.final_logit_softcapping f32 = 30.000000
  28. llama_model_loader: - kv 20: gemma2.attention.sliding_window u32 = 4096
  29. llama_model_loader: - kv 21: tokenizer.ggml.model str = llama
  30. llama_model_loader: - kv 22: tokenizer.ggml.pre str = default
  31. llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
  32. llama_model_loader: - kv 24: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00...
  33. llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
  34. llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 2
  35. llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 1
  36. llama_model_loader: - kv 28: tokenizer.ggml.unknown_token_id u32 = 3
  37. llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 0
  38. llama_model_loader: - kv 30: tokenizer.ggml.add_bos_token bool = true
  39. llama_model_loader: - kv 31: tokenizer.ggml.add_eos_token bool = false
  40. llama_model_loader: - kv 32: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol...
  41. llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false
  42. llama_model_loader: - kv 34: general.quantization_version u32 = 2
  43. llama_model_loader: - kv 35: quantize.imatrix.file str = /models_out/gemma-2-2b-it-GGUF/gemma-...
  44. llama_model_loader: - kv 36: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt
  45. llama_model_loader: - kv 37: quantize.imatrix.entries_count i32 = 182
  46. llama_model_loader: - kv 38: quantize.imatrix.chunks_count i32 = 128
  47. llama_model_loader: - type f32: 105 tensors
  48. llama_model_loader: - type q4_K: 156 tensors
  49. llama_model_loader: - type q6_K: 27 tensors
  50. llm_load_vocab: special tokens cache size = 249
  51. llm_load_vocab: token to piece cache size = 1.6014 MB
  52. llm_load_print_meta: format = GGUF V3 (latest)
  53. llm_load_print_meta: arch = gemma2
  54. llm_load_print_meta: vocab type = SPM
  55. llm_load_print_meta: n_vocab = 256000
  56. llm_load_print_meta: n_merges = 0
  57. llm_load_print_meta: vocab_only = 0
  58. llm_load_print_meta: n_ctx_train = 8192
  59. llm_load_print_meta: n_embd = 2304
  60. llm_load_print_meta: n_layer = 26
  61. llm_load_print_meta: n_head = 8
  62. llm_load_print_meta: n_head_kv = 4
  63. llm_load_print_meta: n_rot = 256
  64. llm_load_print_meta: n_swa = 4096
  65. llm_load_print_meta: n_embd_head_k = 256
  66. llm_load_print_meta: n_embd_head_v = 256
  67. llm_load_print_meta: n_gqa = 2
  68. llm_load_print_meta: n_embd_k_gqa = 1024
  69. llm_load_print_meta: n_embd_v_gqa = 1024
  70. llm_load_print_meta: f_norm_eps = 0.0e+00
  71. llm_load_print_meta: f_norm_rms_eps = 1.0e-06
  72. llm_load_print_meta: f_clamp_kqv = 0.0e+00
  73. llm_load_print_meta: f_max_alibi_bias = 0.0e+00
  74. llm_load_print_meta: f_logit_scale = 0.0e+00
  75. llm_load_print_meta: n_ff = 9216
  76. llm_load_print_meta: n_expert = 0
  77. llm_load_print_meta: n_expert_used = 0
  78. llm_load_print_meta: causal attn = 1
  79. llm_load_print_meta: pooling type = 0
  80. llm_load_print_meta: rope type = 2
  81. llm_load_print_meta: rope scaling = linear
  82. llm_load_print_meta: freq_base_train = 10000.0
  83. llm_load_print_meta: freq_scale_train = 1
  84. llm_load_print_meta: n_ctx_orig_yarn = 8192
  85. llm_load_print_meta: rope_finetuned = unknown
  86. llm_load_print_meta: ssm_d_conv = 0
  87. llm_load_print_meta: ssm_d_inner = 0
  88. llm_load_print_meta: ssm_d_state = 0
  89. llm_load_print_meta: ssm_dt_rank = 0
  90. llm_load_print_meta: model type = 2B
  91. llm_load_print_meta: model ftype = Q4_K - Medium
  92. llm_load_print_meta: model params = 2.61 B
  93. llm_load_print_meta: model size = 1.59 GiB (5.21 BPW)
  94. llm_load_print_meta: general.name = Gemma 2 2b It
  95. llm_load_print_meta: BOS token = 2 '<bos>'
  96. llm_load_print_meta: EOS token = 1 '<eos>'
  97. llm_load_print_meta: UNK token = 3 '<unk>'
  98. llm_load_print_meta: PAD token = 0 '<pad>'
  99. llm_load_print_meta: LF token = 227 '<0x0A>'
  100. llm_load_print_meta: EOT token = 107 '<end_of_turn>'
  101. llm_load_print_meta: max token length = 48
  102. ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
  103. ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
  104. ggml_cuda_init: found 1 CUDA devices:
  105. Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
  106. llm_load_tensors: ggml ctx size = 0.26 MiB
  107. llm_load_tensors: offloading 26 repeating layers to GPU
  108. llm_load_tensors: offloading non-repeating layers to GPU
  109. llm_load_tensors: offloaded 27/27 layers to GPU
  110. llm_load_tensors: CPU buffer size = 461.43 MiB
  111. llm_load_tensors: CUDA0 buffer size = 1623.70 MiB
  112. ..........................................................
  113. llama_new_context_with_model: n_ctx = 8192
  114. llama_new_context_with_model: n_batch = 512
  115. llama_new_context_with_model: n_ubatch = 512
  116. llama_new_context_with_model: flash_attn = 0
  117. llama_new_context_with_model: freq_base = 10000.0
  118. llama_new_context_with_model: freq_scale = 1
  119. llama_kv_cache_init: CUDA0 KV buffer size = 832.00 MiB
  120. llama_new_context_with_model: KV self size = 832.00 MiB, K (f16): 416.00 MiB, V (f16): 416.00 MiB
  121. llama_new_context_with_model: CUDA_Host output buffer size = 3.91 MiB
  122. llama_new_context_with_model: CUDA0 compute buffer size = 504.50 MiB
  123. llama_new_context_with_model: CUDA_Host compute buffer size = 36.51 MiB
  124. llama_new_context_with_model: graph nodes = 1050
  125. llama_new_context_with_model: graph splits = 2
  126. main: chat template example: <start_of_turn>user
  127. You are a helpful assistant
  128. Hello<end_of_turn>
  129. <start_of_turn>model
  130. Hi there<end_of_turn>
  131. <start_of_turn>user
  132. How are you?<end_of_turn>
  133. <start_of_turn>model
  134. system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
  135. main: interactive mode on.
  136. sampling:
  137. repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
  138. top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
  139. mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
  140. sampling order:
  141. CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
  142. generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 1
  143. == Running in interactive mode. ==
  144. - Press Ctrl+C to interject at any time.
  145. - Press Return to return control to the AI.
  146. - To return control without starting a new line, end your input with '/'.
  147. - If you want to submit another line, end your input with '\'.
  148. > What is the sky blue?
  149. The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here's a breakdown:
  150. 1. **Sunlight and its Colors:** Sunlight contains all colors of the rainbow, each with its own wavelength (like visible light).
  151. 2. **Earth's Atmosphere:** Our atmosphere is composed mostly of nitrogen and oxygen molecules.
  152. 3. **Scattering:** When sunlight enters the atmosphere, it interacts with these tiny molecules. The shorter wavelengths of light (blue and violet) are scattered more strongly than longer wavelengths like red and orange.
  153. 4. **Human Perception:** Our eyes are most sensitive to blue light, meaning we perceive this scattered light as the dominant color of the sky.
  154. **Why not other colors?**
  155. * **Violet:** While violet light is scattered even more intensely than blue, our eyes are less sensitive to it, so we don't see it as prominently in the daytime sky.
  156. * **Red and Orange:** These longer wavelengths are scattered less, which is why we see them as dominant during sunrise and sunset.
  157. **In summary:** The blue sky is a result of sunlight being scattered by our atmosphere's molecules, making blue light dominate the color we perceive.
  158. Let me know if you have any further questions!
  159. > Write a report on the financials of Nvidia
  160. ## Nvidia Financial Snapshot: A Deep Dive
  161. This report provides an overview of Nvidia's financial performance, analyzing key financial metrics and identifying key trends.
  162. **Q1 & Q2 2023 Performance:**
  163. * **Revenue**: Strong revenue growth continued in both Q1 and Q2 2023, driven by robust demand for data centers and AI solutions.
  164. * Q1 2023: $7.68 billion (up 14% year-over-year)
  165. * Q2 2023: $8.85 billion (up 29% year-over-year)
  166. * **Net Income**: Nvidia's net income saw a significant increase in both quarters, reflecting the company's strong performance and efficient cost management.
  167. * Q1 2023: $1.94 billion (up 68% year-over-year)
  168. * Q2 2023: $2.17 billion (up 64% year-over-year)
  169. * **Earnings per Share**: EPS also saw significant growth, reflecting the company's profitability and strong financial position.
  170. * Q1 2023: $0.85 per share
  171. * Q2 2023: $1.16 per share
  172. **Drivers of Financial Success:**
  173. * **Data Center Market:** Nvidia's data center business has been a key driver of revenue growth, fueled by demand for its GPUs (Graphics Processing Units) used in AI training and cloud computing.
  174. * **Gaming Segment**: While facing headwinds from increased competition, the gaming segment remains a significant contributor to Nvidia's revenue, benefiting from strong demand for high-performance graphics cards.
  175. * **Automotive Sector:** The company's automotive segment has been experiencing rapid growth, driven by its technology enabling autonomous driving features and connected vehicles.
  176. **Challenges & Risks:**
  177. * **Geopolitical Tensions**: The ongoing geopolitical tensions create uncertainty in the global economy, potentially impacting demand for Nvidia's products in various sectors.
  178. * **Competition**: Competition within the GPU market is intensifying as rival companies like AMD and Intel aggressively enter this space.
  179. * **Macroeconomic Factors**: Economic slowdown and rising inflation pose challenges to overall demand across industries, including Nvidia's key markets.
  180. **Future Outlook:**
  181. * **Continued Growth in Data Centers & AI:** Nvidia expects sustained growth in data center and AI segments as companies invest heavily in cloud computing and artificial intelligence development.
  182. * **Expansion into Automotive and Other Emerging Sectors:** Nvidia is actively pursuing expansion opportunities in automotive, gaming, and other emerging markets to diversify its revenue streams.
  183. **Key Financial Ratios:**
  184. * **Profit Margin**: Nvidia has maintained a high profit margin across recent quarters, reflecting its focus on efficient operations and strong pricing strategies.
  185. * **Return on Equity (ROE)**: The company continues to deliver strong returns on shareholder equity, indicating efficient capital allocation and strong profitability.
  186. * **Debt-to-Equity Ratio**: Nvidia maintains a relatively low debt-to-equity ratio, demonstrating its sound financial position and ability to manage leverage effectively.
  187. **Conclusion:**
  188. Nvidia's financial performance remains strong, driven by robust demand for its technology across multiple market segments. The company has a clear strategic focus on data centers, AI, automotive, and gaming, positioning it well for future growth. However, the company faces challenges from increased competition, geopolitical tensions, and macroeconomic uncertainties.
  189. **Disclaimer:** This report is based on publicly available financial information and should not be construed as financial advice. Please consult with a qualified professional before making any investment decisions.
  190. >
  191. llama_print_timings: load time = 1626.24 ms
  192. llama_print_timings: sample time = 1444.86 ms / 1034 runs ( 1.40 ms per token, 715.64 tokens per second)
  193. llama_print_timings: prompt eval time = 49812.18 ms / 33 tokens ( 1509.46 ms per token, 0.66 tokens per second)
  194. llama_print_timings: eval time = 8107.13 ms / 1032 runs ( 7.86 ms per token, 127.30 tokens per second)
  195. llama_print_timings: total time = 71165.62 ms / 1065 tokens
展开查看全部
vdgimpew

vdgimpew4#

我注意到的一件事(在LLM的帮助下)是llama.cpp显示fma = 1,而ollama显示为0。

aydmsdu9

aydmsdu95#

我还在AppData\Local\ProgramsOllama\ollama_runners中没有看到CUDA 12运行器,这也可能会导致速度变慢。

bprjcwpo

bprjcwpo6#

#4958 似乎是在分支中添加了一个CUDA 12后端,但尚未合并到上游。

cotxawn7

cotxawn77#

很有可能,构建环境的差异是一个影响因素。然而请注意,你并没有比较相同的模型:llama.cpp使用的是gemma-2-2b-it-Q4_K_M.gguf,而ollama使用的是gemma2:2b-instruct-q4_0。值得注意的是,Tensor混合和模型大小是不同的。
gemma2:2b-instruct-q4_0

  1. llama_model_loader: - type q4_0: 182 tensors
  2. llama_model_loader: - type q6_K: 1 tensors
  3. model size = 1.51 GiB (4.97 BPW)

gemma-2-2b-it-Q4_K_M.gguf

  1. llama_model_loader: - type q4_K: 156 tensors
  2. llama_model_loader: - type q6_K: 27 tensors
  3. model size = 1.59 GiB (5.21 BPW)

如果你想排除这个因素,你可以尝试使用ollama模型运行llama.cpp(并不是说我认为这会有很大的区别,但至少可以进行公平的比较):

  1. .\llama-cli -m C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
展开查看全部
j91ykkif

j91ykkif8#

@phly95 我尝试了使用CUDA v12进行自定义构建,并调整了cmake标志以匹配您在llama.cpp system info中的设置,但我没有看到显著的性能差异。您能分享更多关于如何构建llama.cpp的详细信息吗?

yacmzcpb

yacmzcpb9#

这两个之间还有一个区别,那就是ollama版本检测到的是8/16线程,而llama.cpp显示的是16/16线程。请确认您的CPU拥有完整的16个核心且没有SMT(超线程技术)?- llama.cpp已经合并了代码来解决这个问题,但上游仍在等待更新。

相关问题