ollama slower than llama.cpp

tgabmvqs  于 4个月前  发布在  其他
关注(0)|答案(9)|浏览(118)

问题是什么?

在使用llm benchmark与ollama https://github.com/MinhNgyuen/llm-benchmark时,我使用gemma 2 2b得到大约80 t/s。而在对话模式下向llama.cpp询问相同问题时,我得到了130 t/s。我正在运行的llama.cpp命令是".\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv"
这里为什么ollama比llama.cpp慢约38%呢?

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama版本

0.3.5

bihw5rsg

bihw5rsg1#

服务器日志将有助于调试。

sbtkgmzw

sbtkgmzw2#

2024/08/13 09:25:42 routes.go:1123: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\Philip\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-13T09:25:42.918-04:00 level=INFO source=images.go:782 msg="total blobs: 10"
time=2024-08-13T09:25:42.926-04:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
time=2024-08-13T09:25:42.927-04:00 level=INFO source=routes.go:1170 msg="Listening on 127.0.0.1:11434 (version 0.3.5)"
time=2024-08-13T09:25:42.928-04:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]"
time=2024-08-13T09:25:42.928-04:00 level=INFO source=gpu.go:204 msg="looking for compatible GPUs"
time=2024-08-13T09:25:43.218-04:00 level=INFO source=types.go:105 msg="inference compute" id=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060 Ti" total="8.0 GiB" available="7.0 GiB"
[GIN] 2024/08/13 - 09:25:53 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:25:53 | 200 |     52.6791ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-13T09:25:53.238-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6771941376 required="3.3 GiB"
time=2024-08-13T09:25:53.238-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.3 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:25:53.251-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 60908"
time=2024-08-13T09:25:53.279-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:25:53.279-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="9372" timestamp=1723555553
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="9372" timestamp=1723555553 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="60908" tid="9372" timestamp=1723555553
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
time=2024-08-13T09:25:53.791-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="9372" timestamp=1723555557
time=2024-08-13T09:25:57.268-04:00 level=INFO source=server.go:632 msg="llama runner started in 3.99 seconds"
[GIN] 2024/08/13 - 09:25:57 | 200 |    4.1007296s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:25:58 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:26:12 | 200 |    8.8789117s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:02 | 200 |      5.7099ms |       127.0.0.1 | GET      "/api/tags"
[GIN] 2024/08/13 - 09:29:06 | 200 |     3.427132s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:14 | 200 |    7.9011742s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-13T09:29:14.135-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
time=2024-08-13T09:29:14.447-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6820524032 required="6.1 GiB"
time=2024-08-13T09:29:14.447-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.4 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
time=2024-08-13T09:29:14.455-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61318"
time=2024-08-13T09:29:14.461-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:14.461-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="14936" timestamp=1723555754
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="14936" timestamp=1723555754 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61318" tid="14936" timestamp=1723555754
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 67
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
time=2024-08-13T09:29:14.722-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =    52.84 MiB
llm_load_tensors:      CUDA0 buffer size =  2021.84 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.54 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="14936" timestamp=1723555757
time=2024-08-13T09:29:17.340-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.88 seconds"
[GIN] 2024/08/13 - 09:29:18 | 200 |     4.660208s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:24 | 200 |    5.7101586s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:43 | 200 |      1.3685ms |       127.0.0.1 | GET      "/api/tags"
time=2024-08-13T09:29:44.060-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="370.5 MiB"
time=2024-08-13T09:29:44.403-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6819233792 required="3.3 GiB"
time=2024-08-13T09:29:44.403-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.4 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:29:44.412-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61381"
time=2024-08-13T09:29:44.417-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:44.417-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="22872" timestamp=1723555784
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="22872" timestamp=1723555784 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61381" tid="22872" timestamp=1723555784
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-13T09:29:44.673-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="22872" timestamp=1723555786
time=2024-08-13T09:29:46.550-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.13 seconds"
[GIN] 2024/08/13 - 09:29:49 | 200 |    5.8284308s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:29:59 | 200 |     9.386519s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-13T09:29:59.225-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="3.1 GiB"
time=2024-08-13T09:29:59.522-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6812811264 required="6.1 GiB"
time=2024-08-13T09:29:59.523-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[6.3 GiB]" memory.required.full="6.1 GiB" memory.required.partial="6.1 GiB" memory.required.kv="3.0 GiB" memory.required.allocations="[6.1 GiB]" memory.weights.total="4.9 GiB" memory.weights.repeating="4.8 GiB" memory.weights.nonrepeating="77.1 MiB" memory.graph.full="512.0 MiB" memory.graph.partial="512.0 MiB"
time=2024-08-13T09:29:59.531-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 4 --port 61409"
time=2024-08-13T09:29:59.536-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:29:59.537-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="12124" timestamp=1723555799
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="12124" timestamp=1723555799 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61409" tid="12124" timestamp=1723555799
llama_model_loader: loaded meta data with 27 key-value pairs and 197 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 131072
llama_model_loader: - kv   3:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv   4:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   5:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   6:                           phi3.block_count u32              = 32
llama_model_loader: - kv   7:                  phi3.attention.head_count u32              = 32
llama_model_loader: - kv   8:               phi3.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  11:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 2
llama_model_loader: - kv  13:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q4_0:  129 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 67
llm_load_vocab: token to piece cache size = 0.1690 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi3
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32064
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3072
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_rot            = 96
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 96
llm_load_print_meta: n_embd_head_v    = 96
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 3072
llm_load_print_meta: n_embd_v_gqa     = 3072
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.82 B
llm_load_print_meta: model size       = 2.03 GiB (4.55 BPW) 
llm_load_print_meta: general.name     = Phi3
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|endoftext|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 32000 '<|endoftext|>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOT token        = 32007 '<|end|>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
time=2024-08-13T09:29:59.798-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: ggml ctx size =    0.21 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =    52.84 MiB
llm_load_tensors:      CUDA0 buffer size =  2021.84 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  3072.00 MiB
llama_new_context_with_model: KV self size  = 3072.00 MiB, K (f16): 1536.00 MiB, V (f16): 1536.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.54 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   564.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="12124" timestamp=1723555801
time=2024-08-13T09:30:01.721-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.18 seconds"
[GIN] 2024/08/13 - 09:30:08 | 200 |    9.7045794s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:30:19 | 200 |   11.0502895s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:31:52 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:31:52 | 500 |       535.8µs |       127.0.0.1 | DELETE   "/api/delete"
[GIN] 2024/08/13 - 09:32:00 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/13 - 09:32:01 | 200 |     179.651ms |       127.0.0.1 | DELETE   "/api/delete"
[GIN] 2024/08/13 - 09:32:04 | 200 |         783µs |       127.0.0.1 | GET      "/api/tags"
time=2024-08-13T09:32:04.557-04:00 level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a library=cuda total="8.0 GiB" available="472.3 MiB"
time=2024-08-13T09:32:04.891-04:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b gpu=GPU-1049149e-0164-1c68-40a9-d7ce65e65c8a parallel=4 available=6926028800 required="3.3 GiB"
time=2024-08-13T09:32:04.891-04:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=27 layers.offload=27 layers.split="" memory.available="[6.5 GiB]" memory.required.full="3.3 GiB" memory.required.partial="3.3 GiB" memory.required.kv="832.0 MiB" memory.required.allocations="[3.3 GiB]" memory.weights.total="1.9 GiB" memory.weights.repeating="1.4 GiB" memory.weights.nonrepeating="461.4 MiB" memory.graph.full="504.5 MiB" memory.graph.partial="965.9 MiB"
time=2024-08-13T09:32:04.899-04:00 level=INFO source=server.go:393 msg="starting llama server" cmd="C:\\Users\\Philip\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Philip\\.ollama\\models\\blobs\\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --no-mmap --parallel 4 --port 61980"
time=2024-08-13T09:32:04.904-04:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-13T09:32:04.904-04:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-13T09:32:04.905-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3535 commit="1e6f6554" tid="20144" timestamp=1723555924
INFO [wmain] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="20144" timestamp=1723555924 total_threads=16
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="61980" tid="20144" timestamp=1723555924
llama_model_loader: loaded meta data with 34 key-value pairs and 288 tensors from C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2.0 2b It Transformers
llama_model_loader: - kv   3:                           general.finetune str              = it-transformers
llama_model_loader: - kv   4:                           general.basename str              = gemma-2.0
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   8:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv   9:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  10:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  11:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  12:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  13:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  15:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 2
llama_model_loader: - kv  17:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  18:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  19:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  23:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  27:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  28:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  32:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-13T09:32:05.169-04:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.51 GiB (4.97 BPW) 
llm_load_print_meta: general.name     = Gemma 2.0 2b It Transformers
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1548.29 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.94 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="20144" timestamp=1723555926
time=2024-08-13T09:32:07.053-04:00 level=INFO source=server.go:632 msg="llama runner started in 2.15 seconds"
[GIN] 2024/08/13 - 09:32:10 | 200 |    5.9310484s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:32:18 | 200 |    8.0926735s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2024/08/13 - 09:41:44 | 200 |       510.7µs |       127.0.0.1 | GET      "/api/version"
nwsw7zdq

nwsw7zdq3#

如果有帮助的话,这是llama.cpp的输出:

\llama-b3542-bin-win-cuda-cu12.2.0-x64> .\llama-cli -m gemma-2-2b-it-Q4_K_M.gguf --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
Log start
main: build = 3542 (15fa07a5)
main: built with MSVC 19.29.30154.0 for x64
main: seed  = 1723556264
llama_model_loader: loaded meta data with 39 key-value pairs and 288 tensors from gemma-2-2b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2 2b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-2
llama_model_loader: - kv   5:                         general.size_label str              = 2B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["conversational", "text-generation"]
llama_model_loader: - kv   8:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   9:                    gemma2.embedding_length u32              = 2304
llama_model_loader: - kv  10:                         gemma2.block_count u32              = 26
llama_model_loader: - kv  11:                 gemma2.feed_forward_length u32              = 9216
llama_model_loader: - kv  12:                gemma2.attention.head_count u32              = 8
llama_model_loader: - kv  13:             gemma2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:                gemma2.attention.key_length u32              = 256
llama_model_loader: - kv  16:              gemma2.attention.value_length u32              = 256
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  19:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  20:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  24:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  28:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  31:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/gemma-2-2b-it-GGUF/gemma-...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 182
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  105 tensors
llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
llm_load_vocab: special tokens cache size = 249
llm_load_vocab: token to piece cache size = 1.6014 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = gemma2
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 256000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 2304
llm_load_print_meta: n_layer          = 26
llm_load_print_meta: n_head           = 8
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 256
llm_load_print_meta: n_swa            = 4096
llm_load_print_meta: n_embd_head_k    = 256
llm_load_print_meta: n_embd_head_v    = 256
llm_load_print_meta: n_gqa            = 2
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 9216
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 2B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 2.61 B
llm_load_print_meta: model size       = 1.59 GiB (5.21 BPW)
llm_load_print_meta: general.name     = Gemma 2 2b It
llm_load_print_meta: BOS token        = 2 '<bos>'
llm_load_print_meta: EOS token        = 1 '<eos>'
llm_load_print_meta: UNK token        = 3 '<unk>'
llm_load_print_meta: PAD token        = 0 '<pad>'
llm_load_print_meta: LF token         = 227 '<0x0A>'
llm_load_print_meta: EOT token        = 107 '<end_of_turn>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.26 MiB
llm_load_tensors: offloading 26 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 27/27 layers to GPU
llm_load_tensors:        CPU buffer size =   461.43 MiB
llm_load_tensors:      CUDA0 buffer size =  1623.70 MiB
..........................................................
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  =  832.00 MiB, K (f16):  416.00 MiB, V (f16):  416.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     3.91 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   504.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    36.51 MiB
llama_new_context_with_model: graph nodes  = 1050
llama_new_context_with_model: graph splits = 2
main: chat template example: <start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model

system_info: n_threads = 16 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
main: interactive mode on.
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 8192, n_batch = 512, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

> What is the sky blue?
The sky appears blue due to a phenomenon called **Rayleigh scattering**. Here's a breakdown:

1. **Sunlight and its Colors:** Sunlight contains all colors of the rainbow, each with its own wavelength (like visible light).
2. **Earth's Atmosphere:** Our atmosphere is composed mostly of nitrogen and oxygen molecules.
3. **Scattering:** When sunlight enters the atmosphere, it interacts with these tiny molecules. The shorter wavelengths of light (blue and violet) are scattered more strongly than longer wavelengths like red and orange.
4. **Human Perception:**  Our eyes are most sensitive to blue light, meaning we perceive this scattered light as the dominant color of the sky.

**Why not other colors?**

* **Violet:** While violet light is scattered even more intensely than blue, our eyes are less sensitive to it, so we don't see it as prominently in the daytime sky.
* **Red and Orange:** These longer wavelengths are scattered less, which is why we see them as dominant during sunrise and sunset.

**In summary:** The blue sky is a result of sunlight being scattered by our atmosphere's molecules, making blue light dominate the color we perceive.

Let me know if you have any further questions!

> Write a report on the financials of Nvidia
## Nvidia Financial Snapshot: A Deep Dive

This report provides an overview of Nvidia's financial performance, analyzing key financial metrics and identifying key trends.

**Q1 & Q2 2023 Performance:**

* **Revenue**: Strong revenue growth continued in both Q1 and Q2 2023, driven by robust demand for data centers and AI solutions.
    * Q1 2023: $7.68 billion (up 14% year-over-year)
    * Q2 2023: $8.85 billion (up 29% year-over-year)
* **Net Income**:  Nvidia's net income saw a significant increase in both quarters, reflecting the company's strong performance and efficient cost management.
    * Q1 2023: $1.94 billion (up 68% year-over-year)
    * Q2 2023: $2.17 billion (up 64% year-over-year)
* **Earnings per Share**:  EPS also saw significant growth, reflecting the company's profitability and strong financial position.
    * Q1 2023: $0.85 per share
    * Q2 2023: $1.16 per share

**Drivers of Financial Success:**

* **Data Center Market:** Nvidia's data center business has been a key driver of revenue growth, fueled by demand for its GPUs (Graphics Processing Units) used in AI training and cloud computing.
* **Gaming Segment**:  While facing headwinds from increased competition, the gaming segment remains a significant contributor to Nvidia's revenue, benefiting from strong demand for high-performance graphics cards.
* **Automotive Sector:** The company's automotive segment has been experiencing rapid growth, driven by its technology enabling autonomous driving features and connected vehicles.

**Challenges & Risks:**

* **Geopolitical Tensions**:  The ongoing geopolitical tensions create uncertainty in the global economy, potentially impacting demand for Nvidia's products in various sectors.
* **Competition**:  Competition within the GPU market is intensifying as rival companies like AMD and Intel aggressively enter this space.
* **Macroeconomic Factors**: Economic slowdown and rising inflation pose challenges to overall demand across industries, including Nvidia's key markets.

**Future Outlook:**

* **Continued Growth in Data Centers & AI:** Nvidia expects sustained growth in data center and AI segments as companies invest heavily in cloud computing and artificial intelligence development.
* **Expansion into Automotive and Other Emerging Sectors:**  Nvidia is actively pursuing expansion opportunities in automotive, gaming, and other emerging markets to diversify its revenue streams.

**Key Financial Ratios:**

* **Profit Margin**: Nvidia has maintained a high profit margin across recent quarters, reflecting its focus on efficient operations and strong pricing strategies.
* **Return on Equity (ROE)**:  The company continues to deliver strong returns on shareholder equity, indicating efficient capital allocation and strong profitability.
* **Debt-to-Equity Ratio**:   Nvidia maintains a relatively low debt-to-equity ratio, demonstrating its sound financial position and ability to manage leverage effectively.

**Conclusion:**

Nvidia's financial performance remains strong, driven by robust demand for its technology across multiple market segments. The company has a clear strategic focus on data centers, AI, automotive, and gaming, positioning it well for future growth. However, the company faces challenges from increased competition, geopolitical tensions, and macroeconomic uncertainties.



**Disclaimer:** This report is based on publicly available financial information and should not be construed as financial advice. Please consult with a qualified professional before making any investment decisions.



>

llama_print_timings:        load time =    1626.24 ms
llama_print_timings:      sample time =    1444.86 ms /  1034 runs   (    1.40 ms per token,   715.64 tokens per second)
llama_print_timings: prompt eval time =   49812.18 ms /    33 tokens ( 1509.46 ms per token,     0.66 tokens per second)
llama_print_timings:        eval time =    8107.13 ms /  1032 runs   (    7.86 ms per token,   127.30 tokens per second)
llama_print_timings:       total time =   71165.62 ms /  1065 tokens
vdgimpew

vdgimpew4#

我注意到的一件事(在LLM的帮助下)是llama.cpp显示fma = 1,而ollama显示为0。

aydmsdu9

aydmsdu95#

我还在AppData\Local\ProgramsOllama\ollama_runners中没有看到CUDA 12运行器,这也可能会导致速度变慢。

bprjcwpo

bprjcwpo6#

#4958 似乎是在分支中添加了一个CUDA 12后端,但尚未合并到上游。

cotxawn7

cotxawn77#

很有可能,构建环境的差异是一个影响因素。然而请注意,你并没有比较相同的模型:llama.cpp使用的是gemma-2-2b-it-Q4_K_M.gguf,而ollama使用的是gemma2:2b-instruct-q4_0。值得注意的是,Tensor混合和模型大小是不同的。
gemma2:2b-instruct-q4_0

llama_model_loader: - type q4_0:  182 tensors
llama_model_loader: - type q6_K:    1 tensors
model size       = 1.51 GiB (4.97 BPW)

gemma-2-2b-it-Q4_K_M.gguf

llama_model_loader: - type q4_K:  156 tensors
llama_model_loader: - type q6_K:   27 tensors
model size       = 1.59 GiB (5.21 BPW)

如果你想排除这个因素,你可以尝试使用ollama模型运行llama.cpp(并不是说我认为这会有很大的区别,但至少可以进行公平的比较):

.\llama-cli -m C:\Users\Philip\.ollama\models\blobs\sha256-7462734796d67c40ecec2ca98eddf970e171dbb6b370e43fd633ee75b69abe1b --threads 16 -ngl 27 --mlock --port 11484 --top_k 40 --repeat_penalty 1.1 --min_p 0.05 --top_p 0.95 --prompt-cache-all -cb -np 4 --batch-size 512 -cnv
j91ykkif

j91ykkif8#

@phly95 我尝试了使用CUDA v12进行自定义构建,并调整了cmake标志以匹配您在llama.cpp system info中的设置,但我没有看到显著的性能差异。您能分享更多关于如何构建llama.cpp的详细信息吗?

yacmzcpb

yacmzcpb9#

这两个之间还有一个区别,那就是ollama版本检测到的是8/16线程,而llama.cpp显示的是16/16线程。请确认您的CPU拥有完整的16个核心且没有SMT(超线程技术)?- llama.cpp已经合并了代码来解决这个问题,但上游仍在等待更新。

相关问题