ollama 模型在聊天/完成网关上的质量急剧下降,

dphi5xsq  于 23天前  发布在  其他
关注(0)|答案(7)|浏览(15)

问题是什么?

大家在我的前端(ollama)上提出了以下问题:

简而言之,问题是关于通过我的应用程序获得非常低质量的模型响应。长话短说。

  1. 我运行了 export OLLAMA_DEBUG=1 && ollama serve
  2. 按照下面的提示运行 ollama run qwen2:1.5b --verbose --nowordwrap 并得到了相当好的答案。
  3. 然后我运行
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_LOCAL_API_KEY" \
  -d '{
"model": "qwen2:1.5b",
"messages": [{"role": "user", "content": "create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"}]
}'
  1. 得到了这个混乱的结果:
{"id":"chatcmpl-521","object":"chat.completion","created":1724524318,"model":"qwen2:1.5b","system_fingerprint":"fp_ollama","choices":[{"index":0,"message":{"role":"assistant","content":"To create a Sublime Text plugin that converts selected text to base64 encoding, replaces it in place, and wraps the entire process in a command-line interface (CLI), you can use a combination of file handling and scripting. Here's how to implement this feature:\n\n1. Start by creating a new project in Sublime Text.\n2. Add the following package to your `Package Info` -\u003e `Packages/User/...` subfolder:\n```\n//sublime-text-commands\n{\n  // Your custom CLI commands here\n}\n``` \n3. Copy and paste the contents of this code snippet into the newly created `.sublime-package` file:\n\n```json\n{\n  \"name\": \"Custom CLI Commands\",\n  \"description\": \"Commands for a Sublime Text plugin\",\n  \"版本\": 1,\n  \"dependencies: {\n    // ... your npm packages here\n    \"command-line-encoder\": \"^2.3.0\"\n  },\n  \"cmd\": [\n      \"perl -e 'print Encode::b64_encode(\\$arg2));'\"\n  ]\n}\n```\n\n4. Save the file and restart Sublime Text.\n\nNow, you should be able to see your plugin under the `\"commands\"` menu when launching the Sublime Text command palette:\n\n1. Choose `View-\u003e Find -\u003e Replace with` \u003e `\u003cPackage name\u003e`.\n2. Select any line in the current document that contains text.\n3. Press `Enter` to apply the above code.\n4. Choose an item from the output list on the right (you should see your plugin's CLI).\n\nTo convert selected text, use a combination of the following commands:\n\n- `\u003cPackage name\u003e`: Open the Sublime Text command palette and type in `\u003cPackage name\u003e`.\n  - Then press `Enter`.\n  - Check the box next to `\"Find\"` to search for any specific text.\n\nHere's an example of how you can do this with regular expressions and the new plugin:\n\n1. Add a custom regular expression to find any line that matches the input:\n- Search: `'(?s)^\\s+'\n- Replace: `'`\n  - The `\\s+` captures one or more whitespace characters before any text.\n\n2. Press `Enter`.\n\n3. To replace selected text with base64 encoding and wrap it in a command prompt and press `Enter`. \n\n```json\n\"cmd\": [ \n    \"perl -e 'print Encode::b64_encode($arg2);'\",\n    \"(perl -e 'print Encode::b64_encode(\\$arg2));'\"\n]\n```\n\n4. Repeat the process by pressing `Enter` a few times for multiple lines."},"finish_reason":"stop"}],"usage":{"prompt_tokens":32,"completion_tokens":541,"total_tokens":573}}

在服务器端,我注意到 ollama run 触发了另一个网关,而不是 chat/completions,并且出现在日志中的请求比出现在 curl 调用中的要多得多。

我没有深入挖掘这个问题,但我的猜测是,当调用 ollama run 时,有一些额外的设置正在进行。

这是日志。

2024/08/24 20:18:58 routes.go:1125: INFO server config env="map[OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/path-to-ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR:]"
time=2024-08-24T20:18:58.262+02:00 level=INFO source=images.go:782 msg="total blobs: 5"
time=2024-08-24T20:18:58.263+02:00 level=INFO source=images.go:790 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:   export GIN_MODE=release
 - using code:  gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyModelHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).ProcessHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowModelHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListModelsHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2024-08-24T20:18:58.263+02:00 level=INFO source=routes.go:1172 msg="Listening on 127.0.0.1:11434 (version 0.3.6)"
time=2024-08-24T20:18:58.267+02:00 level=INFO source=payload.go:30 msg="extracting embedded files" dir=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-common.h.gz
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ggml-metal.metal.gz
time=2024-08-24T20:18:58.267+02:00 level=DEBUG source=payload.go:182 msg=extracting variant=metal file=build/darwin/arm64/metal/bin/ollama_llama_server.gz
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:18:58.290+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [metal]"
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-24T20:18:58.290+02:00 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-24T20:18:58.338+02:00 level=INFO source=types.go:105 msg="inference compute" id=0 library=metal compute="" driver=0.0 name="" total="10.7 GiB" available="10.7 GiB"
time=2024-08-24T20:19:07.474+02:00 level=DEBUG source=sched.go:181 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=0x100950390 gpu_count=1
time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=sched.go:219 msg="loading first model" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
time=2024-08-24T20:19:07.489+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]"
time=2024-08-24T20:19:07.490+02:00 level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e gpu=0 parallel=4 available=11453251584 required="1.9 GiB"
time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=server.go:101 msg="system memory" total="16.0 GiB" free="4.0 GiB" free_swap="0 B"
time=2024-08-24T20:19:07.490+02:00 level=DEBUG source=memory.go:101 msg=evaluating library=metal gpu_count=1 available="[10.7 GiB]"
time=2024-08-24T20:19:07.490+02:00 level=INFO source=memory.go:309 msg="offload to metal" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[10.7 GiB]" memory.required.full="1.9 GiB" memory.required.partial="1.9 GiB" memory.required.kv="224.0 MiB" memory.required.allocations="[1.9 GiB]" memory.weights.total="927.4 MiB" memory.weights.repeating="744.8 MiB" memory.weights.nonrepeating="182.6 MiB" memory.graph.full="299.8 MiB" memory.graph.partial="299.8 MiB"
time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:19:07.491+02:00 level=DEBUG source=payload.go:71 msg="availableServers : found" file=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server
time=2024-08-24T20:19:07.492+02:00 level=INFO source=server.go:393 msg="starting llama server" cmd="/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal/ollama_llama_server --model /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --verbose --parallel 4 --port 54635"
time=2024-08-24T20:19:07.492+02:00 level=DEBUG source=server.go:410 msg=subprocess environment="[PATH=/opt/homebrew/opt/ruby/bin:/path-to-ollama/.mint/bin:/Applications/Sublime Merge.app/Contents/SharedSupport/bin:/Applications/Sublime Text.app/Contents/SharedSupport/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Library/Apple/usr/bin:/Applications/Little Snitch.app/Contents/Components:/path-to-ollama/.cargo/bin:/Applications/kitty.app/Contents/MacOS LD_LIBRARY_PATH=/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners/metal:/var/folders/gc/v8tx0lzx4qg7tt1rl88wzgwr0000gn/T/ollama4145360689/runners]"
time=2024-08-24T20:19:07.493+02:00 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-24T20:19:07.493+02:00 level=INFO source=server.go:593 msg="waiting for llama runner to start responding"
time=2024-08-24T20:19:07.494+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=3535 commit="1e6f6554" tid="0x1e9306940" timestamp=1724523548
INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="0x1e9306940" timestamp=1724523548 total_threads=10
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="9" port="54635" tid="0x1e9306940" timestamp=1724523548
llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from /path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen2-1.5B-Instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_0:  196 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-08-24T20:19:08.250+02:00 level=INFO source=server.go:627 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 0.9338 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 1536
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 12
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8960
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 1.54 B
llm_load_print_meta: model size       = 885.97 MiB (4.81 BPW) 
llm_load_print_meta: general.name     = Qwen2-1.5B-Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.30 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   885.97 MiB, (  886.03 / 10922.67)
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:        CPU buffer size =   182.57 MiB
llm_load_tensors:      Metal buffer size =   885.97 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
llama_kv_cache_init:      Metal KV buffer size =   224.00 MiB
llama_new_context_with_model: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     2.34 MiB
llama_new_context_with_model:      Metal compute buffer size =   299.75 MiB
llama_new_context_with_model:        CPU compute buffer size =    19.01 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 2
time=2024-08-24T20:19:08.501+02:00 level=DEBUG source=server.go:638 msg="model load progress 1.00"
DEBUG [initialize] initializing slots | n_slots=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=1 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=2 tid="0x1e9306940" timestamp=1724523548
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=3 tid="0x1e9306940" timestamp=1724523548
INFO [main] model loaded | tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="0x1e9306940" timestamp=1724523548
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=0 tid="0x1e9306940" timestamp=1724523548
time=2024-08-24T20:19:08.755+02:00 level=INFO source=server.go:632 msg="llama runner started in 1.26 seconds"
time=2024-08-24T20:19:08.755+02:00 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=1 tid="0x1e9306940" timestamp=1724523548
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54639 status=200 tid="0x16ba43000" timestamp=1724523548
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=2 tid="0x1e9306940" timestamp=1724523548
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523548
time=2024-08-24T20:19:08.777+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\nCreating a Sublime Text plugin to perform Base64 encoding on selected text and replace it with converted data is a complex task as you are asking for more than one operation. However, I can provide an outline of how such a feature might be implemented in Sublime Text.\n\nHere's the step-by-step guide to creating such a plugin:\n\n1. **Define Keybinds:** First, you need to define the key bindings that trigger the conversion when the user selects text and presses a specific key.\n\n2. **Create the Plugin:** Create a new Sublime Text plugin file (like `sublime_text_plugin.py`). This file should include the necessary functions for handling command execution, event listeners, etc.\n\n3. **Implement Conversion Function:** In this function, you need to convert the selected text using Base64 encoding. You can use libraries like `base64` in Python to do this.\n\n4. **Insert or Replace Selected Text:** Once the conversion is complete, you need to either insert the converted text into the user's selection or replace it if the user previously typed something there.\n5. **Check for Keybinds to Continue:** If the user presses a key to continue, check whether `execute_command` has been called and if not, call it with the correct parameters.\n\n6. **Event Listening:** Add event listeners in Sublime Text itself so that when changes are made to the selected text (e.g., typ
ed characters), they can trigger the conversion.\n\n7. **Error Handling:** Include error handling for situations where the Base64 encoding process fails or if something else goes wrong during the execution of the command.\n8. **Testing:** Ensure your plugin works as expected by testing it with different scenarios and edge cases, such as when there's no text selected in Sublime Text.\n\nPlease note that this is a high-level overview of creating a Sublime Text plugin. The specifics will depend on the programming language you're using for the plugin (in this case, Python), and how you choose to implement the features described above. For full documentation, follow your chosen platform's official documentation or look up examples online.\n\nRemember that creating plugins like these can be a significant commitment, especially if they are complex and need thorough testing before release.<|im_end|>\n<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=3 tid="0x1e9306940" timestamp=1724523548
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] slot progression | ga_i=0 n_past=0 n_past_se=0 n_prompt_tokens_processed=523 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523548
DEBUG [print_timings] prompt eval time     =     473.73 ms /   523 tokens (    0.91 ms per token,  1104.00 tokens per second) | n_prompt_tokens_processed=523 n_tokens_second=1103.9997298050373 slot_id=0 t_prompt_processing=473.732 t_token=0.9057973231357553 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [print_timings] generation eval time =    4179.91 ms /   322 runs   (   12.98 ms per token,    77.04 tokens per second) | n_decoded=322 n_tokens_second=77.0351330446988 slot_id=0 t_token=12.981090062111802 t_token_generation=4179.911 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [print_timings]           total time =    4653.64 ms | slot_id=0 t_prompt_processing=473.732 t_token_generation=4179.911 t_total=4653.643 task_id=4 tid="0x1e9306940" timestamp=1724523553
DEBUG [update_slots] slot released | n_cache_tokens=845 n_ctx=8192 n_past=844 n_system_tokens=0 slot_id=0 task_id=4 tid="0x1e9306940" timestamp=1724523553 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54640 status=200 tid="0x16bacf000" timestamp=1724523553
[GIN] 2024/08/24 - 20:19:13 | 200 |  5.980737958s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s
time=2024-08-24T20:19:13.432+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0



time=2024-08-24T20:21:08.524+02:00 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=329 tid="0x1e9306940" timestamp=1724523668
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=330 tid="0x1e9306940" timestamp=1724523668
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523668
time=2024-08-24T20:21:08.527+02:00 level=DEBUG source=routes.go:1363 msg="chat request" images=0 prompt="<|im_start|>user\n\"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion\"<|im_end|>\n<|im_start|>assistant\n"
DEBUG [process_single_task] slot data | n_idle_slots=4 n_processing_slots=0 task_id=331 tid="0x1e9306940" timestamp=1724523668
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",193]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] slot progression | ga_i=0 n_past=34 n_past_se=0 n_prompt_tokens_processed=34 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] we have to evaluate at least 1 token to generate logits | slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [update_slots] kv cache rm [p0, end) | p0=33 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523668
DEBUG [print_timings] prompt eval time     =     166.62 ms /    34 tokens (    4.90 ms per token,   204.05 tokens per second) | n_prompt_tokens_processed=34 n_tokens_second=204.0546866560238 slot_id=0 t_prompt_processing=166.622 t_token=4.90064705882353 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [print_timings] generation eval time =    8888.16 ms /   596 runs   (   14.91 ms per token,    67.06 tokens per second) | n_decoded=596 n_tokens_second=67.0554910065198 slot_id=0 t_token=14.913021812080537 t_token_generation=8888.161 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [print_timings]           total time =    9054.78 ms | slot_id=0 t_prompt_processing=166.622 t_token_generation=8888.161 t_total=9054.783 task_id=332 tid="0x1e9306940" timestamp=1724523677
DEBUG [update_slots] slot released | n_cache_tokens=630 n_ctx=8192 n_past=629 n_system_tokens=0 slot_id=0 task_id=332 tid="0x1e9306940" timestamp=1724523677 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=54664 status=200 tid="0x16bb5b000" timestamp=1724523677
[GIN] 2024/08/24 - 20:21:17 | 200 |   9.10044125s |       127.0.0.1 | POST     "/v1/chat/completions"
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:403 msg="context for request finished"
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e duration=5m0s
time=2024-08-24T20:21:17.584+02:00 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/path-to-ollama/.ollama/models/blobs/sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e refCount=0
g0czyy6m

g0czyy6m1#

我不是一个高级用户,所以无法评估答案的质量。如果能看到 ollama run 的优质回答就更好了。然而,日志显示 ollama run 是一个多轮对话,因此输出的质量可能已经通过之前的回答得到了改善。
ollama run 使用 ollama 的 /api/chat API,而 curl 使用 OpenAI API兼容端点 /v1/chat/completions。这两个端点的参数 temperaturetop_p 有不同的默认值。如果你控制了这些参数以及 seed,两个端点都会返回相同的答案。请注意,当通过 OpenAPI 端点传递时,ollama 将 temperature 的值加倍,即 0.8(默认值)对于 temperature 是 0.4 对于 /v1/chat/completions
OpenAI API 对提示的响应:

curl -s localhost:11434/v1/chat/completions -d '{
  "model":"qwen2:1.5b",
  "seed":0,
  "temperature":0.4,
  "top_p":0.9,
  "messages":[{
    "role":"user",
    "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"
  }]
}' > result.openai

Ollama 对提示的响应:

curl -s localhost:11434/api/chat -d '{
  "model":"qwen2:1.5b",
  "options":{                              
    "seed":0
  },              
  "stream":false,                                                                                                                                            
  "messages":[{
    "role":"user",
    "content":"create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion"
  }]
}' > result.ollama

比较输出,我们可以看到 content 是相同的:

$ sdiff -b <(jq -r . result.openai) <(jq -r . result.ollama)
{                                                               {
  "id": "chatcmpl-933",                                       <
  "object": "chat.completion",                                <
  "created": 1724539470,                                      <
  "model": "qwen2:1.5b",                                          "model": "qwen2:1.5b",
  "system_fingerprint": "fp_ollama",                          |   "created_at": "2024-08-24T22:43:27.102258036Z",
  "choices": [                                                <
    {                                                         <
      "index": 0,                                             <
      "message": {                                                "message": {
        "role": "assistant",                                        "role": "assistant",
        "content": "Here's an example of a Sublime Text plugi       "content": "Here's an example of a Sublime Text plugin th
      },                                                          },
      "finish_reason": "stop"                                 |   "done_reason": "stop",
    }                                                         |   "done": true,
  ],                                                          |   "total_duration": 1696781939,
  "usage": {                                                  |   "load_duration": 18291392,
    "prompt_tokens": 32,                                      |   "prompt_eval_count": 32,
    "completion_tokens": 280,                                 |   "prompt_eval_duration": 19763000,
    "total_tokens": 312                                       |   "eval_count": 280,
  }                                                           |   "eval_duration": 1616478000
}

对于 ollama run 也是如此:

$ script -c 'ollama run qwen2:1.5b --nowordwrap'
Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256color" TTY="/dev/pts/1" COLUMNS="211" LINES="41"]
>>> /set parameter seed 0
Set parameter 'seed' to '0'
>>> create sublime text plugin that takes selected text and convert it by applying base64encoding and replacing selected text with the conversion
Here's an example of a Sublime Text plugin that converts selected text to base64 encoding and replaces it with the converted value:
...

如果我们然后比较 ollama run 的输出和 curl 结果,我们可以看到它们是相同的:

$ sdiff <(jq -r '.choices[0].message.content' result.openai) <(ansifilter typescript)
Here's an example of a Sublime Text plugin that converts sele | Script started on 2024-08-25 00:51:43+02:00 [TERM="xterm-256c
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              > Set parameter 'seed' to '0'
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              > ⠋ Here's an example of a Sublime Text plugin that converts se

```python                                                       ```python
// In your Sublime Text preferences, create a new folder call   // In your Sublime Text preferences, create a new folder call
// Alternatively, you can edit this file directly from the Su   // Alternatively, you can edit this file directly from the Su

package = require("sublime-package");                           package = require("sublime-package");

module.exports = {                                              module.exports = {
  init: function() {                                              init: function() {

    var plugin = {};                                                var plugin = {};
                                                                    
    plugin.exec = function(editor) {                                plugin.exec = function(editor) {
      editor.commands.executeCommand("repl.text.edit", "", nu         editor.commands.executeCommand("repl.text.edit", "", nu
      editor.commands.executeCommand("repl.text.replace", "",         editor.commands.executeCommand("repl.text.replace", "",
    };                                                              };

    return plugin;                                                  return plugin;
  }                                                               }
};                                                              };
```                                                             ```

This plugin defines a `init` function that runs when the plug   This plugin defines a `init` function that runs when the plug

To use this plugin, go to the "Preferences" > "Package Contro   To use this plugin, go to the "Preferences" > "Package Contro
                                                              >
                                                              > >>> Send a message (/? for help)Send a message (/? for help)S
                                                              >
                                                              > Script done on 2024-08-25 00:51:54+02:00 [COMMAND_EXIT_CODE="

总之,控制 seedtemperaturetop_p 参数将从 ollama 提供的不同端点获得相同的结果。

hzbexzde

hzbexzde2#

通过说它提供了混乱,我的意思是它😅如果你展开我之前提供的完整日志中的响应,你会看到一些Unicode代码、汉字等,这些在任何方面都不能被认为是一个相当可接受的答案,我的意思是它不是正确性的问题,而是不破坏的问题。
我也认为这与模型设置有关,但不确定。
感谢强调这两个API的不同默认设置,有没有办法获取这两个API的完整默认配置?

qhhrdooz

qhhrdooz3#

ollama/api/types.go
第585行:69be940
| | funcDefaultOptions() Options { |
ollama/openai/openai.go
第454行:69be940
| | ifr.Temperature!=nil { |

3df52oht

3df52oht4#

感谢您的回答,这对我帮助很大。我能够通过响应解决这个问题,因为我插件上的默认temp值为1,所以在ollama那边变成了2。
那么,我能问您为什么v1/chat/completions端点的temp值翻倍了吗?

1cklez4t

1cklez4t5#

这是为了尝试匹配不同API之间用于温度的不同刻度。OpenAI使用一个从0到2(最小随机性到最大随机性)的温度刻度,而ollama认为1是最大随机性。在这种情况下,它实际上应该是/2,而不是*2。
老实说,我认为温度没有得到很好的定义。查看llama.cpp的源代码,除了一个评论提到1.5更具创造性之外,没有对温度的可能范围进行解释,暗示范围不是[0..1]。即使OpenAI的API文档在不同的端点之间也在[0..1]和[0..2]之间切换。

dgiusagp

dgiusagp6#

实际上,我认为温度的定义并不明确。
是的,这是我的真正担忧。我使用OpenAI端点的我的前向到ollama具有预定义的配置和默认值,这些值反过来又反映了OpenAI API的默认值。
但在ollama的情况下,似乎出现了问题,因为默认的temp=1似乎变成了temp=2,这使得模型变得疯狂。
尽管我在我的网站上发布了常见问题解答(FAQ),但我确信未来还会有更多类似的问题出现,因为这件事太令人困惑,很难自己弄清楚。
因此我有一个问题,你是否考虑过提交一个PR来移除这个温度乘法的部分?
附注:llama.cpp没有进行任何修改,在我上次测试时,它与OpenAI的默认配置值一起工作得很好。

fcipmucu

fcipmucu7#

当然欢迎您提交PR,但是否合并取决于ollama团队(我并非其中一员)。

相关问题