ollama 运行smollm:360m和smollm:135m时出现问题,

bgibtngc  于 23天前  发布在  其他
关注(0)|答案(4)|浏览(20)

问题是什么?

我尝试使用1.7b版本运行,它成功运行了。

然而,当运行这两个较小的版本时,它显示以下错误。

操作系统

Windows

GPU

Nvidia

CPU

Intel

Ollama版本

0.3.6

lokaqttq

lokaqttq1#

服务器日志有助于调试,但作为初步猜测,我认为您的机器没有足够的(V)RAM来托管135m(92MB)或360m(229MB)模型。

xv8emn3q

xv8emn3q2#

这是服务器日志,记录了服务器的一些配置信息、运行状态和事件。
llama_model_loader: - kv 10: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/ ...
llama_model_loader: - kv 12: general.tags arr[str,3] = ["alignment-handbook", "trl", "sft"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: general.datasets arr[str,4] = ["Magpie-Align/Magpie-Pro-300K-Filter...
llama_model_loader: - kv 15: llama.block_count u32 = 30
llama_model_loader: - kv 16: llama.context_length u32 = 2048
llama_model_loader: - kv 17: llama.embedding_length u32 = 576
llama_model_loader: - kv 18: llama.feed_forward_length u32 = 1536
llama_model_loader: - kv 19: llama.attention.head_count u32 = 9
llama_model_loader: - kv 20: llama.attention.head_count_kv u32 = 3
llama_model_loader: - kv 21: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 22: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: llama.vocab_size u32 = 49152
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: -

llm_load_tensors: CUDA_Host buffer size = 28.69 MiB
llm_load_tensors: CUDA0 buffer size = 85.82 MiB
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 180.00 MiB
llama_new_context_with_model: KV self size = 180.00 MiB, K (f16): 90.00 MiB, V (f16): 90.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.76 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 164.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.13 MiB
llama_new_context_with_model: graph nodes = 966
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="8360" timestamp=1724639477
time=2024-08-26T10:31:17.560+08:00 level=INFO source=server.go:632 msg="llama runner started in 2.36 seconds"
[GIN] 2024/08/26 - 10:31:17 | 200 | 2.4895053s | 127.0.0.1 | POST "/api/chat"
CUDA error: CUBLAS_STATUS_NOT_SUPPORTED
current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:1889
cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\llm\llama.cpp\ggml\src\ggml-cuda.cu:101: CUDA error
[GIN] 2024/08/26 - 10:31:29 | 200 | 8.9100593s | 127.0.0.1 | POST "/api/chat"
twh00eeo

twh00eeo3#

在ollama中,似乎存在一些问题。所有q4、q8和fp16量化的360m和135m模型都会出现相同的错误。

我甚至从HF下载了safetensors并使用llama.cpp将其转换为GGUF,但仍然出现相同的错误。然而,GGUF在使用llama.cpp中的llama-cli时可以正常工作(至少它不会崩溃,输出也不是很好)。因此,这似乎是一个关于ollama中llama.cpp后端的问题。

ny6fqffe

ny6fqffe4#

Upgrading ollama to 0.3.7 makes the 360m models work, the 135 models output more text but still crash.

$ ollama run smollm:135m-instruct-v0.2-fp16 hello
Hello! How can you are welcome. I am so glad to thank you are you are you are you are the most beautiful and i'm a very much more than just like you are you. You're we have a great, but it is an expert in your 
friend.

Error: an unknown error was encountered while running the model CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2416
  cudaStreamSynchronize(cuda_ctx->stream())
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error

相关问题