Upgrading ollama to 0.3.7 makes the 360m models work, the 135 models output more text but still crash.
$ ollama run smollm:135m-instruct-v0.2-fp16 hello
Hello! How can you are welcome. I am so glad to thank you are you are you are you are the most beautiful and i'm a very much more than just like you are you. You're we have a great, but it is an expert in your
friend.
Error: an unknown error was encountered while running the model CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:2416
cudaStreamSynchronize(cuda_ctx->stream())
/go/src/github.com/ollama/ollama/llm/llama.cpp/ggml/src/ggml-cuda.cu:101: CUDA error
4条答案
按热度按时间lokaqttq1#
服务器日志有助于调试,但作为初步猜测,我认为您的机器没有足够的(V)RAM来托管135m(92MB)或360m(229MB)模型。
xv8emn3q2#
这是服务器日志,记录了服务器的一些配置信息、运行状态和事件。
llama_model_loader: - kv 10: general.base_model.0.organization str = HuggingFaceTB
llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/HuggingFaceTB/ ...
llama_model_loader: - kv 12: general.tags arr[str,3] = ["alignment-handbook", "trl", "sft"]
llama_model_loader: - kv 13: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 14: general.datasets arr[str,4] = ["Magpie-Align/Magpie-Pro-300K-Filter...
llama_model_loader: - kv 15: llama.block_count u32 = 30
llama_model_loader: - kv 16: llama.context_length u32 = 2048
llama_model_loader: - kv 17: llama.embedding_length u32 = 576
llama_model_loader: - kv 18: llama.feed_forward_length u32 = 1536
llama_model_loader: - kv 19: llama.attention.head_count u32 = 9
llama_model_loader: - kv 20: llama.attention.head_count_kv u32 = 3
llama_model_loader: - kv 21: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 22: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 23: general.file_type u32 = 2
llama_model_loader: - kv 24: llama.vocab_size u32 = 49152
llama_model_loader: - kv 25: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 26: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: -
twh00eeo3#
在ollama中,似乎存在一些问题。所有q4、q8和fp16量化的360m和135m模型都会出现相同的错误。
我甚至从HF下载了safetensors并使用llama.cpp将其转换为GGUF,但仍然出现相同的错误。然而,GGUF在使用llama.cpp中的llama-cli时可以正常工作(至少它不会崩溃,输出也不是很好)。因此,这似乎是一个关于ollama中llama.cpp后端的问题。
ny6fqffe4#
Upgrading ollama to 0.3.7 makes the 360m models work, the 135 models output more text but still crash.