llama.cpp Bug: (CUDA) 当将任务卸载到多个GPU时，输出结果损坏

im9ewurl 于 5个月前发布在其他

关注(0)|答案(7)|浏览(63)

发生了什么？

问题
某些模型在卸载到多个CUDA GPU时会产生损坏的输出。当卸载到单个GPU或仅使用CPU时，问题会消失。
我在以下环境中成功复现了问题：

llama 3.0 8B
llama 3.1 8b
glm-4-9b

然而，我无法在以下环境中复现问题：

Mistral Nemo

Bug 1

当卸载到多个GPU时，模型给出错误的答案。它似乎无法正确解析提示。

Bug 2

当向模型发送第二个提示时，模型会重用第一个提示的信息。

重现Bug 1的步骤

从HF下载llama 3.0 8B
下载我的示例提示：

llama-multi-gpu-bug.txt

用CUDA_VISIBLE_DEVICES=0 ./llama-cli -c 8192 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -ngl 99 -f /tmp/llama-multi-gpu-bug.txt -sp -s 0启动
验证模型是否能正确回答：(Alice:7,Bob:8,Charlie:7)完整的日志如下。
用CUDA_VISIBLE_DEVICES=0,1 ./llama-cli -c 8192 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -ngl 99 -f /tmp/llama-multi-gpu-bug.txt -sp -s 0再次启动
模型无法回答：(Alice:7,Bob:7,Charlie: null),然后说Charlie没有提供答案。

重现Bug 2的步骤

用CUDA_VISIBLE_DEVICES=0,1 ./llama-server -ngl 99 -c 8192 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf启动服务器
转到新UI
选择llama 3模板。
复制并粘贴样本提示的内容：

apples_majority.txt 并发送

模型给出与Bug 1相同的损坏答案
返回
复制并粘贴这个新的提示

apples_simple.txt 并发送

模型给出一个json答案，就像回答了第一个提示一样。

第二个提示与第一个提示共享相同的前缀。

正确答案的完整日志

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their step-by-step breakdowns of the problem, but they all reached the conclusion that Matteo discards a quarter of his remaining fruits equally between apples and oranges.",
  "who is right": "It's a tie between Alice and Charlie, as they both answered 7 apples remaining.",
  "answer": 7
}

Note that since Alice and Charlie have the same answer, 7, they are considered the "right" answer based on majority voting.

错误答案的完整日志

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, which is 7 apples. Since Charlie did not provide an answer, we use majority voting to determine the correct answer.",
  "who is right": "Alice and Bob",
  "answer": 7
}

In this case, both Alice and Bob provided the same answer, which is 7 apples. Since Charlie did not provide an answer, we use majority voting to determine the correct answer, which is 7 apples.

我的设置：

Linux,配备：

Nvidia 4060 16GB
Nvidia 3060 12GB

名称和版本

版本：3463 (dc820804)
使用cc(Debian 12.2.0-14)构建，针对x86_64-linux-gnu,版本为12.2.0。参考模型是llama3.0 8b,作者为bartowski,SHA256:d6f1dcba991f8e629531a5f4cf19e4dbc2a89f80fd20737e256b92aac11572f1

llama.cpp

来源：https://github.com/ggerganov/llama.cpp/issues/8685

7条答案

按热度按时间

exdqitrt1#

这并不是一个实际bug的确凿证据。浮点数舍入误差对于不同的GPU来说是不同的，因此RTX 4060、RTX 3060以及两者组合之间的结果不会完全相同。因此，预计某些输入将仅仅通过随机性产生更好的或更差的结果。我只会认为通过llama-perplexity统计学上显著的差异才是确凿的证据。

我在1x RTX 4090和2x RTX 4090之间得到了完全相同的结果，所以我认为多GPU代码中没有bug。

赞(0）回复(0）举报 5个月前

prdp8dxp2#

@JohannesGaessler
Bug 2应该是足够有说服力的证据。
当发送第二个提示时，模型的回答就像它收到了第一个提示一样。
以下是发生的事情的重述：

我发送了一个问题，指定答案必须以特定的json格式呈现。
模型以json格式回答。
我没有指定答案格式(apples_simple.txt),但我发送了相同的问题。
模型使用与第一个提示中指定的完全相同的模式以相同的json响应进行回答，即使在提示中没有指定任何内容。
编辑：
llama-perplexity在两种情况下给出了大致相同的值，一种是使用一个GPU,另一种是使用两个GPU。

赞(0）回复(0）举报 5个月前

nnvyjq4y3#

你没有为第二个提示发布任何输出，即使这样，我也看不到为什么这会是多GPU设置中的问题。如果多GPU代码产生不正确的数值输出，那才是问题所在，除非是由于浮点舍入误差导致的差异。在提示之间泄漏信息应该只发生在KV缓存管理出现问题的情况下，但我不明白为什么在这里会发生这种情况，或者更具体地说，为什么它只会发生在多个GPU的情况下。

赞(0）回复(0）举报 5个月前

r1zhe5dt4#

对于这个测试，我只改变了Tensor分割以将差异限制在最小范围内。
我在一个单独的步骤中对字符串进行了分词，并将分词后的字符串发送到服务器。
第一个GPU是4060,第二个是3060

使用ts 100,0,正确回复

使用./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 100,0
来自apples_majority.txt的结果：
这里是输出：

json
{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their individual breakdowns of the problem.",
  "who is right": "It's a tie between Alice and Charlie, both of whom answered 7 apples remaining.",
  "answer": 7
}

在apples_simple.txt之后：

The answers from the individual users are:

* Alice: 7 apples
* Bob: 8 apples
* Charlie: 7 apples

To determine the answer by majority voting, we can count the number of users who answered each option:

* 7 apples: 2 users (Alice and Charlie)
* 8 apples: 1 user (Bob)

Since 2 users answered 7 apples and only 1 user answered 8 apples, the answer chosen by majority voting is:

* 7 apples

两个答案都与提供的提示一致。

使用-ts 50,50不需要的回复

./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 50,50
在apples_majority.txt之后

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, 7 apples, which is the majority vote.",
  "who is right": "Alice and Bob",
  "answer": 7
}

在apples_simple.txt之后

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob arrived at the same answer, 7 apples, despite having slightly different steps. Charlie did not provide an answer.",
  "who is right": "Both Alice and Bob",
  "answer": 7
}

第一个答案是错误的，但它尊重了请求的格式。第二个回复是对第一个提示的答案，即使第二个提示是不同的。

使用-ts 75,25

在75,25的分割下，我得到了所有错误的答案，就像在50,50的情况下一样

使用-ts 25,75

在25,75的分割下，我得到了所有正确的答案。与100,0的情况相同

使用-ts 0,100

在0,100的分割下，我得到了所有正确的答案。与100,0的情况相同

赞(0）回复(0）举报 5个月前

anhgbhbe5#

我假设您没有使用提示缓存，因为它默认是禁用的。在这种情况下，如果您不运行 apples_majority.txt ,是否会收到针对 apples_simple.txt 的 "错误" JSON 回复？

赞(0）回复(0）举报 5个月前

utugiqy66#

我假设你没有使用提示缓存，因为它默认是禁用的。在这种情况下，如果你不运行apples_majority.txt,你会得到"错误"的JSON回复吗？

提示缓存已启用。我没有提到。很抱歉。
在提示缓存为False的情况下，我没有从apples_simple中获得json输出。事后看来，这应该是首先要检查的事情。(我应该为此开一个新问题)。
仍然奇怪的是，json问题与Tensor分割的选择有关。
即使在删除缓存后，当在多个GPU上运行模型时，对于任何种子选择，它都会给我错误的答案。这似乎是一个与舍入误差不同的问题。
有没有办法排除舍入误差？例如，以fp32运行所有内容？
以下是我目前使用的自定义参数：

prompt_cache: False
stream: True
temperature: 0.01
seed: 0

带有ts 100,0

带有./llama-server -ngl 175 -t 6 -c 8192 --host 0.0.0.0 -m Meta-Llama-3-8B-Instruct-Q6_K.gguf -fa --override-kv tokenizer.ggml.add_bos_token=bool:false -ts 100,0的结果来自apples_majority.txt:

Here is the output:

json
{
  "answers": {
    "alice": 7,
    "bob": 8,
    "charlie": 7
  },
  "reasoning": "The users provided different answers based on their individual breakdowns of the problem.",
  "who is right": "It's a tie between Alice and Charlie, both of whom answered 7 apples remaining.",
  "answer": 7
}

经过apples_simple.txt后：

The answers from the individual users are:

* Alice: 7 apples
* Bob: 8 apples
* Charlie: 7 apples

To determine the answer by majority voting, we can count the number of users who answered each option:

* 7 apples: 2 users (Alice and Charlie)
* 8 apples: 1 user (Bob)

Since 2 users answered 7 apples and only 1 user answered 8 apples, the answer chosen by majority voting is:

* 7 apples

带有ts 50,50

apples_majority.txt:

Here is the output:

{
  "answers": {
    "alice": 7,
    "bob": 7,
    "charlie": null
  },
  "reasoning": "Both Alice and Bob provided the same answer, 7 apples, which is the majority vote.",
  "who is right": "Alice and Bob",
  "answer": 7
}

apples_simple.txt

The answers from the individual users are:

* Alice: 7 apples
* Bob: 7 apples
* Charlie: 7 apples

The majority voting result is:

* 7 apples

The output of the problem chosen by majority voting is:

Matteo has 7 apples remaining.

赞(0）回复(0）举报 5个月前

eqqqjvef7#

是否有方法排除舍入误差？
是的，您可以通过收集大量数据并对这些数据进行统计分析来排除舍入误差(平均来说，这不应该影响正确/错误的答案百分比)。对于一个简单的问题，您只需要检查答案是否正确，那么使用样本大小为n接收到正确答案的概率p的不确定性可以估计为
$$ Delta p = \sqrt{\frac{p (1 - p)}{n}} . $$
如您所见，您需要非常大的样本大小(至少1000个)才能获得良好的精度(也因为这个公式只在大型样本极限下有效)。这就是为什么我告诉您要使用 llama-perplexity 的原因，其中模型可以在以某种方式估计统计显著性的方式上评估数十万个标记。
例如，全部运行在fp32?
如果您使用FP32,舍入误差会减少，但由于神经网络的结构，从输入的小扰动中计算输出将发生多大变化的根本上是不可能的。这不是排除舍入误差差异的可靠方法，只有使用大样本尺寸和统计分析的方法才是。

赞(0）回复(0）举报 5个月前

我来回答

llama.cpp Bug: (CUDA) 当将任务卸载到多个GPU时，输出结果损坏

Bug 1

Bug 2

重现Bug 1的步骤

重现Bug 2的步骤

正确答案的完整日志

错误答案的完整日志

我的设置：

名称和版本

7条答案

使用ts 100,0,正确回复

使用-ts 50,50不需要的回复

使用-ts 75,25

使用-ts 25,75

使用-ts 0,100

带有ts 100,0

带有ts 50,50

相关问题

热门标签

最新问答