vllm [Bug]: NCCL在推理过程中超时

yiytaume 于 6个月前发布在其他

关注(0)|答案(6)|浏览(146)

当前环境

使用的：

vllm 0.4.1
nccl 2.18.1
pytorch 2.2.1

🐛 描述错误

在推理过程中，我有时会遇到这个错误：

(RayWorkerWrapper pid=2376582) [rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50404, OpType=GATHER, NumelIn=8000, NumelOut=0, Timeout(ms)=600000) ran for 600327 milliseconds before timing out.

在vllm的早期版本中没有见过这个错误，有什么想法吗？

vllm

来源：https://github.com/vllm-project/vllm/issues/4653

6条答案

按热度按时间

5jvtdoz21#

同样的问题，在我的数据集上随机出现。
vllm 0.4.1
torch 2.2.0+cu118

赞(0）回复(0）举报 6个月前

xxe27gdn2#

我已经遇到了相同的问题，尝试使用--disable-custom-all-reduce和--enforce-eager,它们对我有效。

赞(0）回复(0）举报 6个月前

dgenwo3n3#

请参考 #4430

--disable-custom-all-reduce = True
--enforce-eager = True (可能不需要)
更新到 [Core] Ignore infeasible swap requests. #4557
这三个可以解决我的问题，在此之前，nccl watchdog错误每天发生几次，现在它运行良好。

赞(0）回复(0）举报 6个月前

km0tfn4u4#

我们也在0.4.2版本的mixtral 8x22b上看到了这个问题。禁用自定义all reduce可以解决这个问题。

赞(0）回复(0）举报 6个月前

njthzxwz5#

我们能再次默认禁用自定义的all reduce吗？

赞(0）回复(0）举报 6个月前

pexxcrt26#

我已经遇到了同样的问题，尝试使用--disable-custom-all-reduce和--enforce-eager,它们对我有效。
对我有效！非常感谢！

赞(0）回复(0）举报 6个月前