vllm [Bug]:批处理推理不一致(即使温度为0)

2admgd59  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(37)

当前环境

PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A

GCC version: (GCC) 8.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.9.19 (main, May  6 2024, 19:43:03)  [GCC 11.2.0] (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.6.124
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.104.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    2
Core(s) per socket:    32
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 106
Model name:            Intel(R) Xeon(R) Platinum 8369B CPU @ 2.90GHz
Stepping:              6
CPU MHz:               3490.909
BogoMIPS:              5807.31
Virtualization:        VT-x
L1d cache:             48K
L1i cache:             32K
L2 cache:              1280K
L3 cache:              49152K
NUMA node0 CPU(s):     0-127

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==1.13.1
[pip3] torchaudio==0.13.1
[pip3] torchvision==0.14.1
[pip3] transformers==4.37.0
[pip3] transformers-stream-generator==0.0.5
[conda] blas                      1.0                         mkl  
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] mkl-service               2.4.0            py39h5eee18b_1  
[conda] mkl_fft                   1.3.8            py39h5eee18b_0  
[conda] mkl_random                1.2.4            py39hdb19cb5_0  
[conda] numpy                     1.26.4           py39h5f9d8c6_0  
[conda] numpy-base                1.26.4           py39hb5e798b_0  
[conda] pytorch                   1.13.1          py3.9_cuda11.6_cudnn8.3.2_0    pytorch
[conda] pytorch-cuda              11.6                 h867d48c_1    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.13.1               py39_cu116    pytorch
[conda] torchvision               0.14.1               py39_cu116    pytorch
[conda] transformers              4.37.0                   pypi_0    pypi
[conda] transformers-stream-generator 0.0.5                    pypi_0    pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: N/A
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     0-127           N/A             N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     0-127           N/A             N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     0-127           N/A             N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     0-127           N/A             N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     PXB     SYS     0-127           N/A             N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     PXB     SYS     0-127           N/A             N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     SYS     PXB     0-127           N/A             N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     SYS     PXB     0-127           N/A             N/A
NIC0    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC1    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3

🐛 描述bug

使用不同批次大小(1或其他数字)运行推理时,响应不同。我想知道原因以及如何保持响应一致?
具体来说,当我使用batchsize=1运行推理时,响应是A,而当我使用batchsize=2、10或20(但保持相同的提示)运行推理时,每个响应都是B。A与B不同。

相关问题