性能讨论
我在多节点并行管道上使用NCCL运行了一些简单的测试。我将节点之间的带宽加倍,但没有看到t/s或吞吐量的增加。我在节点之间测试了400Gbps和800Gbps的带宽,但得到的t/s相同,尽管nccl-tests显示了GPU之间的2倍带宽。
问题:
- 如果不是节点间连接,多节点并行管道中的瓶颈是什么?
- 可以进行哪些其他测试或分析来找出并行管道上的瓶颈?
- 目前在vllm(如
benchmark_throughput.py
或benchmark_latency.py
)中是否有可以与--pipeline-parallel-size
(所需异步:Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise.
)一起运行的基准测试?
环境:
(2) x GH200个节点(aarch64)-节点之间的以太网带宽为800Gbps
模型: L3-70B-Instruct
测试1:
(4)x 200Gbps连接(总带宽为800Gbps)
NCCL测试:
执行命令
mpirun -np 2 -host localhost,10.5.6.103 ./build/{all_reduce_perf,all_gather_perf,alltoall_perf} -b 512k -e 1024M -f 2 -g 1
NCCL测试(all_reduce_perf):algbw = 70.83GB/s,busbw = 70.83
NCCL测试(all_gather_perf):algbw = 137.76GB/s,busbw = 68.88
NCCL测试(alltoall_perf):algbw = 47.32GB/s,busbw = 23.66
VLLM测试:
(使用Open WebUI和OpenAI API)
python -m vllm.entrypoints.openai.api_server --model /models/Meta-Llama-3-70B-Instruct --pipeline-parallel-size 2 --distributed-executor-backend ray
日志:
INFO 07-20 15:57:52 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 07-20 15:57:57 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 07-20 15:58:02 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
测试2:
(2)x 200Gbps连接(总带宽为400Gbps)
NCCL测试:
执行命令
mpirun -np 2 -host localhost,10.5.6.103 ./build/{all_reduce_perf,all_gather_perf,alltoall_perf} -b 512k -e 1024M -f 2 -g 1
NCCL测试(all_reduce_perf):algbw = 45.42 GB/s,busbw = 45.42 GB/s
NCCL测试(all_gather_perf):algbw = 91.32 GB/s,busbw = 45.66 GB/s
NCCL测试(alltoall_perf):algbw = 49.29 GB/s,busbw = 24.64 GB/s
VLLM测试:
(使用Open WebUI和OpenAI API)
python -m vllm.entrypoints.openai.api_server --model /models/Meta-Llama-3-70B-Instruct --pipeline-parallel-size 2 --distributed-executor-backend ray
日志:
INFO 07-20 17:44:20 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.
INFO 07-20 17:44:25 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
INFO 07-20 17:44:30 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 21.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.5%, CPU KV cache usage: 0.0%.
启动日志显示在推理过程中使用了nccl==2.21.5
。
INFO 07-20 17:27:16 api_server.py:212] vLLM API server version 0.5.2
INFO 07-20 17:27:16 api_server.py:213] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='/models/Meta-Llama-3-70B-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend='ray', worker_use_ray=False, pipeline_parallel_size=2, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, model_loader_extra_config=None, preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-07-20 17:27:16,258 INFO worker.py:1603 -- Connecting to existing Ray cluster at address: 10.5.6.100:6379...
2024-07-20 17:27:16,262 INFO worker.py:1788 -- Connected to Ray cluster.
INFO 07-20 17:27:16 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='/models/Meta-Llama-3-70B-Instruct', speculative_config=None, tokenizer='/models/Meta-Llama-3-70B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/models/Meta-Llama-3-70B-Instruct, use_v2_block_manager=False, enable_prefix_caching=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-20 17:27:20 utils.py:737] Found nccl from library libnccl.so.2
INFO 07-20 17:27:20 pynccl.py:63] vLLM is using nccl==2.21.5
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:27:20 utils.py:737] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:27:20 pynccl.py:63] vLLM is using nccl==2.21.5
INFO 07-20 17:34:54 model_runner.py:266] Loading model weights took 67.6673 GB
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:34:57 model_runner.py:266] Loading model weights took 67.6673 GB
INFO 07-20 17:34:59 distributed_gpu_executor.py:56] # GPU blocks: 6149, # CPU blocks: 1638
INFO 07-20 17:35:00 model_runner.py:1007] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 07-20 17:35:00 model_runner.py:1011] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:35:00 model_runner.py:1007] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:35:00 model_runner.py:1011] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(RayWorkerWrapper pid=593, ip=10.5.6.103) INFO 07-20 17:35:12 model_runner.py:1208] Graph capturing finished in 12 secs.
当前环境
Collecting environment information...
WARNING 07-20 21:08:09 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/vllm/vllm/usage/usage_lib.py:19: RuntimeWarning: Failed to read commit hash:
No module named 'vllm.commit_id'
from vllm.version import __version__ as VLLM_VERSION
PyTorch version: 2.4.0a0+f70bd71a48.nv24.06
Is debug build: False
CUDA used to build PyTorch: 12.5
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-1009-nvidia-64k-aarch64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.5.40
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GH200 480GB
Nvidia driver version: 555.42.06
cuDNN version: Probably one of the following:
/usr/lib/aarch64-linux-gnu/libcudnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_adv.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_cnn.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_graph.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_heuristic.so.9.1.0
/usr/lib/aarch64-linux-gnu/libcudnn_ops.so.9.1.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Vendor ID: ARM
Model name: Neoverse-V2
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 72
Socket(s): -
Cluster(s): 1
Stepping: r0p0
Frequency boost: disabled
CPU max MHz: 3483.0000
CPU min MHz: 81.0000
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh bti
L1d cache: 4.5 MiB (72 instances)
L1i cache: 4.5 MiB (72 instances)
L2 cache: 72 MiB (72 instances)
L3 cache: 114 MiB (1 instance)
NUMA node(s): 9
NUMA node0 CPU(s): 0-71
NUMA node1 CPU(s):
NUMA node2 CPU(s):
NUMA node3 CPU(s):
NUMA node4 CPU(s):
NUMA node5 CPU(s):
NUMA node6 CPU(s):
NUMA node7 CPU(s):
NUMA node8 CPU(s):
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Not affected
Vulnerability Spectre v1: Mitigation; __user pointer sanitization
Vulnerability Spectre v2: Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] onnx==1.16.0
[pip3] optree==0.11.0
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==3.0.0+989adb9a2
[pip3] torch==2.4.0a0+f70bd71a48.nv24.6
[pip3] torch-tensorrt==2.4.0a0
[pip3] torchvision==0.19.0a0
[pip3] transformers==4.42.4
[pip3] triton==3.0.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.2
vLLM Build Flags:
CUDA Archs: 9.0+PTX; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS 0-71 0 1
NIC0 SYS X PIX SYS SYS
NIC1 SYS PIX X SYS SYS
NIC2 SYS SYS SYS X PIX
NIC3 SYS SYS SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
4条答案
按热度按时间4dc9hkyq1#
请使用
export NCCL_DEBUG=TRACE
查看nccl信息。很有可能rdma没有被使用,而nccl仍然在使用socket。o75abkj42#
感谢您的建议!您是对的,IB没有被使用,NCCL正在使用socket。
为容器设置
--privileged
解决了这个问题。然而,我仍然看到了相同的吞吐量。我会进行更多的测试,如果我找到了一些优化方法,我会回报的。祝好。
pw9qyyiw3#
你可以使用https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py来对性能进行基准测试。
如何将带宽翻倍?如果是一些InfiniBand或RROC网络,请确保NCCL正确地使用它们。
fnatzsnv4#
通过在每台服务器上添加额外的bluefield-3卡,将带宽翻倍。(在测试过程中,我简单地禁用了2个接口)
我会尝试对
benchmark_serving.py
进行更彻底的测试。网络是RoCE,验证了所有Infiniband测试,NCCL按预期工作。VLLM显示nccl的使用正确:
我还在Docker容器内运行(这是aarch64所需的)。不确定这是否会影响带宽。