当前环境
Collecting environment information...
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux release 8.9 (Ootpa) (x86_64)
GCC version: (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
Nvidia driver version: 525.147.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7513 32-Core Processor
Stepping: 1
CPU MHz: 2600.000
CPU max MHz: 3681.6399
CPU min MHz: 1500.0000
BogoMIPS: 5199.85
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 32768K
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate pti ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.18.1 pypi_0 pypi
[conda] torch 2.1.2 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.3.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 NIC0 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS 4-7,68-71 0-1
GPU1 NV12 X SYS 4-7,68-71 0-1
NIC0 SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
您希望如何使用vllm
我想在mistralai/Mixtral-8x7B-Instruct-v0.1上运行推理,并使用openAI兼容服务器。python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2 --port 6370 --tensor-parallel-size 2
当我按照以下指令操作时,程序会因为输出而冻结 here
INFO 04-29 08:55:55 api_server.py:229] args: Namespace(host=None, port=6370, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='mistralai/Mixtral-8x7B-Instruct-v0.1', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-29 08:55:56 config.py:413] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.
2024-04-29 08:56:16,845 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8266
nvidia-smi的输出
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:C0:00.0 Off | 0 |
| N/A 20C P0 58W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:C3:00.0 Off | 0 |
| N/A 23C P0 60W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
当我使用较小的mistra模型并将--tensor-parallel-size设置为1时,它按预期工作。
更新1:
在评论下方看到了容器版本的进一步进展 podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxxxx" -p 6370:8000 --ipc=host vllm/vllm-openai:v0.2.7 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2
更新2:
成功运行了V0.2.7 podman run --rm --device nvidia.com/gpu=all --security-opt=label=disable -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=xxxxx" -p 6370:8000 --ipc=host vllm/vllm-openai:v0.2.7 --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2
然而,V0.3.3由于Google DNS问题而失败......
(RayWorkerVllm pid=1024) ERROR 04-29 20:41:42 ray_utils.py:44] Possible files are located at ['/lib64/libcuda.so.1'].Please create a symlink of libcuda.so to any of the file.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
更新3:
v0.4.0显示错误
(RayWorkerVllm pid=1024) ERROR 04-29 20:41:42 ray_utils.py:44] Possible files are located at ['/lib64/libcuda.so.1'].Please create a symlink of libcuda.so to any of the file.
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
更新4:
v0.3.2按预期工作
1条答案
按热度按时间wfauudbj1#
在测试容器版本后,我注意到我可以进一步
但是在大约10分钟后加载失败,输出为
有趣的是,我可以在2个GPU上完美运行mistralai/Mistral-7B-Instruct-v0.2