vllm [Bug]: 待处理,但平均生成吞吐量为:0.0个令牌/秒

nwwlzxa7  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(25)

当前环境信息如下:

  • PyTorch版本:2.1.2+cu118
  • 是否为调试构建:否
  • 用于构建PyTorch的CUDA版本:11.8
  • 是否使用ROCM进行构建:N/A
  • 操作系统:Ubuntu 22.04.4 LTS(x86_64)
  • GCC版本:Ubuntu 11.4.0-1ubuntu1~22.04
  • Clang版本:无法收集
  • CMake版本:3.29.0
  • Libc版本:glibc-2.35
  • Python版本:3.10.14(主,Mar 21 2024,16:24:04) [GCC 11.2.0] (64位运行时)
  • Python平台:Linux-5.15.0-94-generic-x86_64-with-glibc2.35
  • 是否支持CUDA:是
  • CUDA运行时版本:11.8.89
  • CUDA模块加载设置:LAZY
  • GPU模型和配置:
    GPU 0:NVIDIA A100-SXM4-80GB
    GPU 1:NVIDIA A100-SXM4-80GB
    GPU 2:NVIDIA A100-SXM4-80GB
    GPU 3:NVIDIA A100-SXM4-80GB
    GPU 4:NVIDIA A100-SXM4-80GB
    GPU 5:NVIDIA A100-SXM4-80GB
    GPU 6:NVIDIA A100-SXM4-80GB
    GPU 7:NVIDIA A100-SXM4-80GB
  • Nvidia驱动版本:525.147.05
  • cuDNN版本:无法收集
  • HIP运行时版本:N/A
  • MIOpen运行时版本:N/A
  • XNNPACK可用性:是
  • CPU信息:
    架构:x86_64
    CPU op-mode(s):32位,64位
    地址大小:43 bits physical,48 bits virtual
    字节序:Little Endian
    CPU(s):128
    On-line CPU list(s)0-127
    厂商ID:AuthenticAMD
    Model name:AMD EPYC 7542 32-Core处理器
    CPU family:23
    Model:49
    Thread(s) per core:2
    Core(s) per socket:32
    Socket(s):2
    Stepping:0
    频率提升倍数:无数据
    CPU max MHz:29000.000000
    CPU min MHz:15000.000000
    BogoMIPS:5788.97
    Flags:fpu vme de pse msc pae mce cxa sep mtrr pge mca cmov pat pse36 clflush mmx fxsrsse3 xtpr pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est usdts acpi mmxext smep bmi1 avx2 smep2 intel_pt xsave avx fma3 fma4 xop xsaveopt xsavec xgetbv1 xsaves clflushopt intel_prefetch auto_prefetcher env_memory_model intel_ppc64le intel_vapic intel_pid_store intel_ilp64 intel_ptx_msr intel_arch_perfmon intel_pcid intel_sgx intel_dc_cik intel_bdw_l3cache intel_tbb intel_abm intel_stibp intel_vmx intel_osxsycl intel_sgx_lc ibrs intel_ibpb intel_stibp_thread_pointer intel_pt xsaveopt xsavec xgetbv1 xsaves clflushopt flushreordering xsavers ibrs optnone _clearvolatiles _GLOBAL_SUBSPACE _HAVE_IBSSB _MCE _MCASSEMBLERBYTECODE _MACHDEP _MSR _MTRR _PWT _PVECTOR _PUSHFP _XOP _XSAVES _XSAVEOPT _XSAVESC _XSTOREB _XSTOREDW _XSTOREL _XSTOREW _XTSS _RELAXEDORDERING _RELAXEDPREFETCH _RDPTRDM _LLC _LGOPCLMULQDQ _INVARIANT_MAXIMUM _INVARIANT_MINIMUM _NONSTOPTIMERS _PERFMON _TSCdeadline \n"
    这个问题可能是由于模型生成的文本过长导致的。你可以尝试减小max_tokens的值,例如将其设置为500或更小的值,以减少生成文本的长度。修改后的代码如下:
def generate_text(prompt):
    data = {
        "model": "******",
        "prompt": prompt,
        "max_tokens": 500,  # 将这里的值减小
        "temperature": 0.8,
        "repetition_penalty": 1.1,
        "top_k": -1,
        "top_p": 0.8,
        "n": 3,
    }
    headers = {'Content-Type': 'application/json'}
    response = requests.post(api_url, headers=headers, json=data, stream=True, timeout=60)
    response_data = json.loads(response.content)
    response_content1 = response_data['choices'][0]["text"]
    response_content2 = response_data['choices'][1]["text"]
    response_content3 = response_data['choices'][2]["text"]
    print(response_content1)
    print(response_content2)
    print(response_content3)
    return response_content1, response_content2, response_content3

如果问题仍然存在,你可以尝试进一步减小max_tokens的值,或者检查模型的训练数据和参数设置是否合适。

p8h8hvxi

p8h8hvxi1#

我遇到了类似的错误,模型是使用Kserve和vLLM托管的。它在125m模型上运行正常,但当将其更改为10B模型时,会出现以下错误。

INFO 06-07 12:24:29 async_llm_engine.py:117] Received request 4191e1b08e1a48a890dd1d07e55f10ae: prompt: 'Triton 추론 서버란 무엇입니까?', sampling params: SamplingParams(n=2, best_of=2, presence_penalty=0.0, frequency_penalty=0.0, temperature=0.0, top_p=1.0, top_k=-1, use_beam_search=True, stop=[], ignore_eos=False, max_tokens=500, logprobs=None), prompt token ids: None.
 INFO 06-07 12:24:29 llm_engine.py:394] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%
 INFO 06-07 12:24:34 llm_engine.py:394] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 108.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%
 INFO 06-07 12:24:38 async_llm_engine.py:171] Finished request 4191e1b08e1a48a890dd1d07e55f10ae.
 INFO: 127.0.0.6:0 - "POST /generate HTTP/1.1" 200 OK

根据文档使用了/generate API:

token = os.environ['AUTH_TOKEN']
headers = {"Authorization": f"Bearer {token}"}
pload = {
 "prompt": prompt,
 "n": 2,
 "use_beam_search": True,
 "temperature": 0.0,
 "max_tokens": 500,
 "stream": False,
}
response = requests.post(URL, headers=headers, json=pload)

相关问题