text-generation-inference get stucked when run text-generation-benchmark on AMD gpu

ghg1uchk  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(92)

系统信息

目标:x86_64-unknown-linux-gnu
Cargo版本:1.78.0
提交的sha:96b7b40
Docker标签:sha-96b7b40-rocm

信息

  • Docker
  • 直接使用CLI

任务

  • 一个官方支持的命令
  • 我自己的修改

复现过程

我按照网站上的步骤进行操作,成功设置了Docker容器和本地模型服务器。

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g \
    --net host -v $(pwd)/hf_cache:/data -e HUGGING_FACE_HUB_TOKEN=$HF_READ_TOKEN \
    ghcr.io/huggingface/text-generation-inference:sha-293b8125-rocm \
    --model-id local_path/Meta-Llama-70B-Instruct --num-shard 8
  1. 打开另一个shell:docker exec -it tgi_container_name /bin/bash
  2. 运行基准测试
text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2

然后在以下日志后卡住了。

2024-06-17T11:01:59.291750Z  INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-06-17T11:01:59.291802Z  INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
2024-06-17T11:01:59.336401Z  INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
2024-06-17T11:01:59.365280Z  INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
2024-06-17T11:01:59.368575Z  INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected

我还尝试使用llama2-7b,只使用一张GPU卡,序列长度为512,解码长度为128,但也卡住了。

2024-06-17T10:54:34.661975Z  INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z  INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z  INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z  INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z  WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z  INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z  INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z  INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z  WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z  WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z  INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z  INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z  INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z  INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z  INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z  INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z  INFO text_generation_router::server: router/src/server.rs:1868: Connected

预期行为

预填充和解码延迟是预期的,但它在几乎一个小时后堆积起来,没有输出任何内容。
此外,GPU的可用性为零,而在设置热身步骤时是非零的。

ftf50wuq

ftf50wuq1#

感谢yuqie的报告!
cc @fxmarty作为基准测试的作者

vjhs03f7

vjhs03f72#

你好,yuqie,谢谢你。在容器内的第二个终端中启动

text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
    --sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
    -b 1 -b 2

后会发生什么?
此时你应该有一个像https://youtu.be/jlMAX2Oaht0?t=198这样的图形基准测试。

相关问题