系统信息
目标:x86_64-unknown-linux-gnu
Cargo版本:1.78.0
提交的sha:96b7b40
Docker标签:sha-96b7b40-rocm
信息
- Docker
- 直接使用CLI
任务
- 一个官方支持的命令
- 我自己的修改
复现过程
我按照网站上的步骤进行操作,成功设置了Docker容器和本地模型服务器。
docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g \
--net host -v $(pwd)/hf_cache:/data -e HUGGING_FACE_HUB_TOKEN=$HF_READ_TOKEN \
ghcr.io/huggingface/text-generation-inference:sha-293b8125-rocm \
--model-id local_path/Meta-Llama-70B-Instruct --num-shard 8
- 打开另一个shell:
docker exec -it tgi_container_name /bin/bash
- 运行基准测试
text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
--sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
-b 1 -b 2
然后在以下日志后卡住了。
2024-06-17T11:01:59.291750Z INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-06-17T11:01:59.291802Z INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
2024-06-17T11:01:59.336401Z INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
2024-06-17T11:01:59.365280Z INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
2024-06-17T11:01:59.368575Z INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected
我还尝试使用llama2-7b,只使用一张GPU卡,序列长度为512,解码长度为128,但也卡住了。
2024-06-17T10:54:34.661975Z INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z INFO text_generation_router::server: router/src/server.rs:1868: Connected
预期行为
预填充和解码延迟是预期的,但它在几乎一个小时后堆积起来,没有输出任何内容。
此外,GPU的可用性为零,而在设置热身步骤时是非零的。
2条答案
按热度按时间ftf50wuq1#
感谢yuqie的报告!
cc @fxmarty作为基准测试的作者
vjhs03f72#
你好,yuqie,谢谢你。在容器内的第二个终端中启动
后会发生什么?
此时你应该有一个像https://youtu.be/jlMAX2Oaht0?t=198这样的图形基准测试。