系统信息
操作系统版本:Ubuntu 22.04
正在使用的模型:Qwen/Qwen2-72B-Instruct
正在使用的硬件:4x 40GB A100
部署特定要求:通过docker使用latest
标签运行,截至6月26日
信息
- Docker
- CLI直接使用
任务
- 一个官方支持的命令
- 我自己的修改
重现过程
命令为:
volume=/data/hf_cache
tgi_version="latest"
model_id="Qwen/Qwen2-72B-Instruct"
num_shard="4"
max_input_length=32767
max_batch_prefill_tokens=$max_input_length
max_total_tokens=32768
docker run \
-d \
--name tgi \
--gpus all \
--shm-size 1g \
-p 3000:3001 \
-v $volume:/data \
--restart=always harbor.aip.mitre.org/huggingface/text-generation-inference:$tgi_version \
--model-id $model_id \
--huggingface-hub-cache /data \
--num-shard $num_shard \
--max-input-length $max_input_length \
--max-batch-prefill-tokens $max_batch_prefill_tokens \
--max-total-tokens $max_total_tokens \
--quantize eetq
预期行为
一个正在运行的LLM。我得到的输出如下:
2024-06-26T19:46:48.038583Z INFO text_generation_launcher: Args {
model_id: "Qwen/Qwen2-72B-Instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
4,
),
quantize: Some(
Eetq,
),
speculate: None,
dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: Some(
32767,
),
max_total_tokens: Some(
32768,
),
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: Some(
32767,
),
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "290a3e43304e",
port: 80,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: Some(
"/data",
),
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
}
2024-06-26T19:46:48.038727Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-06-26T19:46:48.067446Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-06-26T19:46:48.067460Z INFO text_generation_launcher: Sharding model on 4 processes
2024-06-26T19:46:48.067564Z INFO download: text_generation_launcher: Starting download process.
2024-06-26T19:46:49.913089Z INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:51.763885Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-06-26T19:46:52.576390Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-26T19:46:52.576816Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-26T19:46:52.576827Z INFO shard-manager: text_generation_launcher: Starting shard rank=2
2024-06-26T19:46:52.576829Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-06-26T19:46:52.576870Z INFO shard-manager: text_generation_launcher: Starting shard rank=3
2024-06-26T19:46:54.847550Z INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.864114Z INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.877120Z INFO text_generation_launcher: Detected system cuda
2024-06-26T19:46:54.883280Z INFO text_generation_launcher: Detected system cuda
2024-06-26T19:47:02.590571Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-26T19:47:02.591274Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-26T19:47:02.592424Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-06-26T19:47:02.593881Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
...
...
...
2024-06-26T19:54:43.498217Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=1
2024-06-26T19:54:45.986445Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=2
2024-06-26T19:54:45.998914Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-26T19:54:46.063442Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=3
2024-06-26T19:54:51.284297Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-2
2024-06-26T19:54:51.294666Z INFO shard-manager: text_generation_launcher: Shard ready in 478.716316913s rank=2
2024-06-26T19:54:51.860967Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-26T19:54:51.861355Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-3
2024-06-26T19:54:51.862223Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-06-26T19:54:51.872910Z INFO shard-manager: text_generation_launcher: Shard ready in 479.294074083s rank=3
2024-06-26T19:54:51.908038Z INFO shard-manager: text_generation_launcher: Shard ready in 479.32951286s rank=0
2024-06-26T19:54:51.909521Z INFO shard-manager: text_generation_launcher: Shard ready in 479.330793679s rank=1
2024-06-26T19:54:51.978363Z INFO text_generation_launcher: Starting Webserver
2024-06-26T19:54:52.051402Z INFO text_generation_router: router/src/main.rs:199: Using the Hugging Face API
2024-06-26T19:54:52.051440Z INFO hf_hub: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/hf-hub-0.3.2/src/lib.rs:55: Token file not found "/root/.cache/huggingface/token"
2024-06-26T19:54:52.333418Z INFO text_generation_router: router/src/main.rs:453: Serving revision 1af63c698f59c4235668ec9c1395468cb7cd7e79 of model Qwen/Qwen2-72B-Instruct
2024-06-26T19:54:52.568054Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|endoftext|>' was expected to have ID '151643' but was given ID 'None'
2024-06-26T19:54:52.568081Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_start|>' was expected to have ID '151644' but was given ID 'None'
2024-06-26T19:54:52.568084Z WARN tokenizers::tokenizer::serialization: /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/tokenizers-0.19.1/src/tokenizer/serialization.rs:159: Warning: Token '<|im_end|>' was expected to have ID '151645' but was given ID 'None'
2024-06-26T19:54:52.573774Z INFO text_generation_router: router/src/main.rs:307: Using config Some(Qwen2)
2024-06-26T19:54:52.573792Z WARN text_generation_router: router/src/main.rs:334: Invalid hostname, defaulting to 0.0.0.0
2024-06-26T19:54:52.578282Z INFO text_generation_router::server: router/src/server.rs:1554: Warming up model
2024-06-26T19:54:54.144694Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
mlp_output = self.mlp(normed_attn_res_output)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
out = super().forward(input)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144776Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
mlp_output = self.mlp(normed_attn_res_output)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
out = super().forward(input)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144786Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
mlp_output = self.mlp(normed_attn_res_output)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
out = super().forward(input)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.144835Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 106, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.10/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/interceptor.py", line 21, in intercept
return await response
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.10/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 125, in Warmup
max_supported_total_tokens = self.model.warmup(batch)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 961, in warmup
_, batch, _ = self.generate_token(batch)
File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1227, in generate_token
out, speculative_logits = self.forward(batch, adapter_data)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 1152, in forward
logits, speculative_logits = self.model.forward(
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 373, in forward
hidden_states = self.model(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 314, in forward
hidden_states, residual = layer(
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 257, in forward
mlp_output = self.mlp(normed_attn_res_output)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 202, in forward
return self.down_proj(self.act(gate_up_states[:, 0]) * gate_up_states[:, 1])
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 218, in forward
out = super().forward(input)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 37, in forward
return self.linear.forward(x)
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/eetq.py", line 23, in forward
output = w8_a16_gemm(input, self.weight, self.scale)
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
2024-06-26T19:54:54.358640Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.379404Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.382335Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
2024-06-26T19:54:54.388277Z ERROR warmup{max_input_length=32767 max_prefill_tokens=32767 max_total_tokens=32768 max_batch_size=None}:warmup: text_generation_client: router/client/src/lib.rs:46: Server error: CANCELLED
Error: WebServer(Warmup(Generation("CANCELLED")))
2024-06-26T19:54:54.536419Z ERROR text_generation_launcher: Webserver Crashed
2024-06-26T19:54:54.536440Z INFO text_generation_launcher: Shutting down shards
2024-06-26T19:54:54.577779Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-06-26T19:54:54.579074Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-06-26T19:54:54.600131Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-06-26T19:54:54.601631Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-06-26T19:54:54.612581Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-06-26T19:54:54.614073Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-06-26T19:54:54.615939Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-06-26T19:54:54.617687Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-06-26T19:54:55.380226Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-06-26T19:54:55.503083Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-06-26T19:54:55.515224Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
Error: WebserverFailed
2024-06-26T19:54:55.619102Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
1条答案
按热度按时间v2g6jxz61#
感谢您的问题 @michaelthreet!有趣的是,您遇到的错误
RuntimeError: [FT Error] Heurisitc failed to find a valid config.
似乎来自 TRT-LLM:https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/cutlass_heuristic.cpp#L372如果您以前遇到过这个错误,可以联系 @Narsil 或 @mfuntowicz