系统信息
docker exec -it text-generation-inference text-generation-launcher --env
(base) ➜ huggingface-text-generation-inference docker exec -it 401ba897d58aa498e6fffa0e717144c47fea4cf56c0578fbb4b384b42bcf6040 text-generation-launcher --env
2023-06-03T03:36:08.324157Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Sat Jun 3 03:36:08 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 37C P8 13W / 310W | 693MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
2023-06-03T03:36:08.324179Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: None, quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
(base) ➜ huggingface-text-generation-inference curl 127.0.0.1:8080/info | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 455 100 455 0 0 444k 0 --:--:-- --:--:-- --:--:-- 444k
{
"model_id": "/data/bigcode/starcoder",
"model_sha": null,
"model_dtype": "torch.float32",
"model_device_type": "cpu",
"model_pipeline_tag": null,
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1000,
"max_total_tokens": 1512,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 32000,
"max_waiting_tokens": 20,
"validation_workers": 2,
"version": "0.8.2",
"sha": "e7248fe90e27c7c8e39dd4cac5874eb9f96ab182",
"docker_label": "sha-e7248fe"
}
信息
- Docker
- 直接使用CLI
任务
- 一个官方支持的命令
- 我自己的修改
重现
1 Ubuntu 20.04
2 以Docker启动 text-generation-interface
model=/data/bigcode/starcoder
num_shard=1
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --disable-custom-kernels
3
使用 VSCODE extension 发起请求
4 我得到了以下错误:
➜ huggingface-text-generation-inference docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --disable-custom-kernels
2023-06-03T03:33:15.272607Z INFO text_generation_launcher: Args { model_id: "/data/bigcode/starcoder", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }
2023-06-03T03:33:15.272886Z INFO text_generation_launcher: Starting download process.
2023-06-03T03:33:16.389565Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-03T03:33:16.775719Z INFO text_generation_launcher: Successfully downloaded weights.
2023-06-03T03:33:16.776087Z INFO text_generation_launcher: Starting shard 0
2023-06-03T03:33:26.786743Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:36.797049Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:46.807792Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:33:56.818618Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:06.830109Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:16.839934Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:26.850552Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:36.861382Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:46.873280Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:34:56.885746Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:35:06.896503Z INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-03T03:35:12.065627Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
rank=0
2023-06-03T03:35:12.103705Z INFO text_generation_launcher: Shard 0 ready in 115.326268544s
2023-06-03T03:35:12.191281Z INFO text_generation_launcher: Starting Webserver
2023-06-03T03:35:12.271308Z WARN text_generation_router: router/src/main.rs:158: no pipeline tag found for model /data/bigcode/starcoder
2023-06-03T03:35:12.276164Z INFO text_generation_router: router/src/main.rs:178: Connected
2023-06-03T03:43:43.852322Z ERROR shard-manager: text_generation_launcher: Method Prefill encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.9/site-packages/grpc_interceptor/server.py", line 159, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.9/site-packages/text_generation_server/interceptor.py", line 20, in intercept
return await response
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 82, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 73, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 61, in Prefill
generations, next_batch = self.model.generate_token(batch)
File "/opt/conda/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/causal_lm.py", line 575, in generate_token
next_token_id, logprobs = next_token_chooser(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/tokens.py", line 71, in __call__
scores, next_logprob = self.static_warper(scores)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/logits_process.py", line 47, in __call__
self.cuda_graph = torch.cuda.CUDAGraph()
RuntimeError: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
rank=0
2023-06-03T03:43:43.852597Z ERROR batch{batch_size=1}:prefill:prefill{id=0 size=1}:prefill{id=0 size=1}: text_generation_client: router/client/src/lib.rs:33: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
2023-06-03T03:43:43.853127Z ERROR HTTP request{otel.name=POST /generate http.client_ip= http.flavor=1.1 http.host=192.168.1.9:8080 http.method=POST http.route=/generate http.scheme=HTTP http.target=/generate http.user_agent=node-fetch otel.kind=server trace_id=92dbf3a1bfd4c5408c7350b41e793129}:generate{parameters=GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None }}:generate{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:generate_stream{request=GenerateRequest { inputs: "<?php\n\necho \"hello world\";\n", parameters: GenerateParameters { best_of: None, temperature: Some(0.2), repetition_penalty: None, top_k: None, top_p: Some(0.95), typical_p: None, do_sample: true, max_new_tokens: 60, return_full_text: None, stop: ["<|endoftext|>"], truncate: None, watermark: false, details: false, seed: None } }}:infer:send_error: text_generation_router::infer: router/src/infer.rs:533: Request failed during generation: Server error: CUDA error: forward compatibility was attempted on non supported HW
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Device-side assertions were explicitly omitted for this error check; the error probably arose while initializing the DSA handlers.
预期行为
预期不会出现错误
3条答案
按热度按时间bqujaahr1#
由于某种原因,模型加载到了CPU上。
"model_device_type": "cpu", 出现在info中。
您能直接运行
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
吗?pb3skfrl2#
由于某种原因,模型加载到了CPU上。在info中,"model_device_type": "cpu"。
你能直接运行
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
吗?输出:
qgzx9mmu3#
我在这里也遇到了同样的问题,你是否找到了解决方案?