系统信息
A100-80GB * 4
信息
- Docker
- CLI直接使用
任务
- 一个官方支持的命令
- 我自己的修改
复现
docker run -d \
--gpus '"device=4,5,6,7"' \
--shm-size 1g \
--name $model_name \
-p ${external_port}:80 -v $model_path:/data/CmwCoder \
-e WEIGHTS_CACHE_OVERRIDE="/data/CmwCoder" \
tgi:2.2.0 \
--weights-cache-override="/data/CmwCoder" \
--model-id "/data/CmwCoder" --num-shard $num_shard \
--max-input-length 14000 \
--max-total-tokens 16000 \
--max-batch-prefill-tokens 14000 \
--trust-remote-code \
--quantize gptq
预期行为
描述错误
当我使用Text-generation-inference 2.2.0对GPTQ量化模型DeepSeekCoderV2进行推理时,我遇到了错误 无法加载 gptq
权重以进行GPTQ -> Marlin重新打包,请确保模型已经量化。config.json
{
"_name_or_path": "/var/mntpkg/deepseek-coder-v2-instruct",
"architectures": [
"DeepseekV2ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV2Config",
"AutoModel": "modeling_deepseek.DeepseekV2Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV2ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 100000,
"eos_token_id": 100001,
"ep_size": 1,
"first_k_dense_replace": 1,
"hidden_act": "silu",
"hidden_size": 5120,
"initializer_range": 0.02,
"intermediate_size": 12288,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v2",
"moe_intermediate_size": 1536,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 160,
"n_shared_experts": 2,
"norm_topk_prob": false,
"num_attention_heads": 128,
"num_experts_per_tok": 6,
"num_hidden_layers": 60,
"num_key_value_heads": 128,
"pretraining_tp": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"quantization_config": {
"bits": 4,
"checkpoint_format": "gptq",
"damp_percent": 0.005,
"desc_act": true,
"dynamic_bits": null,
"group_size": 128,
"lm_head": false,
"meta": {
"quantizer": "gptqmodel:0.9.10-dev0"
},
"model_file_base_name": null,
"model_name_or_path": null,
"quant_method": "gptq",
"static_groups": false,
"sym": true,
"true_sequential": true
},
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 16.0,
"scoring_func": "softmax",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 3,
"topk_method": "group_limited_greedy",
"torch_dtype": "bfloat16",
"transformers_version": "4.43.3",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 102400
}
quantize_config.json
{
"bits": 4,
"dynamic_bits": null,
"group_size": 128,
"desc_act": true,
"static_groups": false,
"sym": true,
"lm_head": false,
"damp_percent": 0.005,
"true_sequential": true,
"model_name_or_path": "deepseek-coder-v2-instruct-gptq",
"model_file_base_name": "model",
"quant_method": "gptq",
"checkpoint_format": "gptq",
"meta": {
"quantizer": "gptqmodel:0.9.10-dev0"
}
错误日志
2024-08-02 03:31:18.315 | INFO | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
You are using a model of type deepseek_v2 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/gptq/__init__.py", line 153, in get_weights
[rank0]: qweight = weights.get_tensor(f"{prefix}.qweight")
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
[rank0]: filename, tensor_name = self.get_filename(tensor_name)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
[rank0]: raise RuntimeError(f"weight {tensor_name} does not exist")
[rank0]: RuntimeError: weight model.layers.59.self_attn.q_a_proj.qweight does not exist
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/opt/conda/bin/text-generation-server", line 8, in <module>
[rank0]: sys.exit(app())
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
[rank0]: server.serve(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
[rank0]: asyncio.run(
[rank0]: File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
[rank0]: return loop.run_until_complete(main)
[rank0]: File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
[rank0]: return future.result()
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
[rank0]: model = get_model(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 490, in get_model
[rank0]: return FlashCausalLM(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
[rank0]: model = model_class(prefix, config, weights)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 764, in __init__
[rank0]: self.model = DeepseekV2Model(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 703, in __init__
[rank0]: [
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 704, in <listcomp>
[rank0]: DeepseekV2Layer(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 626, in __init__
[rank0]: self.self_attn = DeepseekV2Attention(
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_deepseek_v2_modeling.py", line 236, in __init__
[rank0]: weight=weights.get_weights(f"{prefix}.q_a_proj"),
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 344, in get_weights
[rank0]: return self.weights_loader.get_weights(self, prefix)
[rank0]: File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/gptq/__init__.py", line 155, in get_weights
[rank0]: raise RuntimeError(
[rank0]: RuntimeError: Cannot load `gptq` weight for GPTQ -> Marlin repacking, make sure the model is already quantized
rank=0
2024-08-02T04:51:40.980892Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-02T04:51:40.980912Z INFO text_generation_launcher: Shutting down shards
2024-08-02T04:51:40.983834Z INFO shard-manager: text_generation_launcher: Terminating shard rank=2
2024-08-02T04:51:40.984152Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=2
2024-08-02T04:51:40.985398Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-08-02T04:51:40.985884Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-08-02T04:51:41.008435Z INFO shard-manager: text_generation_launcher: Terminating shard rank=3
2024-08-02T04:51:41.008681Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=3
2024-08-02T04:51:48.113698Z INFO shard-manager: text_generation_launcher: shard terminated rank=3
2024-08-02T04:51:48.589678Z INFO shard-manager: text_generation_launcher: shard terminated rank=2
2024-08-02T04:51:49.291857Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: ShardCannotStart
附加上下文
我没有在 GPTQMode.from_quantized()
上遇到推理问题。
2条答案
按热度按时间8nuwlpux1#
遇到类似的问题,TGI似乎不使用
config.json
中的自定义模型Map,即使存在并回退到AutoModel
。bksxznpy2#
你好@Cucunnber 👋
感谢你报告这个问题。我认为我们没有足够的带宽直接解决这个问题,所以我现在会标记@danieldk,因为他是marlin和GPTQ的Maven。