text-generation-inference 运行时错误:"weight lm_head.weight不存在",当加载qwen2-0.5B-Instruct模型时,

oyxsuwqo  于 23天前  发布在  其他
关注(0)|答案(3)|浏览(17)

在使用TGI库加载qwen2-0.5B-Instruct模型时遇到了一个问题。抛出的错误信息是“RuntimeError: weight lm_head.weight does not exist”。

我怀疑这可能是由于'safetensors'文件没有保留'tied'参数导致的。似乎在调用'model.save_pretrained()'之前,可以通过解开'lm_head'和'embed_tokens'来避免这个问题。
有趣的是,这个问题并没有出现在较大的qwen模型中,例如7B或72B版本。我在想这是否是一个预期的行为还是一个意外的bug。
我正在使用最新的官方镜像:ghcr.io/huggingface/text-generation-inference:2.2.0。

回溯信息:

2024-08-07T16:32:00.053201Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0

    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
    return FlashCausalLM(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
    model = model_class(prefix, config, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
    self.lm_head = SpeculativeHead.load(
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
    lm_head = TensorParallelHead.load(config, prefix, weights)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
    weight = weights.get_tensor(f"{prefix}.weight")
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)
  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")
RuntimeError: weight lm_head.weight does not exist

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(


  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
    return FlashCausalLM(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
    model = model_class(prefix, config, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
    self.lm_head = SpeculativeHead.load(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
    lm_head = TensorParallelHead.load(config, prefix, weights)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
    weight = weights.get_tensor(f"{prefix}.weight")

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
    filename, tensor_name = self.get_filename(tensor_name)

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
    raise RuntimeError(f"weight {tensor_name} does not exist")

RuntimeError: weight lm_head.weight does not exist
 rank=0
2024-08-07T16:32:05.257583Z ERROR text_generation_launcher: Shard 0 failed to start
2024-08-07T16:32:05.257616Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```
yi0zb3m4

yi0zb3m41#

你好,@boyang-nlp 👋
感谢你报告这个问题!
我认为我们目前在带宽方面有些受限,但希望我能在下周看一下。如果在此期间你有时间深入研究可能的解决方案,请随时在这里发布 👍

jgzswidk

jgzswidk2#

当然!不用担心,您可以按照自己的需要来。我会继续寻找解决方案并更新您。谢谢!👍

dffbzjpn

dffbzjpn3#

你好,@boyang-nlp 和 @ErikKaum,
我们也在 Qwen2-1.5B 上遇到了这个问题,这里是临时解决方案(应该也适用于 Qwen2-0.5B):
打开 huggingface docker (ghcr.io/huggingface/text-generation-inference:2.2.0),然后在其中打开 speculative.py 文件。

vi /opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py

在那个文件中,在 line 40 处添加以下行(位于 "FIX START" 和 "FIX END" 注解之间):

import torch
...

class SpeculativeHead(torch.nn.Module):
    ...
    @staticmethod
    def load(config, prefix: str, weights):
        speculator = config.speculator
        if speculator:
            ...
        else:
            # FIX START
            if config._name_or_path == "Qwen/Qwen2-1.5B":
                if prefix == "lm_head":
                    prefix = "model.embed_tokens"
            # FIX END
            lm_head = TensorParallelHead.load(config, prefix, weights)
            speculator = None
        return SpeculativeHead(lm_head, speculator)
...

上述修复之所以有效,是因为对于 Qwen2-1.5B,嵌入是绑定在一起的。
为了验证这一点,我们运行了以下脚本:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-1.5B",
    torch_dtype="auto",
    device_map="auto"
)
torch.all(model.model.embed_tokens.weight == model.lm_head.weight)

输出结果如下:

tensor(True, device='cuda:0')

这意味着 lm_head 的权重与 model.embed_tokens 的权重相等。

相关问题