text-generation-inference 运行时错误:"weight lm_head.weight不存在",当加载qwen2-0.5B-Instruct模型时,

oyxsuwqo  于 9个月前  发布在  其他
关注(0)|答案(3)|浏览(302)

在使用TGI库加载qwen2-0.5B-Instruct模型时遇到了一个问题。抛出的错误信息是“RuntimeError: weight lm_head.weight does not exist”。

我怀疑这可能是由于'safetensors'文件没有保留'tied'参数导致的。似乎在调用'model.save_pretrained()'之前,可以通过解开'lm_head'和'embed_tokens'来避免这个问题。
有趣的是,这个问题并没有出现在较大的qwen模型中,例如7B或72B版本。我在想这是否是一个预期的行为还是一个意外的bug。
我正在使用最新的官方镜像:ghcr.io/huggingface/text-generation-inference:2.2.0。

回溯信息:

  1. 2024-08-07T16:32:00.053201Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
  2. return loop.run_until_complete(main)
  3. File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
  4. self.run_forever()
  5. File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
  6. self._run_once()
  7. File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
  8. handle._run()
  9. File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
  10. self._context.run(self._callback, *self._args)
  11. > File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
  12. model = get_model(
  13. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
  14. return FlashCausalLM(
  15. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
  16. model = model_class(prefix, config, weights)
  17. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
  18. self.lm_head = SpeculativeHead.load(
  19. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
  20. lm_head = TensorParallelHead.load(config, prefix, weights)
  21. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
  22. weight = weights.get_tensor(f"{prefix}.weight")
  23. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
  24. filename, tensor_name = self.get_filename(tensor_name)
  25. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
  26. raise RuntimeError(f"weight {tensor_name} does not exist")
  27. RuntimeError: weight lm_head.weight does not exist
  28. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
  29. model = get_model(
  30. File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
  31. return loop.run_until_complete(main)
  32. File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
  33. return future.result()
  34. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
  35. model = get_model(
  36. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 953, in get_model
  37. return FlashCausalLM(
  38. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_causal_lm.py", line 898, in __init__
  39. model = model_class(prefix, config, weights)
  40. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_qwen2_modeling.py", line 344, in __init__
  41. self.lm_head = SpeculativeHead.load(
  42. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py", line 40, in load
  43. lm_head = TensorParallelHead.load(config, prefix, weights)
  44. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/tensor_parallel.py", line 67, in load
  45. weight = weights.get_tensor(f"{prefix}.weight")
  46. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 212, in get_tensor
  47. filename, tensor_name = self.get_filename(tensor_name)
  48. File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/weights.py", line 193, in get_filename
  49. raise RuntimeError(f"weight {tensor_name} does not exist")
  50. RuntimeError: weight lm_head.weight does not exist
  51. rank=0
  52. 2024-08-07T16:32:05.257583Z ERROR text_generation_launcher: Shard 0 failed to start
  53. 2024-08-07T16:32:05.257616Z INFO text_generation_launcher: Shutting down shards
  54. Error: ShardCannotStart```
yi0zb3m4

yi0zb3m41#

你好,@boyang-nlp 👋
感谢你报告这个问题!
我认为我们目前在带宽方面有些受限,但希望我能在下周看一下。如果在此期间你有时间深入研究可能的解决方案,请随时在这里发布 👍

jgzswidk

jgzswidk2#

当然!不用担心,您可以按照自己的需要来。我会继续寻找解决方案并更新您。谢谢!👍

dffbzjpn

dffbzjpn3#

你好,@boyang-nlp 和 @ErikKaum,
我们也在 Qwen2-1.5B 上遇到了这个问题,这里是临时解决方案(应该也适用于 Qwen2-0.5B):
打开 huggingface docker (ghcr.io/huggingface/text-generation-inference:2.2.0),然后在其中打开 speculative.py 文件。

  1. vi /opt/conda/lib/python3.10/site-packages/text_generation_server/layers/speculative.py

在那个文件中,在 line 40 处添加以下行(位于 "FIX START" 和 "FIX END" 注解之间):

  1. import torch
  2. ...
  3. class SpeculativeHead(torch.nn.Module):
  4. ...
  5. @staticmethod
  6. def load(config, prefix: str, weights):
  7. speculator = config.speculator
  8. if speculator:
  9. ...
  10. else:
  11. # FIX START
  12. if config._name_or_path == "Qwen/Qwen2-1.5B":
  13. if prefix == "lm_head":
  14. prefix = "model.embed_tokens"
  15. # FIX END
  16. lm_head = TensorParallelHead.load(config, prefix, weights)
  17. speculator = None
  18. return SpeculativeHead(lm_head, speculator)
  19. ...

上述修复之所以有效,是因为对于 Qwen2-1.5B,嵌入是绑定在一起的。
为了验证这一点,我们运行了以下脚本:

  1. from transformers import AutoModelForCausalLM, AutoTokenizer
  2. import torch
  3. device = "cuda" # the device to load the model onto
  4. model = AutoModelForCausalLM.from_pretrained(
  5. "Qwen/Qwen2-1.5B",
  6. torch_dtype="auto",
  7. device_map="auto"
  8. )
  9. torch.all(model.model.embed_tokens.weight == model.lm_head.weight)

输出结果如下:

  1. tensor(True, device='cuda:0')

这意味着 lm_head 的权重与 model.embed_tokens 的权重相等。

展开查看全部

相关问题