text-generation-inference 无法导入启用了Flash Attention的模型：无法导入名为'FastLayerNorm'的名称,

xurqigkl 于 2个月前发布在其他

关注(0)|答案(1)|浏览(35)

系统信息

操作系统版本：WSL 2. Ubuntu 22.04
型号：llama3-8B-Instruct
硬件：无GPU
没有GPU,但我在WSL中使用以下命令安装了nvcc库。sudo apt install nvidia-cuda-toolkit
并且没有$CUDA_HOME,$LD_LIBRARY_PATH,

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ nvidia-smi
Command 'nvidia-smi' not found

信息

Docker
直接使用CLI

任务

一个官方支持的命令
我自己的修改

重现过程

在WSL shell中，我运行了以下命令：

docker run --shm-size 1g -p 8080:80 \
  -v ${hf_model_download_path}:/data \
  -e HF_TOKEN=${my_hf_api_token} \
  --name tgi \
  ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-8B-Instruct --disable-custom-kernels

错误日志

...
2024-06-29T07:29:12.599418Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-29T07:29:12.637348Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-29T07:29:21.467781Z  INFO text_generation_launcher: Detected system cpu
2024-06-29T07:29:22.678981Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)
2024-06-29T07:29:32.697783Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-29T07:29:42.713623Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
...

所以我直接进入Docker,运行make install,并找到了错误日志。

$ docker run --rm --entrypoint /bin/bash -it  \
  -e HF_TOKEN=${my_hf_api_token} \
  -v ${hf_model_download_path}:/data -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest  

root@984a3b8b4a4c:/usr/src/server# pip install flash-attn==v2.5.9.post1
Collecting flash-attn==v2.5.9.post1
  Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 3.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 113, in <module>
          _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 65, in get_cuda_bare_metal_version
          raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
        File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/opt/conda/lib/python3.10/subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/opt/conda/lib/python3.10/subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/opt/conda/lib/python3.10/subprocess.py", line 1863, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

      torch.__version__  = 2.3.0

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

预期行为

尽管我根据tgi GitHub指南删除了--gpus标签并添加了--disable-custom-kernels标签，但闪存错误仍然发生。请告诉我如何在CPU上运行TGI。
注意：要使用NVIDIA GPU,您需要安装NVIDIA Container Toolkit。我们还建议使用CUDA版本12.2或更高版本的NVIDIA驱动程序。对于在没有GPU或CUDA支持的机器上运行Docker容器，只需删除--gpus all标志并添加--disable-custom-kernels即可，请注意，CPU不是该项目的预期平台，因此性能可能会低于预期。

text-generation-inference

来源：https://github.com/huggingface/text-generation-inference/issues/2144

1条答案

按热度按时间

2mbi3lxu1#

我认为在这里，闪存注意力可能是一个误导。错误：

2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)

表示无法导入 FastLayerNorm,这是因为在没有GPU的情况下，系统类型会被检测为CPU,而对于 FastLayerNorm ,只有CUDA、ROCm和IPEX实现可用。

赞(0）回复(0）举报 2个月前

我来回答

text-generation-inference 无法导入启用了Flash Attention的模型：无法导入名为'FastLayerNorm'的名称,

系统信息

信息

任务

重现过程

预期行为

1条答案

相关问题

热门标签

最新问答