text-generation-inference 无法导入启用了Flash Attention的模型:无法导入名为'FastLayerNorm'的名称,

xurqigkl  于 2个月前  发布在  其他
关注(0)|答案(1)|浏览(35)

系统信息

操作系统版本:WSL 2. Ubuntu 22.04
型号:llama3-8B-Instruct
硬件:无GPU
没有GPU,但我在WSL中使用以下命令安装了nvcc库。sudo apt install nvidia-cuda-toolkit
并且没有$CUDA_HOME,$LD_LIBRARY_PATH,

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

$ nvidia-smi
Command 'nvidia-smi' not found

信息

  • Docker
  • 直接使用CLI

任务

  • 一个官方支持的命令
  • 我自己的修改

重现过程

  1. 在WSL shell中,我运行了以下命令:
docker run --shm-size 1g -p 8080:80 \
  -v ${hf_model_download_path}:/data \
  -e HF_TOKEN=${my_hf_api_token} \
  --name tgi \
  ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-8B-Instruct --disable-custom-kernels
  1. 错误日志
...
2024-06-29T07:29:12.599418Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-29T07:29:12.637348Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-29T07:29:21.467781Z  INFO text_generation_launcher: Detected system cpu
2024-06-29T07:29:22.678981Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)
2024-06-29T07:29:32.697783Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-29T07:29:42.713623Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
...
  1. 所以我直接进入Docker,运行make install,并找到了错误日志。
$ docker run --rm --entrypoint /bin/bash -it  \
  -e HF_TOKEN=${my_hf_api_token} \
  -v ${hf_model_download_path}:/data -p 8080:80 \
  ghcr.io/huggingface/text-generation-inference:latest  

root@984a3b8b4a4c:/usr/src/server# pip install flash-attn==v2.5.9.post1
Collecting flash-attn==v2.5.9.post1
  Downloading flash_attn-2.5.9.post1.tar.gz (2.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.6/2.6 MB 3.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [23 lines of output]
      No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
      fatal: not a git repository (or any of the parent directories): .git
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 113, in <module>
          _, bare_metal_version = get_cuda_bare_metal_version(CUDA_HOME)
        File "/tmp/pip-install-mc4cargc/flash-attn_b8fe41f0c83d4045a248ec2027dda9da/setup.py", line 65, in get_cuda_bare_metal_version
          raw_output = subprocess.check_output([cuda_dir + "/bin/nvcc", "-V"], universal_newlines=True)
        File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/opt/conda/lib/python3.10/subprocess.py", line 503, in run
          with Popen(*popenargs, **kwargs) as process:
        File "/opt/conda/lib/python3.10/subprocess.py", line 971, in __init__
          self._execute_child(args, executable, preexec_fn, close_fds,
        File "/opt/conda/lib/python3.10/subprocess.py", line 1863, in _execute_child
          raise child_exception_type(errno_num, err_msg, err_filename)
      FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

      torch.__version__  = 2.3.0

      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

预期行为

尽管我根据tgi GitHub指南删除了--gpus标签并添加了--disable-custom-kernels标签,但闪存错误仍然发生。请告诉我如何在CPU上运行TGI。
注意:要使用NVIDIA GPU,您需要安装NVIDIA Container Toolkit。我们还建议使用CUDA版本12.2或更高版本的NVIDIA驱动程序。对于在没有GPU或CUDA支持的机器上运行Docker容器,只需删除--gpus all标志并添加--disable-custom-kernels即可,请注意,CPU不是该项目的预期平台,因此性能可能会低于预期。

2mbi3lxu

2mbi3lxu1#

我认为在这里,闪存注意力可能是一个误导。错误:

2024-06-29T07:29:25.389048Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.layers.layernorm' (/opt/conda/lib/python3.10/site-packages/text_generation_server/layers/layernorm.py)

表示无法导入 FastLayerNorm,这是因为在没有GPU的情况下,系统类型会被检测为CPU,而对于 FastLayerNorm ,只有CUDA、ROCm和IPEX实现可用。

相关问题