pytorch 8位量化应该使耳语推理在GPU上更快吗?

jxct1oxe  于 2024-01-09  发布在  其他
关注(0)|答案(1)|浏览(145)

我正在huggingface transformers上执行耳语推断。load_in_8bit量化由bitsandbytes提供。
如果在NVIDIA T4 GPU上以8位模式加载whisper-large-v3,则对示例文件的推断需要更长的时间(5倍)。nvidia-smi中的GPU利用率为33%。
量化不应该提高GPU上的推理速度吗?https://pytorch.org/docs/stable/quantization.html
类似问题:

import torch

from transformers import WhisperFeatureExtractor, WhisperTokenizerFast
from transformers.pipelines.audio_classification import ffmpeg_read

MODEL_NAME = "openai/whisper-large-v3"

tokenizer = WhisperTokenizerFast.from_pretrained(MODEL_NAME)
feature_extractor = WhisperFeatureExtractor.from_pretrained(MODEL_NAME)

model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
     "openai/whisper-large-v3",
    device_map='auto',
    load_in_8bit=True)

sample = "sample.mp3" #27s long

with torch.inference_mode():
    with open(sample, "rb") as f:
        inputs = f.read()
        inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)

        input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']

        input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')

        forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)

        out = tokenizer.decode(forced_decoder_ids_output.squeeze())
        print(out)

字符串

hts6caw3

hts6caw31#

预计int8量化的模型会更慢。这是因为量化增加了额外的操作到模型的前向传递。你可以在int8 quantization paper中阅读更多关于这一点的信息。你也可以找到一些基准测试here,它们显示了相同的情况。
使用int8量化的原因是为了减少模型的内存占用,它允许在更少的硬件上加载更大的模型。

相关问题