我正在huggingface transformers上执行耳语推断。load_in_8bit
量化由bitsandbytes
提供。
如果在NVIDIA T4 GPU上以8位模式加载whisper-large-v3,则对示例文件的推断需要更长的时间(5倍)。nvidia-smi
中的GPU利用率为33%。
量化不应该提高GPU上的推理速度吗?https://pytorch.org/docs/stable/quantization.html
类似问题:
import torch
from transformers import WhisperFeatureExtractor, WhisperTokenizerFast
from transformers.pipelines.audio_classification import ffmpeg_read
MODEL_NAME = "openai/whisper-large-v3"
tokenizer = WhisperTokenizerFast.from_pretrained(MODEL_NAME)
feature_extractor = WhisperFeatureExtractor.from_pretrained(MODEL_NAME)
model_8bit = AutoModelForSpeechSeq2Seq.from_pretrained(
"openai/whisper-large-v3",
device_map='auto',
load_in_8bit=True)
sample = "sample.mp3" #27s long
with torch.inference_mode():
with open(sample, "rb") as f:
inputs = f.read()
inputs = ffmpeg_read(inputs, feature_extractor.sampling_rate)
input_features = feature_extractor(inputs, sampling_rate = feature_extractor.sampling_rate, return_tensors='pt')['input_features']
input_features = torch.tensor(input_features, dtype=torch.float16, device='cuda')
forced_decoder_ids_output = model_8bit.generate(input_features=input_features, return_timestamps=False)
out = tokenizer.decode(forced_decoder_ids_output.squeeze())
print(out)
字符串
1条答案
按热度按时间hts6caw31#
预计int8量化的模型会更慢。这是因为量化增加了额外的操作到模型的前向传递。你可以在int8 quantization paper中阅读更多关于这一点的信息。你也可以找到一些基准测试here,它们显示了相同的情况。
使用int8量化的原因是为了减少模型的内存占用,它允许在更少的硬件上加载更大的模型。