CTranslate2 V100 32GB上的奇怪行为

i5desfxk  于 2个月前  发布在  其他
关注(0)|答案(2)|浏览(44)

你好。

我最近对NVIDIA V100 32GB GPU进行了一些基准测试。首先,我使用Huggingface的Transformers和CTranslate2对Llama2-7B-chat进行了基准测试。在使用ct2时,我看到了延迟的降低(分别为12秒和7.5秒)。

然而,当我尝试使用13B版本时,我并没有看到任何延迟方面的改进(分别为18秒和18秒),尽管vRAM的使用量略有减少。

为什么会这样?我做错了什么吗?

这是我正在使用的代码:

input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

print(f"GPU:\tV100\nTime(s):\t{t}\nResult: {output}")
ujv3wf0j

ujv3wf0j1#

你好,
你能分享一下你使用HuggingFace transformers运行模型的代码吗?
另外,在转换并加载模型到CTranslate2时,你设置了哪些参数?

polkgigr

polkgigr2#

Hi,
Can you share the code you are using to run the model with HuggingFace transformers?
Also what parameters do you set when converting and then loading the model to CTranslate2?
Sure.
Here's the code:
HF Transformers

import transformers
import torch
import time

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
chatbot = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

chatbot.pad_token_id=tokenizer.eos_token_id,
# Warm up the model
chatbot(
    "Who is the president of the US?",
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=10,
)
input = llama2_chat_prompt_template.format(transcript=transcript)
start = time.time()
sequences = chatbot(
    [input, input, input],
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=1200,
    batch_size=5
)
end = time.time()
t = end-start

FOR CTranslate2

import ctranslate2
import transformers
import time

# Load and warmup the model
start = time.time()
generator = ctranslate2.Generator("/content/Llama-2-13B-Chat-ct2", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16")
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode("WHo is the president of the US?"))
results = generator.generate_batch([tokens], max_length=5, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
end = time.time()
t = end - start
print("Time:\t", t)
print(output)

# Run speed test
input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

Conversion script:

# Unquantized
import time
import os
start = time.time()
os.system("ct2-transformers-converter --model TheBloke/Llama-2-13B-Chat-fp16 --quantization float16 --output_dir Llama-2-13B-Chat-ct2")
end = time.time()
t = end - start
print("Time:\t", t)

相关问题