CTranslate2 V100 32GB上的奇怪行为

i5desfxk 于 2个月前发布在其他

关注(0)|答案(2)|浏览(45)

你好。

我最近对NVIDIA V100 32GB GPU进行了一些基准测试。首先，我使用Huggingface的Transformers和CTranslate2对Llama2-7B-chat进行了基准测试。在使用ct2时，我看到了延迟的降低(分别为12秒和7.5秒)。

然而，当我尝试使用13B版本时，我并没有看到任何延迟方面的改进(分别为18秒和18秒),尽管vRAM的使用量略有减少。

为什么会这样？我做错了什么吗？

这是我正在使用的代码：

input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

print(f"GPU:\tV100\nTime(s):\t{t}\nResult: {output}")

CTranslate2

来源：https://github.com/OpenNMT/CTranslate2/issues/1431

2条答案

按热度按时间

ujv3wf0j1#

你好，
你能分享一下你使用HuggingFace transformers运行模型的代码吗？
另外，在转换并加载模型到CTranslate2时，你设置了哪些参数？

赞(0）回复(0）举报 2个月前

polkgigr2#

Hi,
Can you share the code you are using to run the model with HuggingFace transformers?
Also what parameters do you set when converting and then loading the model to CTranslate2?
Sure.
Here's the code:
HF Transformers

import transformers
import torch
import time

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
chatbot = transformers.pipeline(
    "text-generation",
    model=model_id,
    torch_dtype=torch.float16,
    device_map="auto",
)

chatbot.pad_token_id=tokenizer.eos_token_id,
# Warm up the model
chatbot(
    "Who is the president of the US?",
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=10,
)
input = llama2_chat_prompt_template.format(transcript=transcript)
start = time.time()
sequences = chatbot(
    [input, input, input],
    do_sample=False,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=1200,
    batch_size=5
)
end = time.time()
t = end-start

FOR CTranslate2

import ctranslate2
import transformers
import time

# Load and warmup the model
start = time.time()
generator = ctranslate2.Generator("/content/Llama-2-13B-Chat-ct2", device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16")
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode("WHo is the president of the US?"))
results = generator.generate_batch([tokens], max_length=5, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])
end = time.time()
t = end - start
print("Time:\t", t)
print(output)

# Run speed test
input = llama2_chat_prompt_template.format(transcript=transcript)

start = time.time()

tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(input))
results = generator.generate_batch([tokens], max_length=512, include_prompt_in_result=False)
output = tokenizer.decode(results[0].sequences_ids[0])

end = time.time()

t = end-start

Conversion script:

# Unquantized
import time
import os
start = time.time()
os.system("ct2-transformers-converter --model TheBloke/Llama-2-13B-Chat-fp16 --quantization float16 --output_dir Llama-2-13B-Chat-ct2")
end = time.time()
t = end - start
print("Time:\t", t)

赞(0）回复(0）举报 2个月前

我来回答

CTranslate2 V100 32GB上的奇怪行为

2条答案

相关问题

热门标签

最新问答