CTranslate2 非常慢的生成速度，用于llama 2 70B聊天模型,

2hh7jdfx 于 2个月前发布在其他

关注(0)|答案(1)|浏览(24)

我能够对llama 2 7B聊天(int8)进行基准测试，并且在A100 GPU上大约12秒内获得了约600个令牌，而HF管道对于相同的输入和参数需要大约25秒。
然而，当我尝试llama v2 70B聊天模型(int8)时，它非常慢(约90秒),对于500个令牌，而HF管道需要大约32秒(尽管管道使用了多个GPU,所以这不是一个公平的比较？)。这是预期的吗还是我做错了什么？
这是我的代码：

import ctranslate2

CT2_INT8_MODEL_CKPT_LLAMA_7B = "llama-2-7b-chat-ct2"
CT2_INT8_MODEL_CKPT_LLAMA_70B = "llama-2-70b-chat-ct2"

generator = ctranslate2.Generator(CT2_INT8_MODEL_CKPT_LLAMA_70B, device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained(LLAMA_PATH_7B)

def predict(prompt:str):
    "Generate text give a prompt"
    start = time.perf_counter()
    tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
    results = generator.generate_batch([tokens],
                                       sampling_temperature=0.8,
                                       sampling_topk=0,
                                       sampling_topp=1,
                                       max_length=1000,
                                       include_prompt_in_result=False)
    tokens = results[0].sequences_ids[0]
    output = tokenizer.decode(tokens)
    request_time = time.perf_counter() - start
    return {'tok_count': len(tokens),
            'time': request_time,
            'question': prompt,
            'answer': output,
            'note': 'CTranslate2 int8 quantization'}
  
import time
print('benchmarking ctranslate2...\n')
time_taken = []
results = []

for _ in range(10):
    start = time.perf_counter()
    out = predict("explain rotary positional embeddings")
    print(out)
    results.append(out)
    request_time = time.perf_counter() - start
    time_taken.append(request_time)

CTranslate2

来源：https://github.com/OpenNMT/CTranslate2/issues/1388