vllm [用法]:似乎nn.module的定义可能会影响输出的tokens,不知道原因,

fykwrbwg  于 2个月前  发布在  其他
关注(0)|答案(3)|浏览(26)

当前环境

环境:CPU设备
vllm版本:0.4.2+cpu

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=4608, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

对于这段代码,只要我在当前vLLM模型的领域内定义了torch.nn.modules,即使我不使用它们,也会影响输出token结果。换句话说,如果我将这些nn.modules(我不需要使用它们)移动到LLM()定义的上方,它不会影响结果。
llm1与llm2相同,因为它们都在当前模型的领域内定义了nn.module。但是,llm3不同,因为我没有定义任何内容,而llm3是我想要的结果。
难道它们的结果不应该相同吗?请查看屏幕截图或文本。
输出截图:

Processed prompts: 100%|¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦¦| 1/1 [00:01<00:00,  1.22s/it]
outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665805.6874118, last_token_time=1715665805.6874118, first_scheduled_time=1715665805.689108, first_token_time=1715665805.8463485, time_in_queue=0.0016961097717285156, finished_time=1715665806.759257), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是小助手 本AI 欢迎你随时向我提问,我会尽力回答', token_ids=[31123, 33030, 54603, 42481, 35786, 23833, 30910, 32616, 54622, 34498, 46993, 37817, 31123, 35094, 40328, 33287], cumulative_logprob=-17.481587450020015, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665811.4080832, last_token_time=1715665811.4080832, first_scheduled_time=1715665811.4091282, first_token_time=1715665811.539016, time_in_queue=0.0010449886322021484, finished_time=1715665812.7462144), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是 ChatGLM2-6B, 我是基于大型语言模型', token_ids=[31123, 33030, 22011, 10461, 30944, 30943, 30941, 30978, 30949, 31123, 30910, 33030, 33053, 32997, 32330, 34030], cumulative_logprob=-8.741462323308497, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715665822.238591, last_token_time=1715665822.238591, first_scheduled_time=1715665822.2395456, first_token_time=1715665822.5107977, time_in_queue=0.0009546279907226562, finished_time=1715665823.461715), lora_request=None)]

此外,如果我更改torch.nn.module的输出特征,它也会影响输出tokens。

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)

我只更改了output_features,但结果却不同。
输出:

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',是一名人工智能助手。 \n\n如果你需要帮助,请告诉我具体问题', token_ids=[31123, 38628, 34797, 42481, 31155, 30910, 13, 13, 32763, 31665, 31934, 30932, 55073, 38953, 32149, 31639], cumulative_logprob=-21.3015581928193, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666711.2086165, last_token_time=1715666711.2086165, first_scheduled_time=1715666711.2102835, first_token_time=1715666711.3079636, time_in_queue=0.001667022705078125, finished_time=1715666712.208443), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',小河流段便会非常活跃。很多体载货物的鱼类 difficult,', token_ids=[31123, 54603, 36773, 55005, 42237, 31685, 35203, 31155, 31679, 54618, 55387, 55466, 34090, 49426, 2529, 30932], cumulative_logprob=-96.62851423444226, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715666716.799589, last_token_time=1715666716.799589, first_scheduled_time=1715666716.8003457, first_token_time=1715666716.8765712, time_in_queue=0.0007567405700683594, finished_time=1715666718.0433056), lora_request=None)]

如您所见,我实际上并没有使用这些nn.modules,但它们确实影响了结果。我提供了5个输出结果,但它们都不同。唯一的区别是关于nn.module的更改。
需要一些帮助。谢谢!

如何使用vllm

似乎nn.module的定义可能会影响输出tokens。不知道原因。

3htmauhk

3htmauhk1#

这相当有趣。你能通过设置 seed 来再次检查吗?

oyt4ldly

oyt4ldly2#

如果这是真的,我怀疑这与内存泄漏和PyTorch缓存分配器有关。也许我们泄漏了一些对象引用,当你创建新的nn模块时,PyTorch缓存分配器回收了它认为不再使用的某些内存,但实际上在某个地方被使用?
无论如何,我可能错了。如果是这种情况,根本原因将很难调试。

gkn4icbw

gkn4icbw3#

你好,

from vllm import LLM
import torch

prompts = ["你好"]
llm1 = LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=8888, bias=True, dtype=torch.bfloat16)
outputs1 = llm1.generate(prompts)  # Generate texts from the prompts.
print(outputs1)

llm2= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
torch.nn.Linear(in_features=4096,out_features=9999, bias=True, dtype=torch.bfloat16)
outputs2 = llm2.generate(prompts)  # Generate texts from the prompts.

llm3= LLM(model="/home/zhenzhong/model/chatglm2-6b", trust_remote_code=True, seed=666)  # Create an LLM.
outputs3 = llm3.generate(prompts)  # Generate texts from the prompts.

print("outputs1 = ", outputs1)
print("outputs2 = ", outputs2)
print("outputs3 = ", outputs3)

我设置了相同的种子,但也输出了三个不同的结果。实际上LLM()具有默认的种子(seed: int = 0)。

outputs1 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=', p更 爱 你 要 是 你 要 是 你 要 是 你 要 是', token_ids=[31123, 281, 54664, 47802, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369, 36474, 43159, 35369], cumulative_logprob=-41.74734868388623, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824550.3473322, last_token_time=1715824550.3473322, first_scheduled_time=1715824550.3491716, first_token_time=1715824555.3297749, time_in_queue=0.0018393993377685547, finished_time=1715824620.9681613), lora_request=None)]
outputs2 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text='老师和同学们,今天我带了人民调解委员会调解费收据 我不知道', token_ids=[42116, 32812, 31123, 31869, 54546, 54882, 54537, 31657, 36122, 32007, 36122, 55000, 54821, 54830, 34211, 32522], cumulative_logprob=-43.803544878959656, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824629.7847252, last_token_time=1715824629.7847252, first_scheduled_time=1715824629.7856104, first_token_time=1715824633.9895625, time_in_queue=0.0008852481842041016, finished_time=1715824653.5920393), lora_request=None)]
outputs3 =  [RequestOutput(request_id=0, prompt='你好', prompt_token_ids=[64790, 64792, 36474, 54591], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=',我是人工智能助手。 根据用户名登录后,我的作用是提供咨询', token_ids=[31123, 33030, 34797, 42481, 31155, 47383, 32053, 54653, 36782, 54585, 31123, 31791, 31827, 54532, 31692, 32539], cumulative_logprob=-32.18759796023369, logprobs=None, finish_reason=length, stop_reason=None)], finished=True, metrics=RequestMetrics(arrival_time=1715824663.3346176, last_token_time=1715824663.3346176, first_scheduled_time=1715824663.3352196, first_token_time=1715824663.549846, time_in_queue=0.0006020069122314453, finished_time=1715824664.6953938), lora_request=None)]

相关问题