非常感谢您的出色工作！在使用internlm模型时，我发现vLLM第一次前向传播得到的特征与HF相同输入得到的特征不同。我想问这是为什么，是否是由于底层实现架构的不一致性导致的？

以下是实验的一些配置：
环境

cuda==11.8
python==3.9
torch==2.1.0+cu118
xformers==0.0.22.post7+cu118
transformers==4.35.0
graphics_card：V100

用于测试结果的代码

vLLM code：
from vllm import LLM, SamplingParams
prompts = ['请介绍下爱因斯坦的生平。']
sampling_params = SamplingParams(
    temperature=0, top_p=1, max_tokens=128, repetition_penalty=1.1,
    use_beam_search=True, best_of=5)
llm = LLM(model="internlm/internlm-7b", trust_remote_code=True)
outputs = llm.generate(prompts, sampling_params, use_tqdm=False)

Huggingface code：
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-7b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()
prompts = ['请介绍下爱因斯坦的生平。']
for prompt in prompts:
    inputs = tokenizer([prompt], return_tensors="pt")
    for k,v in inputs.items():
        inputs[k] = v.cuda()
    gen_kwargs = {"num_beams":5, "max_length": 128, "top_p": 1, "temperature": 0., "do_sample": False, "repetition_penalty": 1.1} #官方版
    output = model.generate(**inputs, **gen_kwargs)
    output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)

输入保证一致，但第一个前向输出的hidden_state不一致

huggingface framework
input_embeds-features before being fed to the model
array([[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
        -0.001038],
       [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
        -0.021   ],
       [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
         0.01636 ],
       ...,
       [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
        -0.02856 ],
       [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
        -0.0232  ],
       [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
        -0.00787 ]], dtype=float16)
hidden_state-the output of the first round forward
array([[-0.0571 ,  1.743  ,  0.521  , ..., -1.4795 , -5.82   , -0.3972 ],
       [-0.671  , -2.166  ,  1.967  , ...,  0.2404 , -1.173  , -0.0839 ],
       [-0.8433 , -5.168  , -0.03244, ...,  5.035  ,  2.578  , -0.507  ],
       ...,
       [-0.547  , -4.03   ,  2.383  , ...,  3.295  ,  0.3582 ,  0.737  ],
       [-1.602  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  , -0.1273 ],
       [-1.817  , -5.45   ,  0.1937 , ...,  5.4    ,  3.84   , -0.3865 ]],
      dtype=float16)

vLLM framework
vLLM==0.2.2(https://github.com/vllm-project/vllm/releases/download/v0.2.2/vllm-0.2.2+cu118-cp39-cp39-manylinux1_x86_64.whl#sha256=7a8b51f0565baaa820f8dc0376e1ff5a732fcabda26397a55becd90e07b5fc63)
input_embeds-features before being fed to the model
array([[[-0.007782, -0.001129,  0.001808, ...,  0.001305, -0.001099,
         -0.001038],
        [-0.01233 , -0.02148 , -0.00812 , ..., -0.002289,  0.01782 ,
         -0.021   ],
        [ 0.003204,  0.009766,  0.004364, ..., -0.02527 ,  0.005524,
          0.01636 ],
        ...,
        [-0.007263,  0.003021,  0.01721 , ..., -0.06006 , -0.02747 ,
         -0.02856 ],
        [-0.00412 ,  0.01068 ,  0.006622, ...,  0.00705 ,  0.007538,
         -0.0232  ],
        [-0.0381  , -0.02625 ,  0.0065  , ...,  0.02722 ,  0.02759 ,
         -0.00787 ]]], dtype=float16)
hidden_state-the output of the first round forward
array([[[-0.0643 ,  1.74   ,  0.5254 , ..., -1.48   , -5.82   ,
         -0.3816 ],
        [-0.674  , -2.17   ,  1.966  , ...,  0.2505 , -1.162  ,
         -0.0839 ],
        [-0.8413 , -5.168  , -0.03452, ...,  5.04   ,  2.582  ,
         -0.5073 ],
        ...,
        [-0.5483 , -4.035  ,  2.38   , ...,  3.295  ,  0.3564 ,
          0.7373 ],
        [-1.603  , -4.344  ,  0.466  , ...,  4.594  ,  3.092  ,
         -0.1282 ],
        [-1.816  , -5.45   ,  0.1952 , ...,  5.4    ,  3.836  ,
         -0.3877 ]]], dtype=float16)

你好，
我也遇到了这个问题，但我的案例更糟糕。实际上，我使用了Llama2-7B-Chat-Hf模型，配置信息如下：
vllm 0.2.6
transformers 4.36.2
LLM Llama2-7B-Chat-Hf
Python 3.10.12
Ubuntu 22.04
GPU NVIDIA 4090 24gb
代码如下：

llm = VLLM(model="meta-llama/Llama-2-7b-chat-hf", tensor_parallel_size=1, trust_remote_code=True, temperature=0.6, top_k=5, top_p=0.9, torch_dtype=torch.bfloat16, max_new_tokens=500)
llm("hello")

从llm("hello")得到的答案是：
@matthew-mitchell.com
www.matthew-mitchell.com
Matthew Mitchell is a composer ...
我的问题是：
另外，我不知道Llama2-7b-Chat占用了大约21GB的显存。(事实上，我在之前运行了完全相同的代码，但只占用了大约14GB的显存)
有没有人像我一样得到了奇怪的答案和高显存占用的GPU内存？如果有，你们是如何解决的？