python 如何使用部署到SageMaker的Llava Llama模型从Huggingface执行推理?

kh212irz  于 2024-01-05  发布在  Python
关注(0)|答案(2)|浏览(337)

我使用Huggingface提供的部署卡将Llava Llama Huggingface模型(https://huggingface.co/liuhaotian/llava-llama-2-13b-chat-lightning-preview/discussions/3)部署到SageMaker Domain + Endpoint:

import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

# Hub Model configuration. https://huggingface.co/models
hub = {
    'HF_MODEL_ID': 'liuhaotian/llava-llama-2-13b-chat-lightning-preview',
    'HF_TASK': 'text-generation'
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    transformers_version='4.26.0',
    pytorch_version='1.13.1',
    py_version='py39',
    env=hub,
    role=role, 
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1, # number of instances
    instance_type='ml.m5.xlarge' # ec2 instance type
)

字符串
部署将HF_TASK设置为text-generation。然而,Llava Llama是一个多模态文本+图像模型。所以最大的问题是我如何执行推理/预测。我需要传递图像和文本提示符。其他图像+文本API(如Imagen或Imagch)接受base64编码的图像数据。我知道我需要做的不止这些,因为例如模型是用特定维度训练的数据集(我认为Llava Llama模型可能是336 x336),而Imagen或PaaS服务则负责裁剪/裁剪/填充。
Llava Llama有一个演示页面https://llava-vl.github.io/,它使用Gradio用户界面。所以我不能告诉在哪里以及如何托管模型。但是,我们可能能够从源代码中破译解决方案。这个get_image函数我认为很重要,它可以调整大小/裁剪/填充:https://github.com/haotian-liu/LLaVA/blob/a4269fbf014af3cab1f1d172914493fae8b74820/llava/conversation.py#L109,并从https://github.com/haotian-liu/LLaVA/blob/a4269fbf014af3cab1f1d172914493fae8b74820/llava/serve/gradio_web_server.py#L138调用
我们可以看到,将有一些神奇的令牌,将标记的开始和结束的图像和分离的文本提示(https://github.com/haotian-liu/LLaVA/blob/a4269 fbf 014 af 3cab 1f 1d 172914493 fae 8b 74820/llava/serve/gradio_web_server.py#L154)。我们可以看到,文本到图像生成模式的文本提示被截断为1536个token(?),图像QnA模式的文本提示被截断为1200个token,在这些token的帮助下组装了一个复合提示(https://github.com/haotian-liu/LLaVA/blob/a4269 fbf 014 af 3cab 1f 1d 172914493 fae 8b 74820/llava/conversation.py#L287)和模板(https://github.com/haotian-liu/LLaVA/blob/a4269 fbf 014 af 3cab 1f 1d 172914493 fae 8b 74820/llava/conversation.py#L71)。图像也被附加为base64字符串,PNG格式:https://github.com/haotian-liu/LLaVA/blob/a4269fbf014af3cab1f1d172914493fae8b74820/llava/conversation.py#L154
当我尝试调用推断/预测的端点时,

from sagemaker.predictor import Predictor
from base64 import b64encode

endpoint = 'huggingface-pytorch-inference-2023-09-23-08-55-26-117'
ENCODING = "utf-8"
IMAGE_NAME = "eiffel_tower_336.jpg"

payload = {
    "inputs": "Describe the content of the image in great detail ",
}
with open(IMAGE_NAME, 'rb') as f:
    byte_content = f.read()
    base64_bytes = b64encode(byte_content)
    base64_string = base64_bytes.decode(ENCODING)

predictor = Predictor(endpoint)
inference_response = predictor.predict(data=payload)
print (inference_response)


我得到一个错误,ParamValidationError: Parameter validation failed: Invalid type for parameter Body, value: {'inputs': 'Describe the content of the image in great detail '}, type: <class 'dict'>, valid types: <class 'bytes'>, <class 'bytearray'>, file-like object
这个HuggingFace讨论说https://discuss.huggingface.co/t/can-text-to-image-models-be-deployed-to-a-sagemaker-endpoint/20120需要创建一个inference.py。我不知道Llava Llama有什么。我试图查看模型的文件,但我没有看到相关的Meta数据。
这个StackOverflow条目How to do model inference on a multimodal model from hugginface using sagemaker是关于一个无服务器部署案例的,但是它使用了一个自定义的TextImageSerializer序列化器。我应该尝试使用类似的东西吗?
这个Reddittor建议https://www.reddit.com/r/LocalLLaMA/comments/16pzn88/how_to_parametrize_a_llava_llama_model/某种CLIP编码。我不确定我是否真的需要这样做,或者模型能够编码?
其他参考文献:

7gyucuyw

7gyucuyw1#

TL:DR;直到有更好的模型卡(虽然https://huggingface.co/shauray/Llava-Llama-2-7B-hf似乎有一个有意义的用法,但我不知道from transformers import LlavaProcessor, LlavaForCausalLM从何而来,见https://huggingface.co/shauray/Llava-Llama-2-7B-hf/discussions/1),S似乎无处不在。所以我受够了,继续使用replicate.com,见https://replicate.com/yorickvp/llava-13b/api

import replicate
output = replicate.run(
    "yorickvp/llava-13b:2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591",
    input=dict(
        image=open("eiffel_tower_336.jpg", "rb"),
        prompt="Describe what is on the photo in great detail, be very verbose"
    )
)
# The yorickvp/llava-13b model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.
for item in output:
    # https://replicate.com/yorickvp/llava-13b/versions/2facb4a474a0462c15041b78b1ad70952ea46b5ec6ad29583c0b29dbd4249591/api#output-schema
    print(item, end="")

字符串
就这么简单。
详细调试:LLaVA GitHub repo应该包含CLI或GUI如何与后端交互。至于CLI:https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/cli.py
1.我们可以看到有一个基于模型类型的conv_mode。在我们的例子中,它是Llama 2 https://github.com/haotian-liu/LLaVA/blob/6bfd90754621c0277672d1418336a31d976c4ec3/llava/serve/cli.py#L34C5-L35C36

if 'llama-2' in model_name.lower():
    conv_mode = "llava_llama_2"


这意味着https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/conversation.py#L277

conv_llava_llama_2 = Conversation(
    system="You are a helpful language and vision assistant. "
           "You are able to understand the visual content that the user provides, "
           "and assist the user with a variety of tasks using natural language.",
    roles=("USER", "ASSISTANT"),
    version="llama_v2",
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.LLAMA_2,
    sep="<s>",
    sep2="</s>",
)


我们可以在这里看到分隔符和分隔符样式。LLAMA_2SeparatorStyle

elif self.sep_style == SeparatorStyle.LLAMA_2:
            wrap_sys = lambda msg: f"<<SYS>>\n{msg}\n<</SYS>>\n\n"
            wrap_inst = lambda msg: f"[INST] {msg} [/INST]"


1.图片处理:https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/serve/cli.py#L54

image = load_image(args.image_file)
    # Similar operation in model_worker.py
    image_tensor = process_images([image], image_processor, args)
    if type(image_tensor) is list:
        image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        image_tensor = image_tensor.to(model.device, dtype=torch.float16)


load_image简单地加载二进制图像数据,而不管它的格式(jpg,png等)。然后process_images将图像转换为Tensor:https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/mm_utils.py#L28C5-L28C19在转换之前,它将图像规则化为带有填充的方形。为了获得最佳效果,我已经提供了一个方形336 x 336 px或224 x 224 px图像。然后这个Tensor化的图像被插入到提示符中,看起来像是用图像分隔符标记前缀:https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/serve/cli.py#L75

if model.config.mm_use_im_start_end:
    inp = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + '\n' + inp
else:
    inp = DEFAULT_IMAGE_TOKEN + '\n' + inp
conv.append_message(conv.roles[0], inp)


对于实际的推断,image_tensor作为参数传递给model。尽管在调用API时,model不存在于客户端。
1.让我们看看GRadio前端如何调用后端:https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/gradio_web_server.py这是有效负载assmebled:https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/serve/gradio_web_server.py#L224

pload = {
    "model": model_name,
    "prompt": prompt,
    "temperature": float(temperature),
    "top_p": float(top_p),
    "max_new_tokens": min(int(max_new_tokens), 1536),
    "stop": state.sep if state.sep_style in [SeparatorStyle.SINGLE, SeparatorStyle.MPT] else state.sep2,
    "images": f'List of {len(state.get_images())} images: {all_image_hash}',
}


然后images被替换为:pload['images'] = state.get_images()state在对话对象中,我们已经看过了:https://github.com/haotian-liu/LLaVA/blob/f47c16e4aeac6d4d61259800ca9cd33b26824113/llava/conversation.py#L109它对图像进行了加密,看起来也像base64对PNG格式的图像进行了编码。
我厌倦了尝试组合,我一直得到

ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
  "code": 400,
  "type": "InternalServerException",
  "message": "\u0027llava\u0027"
}

0x6upsns

0x6upsns2#

LLaVa现在在Transformers库中得到了原生支持,现在应该更容易部署了。
电话:https://huggingface.co/docs/transformers/main/model_doc/llava
集线器上的检查点:https://huggingface.co/llava-hf

相关问题