Hi OLLaMa team,
ViP-LLaVA is a region-level large multimodal model from LLaVA team that is capable of understanding visual prompts such as scribbles, bounding boxes, arrows, etc.
There are only several lines of changes to the original LLaVA code. Huggingface already integrate ViP-LLaVA into the official transformers library. https://huggingface.co/docs/transformers/main/model_doc/vipllava
Consider adding ViP-LLaVA here?
Thank you!
Mu Cai
1条答案
按热度按时间a2mppw5e1#
Llama.cpp仓库中的问题ggerganov/llama.cpp#4515