Image promoting with LLaVa
In this example, we'll deploy the LLaVa multi-modal vision and language model to a REST endpoint.
Setup
We recommend using Google Colab notebook with high memory and a T4 GPU for this example. Because finding A100s on Colab can be hit-or-miss, we'll start by using a version that fits on a T4. Scroll down for a guide to deploying a larger version.
First, install the accelerate
, bitsandbytes
, transformers
, Pillow
, and modelbit
packages:
!pip install accelerate bitsandbytes transformers Pillow modelbit
Import your dependencies:
from transformers import AutoProcessor, LlavaForConditionalGeneration
from huggingface_hub import snapshot_download
from PIL import Image
import requests
import bitsandbytes
import accelerate
import modelbit
And log into Modelbit:
mb = modelbit.login()
Finally, download the LLaVa weights from HuggingFace:
snapshot_download(repo_id="llava-hf/llava-1.5-7b-hf", local_dir="/content/llava-hf")
Building the model and performing an inference
First we'll load the model using modelbit.setup
:
with modelbit.setup(name="load_model"):
model = LlavaForConditionalGeneration.from_pretrained("./llava-hf",
local_files_only=True,
load_in_8bit=True)
processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True, load_in_8bit=True)
Note load_in_8bit=True
, which quantizes the model to fit in VRAM in a T4 GPU.
Next we'll write our function that prompts the model and returns the result:
def prompt_llava(url: str, prompt: str):
image = Image.open(requests.get(url, stream=True).raw)
mb.log_image(image) # Log the input image in Modelbit
full_prompt = f"USER: <image>\n{prompt} ASSISTANT:"
inputs = processor(text=full_prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=15)
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0].split("ASSISTANT:")[1]
return response
This function downloads the picture from the URL and prompts LLaVa with the picture and the text prompt, returning just the model's response.
Test your model by calling prompt_llava
:
prompt_llava("https://doc.modelbit.com/img/cat.jpg", "What animals are in this picture?")
Deployment
From here, deployment to a REST API is just one line of code:
mb.deploy(prompt_llava,
extra_files=["llava-hf"],
python_packages=["bitsandbytes==0.43.1", "accelerate==0.32.1"],
setup="load_model",
require_gpu=True)
We want to make sure to bring along the weight files from the llava-hf
directory using the extra_files parameter.
And we want to specify bitsandbytes
and accelerate
dependencies because transformers
needs them in this case but does not specify that dependency.
Finally, of course, this model requires a GPU.
Using a base64-encoded image instead of an image URL
It may be more convenient or more performant to encode the image itself into the REST request, instead of passing an image URL.
In that case, we'll add a couple more imports to our code:
import base64
from io import BytesIO
We'll update the definition of our inference function to make it clear we expect base64-encoded image bytes, and the second line of the inference function to load the image directly from the passed-in value.
Here's the entire updated inference function:
def prompt_llava(image_b64: str, prompt: str):
image = Image.open(BytesIO(base64.b64decode(image_b64)))
mb.log_image(image)
full_prompt = f"USER: <image>\n{prompt} ASSISTANT:"
inputs = processor(text=full_prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=15)
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0].split("ASSISTANT:")[1]
return response
Note that the model
and processor
from before are unchanged.
If we have that same cat.jpg
file locally, we can test our function like so:
def b64_image(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode()
response = prompt_llava(b64_image("cat.jpg"), "How many cats are in the image?")
print(response)
Finally, the code to deploy the model is the same:
mb.deploy(prompt_llava,
extra_files=["llava-hf"],
python_packages=["bitsandbytes==0.43.1", "accelerate==0.30.1"],
setup="load_model",
require_gpu=True)
And to call the model using Modelbit's Python API, the code would be:
response = mb.get_inference(deployment="prompt_llava",
workspace="<YOUR_WORKSPACE>",
region="<YOUR_REGION>",
data=[b64_image("cat.jpg"), "How many cats are in the image?"])
print(response)
Deploying a larger version
If you can get an A100 from Colab, you can build a larger (non-quantized) version of the model and deploy that to Modelbit!
To do so, simply remove load_in_8bit=True
in your from_pretrained
calls. Since quantized models automatically load into CUDA but default models do not, you'll also need to add .to("cuda")
to your model construction. Here's the new setup
definition:
with modelbit.setup(name="load_model"):
model = LlavaForConditionalGeneration.from_pretrained("./llava-hf", local_files_only=True).to("cuda")
processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True)
The inference function is unchanged. Finally, when deploying, make sure you specify a large enough GPU:
mb.deploy(prompt_llava,
extra_files=["llava-hf"],
setup="load_model",
require_gpu="A10G")
No need for accelerate
or bitsandbytes
since we're no longer quantizing.
You can now call your LLaVa model from its production REST endpoint!