Skip to main content

Example LLaVa Multi-Modal Model Deployment w/GPU

In this example, we'll deploy the LLaVa multi-modal vision and language model to a REST endpoint.


We recommend using Google Colab notebook with high memory and a T4 GPU for this example. Because finding A100s on Colab can be hit-or-miss, we'll start by using a version that fits on a T4. Scroll down for a guide to deploying a larger version.

First, install the accelerate, bitsandbytes and modelbit packages:

!pip install accelerate bitsandbytes modelbit

Go ahead and login to Modelbit:

import modelbit
mb = modelbit.login()

Import the rest of your dependencies:

from transformers import AutoProcessor, LlavaForConditionalGeneration
from huggingface_hub import snapshot_download
from PIL import Image
import requests
import bitsandbytes
import accelerate
from functools import cache

Finally, download the LLaVa weights from HuggingFace:

snapshot_download(repo_id="llava-hf/llava-1.5-7b-hf", local_dir="/content/llava-hf")

Building the model and performing an inference

First we'll write a function that loads the model:

def load_model():
model = LlavaForConditionalGeneration.from_pretrained("./llava-hf", local_files_only=True, load_in_8bit=True)
processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True, load_in_8bit=True)
return model, processor

Note load_in_8bit=True, which quantizes the model to fit in VRAM in a T4 GPU.

The @cache decorator will cause this function to only load the model once. After that, it stays in memory. The same behavior will be preserved in production in Modelbit.

Next we'll write our function that prompts the model and returns the result:

def prompt_llava(url: str, prompt: str):
model, processor = load_model()
image =, stream=True).raw)
mb.log_image(image) # Log the input image in Modelbit
full_prompt = f"USER: <image>\n{prompt} ASSISTANT:"
inputs = processor(text=full_prompt, images=image, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=15)
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0].split("ASSISTANT:")[1]
return response

This function downloads the picture from the URL and prompts LLaVa with the picture and the text prompt, returning just the model's response.


From here, deployment to a REST API is just one line of code:

python_packages=["bitsandbytes==0.43.1", "accelerate==0.30.1"],

We want to make sure to bring along the weight files from the llava-hf directory using the extra_files parameter.

And we want to specify bitsandbytes and accelerate dependencies because transformers needs them in this case but does not specify that dependency.

Finally, of course, this model requires a GPU.

Deploying a larger version

If you can get an A100 from Colab, you can build a larger (non-quantized) version of the model and deploy that to Modelbit!

To do so, simply remove load_in_8bit=True in your from_pretrained calls. Since quantized models automatically load into CUDA but default models do not, you'll also need to add .to("cuda") to your model construction. Here's the new load_model definition:

def load_model():
model = LlavaForConditionalGeneration.from_pretrained("./llava-hf", local_files_only=True).to("cuda")
processor = AutoProcessor.from_pretrained("./llava-hf", local_files_only=True)
return model, processor

The inference function is unchanged. Finally, when deploying, make sure you specify a large enough GPU:


No need for accelerate or bitsandbytes since we're no longer quantizing.

You can now call your LLaVa model from its production REST endpoint!