Skip to main content

Example Llama 7B Deployment w/GPU

In this example, we'll use llama-cpp-python to deploy a Llama2 7B model as a REST endpoint.


This example only works on GPUs. Make sure you enable GPUs when deploying this model.


We recommend using Google Colab with high memory and a T4 GPU for this example.

First, download llama-2-7b.Q5_K_M.gguf from Hugging Face to your notebook environment. Store this file in the same directory as your notebook.


Then pip install llama_cpp_python with GPU support.

CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama_cpp_python==0.2.11

Creating the deployment​

We'll use two functions in our deployment. The first instantiates Llama and caches the result. The caching will speed up inferences when you call this deployment in Modelbit. The second function will simply pass a prompt to the Llama model and return the result.

from llama_cpp.llama import Llama
from functools import cache

def get_llm():
return Llama(model_path="./llama-2-7b.Q5_K_M.gguf", n_gpu_layers=100, n_ctx=2048)

def run_prompt(prompt: str, max_tokens: int):
llm = get_llm()
return llm(prompt, max_tokens=max_tokens)["choices"][0]["text"]

Call run_prompt to see it work locally:

run_prompt("Larry is a fun llama.", 100)

Which will return something like the following:

And he’s not wearing his pajamas.

With your function working locally, you're ready to deploy!

Deploying your Llama to Modelbit​

First log in to Modelbit:

import modelbit
mb = modelbit.login()

Then deploy run_prompt. When Modelbit builds your environment, it will automatically configure the CMAKE_ARGS and related NVIDIA libraries necessary to run your Llama model on the GPU.


Notice that we loaded llama-2-7b.Q5_K_M.gguf with a relative path in get_llm. We'll upload it with the same relative path in our deployment using extra_files.

After run_prompt has deployed, you can call it:

curl -s -XPOST "" -d '{"data": ["Larry is a fun llama.", 100]}' | json_pp

You have successfully deploy a Llama2 7B model as a REST endpoint!